Web search activity data accurately predict population chronic disease risk in the USA
Background: The WHO framework for non-communicable disease (NCD) describes risks and outcomes comprising the majority of the global burden of disease. These factors are complex and interact at biological, behavioural, environmental and policy levels presenting challenges for population monitoring an...
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Journal Article |
| Published: |
BMJ Publishing Group
2015
|
| Online Access: | http://hdl.handle.net/20.500.11937/45061 |
| _version_ | 1848757177568722944 |
|---|---|
| author | Nguyen, T. Tran, The Truyen Luo, W. Gupta, S. Rana, S. Phung, D. Nichols, M. Millar, L. Venkatesh, S. Allender, S. |
| author_facet | Nguyen, T. Tran, The Truyen Luo, W. Gupta, S. Rana, S. Phung, D. Nichols, M. Millar, L. Venkatesh, S. Allender, S. |
| author_sort | Nguyen, T. |
| building | Curtin Institutional Repository |
| collection | Online Access |
| description | Background: The WHO framework for non-communicable disease (NCD) describes risks and outcomes comprising the majority of the global burden of disease. These factors are complex and interact at biological, behavioural, environmental and policy levels presenting challenges for population monitoring and intervention evaluation. This paper explores the utility of machine learning methods applied to population-level web search activity behaviour as a proxy for chronic disease risk factors. Methods: Web activity output for each element of the WHO's Causes of NCD framework was used as a basis for identifying relevant web search activity from 2004 to 2013 for the USA. Multiple linear regression models with regularisation were used to generate predictive algorithms, mapping web search activity to Centers for Disease Control and Prevention (CDC) measured risk factor/disease prevalence. Predictions for subsequent target years not included in the model derivation were tested against CDC data from population surveys using Pearson correlation and Spearman's r. Results: For 2011 and 2012, predicted prevalence was very strongly correlated with measured risk data ranging from fruits and vegetables consumed (r=0.81; 95% CI 0.68 to 0.89) to alcohol consumption (r=0.96; 95% CI 0.93 to 0.98). Mean difference between predicted and measured differences by State ranged from 0.03 to 2.16. Spearman's r for state-wise predicted versus measured prevalence varied from 0.82 to 0.93. Conclusions: The high predictive validity of web search activity for NCD risk has potential to provide real-time information on population risk during policy implementation and other population-level NCD prevention efforts. |
| first_indexed | 2025-11-14T09:23:57Z |
| format | Journal Article |
| id | curtin-20.500.11937-45061 |
| institution | Curtin University Malaysia |
| institution_category | Local University |
| last_indexed | 2025-11-14T09:23:57Z |
| publishDate | 2015 |
| publisher | BMJ Publishing Group |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | curtin-20.500.11937-450612018-04-09T05:05:25Z Web search activity data accurately predict population chronic disease risk in the USA Nguyen, T. Tran, The Truyen Luo, W. Gupta, S. Rana, S. Phung, D. Nichols, M. Millar, L. Venkatesh, S. Allender, S. Background: The WHO framework for non-communicable disease (NCD) describes risks and outcomes comprising the majority of the global burden of disease. These factors are complex and interact at biological, behavioural, environmental and policy levels presenting challenges for population monitoring and intervention evaluation. This paper explores the utility of machine learning methods applied to population-level web search activity behaviour as a proxy for chronic disease risk factors. Methods: Web activity output for each element of the WHO's Causes of NCD framework was used as a basis for identifying relevant web search activity from 2004 to 2013 for the USA. Multiple linear regression models with regularisation were used to generate predictive algorithms, mapping web search activity to Centers for Disease Control and Prevention (CDC) measured risk factor/disease prevalence. Predictions for subsequent target years not included in the model derivation were tested against CDC data from population surveys using Pearson correlation and Spearman's r. Results: For 2011 and 2012, predicted prevalence was very strongly correlated with measured risk data ranging from fruits and vegetables consumed (r=0.81; 95% CI 0.68 to 0.89) to alcohol consumption (r=0.96; 95% CI 0.93 to 0.98). Mean difference between predicted and measured differences by State ranged from 0.03 to 2.16. Spearman's r for state-wise predicted versus measured prevalence varied from 0.82 to 0.93. Conclusions: The high predictive validity of web search activity for NCD risk has potential to provide real-time information on population risk during policy implementation and other population-level NCD prevention efforts. 2015 Journal Article http://hdl.handle.net/20.500.11937/45061 10.1136/jech-2014-204523 BMJ Publishing Group restricted |
| spellingShingle | Nguyen, T. Tran, The Truyen Luo, W. Gupta, S. Rana, S. Phung, D. Nichols, M. Millar, L. Venkatesh, S. Allender, S. Web search activity data accurately predict population chronic disease risk in the USA |
| title | Web search activity data accurately predict population chronic disease risk in the USA |
| title_full | Web search activity data accurately predict population chronic disease risk in the USA |
| title_fullStr | Web search activity data accurately predict population chronic disease risk in the USA |
| title_full_unstemmed | Web search activity data accurately predict population chronic disease risk in the USA |
| title_short | Web search activity data accurately predict population chronic disease risk in the USA |
| title_sort | web search activity data accurately predict population chronic disease risk in the usa |
| url | http://hdl.handle.net/20.500.11937/45061 |