Line 332: | Line 332: | ||
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/9/90/T--Lambert_GA--CALMResults.png" style = "width:80%;"></div> | <div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/9/90/T--Lambert_GA--CALMResults.png" style = "width:80%;"></div> | ||
<br> | <br> | ||
− | <div style="font-size:12px; width:80%; text-align:center"><i>Cross-validation and hold-out error for each of our four models. Cross-validation error was obtained by taking the root of the mean of the model’s performance across five rolling-window cross-validation folds. Hold out error was obtained by calculating the root mean squared error for predictions only on the holdout set. </i></div> | + | <div style="text-align:center"> |
+ | <div style="font-size:12px; width:80%; text-align:center"><i>Cross-validation and hold-out error for each of our four models. Cross-validation error was obtained by taking the root of the mean of the model’s performance across five rolling-window cross-validation folds. Hold out error was obtained by calculating the root mean squared error for predictions only on the holdout set. </i></div></div> | ||
<br> | <br> | ||
 Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our model’s performance in real-world simulation, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two-week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to predict around ⅕ of a standard deviation of the number cases, our predictions are robust and reliable across all time frames. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data. |  Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our model’s performance in real-world simulation, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two-week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to predict around ⅕ of a standard deviation of the number cases, our predictions are robust and reliable across all time frames. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data. |
Revision as of 23:12, 17 October 2018
C A L M
D A T A & R E S U L T S
D A T A & R E S U L T S
Cholera Epidemiological Data
Cholera case and death statistics are reported by the World Health Organization (WHO) where health experts and researchers work directly with Yemeni health authorities at both the country and local level. Through this direct connection, the WHO is able to record all reported cholera cases and deaths caused by cholera (WHO presence in Yemen, 2018). The data, collected by the WHO, was accessed through the Humanitarian Data Exchange (https://data.humdata.org /group/yem). It provided reports of accumulated new cholera cases and deaths per governorate from up to May 22, 2017, to February 18, 2018.
Past cholera cases and deaths were included with the simple assumption that they would be predictive of future cases. Vibrio cholerae requires aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and therefore indicate a further spread of cholera (Cholera - Vibrio cholerae infection).
Past cholera cases and deaths were included with the simple assumption that they would be predictive of future cases. Vibrio cholerae requires aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and therefore indicate a further spread of cholera (Cholera - Vibrio cholerae infection).
Rainfall Data
As Vibrio cholerae is indigenous to aquatic environments, rainfall is a significant predictor of the transmission of cholera. In areas exposed to heavy rainfall, through the collapse of sanitary and health infrastructure, interaction between contaminated water and human activities accelerates, resulting in an epidemic (Jutla et al., 2013). Yemen represents this scenario, where when exposed to heavy rainfall and deterioration of health facilities, there was a surge in cholera cases (Camacho et al., 2018). Global Lancet Researchers analyzing surveillance date for the Yemen Cholera Outbreak from 2016 to 2018 have found a positive and nonlinear association between weekly rainfall and suspected cholera incidence: the relative risk of cholera 10 days after a weekly rainfall of 25 mm is 42% higher than compared with a week without rain (Camacho et al., 2018). Despite the inability to establish that rainfall is causal to the increase in cholera outbreaks, the use of unsafe water sources during the drought season, contamination of water sources during the rainy season, and changing levels of zooplankton and iron in water (which help cholera bacteria survive), may contribute to the increasing levels of cholera during the rainy season (Camacho et al., 2018). These correlations demonstrate the need to measure rainfall in the machine learning model, as rainfall is a predictor for possible climate changes and the corresponding human response and subsequently indicates the spread of cholera in Yemen.
Daily rainfall data for Yemen from January 1st ,2017 to March 30th, 2018 was accessed through NASA’s Goddard Earth Sciences Data and Information Services Center (GES DISC), which provides Global Precipitation Measurement data through the Simple Subset Wizard (SSW) database. The Global Precipitation Measurement mission (GPM), launched on February 27th, 2014, is an international network of satellites that use microwave imagers and precipitation radars to measure the volume of rainfall in several regions of the world (Global Precipitation Measurement, 2011). The rainfall data was initially in a netcdf4 format. The 452 files were then parsed and converted to comma-separated-values (CSV). As there were individual data points for every .25 degrees of both latitude and longitude, Reverse geolocation was performed to match coordinates with corresponding Yemeni governorates.
Daily rainfall data for Yemen from January 1st ,2017 to March 30th, 2018 was accessed through NASA’s Goddard Earth Sciences Data and Information Services Center (GES DISC), which provides Global Precipitation Measurement data through the Simple Subset Wizard (SSW) database. The Global Precipitation Measurement mission (GPM), launched on February 27th, 2014, is an international network of satellites that use microwave imagers and precipitation radars to measure the volume of rainfall in several regions of the world (Global Precipitation Measurement, 2011). The rainfall data was initially in a netcdf4 format. The 452 files were then parsed and converted to comma-separated-values (CSV). As there were individual data points for every .25 degrees of both latitude and longitude, Reverse geolocation was performed to match coordinates with corresponding Yemeni governorates.
Conflict Data (Yemeni Civil War)
Yemen is currently in the grip of a devastating civil war, which is heavily impacting the cholera crisis in Yemen; (Camacho et al., 2018). while cholera is preventable and treatable under normal circumstances, the collapse of Yemen’s health, water, and sanitation sectors amidst the ongoing armed conflict have fueled the spread of cholera across the country, and with direct attacks against hospitals and the bombing of water supplies, the conflict has dissolved 55% of the country's medical, wastewater, and solid waste management infrastructure, making access to clean water and healthcare difficult and expensive (Camacho et al, 2018; Yemen’s Cholera Crisis: Fighting Disease During Armed Conflict, 2017; Yemen: The Forgotten War, 2018). Conflict has led to 15 million Yemenis in need of water and sanitation assistance (Camacho et al., 2018). Information regarding the status of ongoing conflicts, namely the severity in terms of death toll, was collected with the hope of it being predictive of the region’s infrastructure ability to provide treatments in cholera in the following weeks. Data gathered by the Armed Conflict Location and Event Data Project (ACLED) was retrieved from the Humanitarian Data Exchange (https://data.humdata.org/group/yem). ACLED reported the type of conflict, agents, locations, dates, and other characteristics of the politically charged conflict from January 1, 2016, to June 6, 2018 (Raleigh and Dowd, 2017). The number of casualties due to conflict on any given day and in any given Yemeni governorate was used as a metric for civil war related violence.
Diagram of Environmental and Demographic Datasets
Diagram of Factors and Processes
Results
Figure 1: Cross-validation and Holdout Error for four XGBoost forecasting models
Cross-validation and hold-out error for each of our four models. Cross-validation error was obtained by taking the root of the mean of the model’s performance across five rolling-window cross-validation folds. Hold out error was obtained by calculating the root mean squared error for predictions only on the holdout set.
Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our model’s performance in real-world simulation, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two-week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to predict around ⅕ of a standard deviation of the number cases, our predictions are robust and reliable across all time frames. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data.