Difference between revisions of "Team:Lambert GA/CALM BACKGROUND SOLUTION"

Line 283: Line 283:
 
<br><br>
 
<br><br>
 
<div id="content2">
 
<div id="content2">
Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). <b>This allowed us to arrive at the best 30-50 features</b> for each time-range. All in all, we were able to remove ~99.9% of our original features.
+
Unique to the Yemeni outbreak has been the availability of expansive epidemiological datasets. As opposed to nations such as Haiti, the Dominican Republic, and various African nations, the Yemeni outbreak has regular and reliable reporting of cholera and various related factors. This wealth of data has opened the possibility for the use of machine learning to predict cholera outbreaks. Thus, we have been able to construct CALM, the Cholera Artificial Learning Model, a system comprised of four extreme-gradient-boosting (XGBoost) machine learning models that, working together, forecast the exact number of cholera cases any given Yemeni governorate will experience for multiple time intervals ranging from 2 weeks to 2 months. With extensive engineering of predictive features, the models utilize a large span of relevant datasets, including multiple mathematical representations of rainfall, past cholera incidence and mortality, and civil war mortalities. By predicting the exact number of new cases (per 10,000 people) each governorate will experience in the next two months with 2-week intervals, CALM provides a comprehensive and accurate forecast of the Yemen cholera outbreak, allowing for necessary preventative action to be taken. Furthermore, the geographic divisions (governorates) for which incidence are predicted are specific enough that practical measures can be taken to distribute medicines to those in need. For reference, YE-AM (Amran), the governorate with the greatest cumulative cholera case count (normalized by population), has an area of 9,587 square kilometers (Yemen, 2014).
 
<br><br>
 
<br><br>
 
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/d/d3/T--Lambert_GA--CALMFeatureTuningResults.png"></div>
 
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/d/d3/T--Lambert_GA--CALMFeatureTuningResults.png"></div>

Revision as of 00:42, 18 October 2018

C A L M



B A C K G R O U N D & S O L U T I O N




































Cholera in Yemen


Cholera is a waterborne disease caused by the bacterium Vibrio cholerae, which has plagued mankind for centuries and continues to do so despite the advances of modern medicine. The ongoing cholera outbreak in Yemen, which began in October of 2016, has been deemed “the largest documented cholera outbreak” through a comprehensive analysis of cholera surveillance data by Camacho et al. (2018). Enabled by a devastating civil war, cholera has spread rampantly across the country, with the World Health Organization’s weekly bulletins reporting that, as of April 2017, there have been 1,055,788 suspected and 612,703 confirmed cases of cholera, causing 2,255 confirmed deaths (World Health Organization, 2017). While cholera has several effective treatments, including Oral Cholera Vaccinations (OCVs) with an 80.2% prevention rate (Azman et al., 2016), the inefficient and untimely distribution of medicine has been the primary cause of cholera mortality (Camacho et al., 2018). This is because the Yemeni outbreak has been largely sporadic, occurring in waves spawned by a variety of environmental (rainfall), political (civil war conflicts), and epidemiological factors (cholera incidence and mortality) (Camacho et al., 2018). Studies suggest that the third wave of cholera transmission may resurge during the rainy season of 2018, resulting in an urgent need for a forecast that details precisely when, where, and how many people will contract the disease (Camacho et al., 2018). With a comprehensive, actionable forecast, health organizations have the opportunity to deploy prevention methods in a highly targeted, efficient fashion, allowing for the mitigation of the outbreak (Camacho et al,. 2018).


Map of cholera outbreak in Yemen in 2017 (Al Jazeera, 2017).


Our Solution


Unique to the Yemeni outbreak has been the availability of expansive epidemiological datasets. As opposed to nations such as Haiti, the Dominican Republic, and various African nations, the Yemeni outbreak has regular and reliable reporting of cholera and various related factors. This wealth of data has opened the possibility for the use of machine learning to predict cholera outbreaks. Thus, we have been able to construct CALM, the Cholera Artificial Learning Model, a system comprised of four extreme-gradient-boosting (XGBoost) machine learning models that, working together, forecast the exact number of cholera cases any given Yemeni governorate will experience for multiple time intervals ranging from 2 weeks to 2 months. With extensive engineering of predictive features, the models utilize a large span of relevant datasets, including multiple mathematical representations of rainfall, past cholera incidence and mortality, and civil war mortalities. By predicting the exact number of new cases (per 10,000 people) each governorate will experience in the next two months with 2-week intervals, CALM provides a comprehensive and accurate forecast of the Yemen cholera outbreak, allowing for necessary preventative action to be taken. Furthermore, the geographic divisions (governorates) for which incidence are predicted are specific enough that practical measures can be taken to distribute medicines to those in need. For reference, YE-AM (Amran), the governorate with the greatest cumulative cholera case count (normalized by population), has an area of 9,587 square kilometers (Yemen, 2014).


Results of Feature Tuning