C A L M
B A C K G R O U N D & S O L U T I O N
B A C K G R O U N D & S O L U T I O N
Cholera in Yemen
Cholera is a waterborne disease caused by the bacterium Vibrio cholerae, which has plagued mankind for centuries and continues to do so despite the advances of modern medicine. The ongoing cholera outbreak in Yemen, which began in October of 2016, has been deemed “the largest documented cholera outbreak” through a comprehensive analysis of cholera surveillance data by Camacho et al. (2018). Enabled by a devastating civil war, cholera has spread rampantly across the country, with the World Health Organization’s weekly bulletins reporting that, as of April 2017, there have been 1,055,788 suspected and 612,703 confirmed cases of cholera, causing 2,255 confirmed deaths (World Health Organization, 2017). While cholera has several effective treatments, including Oral Cholera Vaccinations (OCVs) with an 80.2% prevention rate (Azman et al., 2016), the inefficient and untimely distribution of medicine has been the primary cause of cholera mortality (Camacho et al., 2018). This is because the Yemeni outbreak has been largely sporadic, occurring in waves spawned by a variety of environmental (rainfall), political (civil war conflicts), and epidemiological factors (cholera incidence and mortality) (Camacho et al., 2018). Studies suggest that the third wave of cholera transmission may resurge during the rainy season of 2018, resulting in an urgent need for a forecast that details precisely when, where, and how many people will contract the disease (Camacho et al., 2018). With a comprehensive, actionable forecast, health organizations have the opportunity to deploy prevention methods in a highly targeted, efficient fashion, allowing for the mitigation of the outbreak (Camacho et al,. 2018).
Datasets and Training
Our Solution
Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). This allowed us to arrive at the best 30-50 features for each time-range. All in all, we were able to remove ~99.9% of our original features.
Results of Feature Tuning