Team:Lambert GA/CALM MODEL

C A L M

M O D E L

OVERVIEW

The ongoing Yemeni cholera outbreak, triggered by a devastating civil war, has been deemed “the worst outbreak in history”, with the primary cause of mortality being the inefficient allocation of medicines. This resource allocation problem has been enabled by a lack of a comprehensive forecast detailing when, where, and with how many cases cholera will strike, which would allow for the targeted distribution of relief supplies and outbreak mitigation. In response to this problem, we have constructed CALM, the Cholera Artificial Learning Model, a system of four extreme-gradient-boosting (XGBoost) machine learning models that, in tandem, forecast the exact number (with an error margin of 4.787 cases per 10,000) of cholera cases any given Yemeni governorate will experience for multiple time intervals ranging from 2 weeks to 2 months.

MODEL

We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our four models. The nature of the task was time series forecasting (regression).

Figure 1: XGBoost Stream.

Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by other models (see Uniqueness of Approach), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.

We constructed four separate models for four forecast ranges: 0-2 weeks, 2-4 weeks, 4-6 weeks, and 6-8 weeks in the future. The combination of these four models allows us to produce a comprehensive cholera forecast, detailing exactly how many new cases each Yemeni governorate will experience in each of the aforementioned time frames.

Figure 2: Diagram of CALM Conceptual Structure.

DATA

Lambert iGEM used multiple unique timeseries in the design of CALM. Features were calculated for each governorate and for governorates neighboring the respective governorate. Features were also calculated over multiple time frames: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Data from which features were calculated includes conflict fatalities, rainfall, past cases, and past deaths.

FEATURE ENGINEERING

Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). This allowed us to arrive at the best 30-50 features for each time-range. All in all, we were able to remove ~99.9% of our original features.

TUNING

We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.

RESULTS

Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our model’s performance in real-world simulation, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to predict around ⅕ of a standard deviation of the number cases, our predictions are robust and reliable across all time frames. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data.

Figure 2: XGBoost Predictions of New Cases 0 to 2, 2 to 4, 4 to 6, and 6 to 8 Weeks in Advance for Five Governorates: Our forecasts for each time frame vs a sliding window of real cases. The window represents a two-week interval corresponding to the forecast range with single data points, sliding over the data and summing up the cholera cases falling in the interval. For example, in the 0-2 week forecast plot, on September 15th we predicted there would be 92 new cholera cases per 10,000 people in the next two weeks in the governorate of YE-SN (Sana’a). Then, two weeks later, shown as the red true-value, Sana’a experienced ~92 cases. However, the date for the true-value datapoint remains September 15th, as the value describes the number of cases 0-2 weeks in the future. The red value refers to the true value, or the number of new cholera cases actually experienced by the respective governorate in the corresponding time range (2-4 weeks from present, 4-6 weeks from present, etc.). Cross-validation predictions (green) were completed with a rolling-window method, as described earlier (see methods), and the hold-out predictions (blue) were done separately, with the model training on all data previous to the holdout set.

Our four model system is able to accurately and comprehensively forecast cholera outbreaks across 21 governorates exhibiting heterogeneous behaviors. Five governorates were chosen to represent the entire range of behavior exhibited by all 21 of Yemen’s governorates. YE-AM (Amran), YE-DA (Dhale), and YE-MW (Al Mahwit) experienced the greatest cumulative number of cholera cases from May 22nd to February 18th, respectively, making them three of the four governorates most affected by cholera overall in the given timeframe. It should be noted that the 3rd most affected governorate, YE-SA (Amanat Al Asimah) was not included, as it is technically a municipality that is enclosed by the YE-SN governorate (which is already included in the 5 governorates chosen). YE-RA (Raymah) and YE-SN (Sana’a) were chosen due to the sudden spike in cholera cases both experienced. While it was affected less than other governorates, between January and February (part of the hold-out set), YE-RA saw a sudden increase in cholera cases, peaking at ~20 new cases every two weeks (per 10,000 people). YE-SA experienced the most sudden outbreak of all governorates, with incidence rates reaching as a high as 92 cases every two weeks (per 10,000 people) between mid-September and October. With this representative sample of five governorates, we show that CALM performs well on multiple governorates exhibiting a diverse range of behaviors. It is worth noting that YE-SA and YE-RA present a rare and interesting case - a sudden outbreak. As our predictive range increases, it becomes more difficult to predict sudden spikes, due to either a lack of information many weeks prior or the events preceding a sharp outbreak not having occurred yet. As a result, longer range models seem to predict sharp outbreaks with a certain lag. This can be seen from the 4-6 week model’s forecasting of the YE-RA outbreak, which shows an upward trend, but not a full spike being predicted. However, this is where the combination of all four of our models becomes most useful. While long-range models cannot easily predict outbreaks, our shorter-range models are able to pick up the slack once the outbreak becomes closer. Specifically, our 0-2 week forecasting model is able to predict incidence spikes in YE-RA and YE-SA, so even if our long-range models were unable to predict the outbreak immediately, we would still detect the outbreak at a later date.