Team:Lambert GA/Software


Cholera in Yemen

Cholera is a waterborne disease caused by the bacterium Vibrio cholerae, which has plagued mankind for centuries and continues to do so despite the advances of modern medicine. The ongoing cholera outbreak in Yemen, which began in October of 2016, has been deemed “the largest documented cholera outbreak” through a comprehensive analysis of cholera surveillance data by Camacho et al. (2018). Enabled by a devastating civil war, cholera has spread rampantly across the country, with the World Health Organization’s weekly bulletins reporting that, as of April 2017, there have been 1,055,788 suspected and 612,703 confirmed cases of cholera, causing 2,255 confirmed deaths (World Health Organization, 2017). While cholera has several effective treatments, including Oral Cholera Vaccinations (OCVs) with an 80.2% prevention rate (Azman et al., 2016), the inefficient and untimely distribution of medicine has been the primary cause of cholera mortality (Camacho et al., 2018). This is because the Yemeni outbreak has been largely sporadic, occurring in waves spawned by a variety of environmental (rainfall), political (civil war conflicts), and epidemiological factors (cholera incidence and mortality) (Camacho et al., 2018). Studies suggest that the third wave of cholera transmission may resurge during the rainy season of 2018, resulting in an urgent need for a forecast that details precisely when, where, and how many people will contract the disease (Camacho et al., 2018). With a comprehensive, actionable forecast, health organizations have the opportunity to deploy prevention methods in a highly targeted, efficient fashion, allowing for the mitigation of the outbreak (Camacho et al,. 2018).

Map of cholera outbreak in Yemen in 2017 (Al Jazeera, 2017).

Our Solution

Unique to the Yemeni outbreak has been the availability of expansive epidemiological datasets. As opposed to nations such as Haiti, the Dominican Republic, and various African nations, the Yemeni outbreak has regular and reliable reporting of cholera and various related factors. This wealth of data has opened the possibility for the use of machine learning to predict cholera outbreaks. Thus, we have been able to construct CALM, the Cholera Artificial Learning Model, a system comprised of four extreme-gradient-boosting (XGBoost) machine learning models that, working together, forecast the exact number of cholera cases any given Yemeni governorate will experience for multiple time intervals ranging from 2 weeks to 2 months. With extensive engineering of predictive features, the models utilize a large span of relevant datasets, including multiple mathematical representations of rainfall, past cholera incidence and mortality, and civil war mortalities. By predicting the exact number of new cases (per 10,000 people) each governorate will experience in the next two months with 2-week intervals, CALM provides a comprehensive and accurate forecast of the Yemen cholera outbreak, allowing for necessary preventative action to be taken. Furthermore, the geographic divisions (governorates) for which incidence are predicted are specific enough that practical measures can be taken to distribute medicines to those in need. For reference, YE-AM (Amran), the governorate with the greatest cumulative cholera case count (normalized by population), has an area of 9,587 square kilometers (Yemen, 2014).

Diagram of CALM Conceptual Structure.

Cholera Incidence and Mortality

Past cholera cases and deaths were included with the assumption that they would be predictive of future cases. Vibrio cholerae thrives in aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and further the spread of cholera. With this in mind, cholera case and death data from the World Health Organization (WHO) reports were included as a primary feature in CALM.

Rainfall Data

Analysis of surveillance data for the Yemeni cholera outbreak from 2016 to 2018 found a positive and nonlinear association between weekly rainfall and suspected cholera incidence: the relative risk of cholera 10 days after a weekly rainfall of 25 mm was found to be 42% higher than compared with a week without rain (Camacho et al., 2018). Despite the inability to establish that rainfall is directly causal to the increase in cholera outbreaks, the use of unsafe water sources during the drought season, contamination of water sources during the rainy season, and changing levels of zooplankton and iron in water (which help cholera bacteria survive), may contribute to the increasing levels of cholera during the rainy season (Camacho et al., 2018). Thus, rainfall data from NASA GPM satellites were included in CALM.

Conflict Data (Yemeni Civil War)

While cholera is preventable and treatable under stable circumstances, the collapse of Yemen’s health, water, and sanitation sectors amidst the ongoing armed conflict have fueled the spread of cholera across the country, and with direct attacks against hospitals and the bombing of water supplies, the conflict has dissolved 55% of the country's medical, wastewater, and solid waste management infrastructure, making access to clean water and healthcare difficult and expensive (Camacho et al, 2018; Yemen’s Cholera Crisis: Fighting Disease During Armed Conflict, 2017; Yemen: The Forgotten War, 2018). The number of daily casualties due to conflict in each Yemeni governorate was used as a metric for civil war related violence.

Diagram of Lambert's Approach


In order to produce models that did not solely rely on seasonal trends and were able to predict spikes in cholera cases, our objective became to predict new cholera cases in any given governorate in Yemen from week to week. With this objective, the case and death report time series were made stationary through temporal differencing. Our four target variables were also calculated: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present.

Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our cross-validation dataset was used with a rolling window forecast (see supplementary section - methods for more information) for feature selection and hyperparameter optimization.

Feature Engineering

Feature engineering is at the core of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. We extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set of possible features allowed us to ensure the best ones would be found. We also calculated features over multiple time frames and for geographically neighboring governorates. Through demanding feature selection process (see supplementary section - methods for more information), we were able to arrive at the best 30-50 features for each time-range model. All in all, we were able to remove ~99.9% of our original features.

Results of Feature Tuning


We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (see “Uniqueness of Approach” supplementary section), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.


Hyperparameters are characterized as those whose value is set before the learning process begins, and so can greatly affect a model’s performance. We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.


Figure 1: Cross-validation and Holdout Error for four XGBoost forecasting models

Cross-validation and hold-out error for each of our four models. Cross-validation error was obtained by taking the root of the mean of the model’s performance across five rolling-window cross-validation folds. Hold out error was obtained by calculating the root mean squared error for predictions only on the holdout set.

Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our model’s performance in real-world simulation, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to predict around ⅕ of a standard deviation of the number cases, our predictions are robust and reliable across all time frames. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data.

Figure 2: XGBoost Predictions of New Cases 0 to 2, 2 to 4, 4 to 6, and 6 to 8 Weeks in Advance for Five Governorates
Our forecasts for each time frame vs a sliding window of real cases. The window represents a two-week interval corresponding to the forecast range with single data points, sliding over the data and summing up the cholera cases falling in the interval. For example, in the 0-2 week forecast plot, on September 15th we predicted there would be 92 new cholera cases per 10,000 people in the next two weeks in the governorate of YE-SN (Sana’a). Then, two weeks later, shown as the red true-value, Sana’a experienced ~92 cases. However, the date for the true-value datapoint remains September 15th, as the value describes the number of cases 0-2 weeks in the future. The red value refers to the true value, or the number of new cholera cases actually experienced by the respective governorate in the corresponding time range (2-4 weeks from present, 4-6 weeks from present, etc.). Cross-validation predictions (green) were completed with a rolling-window method, as described earlier (see methods), and the hold-out predictions (blue) were done separately, with the model training on all data previous to the holdout set.
Our four model system is able to accurately and comprehensively forecast cholera outbreaks across 21 governorates exhibiting heterogeneous behaviors. Five governorates were chosen to represent the entire range of behavior exhibited by all 21 of Yemen’s governorates. YE-AM (Amran), YE-DA (Dhale), and YE-MW (Al Mahwit) experienced the greatest cumulative number of cholera cases from May 22nd to February 18th, respectively, making them three of the four governorates most affected by cholera overall in the given timeframe. These three governorates exhibit behavior highly similar to the other governorates, albeit at a higher scale. Our accurate delineation of the outbreak across all four time frames in these three governorates shows the usefulness of our models in the general case. On the other hand, YE-SA and YE-RA present a rare and interesting case - a sudden outbreak. As our predictive range increases, it becomes more difficult to predict sudden spikes, due to either a lack of information many weeks prior or the events preceding a sharp outbreak not having occurred yet. As a result, longer range models seem to predict sharp outbreaks with a certain lag. This can be seen from the 4-6 week model’s forecasting of the YE-RA outbreak, which shows an upward trend, but not a full spike being predicted. However, this is where the combination of all four of our models becomes most useful. While long-range models cannot easily predict outbreaks, our shorter-range models are able to pick up the slack once the outbreak becomes closer. Specifically, our 0-2 week forecasting model is able to predict incidence spikes in YE-RA and YE-SA, so even if our long-range models were unable to predict the outbreak immediately, we would still detect the outbreak at a later date.

Figure 3: Cumulative Cholera Cases for 5 Governorates.

Other models look to forecast the total cholera cases Yemen may experience (Nishiura, 2017) with some models providing governorate-specific (or even geographically smaller) predictions (Cole, 2018). While the objective of our models is to predict new cases (as to not simply follow linear trends and be able to predict outbreak spikes), we are able to convert our predictions to total cases through simple reverse differencing. Using this method, we are able to accurately delineate the course of the Yemeni outbreak. Figure 1.2 illustrates this on a representative sample of five Yemeni governorates.


Cholera has killed millions of people, and without proper action, could continue to do so for many years. Many have modeled and analyzed cholera outbreaks globally to predict where they will next strike, or to forecast cholera cases and/or risk during one particular outbreak. Lambert iGEM has itself created a system of models, CALM, the Cholera Artificial Learning Model, to forecast exact cholera incidence, with the proof-of-concept forecasting cases in Yemen. Consisting of four XGBoost models working in conjunction, CALM is able to make predictions that are highly accurate and reliable, utilizing a broad range of predictive features, and robust machine learning techniques. In addition, by using a system of four models, CALM is not only able to provide a comprehensive 2 month report with 2-week intervals, but also ensures that none of its models “fall behind” by making a prediction too early or too late. Sudden spikes in the cholera prediction may not be modeled accurately by models of later time frames, but more immediate aspects of the model, such as the predictor for 0-2 weeks, can effectively account for sudden changes in the cholera cases. CALM has been shown, within the data used, to be a reliable, powerful way to reduce the number of people suffering from cholera by providing advance notice of outbreaks and allowing for the distribution of medical supplies to pre-empt an outbreak.

CALMWatch- an SMS bot

Lambert iGEM has also developed an SMS bot utilizing the Twilio API and the Flask web framework for gathering health and sanitation data from areas affected by a cholera outbreak. This bot, named CALMWatch, allows for a healthcare organization or government agency to distribute an SMS survey to a given population so that affected people can report data related to an ongoing cholera outbreak such as cleanliness of water sources, water storage, and waste management. This data can then be fed into the CALM model in real time to increase the accuracy of the model and increase the size of its databases. This bot is based on RatWatch, an open-source SMS-based rat reporting service for the Atlanta area developed by M. Koohang (Zegura & DiSalvo 2018), who graciously allowed Lambert iGEM to modify it for the purposes of the 2018 project.

Example CALMBot Interaction

Further Development

While the efficacy of the model has only been proven in Yemen, it is expected that with further development and adaptation CALM will be used to predict disease outbreaks around the world. As development on the project progresses, Lambert iGEM hopes to construct a fully autonomous web-based software platform comprising of data collection bots that collect data from major health and sanitation sources, scripts that are capable of syncing data from the ColorQ app and from CALMWatch surveys with CALM’s databases, and an online platform that coordinates global usage of the model so that users can share and distribute results and model improvements more easily. In terms of CALM itself, Lambert iGEM also hopes to engineer more features for the model by acquiring data for more environmental factors, possibly including algal blooms, migration data, and OCV (oral cholera vaccine) campaign data.