Team:Lambert GA/CALM SUPPLEMENTARY

C A L M

S U P P L E M E N T A R Y

Uniqueness of Approach

The authors have attempted to be as comprehensive as possible in representing the literature on existing models for the Yemeni outbreak. As of the writing of this page, various models, some incorporating machine learning, most not, have been constructed by others. Many of these models are accurate in their specific use cases-such as the one constructed by Jutla, Akanda, and Islam (2010)- but are applied in areas where cholera is seasonal and non-sporadic, such as Bangladesh (Jutla, Akanda, Unnikrishnan, Huq, & Colwell, 2015), and are thus fairly simple (often using various kinds of regression(s) or logistic models and modeling linear relationships). Cholera, in general, is seasonal, but is subject to non-seasonal influences (Emch et al., 2008). In fact, the Yemeni outbreak has been especially subject to many non-seasonal, sporadic influences, namely the Yemeni civil war, necessitating a more complex model that can capture these nonlinear, nonseasonal relations (Camacho et al., 2018). Our extreme gradient boosting approach provides this, offering a robust, principled approach used widely by data scientists to achieve state-of-the-art results on many machine learning challenges (Chen & Guestrin, 2016). The use of machine learning beyond regression is key, as by deriving a deeper understanding of the breadth of data available CALM is able to deliver a more useful forecast. While more complex machine learning algorithms like XGBoost can come at the cost of overfitting, viable complex models are possible without overfitting, as Pezeshki et al. (2016) have demonstrated by predicting cholera in Chabahar City, Iran, using an artificial neural network.

Additionally, forecasts produced by other models often undersupply comprehensiveness, lacking details on when an outbreak might strike and exactly how many will be impacted (for example, Jutla et al. developed a model predicting cholera risk and not cases (Cole, 2018)). In contrast, CALM forecasts the exact number of cholera cases any given Yemeni governorate will experience in 2-week time intervals ranging from 2 weeks to 2 months, providing fundamentally different information than a risk indicator or a broad cumulative incidence count ) to an aid organization or government official.

Finally, existing models often do not make use of the full breadth of cholera-predictive data available, usually making use of only seasonal environmental factors or only cholera incidence. Given that Yemen is currently in a civil war, we propose the incorporation of civil war fatality data along with environmental and epidemiological data to span the entire range of factors that can affect cholera. When paired with extensive feature engineering, CALM’s use of rainfall, past cholera cases and deaths, and civil war fatalities allows it to find key patterns in cholera incidence in Yemen to create a model capable of strongly modeling the nonlinear trends of cholera.