Line 113: | Line 113: | ||
} | } | ||
#target4{ | #target4{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target5{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target6{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target7{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target8{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target9{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target10{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target5{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target5{ | ||
+ | position: relative; | ||
+ | top: -80px; | ||
+ | } | ||
+ | #target5{ | ||
position: relative; | position: relative; | ||
top: -80px; | top: -80px; | ||
Line 261: | Line 297: | ||
<br><br> | <br><br> | ||
− | <div id=" | + | <div id="target1"></div> |
− | <div id=" | + | <div id="subheading1"> |
<b>OVERVIEW | <b>OVERVIEW | ||
</b> | </b> | ||
</div> | </div> | ||
<br><br> | <br><br> | ||
− | <div id=" | + | <div id="content1"> |
The ongoing Yemeni cholera outbreak has been deemed one of the worst cholera outbreaks in history, with over a million people impacted and thousands dead. Triggered by a civil war, the outbreak has been shaped by various political, environmental, and epidemiological factors and is continuing to accelerate.While cholera has several effective treatments, the untimely and ultimately inefficient distribution of existing medicines has been the primary cause of cholera mortality. With the hope of facilitating resource allocation, various mathematical models have been made to track the Yemeni outbreak and identify at-risk governorates (administrative divisions. These models, while useful, are not powerful enough to accurately and consistently forecast exact cholera cases per governorate over multiple timeframes. To address the need for a complex, reliable model, the Lambert iGEM team presents CALM, the Cholera Artificial Learning Model; a system of four extreme-gradient-boosting (XGBoost) machine learning models that forecast the exact number of cholera cases a Yemeni governorate will experience from a time range of 2 weeks to 2 months. CALM provides a novel machine learning approach that makes use of rainfall data, past cholera cases and deaths data, civil war fatalities, and inter-governorate interactions represented across multiple timeframes. Additionally, the use of machine learning, along with extensive feature engineering, allows CALM to easily learn complex non-linear relations apparent in an epidemiological phenomenon. CALM is able to forecast cholera incidence 6-8 weeks ahead within a margin of 4.607 cholera cases per 10,000 people in real world simulation. Similarly, CALM achieved a mean error of 3.921 for a 0-2 week forecast, 4.034 for a 2-4 week forecast, and 4.737 for a 4-6 week forecast. The model’s forecast system provides advanced notice of outbreaks on multiple levels to facilitating the timely allocation of cholera relief supplies and outbreak prevention. | The ongoing Yemeni cholera outbreak has been deemed one of the worst cholera outbreaks in history, with over a million people impacted and thousands dead. Triggered by a civil war, the outbreak has been shaped by various political, environmental, and epidemiological factors and is continuing to accelerate.While cholera has several effective treatments, the untimely and ultimately inefficient distribution of existing medicines has been the primary cause of cholera mortality. With the hope of facilitating resource allocation, various mathematical models have been made to track the Yemeni outbreak and identify at-risk governorates (administrative divisions. These models, while useful, are not powerful enough to accurately and consistently forecast exact cholera cases per governorate over multiple timeframes. To address the need for a complex, reliable model, the Lambert iGEM team presents CALM, the Cholera Artificial Learning Model; a system of four extreme-gradient-boosting (XGBoost) machine learning models that forecast the exact number of cholera cases a Yemeni governorate will experience from a time range of 2 weeks to 2 months. CALM provides a novel machine learning approach that makes use of rainfall data, past cholera cases and deaths data, civil war fatalities, and inter-governorate interactions represented across multiple timeframes. Additionally, the use of machine learning, along with extensive feature engineering, allows CALM to easily learn complex non-linear relations apparent in an epidemiological phenomenon. CALM is able to forecast cholera incidence 6-8 weeks ahead within a margin of 4.607 cholera cases per 10,000 people in real world simulation. Similarly, CALM achieved a mean error of 3.921 for a 0-2 week forecast, 4.034 for a 2-4 week forecast, and 4.737 for a 4-6 week forecast. The model’s forecast system provides advanced notice of outbreaks on multiple levels to facilitating the timely allocation of cholera relief supplies and outbreak prevention. | ||
</div> | </div> | ||
− | <div id=" | + | <br> |
− | <div id=" | + | <br> |
+ | <div id="target2"></div> | ||
+ | <div id="subheading2"> | ||
<b>DATASET PREPARATION</b> | <b>DATASET PREPARATION</b> | ||
</div> | </div> | ||
<br> | <br> | ||
<br> | <br> | ||
− | <div id=" | + | <div id="content2"> |
− | With the objective of predicting new cholera cases in any given governorate in Yemen from week to week, there were a number of steps taken to prepare the data. In order to produce models that did not simply rely on seasonal trends and were able to predict spikes in cholera cases, the case and death report time series were made stationary through temporal differencing. It should be noted that the country of Yemen encompasses 21 governorates or administrative divisions. While the CALM models were trained on data from all 21 governorates, data preparation on each governorate was performed separately to preserve each governorate’s unique time series. As the interval between each WHO cholera case/death report was not standard, the data was linearly interpolated into a daily time series. The Yemeni Cholera outbreak is seasonal and endemic as outbreaks spike during the rainy season (April-August) - however, the outbreaks rely on non-seasonal factors such as conflict and damage to health and sanitation (Camacho et al., 2018a). Parsing data required finding the number of new cholera cases in a single day, given the total number of cases in the previous day. The values were then normalized by the population of each governorate (e.g new cases per 10,000 people). Finally, we calculated our four target variables: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present. | + |  With the objective of predicting new cholera cases in any given governorate in Yemen from week to week, there were a number of steps taken to prepare the data. In order to produce models that did not simply rely on seasonal trends and were able to predict spikes in cholera cases, the case and death report time series were made stationary through temporal differencing. It should be noted that the country of Yemen encompasses 21 governorates or administrative divisions. While the CALM models were trained on data from all 21 governorates, data preparation on each governorate was performed separately to preserve each governorate’s unique time series. As the interval between each WHO cholera case/death report was not standard, the data was linearly interpolated into a daily time series. The Yemeni Cholera outbreak is seasonal and endemic as outbreaks spike during the rainy season (April-August) - however, the outbreaks rely on non-seasonal factors such as conflict and damage to health and sanitation (Camacho et al., 2018a). Parsing data required finding the number of new cholera cases in a single day, given the total number of cases in the previous day. The values were then normalized by the population of each governorate (e.g new cases per 10,000 people). Finally, we calculated our four target variables: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present. |
<br> | <br> | ||
− | Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our base training set was defined from July 1 to August 15th. While WHO reports extended back as far as May 22, we chose to start on July 1 in order to have enough prior data for feature calculation. Our cross-validation dataset was defined from August 15 to November 10. Finally, our hold-out set started from November 11 and extended to a final date in January/February, which varied for each defined target variable depending on the respective range: a 6-8 week forecast implies a larger time frame between current and forecast date than a 2-4 week forecast, and so the 6-8 week forecast holdout set would end prior to the 2-4 week forecast. It may seem that the cross-validation set outweighs the training set significantly, but this was mitigated with the use of a rolling window forecast - a gold standard for cross-validation in time series forecasting. Rolling window cross-validation is easiest understood with the following example. Given a dataset spanning four weeks, a rolling window forecast would dictate that we train on the first week, predict on the second week, then train on the first two weeks, predict on the third, and finally train on the first three weeks and predict on the fourth. In this example, the first week would be the base-training set (as it was never predicted on and was included in the training set of each fold), the second and third weeks the cross-validation set (as they varied between prediction and training sets), and the fourth the week the hold-out set (as it was never trained on). Our five cross-validation sets were defined as follows: August 16 to August 31, August 31 to September 15, September 15 to September 30, September 30 to October 15, and Finally October 15 to October 30 (it should be noted that the final fold included data from October 30 to November 10 as a prediction set, though this does not cross into the hold-out set). The cross-validation sets were used to select features and find optimal hyperparameters for our model, and the hold-out set was used to simulate real-world performance of our model. | + |  Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our base training set was defined from July 1 to August 15th. While WHO reports extended back as far as May 22, we chose to start on July 1 in order to have enough prior data for feature calculation. Our cross-validation dataset was defined from August 15 to November 10. Finally, our hold-out set started from November 11 and extended to a final date in January/February, which varied for each defined target variable depending on the respective range: a 6-8 week forecast implies a larger time frame between current and forecast date than a 2-4 week forecast, and so the 6-8 week forecast holdout set would end prior to the 2-4 week forecast. It may seem that the cross-validation set outweighs the training set significantly, but this was mitigated with the use of a rolling window forecast - a gold standard for cross-validation in time series forecasting. Rolling window cross-validation is easiest understood with the following example. Given a dataset spanning four weeks, a rolling window forecast would dictate that we train on the first week, predict on the second week, then train on the first two weeks, predict on the third, and finally train on the first three weeks and predict on the fourth. In this example, the first week would be the base-training set (as it was never predicted on and was included in the training set of each fold), the second and third weeks the cross-validation set (as they varied between prediction and training sets), and the fourth the week the hold-out set (as it was never trained on). Our five cross-validation sets were defined as follows: August 16 to August 31, August 31 to September 15, September 15 to September 30, September 30 to October 15, and Finally October 15 to October 30 (it should be noted that the final fold included data from October 30 to November 10 as a prediction set, though this does not cross into the hold-out set). The cross-validation sets were used to select features and find optimal hyperparameters for our model, and the hold-out set was used to simulate real-world performance of our model. |
<br><br> | <br><br> | ||
Line 288: | Line 326: | ||
<br><br> | <br><br> | ||
− | <div id=" | + | <div id="target3"></div> |
− | <div id=" | + | <div id="subheading3"> |
<b>Feature Engineering</b> | <b>Feature Engineering</b> | ||
</div> | </div> | ||
<br><br> | <br><br> | ||
− | <div id=" | + | <div id="content3"> |
− | Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). <b>This allowed us to arrive at the best 30-50 features</b> for each time-range. All in all, we were able to remove ~99.9% of our original features. | + |  Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). <b>This allowed us to arrive at the best 30-50 features</b> for each time-range. All in all, we were able to remove ~99.9% of our original features. |
<br><br> | <br><br> | ||
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/d/d3/T--Lambert_GA--CALMFeatureTuningResults.png"></div> | <div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/d/d3/T--Lambert_GA--CALMFeatureTuningResults.png"></div> | ||
Line 304: | Line 342: | ||
− | <div id=" | + | <div id="target4"></div> |
− | <div id=" | + | <div id="subheading4"> |
<b>Model</b> | <b>Model</b> | ||
</div> | </div> | ||
<br><br> | <br><br> | ||
− | <div id=" | + | <div id="content4"> |
− | We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (refer to the background), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm. | + |  We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (refer to the background), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm. |
<br> | <br> | ||
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/4/4f/T--Lambert_GA--CALMmodelPic.png"></div> | <div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/4/4f/T--Lambert_GA--CALMmodelPic.png"></div> | ||
Line 318: | Line 356: | ||
</div> | </div> | ||
− | <div id=" | + | <div id="target5"></div> |
− | <div id=" | + | <div id="subheading5"> |
<b>Tuning</b> | <b>Tuning</b> | ||
</div> | </div> | ||
<br><br> | <br><br> | ||
− | <div id=" | + | <div id="content5"> |
− | We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency. | + |  We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency. |
<br><br> | <br><br> | ||
</div> | </div> | ||
</br> | </br> | ||
− | <div id=" | + | <div id="content5"> |
 Cholera case and death statistics are reported by the World Health Organization (WHO) where health experts and researchers work directly with Yemeni health authorities at both the country and local level. Through this direct connection, the WHO is able to record all reported cholera cases and deaths caused by cholera (WHO presence in Yemen, 2018). The data, collected by the WHO, was accessed through the Humanitarian Data Exchange (https://data.humdata.org /group/yem). It provided reports of accumulated new cholera cases and deaths per governorate from up to May 22, 2017, to February 18, 2018. |  Cholera case and death statistics are reported by the World Health Organization (WHO) where health experts and researchers work directly with Yemeni health authorities at both the country and local level. Through this direct connection, the WHO is able to record all reported cholera cases and deaths caused by cholera (WHO presence in Yemen, 2018). The data, collected by the WHO, was accessed through the Humanitarian Data Exchange (https://data.humdata.org /group/yem). It provided reports of accumulated new cholera cases and deaths per governorate from up to May 22, 2017, to February 18, 2018. | ||
Revision as of 23:48, 17 October 2018
C A L M O V E R V I E W
OVERVIEW
The ongoing Yemeni cholera outbreak has been deemed one of the worst cholera outbreaks in history, with over a million people impacted and thousands dead. Triggered by a civil war, the outbreak has been shaped by various political, environmental, and epidemiological factors and is continuing to accelerate.While cholera has several effective treatments, the untimely and ultimately inefficient distribution of existing medicines has been the primary cause of cholera mortality. With the hope of facilitating resource allocation, various mathematical models have been made to track the Yemeni outbreak and identify at-risk governorates (administrative divisions. These models, while useful, are not powerful enough to accurately and consistently forecast exact cholera cases per governorate over multiple timeframes. To address the need for a complex, reliable model, the Lambert iGEM team presents CALM, the Cholera Artificial Learning Model; a system of four extreme-gradient-boosting (XGBoost) machine learning models that forecast the exact number of cholera cases a Yemeni governorate will experience from a time range of 2 weeks to 2 months. CALM provides a novel machine learning approach that makes use of rainfall data, past cholera cases and deaths data, civil war fatalities, and inter-governorate interactions represented across multiple timeframes. Additionally, the use of machine learning, along with extensive feature engineering, allows CALM to easily learn complex non-linear relations apparent in an epidemiological phenomenon. CALM is able to forecast cholera incidence 6-8 weeks ahead within a margin of 4.607 cholera cases per 10,000 people in real world simulation. Similarly, CALM achieved a mean error of 3.921 for a 0-2 week forecast, 4.034 for a 2-4 week forecast, and 4.737 for a 4-6 week forecast. The model’s forecast system provides advanced notice of outbreaks on multiple levels to facilitating the timely allocation of cholera relief supplies and outbreak prevention.
DATASET PREPARATION
With the objective of predicting new cholera cases in any given governorate in Yemen from week to week, there were a number of steps taken to prepare the data. In order to produce models that did not simply rely on seasonal trends and were able to predict spikes in cholera cases, the case and death report time series were made stationary through temporal differencing. It should be noted that the country of Yemen encompasses 21 governorates or administrative divisions. While the CALM models were trained on data from all 21 governorates, data preparation on each governorate was performed separately to preserve each governorate’s unique time series. As the interval between each WHO cholera case/death report was not standard, the data was linearly interpolated into a daily time series. The Yemeni Cholera outbreak is seasonal and endemic as outbreaks spike during the rainy season (April-August) - however, the outbreaks rely on non-seasonal factors such as conflict and damage to health and sanitation (Camacho et al., 2018a). Parsing data required finding the number of new cholera cases in a single day, given the total number of cases in the previous day. The values were then normalized by the population of each governorate (e.g new cases per 10,000 people). Finally, we calculated our four target variables: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present.
Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our base training set was defined from July 1 to August 15th. While WHO reports extended back as far as May 22, we chose to start on July 1 in order to have enough prior data for feature calculation. Our cross-validation dataset was defined from August 15 to November 10. Finally, our hold-out set started from November 11 and extended to a final date in January/February, which varied for each defined target variable depending on the respective range: a 6-8 week forecast implies a larger time frame between current and forecast date than a 2-4 week forecast, and so the 6-8 week forecast holdout set would end prior to the 2-4 week forecast. It may seem that the cross-validation set outweighs the training set significantly, but this was mitigated with the use of a rolling window forecast - a gold standard for cross-validation in time series forecasting. Rolling window cross-validation is easiest understood with the following example. Given a dataset spanning four weeks, a rolling window forecast would dictate that we train on the first week, predict on the second week, then train on the first two weeks, predict on the third, and finally train on the first three weeks and predict on the fourth. In this example, the first week would be the base-training set (as it was never predicted on and was included in the training set of each fold), the second and third weeks the cross-validation set (as they varied between prediction and training sets), and the fourth the week the hold-out set (as it was never trained on). Our five cross-validation sets were defined as follows: August 16 to August 31, August 31 to September 15, September 15 to September 30, September 30 to October 15, and Finally October 15 to October 30 (it should be noted that the final fold included data from October 30 to November 10 as a prediction set, though this does not cross into the hold-out set). The cross-validation sets were used to select features and find optimal hyperparameters for our model, and the hold-out set was used to simulate real-world performance of our model.
Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our base training set was defined from July 1 to August 15th. While WHO reports extended back as far as May 22, we chose to start on July 1 in order to have enough prior data for feature calculation. Our cross-validation dataset was defined from August 15 to November 10. Finally, our hold-out set started from November 11 and extended to a final date in January/February, which varied for each defined target variable depending on the respective range: a 6-8 week forecast implies a larger time frame between current and forecast date than a 2-4 week forecast, and so the 6-8 week forecast holdout set would end prior to the 2-4 week forecast. It may seem that the cross-validation set outweighs the training set significantly, but this was mitigated with the use of a rolling window forecast - a gold standard for cross-validation in time series forecasting. Rolling window cross-validation is easiest understood with the following example. Given a dataset spanning four weeks, a rolling window forecast would dictate that we train on the first week, predict on the second week, then train on the first two weeks, predict on the third, and finally train on the first three weeks and predict on the fourth. In this example, the first week would be the base-training set (as it was never predicted on and was included in the training set of each fold), the second and third weeks the cross-validation set (as they varied between prediction and training sets), and the fourth the week the hold-out set (as it was never trained on). Our five cross-validation sets were defined as follows: August 16 to August 31, August 31 to September 15, September 15 to September 30, September 30 to October 15, and Finally October 15 to October 30 (it should be noted that the final fold included data from October 30 to November 10 as a prediction set, though this does not cross into the hold-out set). The cross-validation sets were used to select features and find optimal hyperparameters for our model, and the hold-out set was used to simulate real-world performance of our model.
Datasets and Training
Feature Engineering
Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). This allowed us to arrive at the best 30-50 features for each time-range. All in all, we were able to remove ~99.9% of our original features.
Results of Feature Tuning
Model
We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (refer to the background), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.
XGBoost Depiction
Tuning
We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.
Cholera case and death statistics are reported by the World Health Organization (WHO) where health experts and researchers work directly with Yemeni health authorities at both the country and local level. Through this direct connection, the WHO is able to record all reported cholera cases and deaths caused by cholera (WHO presence in Yemen, 2018). The data, collected by the WHO, was accessed through the Humanitarian Data Exchange (https://data.humdata.org /group/yem). It provided reports of accumulated new cholera cases and deaths per governorate from up to May 22, 2017, to February 18, 2018.
Past cholera cases and deaths were included with the simple assumption that they would be predictive of future cases. Vibrio cholerae requires aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and therefore indicate a further spread of cholera (Cholera - Vibrio cholerae infection).
Past cholera cases and deaths were included with the simple assumption that they would be predictive of future cases. Vibrio cholerae requires aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and therefore indicate a further spread of cholera (Cholera - Vibrio cholerae infection).
Rainfall Data
As Vibrio cholerae is indigenous to aquatic environments, rainfall is a significant predictor of the transmission of cholera. In areas exposed to heavy rainfall, through the collapse of sanitary and health infrastructure, interaction between contaminated water and human activities accelerates, resulting in an epidemic (Jutla et al., 2013). Yemen represents this scenario, where when exposed to heavy rainfall and deterioration of health facilities, there was a surge in cholera cases (Camacho et al., 2018). Global Lancet Researchers analyzing surveillance date for the Yemen Cholera Outbreak from 2016 to 2018 have found a positive and nonlinear association between weekly rainfall and suspected cholera incidence: the relative risk of cholera 10 days after a weekly rainfall of 25 mm is 42% higher than compared with a week without rain (Camacho et al., 2018). Despite the inability to establish that rainfall is causal to the increase in cholera outbreaks, the use of unsafe water sources during the drought season, contamination of water sources during the rainy season, and changing levels of zooplankton and iron in water (which help cholera bacteria survive), may contribute to the increasing levels of cholera during the rainy season (Camacho et al., 2018). These correlations demonstrate the need to measure rainfall in the machine learning model, as rainfall is a predictor for possible climate changes and the corresponding human response and subsequently indicates the spread of cholera in Yemen.
Daily rainfall data for Yemen from January 1st ,2017 to March 30th, 2018 was accessed through NASA’s Goddard Earth Sciences Data and Information Services Center (GES DISC), which provides Global Precipitation Measurement data through the Simple Subset Wizard (SSW) database. The Global Precipitation Measurement mission (GPM), launched on February 27th, 2014, is an international network of satellites that use microwave imagers and precipitation radars to measure the volume of rainfall in several regions of the world (Global Precipitation Measurement, 2011). The rainfall data was initially in a netcdf4 format. The 452 files were then parsed and converted to comma-separated-values (CSV). As there were individual data points for every .25 degrees of both latitude and longitude, Reverse geolocation was performed to match coordinates with corresponding Yemeni governorates.
Daily rainfall data for Yemen from January 1st ,2017 to March 30th, 2018 was accessed through NASA’s Goddard Earth Sciences Data and Information Services Center (GES DISC), which provides Global Precipitation Measurement data through the Simple Subset Wizard (SSW) database. The Global Precipitation Measurement mission (GPM), launched on February 27th, 2014, is an international network of satellites that use microwave imagers and precipitation radars to measure the volume of rainfall in several regions of the world (Global Precipitation Measurement, 2011). The rainfall data was initially in a netcdf4 format. The 452 files were then parsed and converted to comma-separated-values (CSV). As there were individual data points for every .25 degrees of both latitude and longitude, Reverse geolocation was performed to match coordinates with corresponding Yemeni governorates.
Conflict Data (Yemeni Civil War)
Yemen is currently in the grip of a devastating civil war, which is heavily impacting the cholera crisis in Yemen; (Camacho et al., 2018). while cholera is preventable and treatable under normal circumstances, the collapse of Yemen’s health, water, and sanitation sectors amidst the ongoing armed conflict have fueled the spread of cholera across the country, and with direct attacks against hospitals and the bombing of water supplies, the conflict has dissolved 55% of the country's medical, wastewater, and solid waste management infrastructure, making access to clean water and healthcare difficult and expensive (Camacho et al, 2018; Yemen’s Cholera Crisis: Fighting Disease During Armed Conflict, 2017; Yemen: The Forgotten War, 2018). Conflict has led to 15 million Yemenis in need of water and sanitation assistance (Camacho et al., 2018). Information regarding the status of ongoing conflicts, namely the severity in terms of death toll, was collected with the hope of it being predictive of the region’s infrastructure ability to provide treatments in cholera in the following weeks. Data gathered by the Armed Conflict Location and Event Data Project (ACLED) was retrieved from the Humanitarian Data Exchange (https://data.humdata.org/group/yem). ACLED reported the type of conflict, agents, locations, dates, and other characteristics of the politically charged conflict from January 1, 2016, to June 6, 2018 (Raleigh and Dowd, 2017). The number of casualties due to conflict on any given day and in any given Yemeni governorate was used as a metric for civil war related violence.
Diagram of Environmental and Demographic Datasets
Diagram of Factors and Processes
Results
Figure 1: Cross-validation and Holdout Error for four XGBoost forecasting models
Cross-validation and hold-out error for each of our four models. Cross-validation error was obtained by taking the root of the mean of the model’s performance across five rolling-window cross-validation folds. Hold out error was obtained by calculating the root mean squared error for predictions only on the holdout set.
Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our model’s performance in real-world simulation, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two-week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to predict around ⅕ of a standard deviation of the number cases, our predictions are robust and reliable across all time frames. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data.
Figure 2: XGBoost Predictions of New Cases 0 to 2, 2 to 4, 4 to 6, and 6 to 8 Weeks in Advance for Five Governorates
Our forecasts for each time frame vs a sliding window of real cases. The window represents a two-week interval corresponding to the forecast range with single data points, sliding over the data and summing up the cholera cases falling in the interval. For example, in the 0-2 week forecast plot, on September 15th we predicted there would be 92 new cholera cases per 10,000 people in the next two weeks in the governorate of YE-SN (Sana’a). Then, two weeks later, shown as the red true-value, Sana’a experienced ~92 cases. However, the date for the true-value datapoint remains September 15th, as the value describes the number of cases 0-2 weeks in the future. The red value refers to the true value, or the number of new cholera cases actually experienced by the respective governorate in the corresponding time range (2-4 weeks from present, 4-6 weeks from present, etc.). Cross-validation predictions (green) were completed with a rolling-window method, as described earlier (see methods), and the hold-out predictions (blue) were done separately, with the model training on all data previous to the holdout set.
Our four model system is able to accurately and comprehensively forecast cholera outbreaks across 21 governorates exhibiting heterogeneous behaviors. Five governorates were chosen to represent the entire range of behavior exhibited by all 21 of Yemen’s governorates. YE-AM (Amran), YE-DA (Dhale), and YE-MW (Al Mahwit) experienced the greatest cumulative number of cholera cases from May 22nd to February 18th, respectively, making them three of the four governorates most affected by cholera overall in the given timeframe. It should be noted that the 3rd most affected governorate, YE-SA (Amanat Al Asimah) was not included, as it is technically a municipality that is enclosed by the YE-SN governorate (which is already included in the 5 governorates chosen). YE-RA (Raymah) and YE-SN (Sana’a) were chosen due to the sudden spike in cholera cases both experienced. While it was affected less than other governorates, between January and February (part of the hold-out set), YE-RA saw a sudden increase in cholera cases, peaking at ~20 new cases every two weeks (per 10,000 people). YE-SA experienced the most sudden outbreak of all governorates, with incidence rates reaching as a high as 92 cases every two weeks (per 10,000 people) between mid-September and October. With this representative sample of five governorates, we show that CALM performs well on multiple governorates exhibiting a diverse range of behaviors. It is worth noting that YE-SA and YE-RA present a rare and interesting case - a sudden outbreak. As our predictive range increases, it becomes more difficult to predict sudden spikes, due to either a lack of information many weeks prior or the events preceding a sharp outbreak not having occurred yet. As a result, longer range models seem to predict sharp outbreaks with a certain lag. This can be seen from the 4-6 week model’s forecasting of the YE-RA outbreak, which shows an upward trend, but not a full spike being predicted. However, this is where the combination of all four of our models becomes most useful. While long-range models cannot easily predict outbreaks, our shorter-range models are able to pick up the slack once the outbreak becomes closer. Specifically, our 0-2 week forecasting model is able to predict incidence spikes in YE-RA and YE-SA, so even if our long-range models were unable to predict the outbreak immediately, we would still detect the outbreak at a later date.
Summary
Cholera has killed millions of people, and without proper action, could continue to do so for many years. Many have modeled and analyzed cholera outbreaks globally to predict where they will next strike, or to forecast cholera cases and/or risk during one particular outbreak. Lambert iGEM has itself created a system of models, CALM, the Cholera Artificial Learning Model, to forecast exact cholera incidence, with the proof-of-concept forecasting cases in Yemen. Consisting of four XGBoost models working in conjunction, CALM is able to make predictions that are highly accurate and reliable, utilizing features, machine learning techniques, and predictor for variables that other machine learning models either do not do or do not do enough of. In addition, by using a system of four models, CALM is not only able to provide a comprehensive 2 month report with 2-week intervals, but also ensures that none of its models “fall behind” by making a prediction too early or too late. Sudden spikes in the cholera prediction may not be modeled accurately by models of later time frames, but more immediate aspects of the model, such as the predictor for 0-2 weeks, can effectively account for sudden changes in the cholera cases. but As demonstrated in figures 1, 2, and 3, CALM has been shown, at least within the data used for this year’s iGEM project, to be a reliable, powerful way to reduce the number of people suffering from cholera by providing a means for cholera to be predicted so that medical supplies can be distributed to pre-empt an outbreak.
CALMWatch- an SMS bot
Lambert iGEM has also developed an SMS bot utilizing the Twilio API and the Flask web framework for gathering health and sanitation data from areas affected by a cholera outbreak by directly talking to those that are affected, providing a highly detailed and specific dataset, as data is being directly gathered from the people. This bot, named CALMWatch, allows for a healthcare organization or government agency to distribute an SMS survey to a given population so that affected people can report data related to an ongoing cholera outbreak such as cleanliness of water sources, water storage, and waste management. This data can then be fed into the CALM model in real time to increase the accuracy of the model and increase the size of its databases. This bot is based on RatWatch, an open-source SMS-based rat reporting service for the Atlanta area developed by M. Koohang (Zegura & DiSalvo 2018), who graciously allowed Lambert iGEM to modify it for the purposes of the 2018 project.
Example CALMBot Interaction
Further Development
While the efficacy of the model has only been proven in Yemen, it is expected that with further development and adaptation CALM will be used to predict outbreaks around the world. As development on the project progresses, Lambert iGEM hopes to construct a fully autonomous web-based software platform comprising of data collection bots that collect data from major health and sanitation sources, software that is capable of syncing data from the ColorQ app and from CALMWatch surveys with CALM’s databases, and software that coordinates global usage of the model so that users can share and distribute results and model improvements more easily. In terms of CALM itself, Lambert iGEM also hopes to engineer more features for the model by acquiring data for more environmental factors, possibly including algal blooms, migration data, and OCV campaign data.