Difference between revisions of "Team:Lambert GA/Software"

Line 257: Line 257:
 
<div id="content3">
 
<div id="content3">
 
The ongoing Yemeni cholera outbreak has been deemed one of the worst cholera outbreaks in history, with over a million people impacted and thousands dead. Triggered by a civil war, the outbreak has been shaped by various political, environmental, and epidemiological factors and is continuing to accelerate.While cholera has several effective treatments, the untimely and ultimately inefficient distribution of existing medicines has been the primary cause of cholera mortality. With the hope of facilitating resource allocation, various mathematical models have been made to track the Yemeni outbreak and identify at-risk governorates (administrative divisions. These models, while useful, are not powerful enough to accurately and consistently forecast exact cholera cases per governorate over multiple timeframes. To address the need for a complex, reliable model, the Lambert iGEM team presents CALM, the Cholera Artificial Learning Model; a system of four extreme-gradient-boosting (XGBoost) machine learning models that forecast the exact number of cholera cases a Yemeni governorate will experience from a time range of  2 weeks to 2 months. CALM provides a novel machine learning approach that makes use of rainfall data, past cholera cases and deaths data, civil war fatalities, and inter-governorate interactions represented across multiple timeframes. Additionally, the use of machine learning, along with extensive feature engineering, allows CALM to easily learn complex non-linear relations apparent in an epidemiological phenomenon. CALM is able to forecast cholera incidence 6-8 weeks ahead within a margin of 4.607 cholera cases per 10,000 people in real world simulation. Similarly, CALM achieved a mean error of 3.921 for a 0-2 week forecast, 4.034 for a 2-4 week forecast, and 4.737 for a 4-6 week forecast. The model’s forecast system provides advanced notice of outbreaks on multiple levels to facilitating the timely allocation of cholera relief supplies and outbreak prevention.
 
The ongoing Yemeni cholera outbreak has been deemed one of the worst cholera outbreaks in history, with over a million people impacted and thousands dead. Triggered by a civil war, the outbreak has been shaped by various political, environmental, and epidemiological factors and is continuing to accelerate.While cholera has several effective treatments, the untimely and ultimately inefficient distribution of existing medicines has been the primary cause of cholera mortality. With the hope of facilitating resource allocation, various mathematical models have been made to track the Yemeni outbreak and identify at-risk governorates (administrative divisions. These models, while useful, are not powerful enough to accurately and consistently forecast exact cholera cases per governorate over multiple timeframes. To address the need for a complex, reliable model, the Lambert iGEM team presents CALM, the Cholera Artificial Learning Model; a system of four extreme-gradient-boosting (XGBoost) machine learning models that forecast the exact number of cholera cases a Yemeni governorate will experience from a time range of  2 weeks to 2 months. CALM provides a novel machine learning approach that makes use of rainfall data, past cholera cases and deaths data, civil war fatalities, and inter-governorate interactions represented across multiple timeframes. Additionally, the use of machine learning, along with extensive feature engineering, allows CALM to easily learn complex non-linear relations apparent in an epidemiological phenomenon. CALM is able to forecast cholera incidence 6-8 weeks ahead within a margin of 4.607 cholera cases per 10,000 people in real world simulation. Similarly, CALM achieved a mean error of 3.921 for a 0-2 week forecast, 4.034 for a 2-4 week forecast, and 4.737 for a 4-6 week forecast. The model’s forecast system provides advanced notice of outbreaks on multiple levels to facilitating the timely allocation of cholera relief supplies and outbreak prevention.
 +
</div>
 +
<div id="target1"></div>
 +
<div id="subheading1">
 +
<b>DATASET PREPARATION</b>
 +
</div>
 +
<br>
 +
<br>
 +
<div id="content1">
 +
With the objective of predicting new cholera cases in any given governorate in Yemen from week to week, there were a number of steps taken to prepare the data. In order to produce models that did not simply rely on seasonal trends and were able to predict spikes in cholera cases, the case and death report time series were made stationary through temporal differencing. It should be noted that the country of Yemen encompasses 21 governorates or administrative divisions. While the CALM models were trained on data from all 21 governorates, data preparation on each governorate was performed separately to preserve each governorate’s unique time series. As the interval between each WHO cholera case/death report was not standard, the data was linearly interpolated into a daily time series. The Yemeni Cholera outbreak is seasonal and endemic as outbreaks spike during the rainy season (April-August) - however, the outbreaks rely on non-seasonal factors such as conflict and damage to health and sanitation (Camacho et al., 2018a).  Parsing data required finding the number of new cholera cases in a single day, given the total number of cases in the previous day. The values were then normalized by the population of each governorate (e.g new cases per 10,000 people). Finally, we calculated our four target variables: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present.
 +
<br>
 +
Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our base training set was defined from July 1 to August 15th. While WHO reports extended back as far as May 22, we chose to start on July 1 in order to have enough prior data for feature calculation. Our cross-validation dataset was defined from August 15 to November 10. Finally, our hold-out set started from November 11 and extended to a final date in January/February, which varied for each defined target variable depending on the respective range: a 6-8 week forecast implies a larger time frame between current and forecast date than a 2-4 week forecast, and so the 6-8 week forecast holdout set would end prior to the 2-4 week forecast. It may seem that the cross-validation set outweighs the training set significantly, but this was mitigated with the use of a rolling window forecast - a gold standard for cross-validation in time series forecasting. Rolling window cross-validation is easiest understood with the following example. Given a dataset spanning four weeks, a rolling window forecast would dictate that we train on the first week, predict on the second week, then train on the first two weeks, predict on the third, and finally train on the first three weeks and predict on the fourth. In this example, the first week would be the base-training set (as it was never predicted on and was included in the training set of each fold), the second and third weeks the cross-validation set (as they varied between prediction and training sets), and the fourth the week the hold-out set (as it was never trained on). Our five cross-validation sets were defined as follows: August 16 to August 31, August 31 to September 15, September 15 to September 30, September 30 to October 15, and Finally October 15 to October 30 (it should be noted that the final fold included data from October 30 to November 10 as a prediction set, though this does not cross into the hold-out set). The cross-validation sets were used to select features and find optimal hyperparameters for our model, and the hold-out set was used to simulate real-world performance of our model.
 +
 +
<br><br>
 +
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/6/66/T--Lambert_GA--CALMDataSplit.png"></div>
 +
<br>
 +
<div style="font-size:12px; text-align:center;"><i>Datasets and Training</i></div>
 +
</div>
 +
<br><br>
 +
 +
<div id="target2"></div>
 +
<div id="subheading2">
 +
<b>Feature Engineering</b>
 +
</div>
 +
<br><br>
 +
<div id="content2">
 +
Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). <b>This allowed us to arrive at the best 30-50 features</b> for each time-range. All in all, we were able to remove ~99.9% of our original features.
 +
<br><br>
 +
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/d/d3/T--Lambert_GA--CALMFeatureTuningResults.png"></div>
 +
<br>
 +
<div style="font-size:12px; text-align:center"><i>Results of Feature Tuning</i></div>
 +
 +
<br>
 +
</div>
 +
 +
 +
<div id="target3"></div>
 +
<div id="subheading3">
 +
<b>Model</b>
 +
</div>
 +
<br><br>
 +
<div id="content3">
 +
 +
We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (refer to the background), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.
 +
<br>
 +
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/4/4f/T--Lambert_GA--CALMmodelPic.png"></div>
 +
<br>
 +
<div style="font-size:12px; text-align:center;"><i>XGBoost Depiction</i></div>
 +
</div>
 +
 +
<div id="target4"></div>
 +
<div id="subheading4">
 +
<b>Tuning</b>
 +
</div>
 +
<br><br>
 +
<div id="content4">
 +
 +
We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.
 +
<br><br>
 +
 +
</div>
 +
</br>
 +
 
</div>
 
</div>
  

Revision as of 23:39, 17 October 2018

C A L M   O V E R V I E W



































OVERVIEW


The ongoing Yemeni cholera outbreak has been deemed one of the worst cholera outbreaks in history, with over a million people impacted and thousands dead. Triggered by a civil war, the outbreak has been shaped by various political, environmental, and epidemiological factors and is continuing to accelerate.While cholera has several effective treatments, the untimely and ultimately inefficient distribution of existing medicines has been the primary cause of cholera mortality. With the hope of facilitating resource allocation, various mathematical models have been made to track the Yemeni outbreak and identify at-risk governorates (administrative divisions. These models, while useful, are not powerful enough to accurately and consistently forecast exact cholera cases per governorate over multiple timeframes. To address the need for a complex, reliable model, the Lambert iGEM team presents CALM, the Cholera Artificial Learning Model; a system of four extreme-gradient-boosting (XGBoost) machine learning models that forecast the exact number of cholera cases a Yemeni governorate will experience from a time range of 2 weeks to 2 months. CALM provides a novel machine learning approach that makes use of rainfall data, past cholera cases and deaths data, civil war fatalities, and inter-governorate interactions represented across multiple timeframes. Additionally, the use of machine learning, along with extensive feature engineering, allows CALM to easily learn complex non-linear relations apparent in an epidemiological phenomenon. CALM is able to forecast cholera incidence 6-8 weeks ahead within a margin of 4.607 cholera cases per 10,000 people in real world simulation. Similarly, CALM achieved a mean error of 3.921 for a 0-2 week forecast, 4.034 for a 2-4 week forecast, and 4.737 for a 4-6 week forecast. The model’s forecast system provides advanced notice of outbreaks on multiple levels to facilitating the timely allocation of cholera relief supplies and outbreak prevention.
DATASET PREPARATION


With the objective of predicting new cholera cases in any given governorate in Yemen from week to week, there were a number of steps taken to prepare the data. In order to produce models that did not simply rely on seasonal trends and were able to predict spikes in cholera cases, the case and death report time series were made stationary through temporal differencing. It should be noted that the country of Yemen encompasses 21 governorates or administrative divisions. While the CALM models were trained on data from all 21 governorates, data preparation on each governorate was performed separately to preserve each governorate’s unique time series. As the interval between each WHO cholera case/death report was not standard, the data was linearly interpolated into a daily time series. The Yemeni Cholera outbreak is seasonal and endemic as outbreaks spike during the rainy season (April-August) - however, the outbreaks rely on non-seasonal factors such as conflict and damage to health and sanitation (Camacho et al., 2018a). Parsing data required finding the number of new cholera cases in a single day, given the total number of cases in the previous day. The values were then normalized by the population of each governorate (e.g new cases per 10,000 people). Finally, we calculated our four target variables: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present.
Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our base training set was defined from July 1 to August 15th. While WHO reports extended back as far as May 22, we chose to start on July 1 in order to have enough prior data for feature calculation. Our cross-validation dataset was defined from August 15 to November 10. Finally, our hold-out set started from November 11 and extended to a final date in January/February, which varied for each defined target variable depending on the respective range: a 6-8 week forecast implies a larger time frame between current and forecast date than a 2-4 week forecast, and so the 6-8 week forecast holdout set would end prior to the 2-4 week forecast. It may seem that the cross-validation set outweighs the training set significantly, but this was mitigated with the use of a rolling window forecast - a gold standard for cross-validation in time series forecasting. Rolling window cross-validation is easiest understood with the following example. Given a dataset spanning four weeks, a rolling window forecast would dictate that we train on the first week, predict on the second week, then train on the first two weeks, predict on the third, and finally train on the first three weeks and predict on the fourth. In this example, the first week would be the base-training set (as it was never predicted on and was included in the training set of each fold), the second and third weeks the cross-validation set (as they varied between prediction and training sets), and the fourth the week the hold-out set (as it was never trained on). Our five cross-validation sets were defined as follows: August 16 to August 31, August 31 to September 15, September 15 to September 30, September 30 to October 15, and Finally October 15 to October 30 (it should be noted that the final fold included data from October 30 to November 10 as a prediction set, though this does not cross into the hold-out set). The cross-validation sets were used to select features and find optimal hyperparameters for our model, and the hold-out set was used to simulate real-world performance of our model.


Datasets and Training


Feature Engineering


Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). This allowed us to arrive at the best 30-50 features for each time-range. All in all, we were able to remove ~99.9% of our original features.


Results of Feature Tuning

Model


We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (refer to the background), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.

XGBoost Depiction
Tuning


We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.