Difference between revisions of "Team:Lambert GA/CALM METHODS"

Revision as of 01:09, 18 October 2018

C A L M

D A T A & M E T H O D S

Cholera Incidence and Mortality

Past cholera cases and deaths were included with the assumption that they would be predictive of future cases. Vibrio cholerae thrives in aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and further the spread of cholera. With this in mind, cholera case and death data from the World Health Organization (WHO) reports were included as a primary feature in CALM.

Rainfall Data

Analysis of surveillance data for the Yemeni cholera outbreak from 2016 to 2018 found a positive and nonlinear association between weekly rainfall and suspected cholera incidence: the relative risk of cholera 10 days after a weekly rainfall of 25 mm was found to be 42% higher than compared with a week without rain (Camacho et al., 2018). Despite the inability to establish that rainfall is directly causal to the increase in cholera outbreaks, the use of unsafe water sources during the drought season, contamination of water sources during the rainy season, and changing levels of zooplankton and iron in water (which help cholera bacteria survive), may contribute to the increasing levels of cholera during the rainy season (Camacho et al., 2018). Thus, rainfall data from NASA GPM satellites were included in CALM.

Conflict Data (Yemeni Civil War)

While cholera is preventable and treatable under stable circumstances, the collapse of Yemen’s health, water, and sanitation sectors amidst the ongoing armed conflict have fueled the spread of cholera across the country, and with direct attacks against hospitals and the bombing of water supplies, the conflict has dissolved 55% of the country's medical, wastewater, and solid waste management infrastructure, making access to clean water and healthcare difficult and expensive (Camacho et al, 2018; Yemen’s Cholera Crisis: Fighting Disease During Armed Conflict, 2017; Yemen: The Forgotten War, 2018). The number of daily casualties due to conflict in each Yemeni governorate was used as a metric for civil war related violence.

Diagram of Lambert's Approach

DATASET PREPARATION

In order to produce models that did not solely rely on seasonal trends and were able to predict spikes in cholera cases, our objective became to predict new cholera cases in any given governorate in Yemen from week to week. With this objective, the case and death report time series were made stationary through temporal differencing. Our four target variables were also calculated: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present.

Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our cross-validation dataset was used with a rolling window forecast (see supplementary section - methods for more information) for feature selection and hyperparameter optimization.

Feature Engineering

Feature engineering is at the core of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. We extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set of possible features allowed us to ensure the best ones would be found. We also calculated features over multiple time frames and for geographically neighboring governorates. Through demanding feature selection process (see supplementary section - methods for more information), we were able to arrive at the best 30-50 features for each time-range model. All in all, we were able to remove ~99.9% of our original features.

Results of Feature Tuning

Model

We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (see “Uniqueness of Approach” supplementary section), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.

Tuning

We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.

@@ Line 358: / Line 358: @@
 <br><br>
 <div id="content5">
-Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). <b>This allowed us to arrive at the best 30-50 features</b> for each time-range. All in all, we were able to remove ~99.9% of our original features.
+Feature engineering is at the core of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. We extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set of possible features allowed us to ensure the best ones would be found. We also calculated features over multiple time frames and for geographically neighboring governorates. Through demanding feature selection process (see supplementary section - methods for more information), we were able to arrive at the best 30-50 features for each time-range model. All in all, we were able to remove ~99.9% of our original features.
 <br><br>
 <div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/d/d3/T--Lambert_GA--CALMFeatureTuningResults.png"></div>
@@ Line 375: / Line 375: @@
 <div id="content6">
-We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (refer to the background), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.
+We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (see “Uniqueness of Approach” supplementary section), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.
-<br>
+<br><br>
-<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/4/4f/T--Lambert_GA--CALMmodelPic.png"></div>
-<br>
-<div style="font-size:12px; text-align:center;"><i>XGBoost Depiction</i></div>
 </div>