Difference between revisions of "Team:Lambert GA/CALM METHODS"

Latest revision as of 01:18, 18 October 2018

C A L M

D A T A & M E T H O D S

Cholera Incidence and Mortality

Past cholera cases and deaths were included with the assumption that they would be predictive of future cases. Vibrio cholerae thrives in aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and further the spread of cholera. With this in mind, cholera case and death data from the World Health Organization (WHO) reports were included as a primary feature in CALM.

Rainfall Data

Analysis of surveillance data for the Yemeni cholera outbreak from 2016 to 2018 found a positive and nonlinear association between weekly rainfall and suspected cholera incidence: the relative risk of cholera 10 days after a weekly rainfall of 25 mm was found to be 42% higher than compared with a week without rain (Camacho et al., 2018). Despite the inability to establish that rainfall is directly causal to the increase in cholera outbreaks, the use of unsafe water sources during the drought season, contamination of water sources during the rainy season, and changing levels of zooplankton and iron in water (which help cholera bacteria survive), may contribute to the increasing levels of cholera during the rainy season (Camacho et al., 2018). Thus, rainfall data from NASA GPM satellites were included in CALM.

Conflict Data (Yemeni Civil War)

While cholera is preventable and treatable under stable circumstances, the collapse of Yemen’s health, water, and sanitation sectors amidst the ongoing armed conflict have fueled the spread of cholera across the country, and with direct attacks against hospitals and the bombing of water supplies, the conflict has dissolved 55% of the country's medical, wastewater, and solid waste management infrastructure, making access to clean water and healthcare difficult and expensive (Camacho et al, 2018; Yemen’s Cholera Crisis: Fighting Disease During Armed Conflict, 2017; Yemen: The Forgotten War, 2018). The number of daily casualties due to conflict in each Yemeni governorate was used as a metric for civil war related violence.

Diagram of Lambert's Approach

DATASET PREPARATION

In order to produce models that did not solely rely on seasonal trends and were able to predict spikes in cholera cases, our objective became to predict new cholera cases in any given governorate in Yemen from week to week. With this objective, the case and death report time series were made stationary through temporal differencing. Our four target variables were also calculated: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present.

Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our cross-validation dataset was used with a rolling window forecast (see supplementary section - methods for more information) for feature selection and hyperparameter optimization.

Feature Engineering

Feature engineering is at the core of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. We extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set of possible features allowed us to ensure the best ones would be found. We also calculated features over multiple time frames and for geographically neighboring governorates. Through demanding feature selection process (see supplementary section - methods for more information), we were able to arrive at the best 30-50 features for each time-range model. All in all, we were able to remove ~99.9% of our original features.

Results of Feature Tuning

Model

We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (see “Uniqueness of Approach” supplementary section), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.

Tuning

Hyperparameters are characterized as those whose value is set before the learning process begins, and so can greatly affect a model’s performance. We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.

@@ Line 117: / Line 117: @@
 }
 #target5{
+position: relative;
+top: -80px;
+}
+#target6{
+position: relative;
+top: -80px;
+}
+#target7{
 position: relative;
 top: -80px;
@@ Line 153: / Line 161: @@
 font-size: 36px;
 font-family: 'Montserrat', sans-serif;
+}
+#content4 {
+font-family: 'Lora', serif;
+font-size: 15px;
+line-height: 1.75em;
 }
 #subheading5 {
 font-size: 36px;
 font-family: 'Montserrat', sans-serif;
+}
+#content5 {
+font-family: 'Lora', serif;
+font-size: 15px;
+line-height: 1.75em;
+}
+#subheading6 {
+font-size: 36px;
+font-family: 'Montserrat', sans-serif;
+}
+#content6 {
+font-family: 'Lora', serif;
+font-size: 15px;
+line-height: 1.75em;
+}
+#subheading7 {
+font-size: 36px;
+font-family: 'Montserrat', sans-serif;
+}
+#content7 {
+font-family: 'Lora', serif;
+font-size: 15px;
+line-height: 1.75em;
 }
 #footer {
@@ Line 220: / Line 260: @@
       <div id="overlay"></div>
            <div id="bigtitle">
-                C A L M <br><br><br><br> M E T H O D S
+                C A L M <br><br><br><br>  D A T A  &  M E T H O D S
            </div>
 </div>
@@ Line 243: / Line 283: @@
 <br><br>
                        <div id="link">
-                             <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target1">OVERVIEW</a></b>
+                             <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target1">CHOLERA INCIDENCE AND MORTALITY</a></b>
                        </div>
 <br><br>
                        <div id="link">
-                             <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target2">COLOR Q APP</a></b>
+                             <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target2">RAINFALL DATA</a></b>
                        </div>
 <br><br>
                        <div id="link">
-                             <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target3">CALM</a></b>
+                             <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target3">CONFLICT DATA</a></b>
-                       </div>
+                      </div>
+<br><br>
+                      <div id="link">
+                            <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target4">DATASET PREPARATION</a></b>
+                      </div>
+<br><br>
+<div id="link">
+                            <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target5">FEATURE ENGINEERING</a></b>
+                      </div>
+<br><br>
+<div id="link">
+                            <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target6">MODEL</a></b>
+                      </div>
+<br><br>
+<div id="link">
+                            <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target7">TUNING</a></b>
+                       </div>
 <br><br>
        </div>
 </div>
@@ Line 261: / Line 316: @@
 <div id="target1"></div>
 <div id="subheading1">
-<b>DATASET PREPERATION</b>
+<b>Cholera Incidence and Mortality</b>
 </div>
 <br>
 <br>
 <div id="content1">
-With the objective of predicting new cholera cases in any given governorate in Yemen from week to week, there were a number of steps taken to prepare the data. In order to produce models that did not simply rely on seasonal trends and were able to predict spikes in cholera cases, the case and death report time series were made stationary through temporal differencing. It should be noted that the country of Yemen encompasses 21 governorates or administrative divisions. While the CALM models were trained on data from all 21 governorates, data preparation on each governorate was performed separately to preserve each governorate’s unique time series. As the interval between each WHO cholera case/death report was not standard, the data was linearly interpolated into a daily time series. The Yemeni Cholera outbreak is seasonal and endemic as outbreaks spike during the rainy season (April-August) - however, the outbreaks rely on non-seasonal factors such as conflict and damage to health and sanitation (Camacho et al., 2018a).  Parsing data required finding the number of new cholera cases in a single day, given the total number of cases in the previous day. The values were then normalized by the population of each governorate (e.g new cases per 10,000 people). Finally, we calculated our four target variables: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present.
+Past cholera cases and deaths were included with the assumption that they would be predictive of future cases. Vibrio cholerae thrives in aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and further the spread of cholera. With this in mind, cholera case and death data from the World Health Organization (WHO) reports were included as a primary feature in CALM.
 <br><br>
 </div>
@@ Line 274: / Line 329: @@
 <div id="target2"></div>
 <div id="subheading2">
-<b>Color Q App</b>
+<b>Rainfall Data</b>
 </div>
-<br><br>
-<div style="text-align:center"><img style="width:800px;height:500px;" src="https://static.igem.org/mediawiki/2018/8/86/T--Lambert_GA--ColorQMap.jpg" /></div>
 <br><br>
 <div id="content2">
-Color Q is a free mobile application developed in Java for the Google Play Store. The app was developed in the Android Studio v3.2.1 integrated development environment. It works alongside the Chrome-Q hardware also developed by Lambert iGEM in order to effectively quantify the result of a biological reporter, similar to the function of a plate reader. The app is able to use circle detection in order to find the samples on the base of the Chrome-Q hardware and then detect the red, green, and blue values (RGB) of the center of each circle. The way the circles are arranged allow for a range of values to be generated. The first row contains 4 circles. The average RGB values of the first two circles are calculated as the negative control and the average RGB values of the second pair of circles are calculated as the positive control. The distance between the positive and negative control is calculated in the 3D-coordinate plane using the following formula:
+Analysis of surveillance data for the Yemeni cholera outbreak from 2016 to 2018 found a positive and nonlinear association between weekly rainfall and suspected cholera incidence: the relative risk of cholera 10 days after a weekly rainfall of 25 mm was found to be 42% higher than compared with a week without rain (Camacho et al., 2018). Despite the inability to establish that rainfall is directly causal to the increase in cholera outbreaks, the use of unsafe water sources during the drought season, contamination of water sources during the rainy season, and changing levels of zooplankton and iron in water (which help cholera bacteria survive), may contribute to the increasing levels of cholera during the rainy season (Camacho et al., 2018). Thus, rainfall data from NASA GPM satellites were included in CALM.
+</div>
 <br><br>
-<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/e/eb/T--Lambert_GA--distance.png" /></div>
+<div id="target3"></div>
+<div id="subheading3">
+<b>Conflict Data (Yemeni Civil War)
+</b>
+</div>
+<br><br>
+<div id="content3">
+While cholera is preventable and treatable under stable circumstances, the collapse of Yemen’s health, water, and sanitation sectors amidst the ongoing armed conflict have fueled the spread of cholera across the country, and with direct attacks against hospitals and the bombing of water supplies, the conflict has dissolved 55% of the country's medical, wastewater, and solid waste management infrastructure, making access to clean water and healthcare difficult and expensive (Camacho et al, 2018; Yemen’s Cholera Crisis: Fighting Disease During Armed Conflict, 2017; Yemen: The Forgotten War, 2018). The number of daily casualties due to conflict in each Yemeni governorate was used as a metric for civil war related violence.
 <br>
-The Hough Circle Transform is used as the method of circle detection found in the app. Open Computer Vision (OpenCV) is a library that can be imported into Android Studio in order to perform image analysis-based methods. The image taken by the smartphone camera is converted into a grayscale photo. This essentially makes the image more readable in terms of edge detection and "round" estimation. There are several parameters that can be modified and calibrated in order to detect an accurate amount of circles:
+<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/a/ab/T--Lambert_GA--approach.png"></div>
+<br>
+<div style="font-size:12px; text-align:center;"><i>Diagram of Lambert's Approach</i></div>
 <br><br>
-<ul>
+</div>
-<li><i>Maximum Radius: The smallest value for the radius of a detected circle</i></li>
+<div id="target4"></div>
-<li><i>Minimum Radius: The largest value for the radius of a detected circle</i></li>
+<div id="subheading4">
-<li><i>Minimum Distance: The smallest distance between the centers of any two detected circles</i></li>
+<b>DATASET PREPARATION</b>
-<li><i>Edge Gradient Value: The roundness of each detected circle</i></li>
+</div>
-<li><i>Threshold Value: The amount of memory the system has to store the detected circles</i></li>
-</ul>
 <br>
-<div style="text-align:center;">
+<br>
-<img style="text-align:left;vertical-align:top;width:337px;height:599px;" src="https://static.igem.org/mediawiki/2018/e/ea/T--Lambert_GA--display.png">
+<div id="content4">
-<img style="text-align:right;vertical-align:top;width:337px;height:599px;" src="https://static.igem.org/mediawiki/2018/e/e1/T--Lambert_GA--results.png">
+In order to produce models that did not solely rely on seasonal trends and were able to predict spikes in cholera cases, our objective became to predict new cholera cases in any given governorate in Yemen from week to week. With this objective, the case and death report time series were made stationary through temporal differencing. Our four target variables were also calculated: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present.
+<br><br>
+Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our cross-validation dataset was used with a rolling window forecast (see supplementary section - methods for more information) for feature selection and hyperparameter optimization.
+<br><br>
+<div id="target5"></div>
+<div id="subheading5">
+<b>Feature Engineering</b>
 </div>
 <br><br>
-The 6 by 6 grid located underneath the first row can be loaded with experimental samples and a percentage value can be determined on a scale from the negative control to the positive control. The circle detection process loops through until the maximum radius reaches 120 pixels. If anywhere from 35 to 40 circles are detected in total, then the loop stops. However, if there are fewer circles detected, then the loops restarts to finish through the maximum radius until anywhere from 25 to 40 circles are detected properly. If less than 25 circles are detected, then an error is caught and another picture is requested to be used. Zooming in or zooming out could possibly make the circle detection process easier for the system and more efficient. The following formula is used to calculate the relative percentage values:
+<div id="content5">
+Feature engineering is at the core of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. We extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set of possible features allowed us to ensure the best ones would be found. We also calculated features over multiple time frames and for geographically neighboring governorates. Through demanding feature selection process (see supplementary section - methods for more information), we were able to arrive at the best 30-50 features for each time-range model. All in all, we were able to remove ~99.9% of our original features.
 <br><br>
+<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/d/d3/T--Lambert_GA--CALMFeatureTuningResults.png"></div>
+<br>
+<div style="font-size:12px; text-align:center"><i>Results of Feature Tuning</i></div>
-<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/0/0b/T--Lambert_GA--value.png"></div>
 <br>
+</div>
-The results are then displayed based upon the row in which the circles fall in on the base of the Chrome-Q hardware. The app is able to determine the row in which the circles fall in by comparing y-coordinates. If the y-values are similar to each other, then the circles are classified as being on the same row. The relative values are then transferred to another page within the app where the user is able to enter information that could help contribute to our machine learning model, CALM. The application uses the latitude, longitude, and timestamp values obtained from the phone's GPS to effectively determine where and when the test was run. When the user submits the data, the results are sent to a MySQL database, which is a part of the Relational Database Service (RDS) as a part of the Amazon Web Services (AWS) platform.
-</br>
+<div id="target6"></div>
+<div id="subheading6">
+<b>Model</b>
 </div>
 <br><br>
-<div id="target3"></div>
+<div id="content6">
-<div id="subheading3">
+We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our models. Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by previous models (see “Uniqueness of Approach” supplementary section), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.
-<b>CALM
-</b>
-</div>
 <br><br>
-<div id="content3">
+</div>
-There are two main components of the CALM platform; the SMS component and the machine learning component. The entirety of the platform is written in Python 3.6+, and several libraries, including the pandas, numpy, scikit-learn, beautifulsoup, xgboost, and flask libraries are utilized. In order to make predictions, the machine learning aspect of CALM ___
-To distribute SMS notifications, Michael Koohang graciously allowed Lambert iGEM to modify his RatWatch project (developed at Georgia Tech) to create CALM’s SMS component. The code for SMS distribution is located on a server using the Flask microframework for logic and computation. The Flask server interacts with Twilio’s (an SMS-survey provider) Python API in order to send out text messages to a specified population. The population’s survey results are aggregated and stored on the Flask server using pandas.
+<div id="target7"></div>
+<div id="subheading7">
+<b>Tuning</b>
+</div>
 <br><br>
-We hope to see CALM in use throughout the cholera field within the next few years as medical organizations begin using it to prevent outbreaks and better distribute medical supplies. As cholera already has a cure, a machine-learning based approach to predicting and preventing cholera, especially one that is open-source and free to use, will drastically reduce the time, energy, and money required to treat an infected population. Finally, we believe the CALM project will not only treat millions of people affected with cholera, but will also begin efforts to use CALM’s foundation to predict other diseases such as malaria and parasitic infections.
+<div id="content7">
-<br><br>
+Hyperparameters are characterized as those whose value is set before the learning process begins, and so can greatly affect a model’s performance. We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.
-CALM began as a subcomponent of Lambert’s 2018 project and rapidly developed throughout the beginning of the 2018 season. In late May Lambert participated in the Day One Challenge, an Atlanta-based AI competition, and won. Through further collaboration and outreach with the Day One organization Lambert has been able to receive feedback and advice from professionals in a variety of fields, such as epidemiology, computer science, machine learning, and business. As CALM develops further, we hope to not only see other teams adopt the platform to address other issues, but also for healthcare organizations across the world to utilize CALM and adapt it to other diseases.
 <br><br>
+</div>
+</br>
+</div>
 </div>