Difference between revisions of "Team:Lambert GA/CALM RESULTS"

 
(36 intermediate revisions by 2 users not shown)
Line 153: Line 153:
 
font-size: 36px;
 
font-size: 36px;
 
font-family: 'Montserrat', sans-serif;
 
font-family: 'Montserrat', sans-serif;
 +
}
 +
#content4 {
 +
font-family: 'Lora', serif;
 +
font-size: 15px;
 +
line-height: 1.75em;
 +
 
}
 
}
 
#subheading5 {
 
#subheading5 {
 
font-size: 36px;
 
font-size: 36px;
 
font-family: 'Montserrat', sans-serif;
 
font-family: 'Montserrat', sans-serif;
 +
}
 +
#content5 {
 +
font-family: 'Lora', serif;
 +
font-size: 15px;
 +
line-height: 1.75em;
 +
 
}
 
}
 
#footer {
 
#footer {
Line 215: Line 227:
 
<div id="holder">
 
<div id="holder">
 
     <video style= "z-index:1;" id="oceanvideo"  autoplay muted loop>
 
     <video style= "z-index:1;" id="oceanvideo"  autoplay muted loop>
               <source src="https://static.igem.org/mediawiki/2018/a/a3/T--Lambert_GA--software.mp4" type="video/mp4">
+
               <source src="https://static.igem.org/mediawiki/2018/3/3c/T--Lambert_GA--dataresults.mp4" type="video/mp4">
 
               Your browser does not support video.
 
               Your browser does not support video.
 
             </video>
 
             </video>
 
     <div id="overlay"></div>
 
     <div id="overlay"></div>
 
           <div id="bigtitle">
 
           <div id="bigtitle">
               C A L M <br><br><br><br> D A T A  &  R E S U L T S
+
               C A L M <br><br><br><br> R E S U L T S
 
           </div>
 
           </div>
 
</div>
 
</div>
Line 242: Line 254:
 
<div id="sidebar">
 
<div id="sidebar">
 
<br><br>
 
<br><br>
 +
                   
 
                       <div id="link">
 
                       <div id="link">
                             <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target1">OVERVIEW</a></b>
+
                             <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target4">RESULTS</a></b>
 
                       </div>
 
                       </div>
 
<br><br>
 
<br><br>
                      <div id="link">
 
                            <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target2">COLOR Q APP</a></b>
 
                      </div>
 
<br><br>
 
                      <div id="link">
 
                            <b><a style="color:black; text-decoration: none; line-height:1.1;" href="#target3">CALM</a></b>
 
                      </div>
 
<br><br>
 
 
 
       </div>
 
       </div>
 
</div>
 
</div>
 
   <div id="maincontent">
 
   <div id="maincontent">
 
<br><br><br>
 
<br><br><br>
<div id="target1"></div>
+
 
<div id="subheading1">
+
 
<b>Cholera Epidemiological Data</b>
+
<br><br>
 +
<div id="target4"></div>
 +
<div id="subheading4">
 +
<b>Results
 +
</b>
 
</div>
 
</div>
 +
<br><br>
 +
<div id="content4">
 +
<div style="font-size:12px">Figure 1: Cross-validation and Holdout Error for four XGBoost forecasting models </div>
 +
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/9/90/T--Lambert_GA--CALMResults.png" style = "width:80%;"></div>
 
<br>
 
<br>
 +
<div style="font-size:12px;"><i>Cross-validation and hold-out error for each of our four models. Cross-validation error was obtained by taking the root of the mean of the model’s performance across five rolling-window cross-validation folds. Hold out error was obtained by calculating the root mean squared error for predictions only on the holdout set. </i></div>
 
<br>
 
<br>
<div id="content1">
+
Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our <b>model’s performance in real-world simulation</b>, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to <b>predict around ⅕ of a standard deviation of the number cases</b>, our predictions are <b>robust and reliable across all time frames</b>. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data.
&emsp;Cholera case and death statistics are reported by the World Health Organization (WHO) where health experts and researchers work directly with Yemeni health authorities at both the country and local level. Through this direct connection, the WHO is able to record all reported cholera cases and deaths caused by cholera (WHO presence in Yemen, 2018). The data, collected by the WHO, was accessed through the Humanitarian Data Exchange (https://data.humdata.org /group/yem). It provided reports of accumulated new cholera cases and deaths per governorate from up to May 22, 2017, to February 18, 2018.  
+
  
<br><br>
 
&emsp;Past cholera cases and deaths were included with the simple assumption that they would be predictive of future cases. Vibrio cholerae requires aquatic environments and can transfer between humans through the transfer of bodily fluids. Thus, the incidence of cholera in one region can indicate the contamination of several food and water sources and therefore indicate a further spread of cholera (Cholera - Vibrio cholerae infection).
 
<br><br>
 
  
</div>
+
<br>
 +
<br>
 +
<div style="font-size:12px">Figure 2: XGBoost Predictions of New Cases 0 to 2, 2 to 4, 4 to 6, and 6 to 8 Weeks in Advance for Five Governorates </div>
 +
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/8/86/T--Lambert_GA--CALMLegend.png" style = "width:30%;"></div>
 +
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/d/dd/T--Lambert_GA--CALMcombined_figure.png" style = "width:70%;"></div>
 +
<div style="font-size:12px;"><i>Our forecasts for each time frame vs a sliding window of real cases. The window represents a two-week interval corresponding to the forecast range with single data points, sliding over the data and summing up the cholera cases falling in the interval. For example, in the 0-2 week forecast plot, on September 15th we predicted there would be 92 new cholera cases per 10,000 people in the next two weeks in the governorate of YE-SN (Sana’a). Then, two weeks later, shown as the red true-value, Sana’a experienced ~92 cases. However, the date for the true-value datapoint remains September 15th, as the value describes the number of cases 0-2 weeks in the future. The red value refers to the true value, or the number of new cholera cases actually experienced by the respective governorate in the corresponding time range (2-4 weeks from present, 4-6 weeks from present, etc.). Cross-validation predictions (green) were completed with a rolling-window method, as described earlier (see methods), and the hold-out predictions (blue) were done separately, with the model training on all data previous to the holdout set. </i></div>
  
 +
Our four model system is able to accurately and comprehensively forecast cholera outbreaks across 21 governorates exhibiting heterogeneous behaviors. Five governorates were chosen to represent the entire range of behavior exhibited by all 21 of Yemen’s governorates. YE-AM (Amran), YE-DA (Dhale), and YE-MW (Al Mahwit) experienced the greatest cumulative number of cholera cases from May 22nd to February 18th, respectively, making them three of the four governorates most affected by cholera overall in the given timeframe. These three governorates exhibit behavior highly similar to the other governorates, albeit at a higher scale. Our accurate delineation of the outbreak across all four time frames in these three governorates shows the usefulness of our models in the general case. On the other hand, YE-SA and YE-RA present a rare and interesting case - a sudden outbreak. As our predictive range increases, it becomes more difficult to predict sudden spikes, due to either a lack of information many weeks prior or the events preceding a sharp outbreak not having occurred yet. As a result, longer range models seem to predict sharp outbreaks with a certain lag. This can be seen from the 4-6 week model’s forecasting of the YE-RA outbreak, which shows an upward trend, but not a full spike being predicted. However, this is where the combination of all four of our models becomes most useful. While long-range models cannot easily predict outbreaks, our shorter-range models are able to pick up the slack once the outbreak becomes closer. Specifically, our 0-2 week forecasting model is able to predict incidence spikes in YE-RA and YE-SA, so even if our long-range models were unable to predict the outbreak immediately, we would still detect the outbreak at a later date.
  
<div id="target2"></div>
 
<div id="subheading2">
 
<b>Rainfall Data</b>
 
</div>
 
<br><br>
 
<div style="text-align:center"><img style="width:800px;height:500px;" src="https://static.igem.org/mediawiki/2018/8/86/T--Lambert_GA--ColorQMap.jpg" /></div>
 
<br><br>
 
<div id="content2">
 
As <i>Vibrio cholerae</i> is indigenous to aquatic environments, rainfall is a significant predictor of the transmission of cholera. In areas exposed to heavy rainfall, through the collapse of sanitary and health infrastructure, interaction between contaminated water and human activities accelerates, resulting in an epidemic (Jutla et al., 2013). Yemen represents this scenario, where when exposed to heavy rainfall and deterioration of health facilities, there was a surge in cholera cases (Camacho et al., 2018). Global Lancet Researchers analyzing surveillance date for the Yemen Cholera Outbreak from 2016 to 2018 have found a positive and nonlinear association between weekly rainfall and suspected cholera incidence: the relative risk of cholera 10 days after a weekly rainfall of 25 mm is 42% higher than compared with a week without rain (Camacho et al., 2018). Despite the inability to establish that rainfall is causal to the increase in cholera outbreaks, the use of unsafe water sources during the drought season, contamination of water sources during the rainy season, and changing levels of zooplankton and iron in water (which help cholera bacteria survive), may contribute to the increasing levels of cholera during the rainy season (Camacho et al., 2018). These correlations demonstrate the need to measure rainfall in the machine learning model, as rainfall is a predictor for possible climate changes and the corresponding human response and subsequently indicates the spread of cholera in Yemen.
 
<br><br>
 
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/e/eb/T--Lambert_GA--distance.png" /></div>
 
 
<br>
 
<br>
The Hough Circle Transform is used as the method of circle detection found in the app. Open Computer Vision (OpenCV) is a library that can be imported into Android Studio in order to perform image analysis-based methods. The image taken by the smartphone camera is converted into a grayscale photo. This essentially makes the image more readable in terms of edge detection and "round" estimation. There are several parameters that can be modified and calibrated in order to detect an accurate amount of circles:
 
<br><br>
 
<ul>
 
<li><i>Maximum Radius: The smallest value for the radius of a detected circle</i></li>
 
<li><i>Minimum Radius: The largest value for the radius of a detected circle</i></li>
 
<li><i>Minimum Distance: The smallest distance between the centers of any two detected circles</i></li>
 
<li><i>Edge Gradient Value: The roundness of each detected circle</i></li>
 
<li><i>Threshold Value: The amount of memory the system has to store the detected circles</i></li>
 
</ul>
 
 
<br>
 
<br>
<div style="text-align:center;">
+
<div style="font-size:12px">Figure 3: Cumulative Cholera Cases for 5 Governorates. </div>
<img style="text-align:left;vertical-align:top;width:337px;height:599px;" src="https://static.igem.org/mediawiki/2018/e/ea/T--Lambert_GA--display.png">
+
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/6/61/T--Lambert_GA--fig3.gif" style = "width:60%;"></div>
<img style="text-align:right;vertical-align:top;width:337px;height:599px;" src="https://static.igem.org/mediawiki/2018/e/e1/T--Lambert_GA--results.png">
+
</div>
+
<br><br>
+
The 6 by 6 grid located underneath the first row can be loaded with experimental samples and a percentage value can be determined on a scale from the negative control to the positive control. The circle detection process loops through until the maximum radius reaches 120 pixels. If anywhere from 35 to 40 circles are detected in total, then the loop stops. However, if there are fewer circles detected, then the loops restarts to finish through the maximum radius until anywhere from 25 to 40 circles are detected properly. If less than 25 circles are detected, then an error is caught and another picture is requested to be used. Zooming in or zooming out could possibly make the circle detection process easier for the system and more efficient. The following formula is used to calculate the relative percentage values:
+
<br><br>
+
 
+
<div style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/0/0b/T--Lambert_GA--value.png"></div>
+
 
<br>
 
<br>
 
+
Other models look to forecast the total cholera cases Yemen may experience (Nishiura, 2017) with some models providing governorate-specific (or even geographically smaller) predictions (Cole, 2018). While the objective of our models is to predict new cases (as to not simply follow linear trends and be able to predict outbreak spikes), we are able to convert our predictions to total cases through simple reverse differencing. Using this method, we are able to accurately delineate the course of the Yemeni outbreak. Figure 1.2 illustrates this on a representative sample of five Yemeni governorates.
The results are then displayed based upon the row in which the circles fall in on the base of the Chrome-Q hardware. The app is able to determine the row in which the circles fall in by comparing y-coordinates. If the y-values are similar to each other, then the circles are classified as being on the same row. The relative values are then transferred to another page within the app where the user is able to enter information that could help contribute to our machine learning model, CALM. The application uses the latitude, longitude, and timestamp values obtained from the phone's GPS to effectively determine where and when the test was run. When the user submits the data, the results are sent to a MySQL database, which is a part of the Relational Database Service (RDS) as a part of the Amazon Web Services (AWS) platform.  
+
 
+
</br>
+
  
 
</div>
 
</div>
<br><br>
 
<div id="target3"></div>
 
<div id="subheading3">
 
<b>CALM
 
</b>
 
</div>
 
<br><br>
 
<div id="content3">
 
There are two main components of the CALM platform; the SMS component and the machine learning component. The entirety of the platform is written in Python 3.6+, and several libraries, including the pandas, numpy, scikit-learn, beautifulsoup, xgboost, and flask libraries are utilized. In order to make predictions, the machine learning aspect of CALM ___
 
To distribute SMS notifications, Michael Koohang graciously allowed Lambert iGEM to modify his RatWatch project (developed at Georgia Tech) to create CALM’s SMS component. The code for SMS distribution is located on a server using the Flask microframework for logic and computation. The Flask server interacts with Twilio’s (an SMS-survey provider) Python API in order to send out text messages to a specified population. The population’s survey results are aggregated and stored on the Flask server using pandas.
 
  
 
<br><br>
 
We hope to see CALM in use throughout the cholera field within the next few years as medical organizations begin using it to prevent outbreaks and better distribute medical supplies. As cholera already has a cure, a machine-learning based approach to predicting and preventing cholera, especially one that is open-source and free to use, will drastically reduce the time, energy, and money required to treat an infected population. Finally, we believe the CALM project will not only treat millions of people affected with cholera, but will also begin efforts to use CALM’s foundation to predict other diseases such as malaria and parasitic infections.
 
<br><br>
 
CALM began as a subcomponent of Lambert’s 2018 project and rapidly developed throughout the beginning of the 2018 season. In late May Lambert participated in the Day One Challenge, an Atlanta-based AI competition, and won. Through further collaboration and outreach with the Day One organization Lambert has been able to receive feedback and advice from professionals in a variety of fields, such as epidemiology, computer science, machine learning, and business. As CALM develops further, we hope to not only see other teams adopt the platform to address other issues, but also for healthcare organizations across the world to utilize CALM and adapt it to other diseases.
 
<br><br>
 
 
</div>
 
  
 
<div id="footer">
 
<div id="footer">

Latest revision as of 01:34, 18 October 2018

C A L M



R E S U L T S






































Results


Figure 1: Cross-validation and Holdout Error for four XGBoost forecasting models

Cross-validation and hold-out error for each of our four models. Cross-validation error was obtained by taking the root of the mean of the model’s performance across five rolling-window cross-validation folds. Hold out error was obtained by calculating the root mean squared error for predictions only on the holdout set.

Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our model’s performance in real-world simulation, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to predict around ⅕ of a standard deviation of the number cases, our predictions are robust and reliable across all time frames. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data.

Figure 2: XGBoost Predictions of New Cases 0 to 2, 2 to 4, 4 to 6, and 6 to 8 Weeks in Advance for Five Governorates
Our forecasts for each time frame vs a sliding window of real cases. The window represents a two-week interval corresponding to the forecast range with single data points, sliding over the data and summing up the cholera cases falling in the interval. For example, in the 0-2 week forecast plot, on September 15th we predicted there would be 92 new cholera cases per 10,000 people in the next two weeks in the governorate of YE-SN (Sana’a). Then, two weeks later, shown as the red true-value, Sana’a experienced ~92 cases. However, the date for the true-value datapoint remains September 15th, as the value describes the number of cases 0-2 weeks in the future. The red value refers to the true value, or the number of new cholera cases actually experienced by the respective governorate in the corresponding time range (2-4 weeks from present, 4-6 weeks from present, etc.). Cross-validation predictions (green) were completed with a rolling-window method, as described earlier (see methods), and the hold-out predictions (blue) were done separately, with the model training on all data previous to the holdout set.
Our four model system is able to accurately and comprehensively forecast cholera outbreaks across 21 governorates exhibiting heterogeneous behaviors. Five governorates were chosen to represent the entire range of behavior exhibited by all 21 of Yemen’s governorates. YE-AM (Amran), YE-DA (Dhale), and YE-MW (Al Mahwit) experienced the greatest cumulative number of cholera cases from May 22nd to February 18th, respectively, making them three of the four governorates most affected by cholera overall in the given timeframe. These three governorates exhibit behavior highly similar to the other governorates, albeit at a higher scale. Our accurate delineation of the outbreak across all four time frames in these three governorates shows the usefulness of our models in the general case. On the other hand, YE-SA and YE-RA present a rare and interesting case - a sudden outbreak. As our predictive range increases, it becomes more difficult to predict sudden spikes, due to either a lack of information many weeks prior or the events preceding a sharp outbreak not having occurred yet. As a result, longer range models seem to predict sharp outbreaks with a certain lag. This can be seen from the 4-6 week model’s forecasting of the YE-RA outbreak, which shows an upward trend, but not a full spike being predicted. However, this is where the combination of all four of our models becomes most useful. While long-range models cannot easily predict outbreaks, our shorter-range models are able to pick up the slack once the outbreak becomes closer. Specifically, our 0-2 week forecasting model is able to predict incidence spikes in YE-RA and YE-SA, so even if our long-range models were unable to predict the outbreak immediately, we would still detect the outbreak at a later date.

Figure 3: Cumulative Cholera Cases for 5 Governorates.

Other models look to forecast the total cholera cases Yemen may experience (Nishiura, 2017) with some models providing governorate-specific (or even geographically smaller) predictions (Cole, 2018). While the objective of our models is to predict new cases (as to not simply follow linear trends and be able to predict outbreak spikes), we are able to convert our predictions to total cases through simple reverse differencing. Using this method, we are able to accurately delineate the course of the Yemeni outbreak. Figure 1.2 illustrates this on a representative sample of five Yemeni governorates.