Line 266: | Line 266: | ||
<br> | <br> | ||
<div id="content1"> | <div id="content1"> | ||
− |   With the objective of predicting new cholera cases in any given governorate in Yemen from week to week, there were a number of steps taken to prepare the data. In order to produce models that did not simply rely on seasonal trends and were able to predict spikes in cholera cases, the case and death report time series were made stationary through temporal differencing. It should be noted that the country of Yemen encompasses 21 governorates or administrative divisions. While the CALM models were trained on data from all 21 governorates, data preparation on each governorate was performed separately to preserve each governorate’s unique time series. As the interval between each WHO cholera case/death report was not standard, the data was linearly interpolated into a daily time series. The Yemeni Cholera outbreak is seasonal and endemic as outbreaks spike during the rainy season (April-August) - however, the outbreaks rely on non-seasonal factors such as conflict and damage to health and sanitation (Camacho et al., 2018a). Parsing data required finding the number of new cholera cases in a single day, given the total number of cases in the previous day. The values were then normalized by the population of each governorate (e.g new cases per 10,000 people). Finally, we calculated our four target variables: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present. | + |   With the objective of predicting new cholera cases in any given governorate in Yemen from week to week, there were a number of steps taken to prepare the data. In order to produce models that did not simply rely on seasonal trends and were able to predict spikes in cholera cases, the case and death report time series were made stationary through temporal differencing. It should be noted that the country of Yemen encompasses 21 governorates or administrative divisions. While the CALM models were trained on data from all 21 governorates, data preparation on each governorate was performed separately to preserve each governorate’s unique time series. As the interval between each WHO cholera case/death report was not standard, the data was linearly interpolated into a daily time series. The Yemeni Cholera outbreak is seasonal and endemic as outbreaks spike during the rainy season (April-August) - however, the outbreaks rely on non-seasonal factors such as conflict and damage to health and sanitation (Camacho et al., 2018a). Parsing data required finding the number of new cholera cases in a single day, given the total number of cases in the previous day. The values were then normalized by the population of each governorate (e.g new cases per 10,000 people). Finally, we calculated our four target variables: the number of new cholera cases 0-2 weeks from the present day, 2-4 weeks from the present, 4-6 weeks from the present, and 6-8 weeks from the present. <p> |
+ |  Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our base training set was defined from July 1 to August 15th. While WHO reports extended back as far as May 22, we chose to start on July 1 in order to have enough prior data for feature calculation. Our cross-validation dataset was defined from August 15 to November 10. Finally, our hold-out set started from November 11 and extended to a final date in January/February, which varied for each defined target variable depending on the respective range: a 6-8 week forecast implies a larger time frame between current and forecast date than a 2-4 week forecast, and so the 6-8 week forecast holdout set would end prior to the 2-4 week forecast. It may seem that the cross-validation set outweighs the training set significantly, but this was mitigated with the use of a rolling window forecast - a gold standard for cross-validation in time series forecasting. Rolling window cross-validation is easiest understood with the following example. Given a dataset spanning four weeks, a rolling window forecast would dictate that we train on the first week, predict on the second week, then train on the first two weeks, predict on the third, and finally train on the first three weeks and predict on the fourth. In this example, the first week would be the base-training set (as it was never predicted on and was included in the training set of each fold), the second and third weeks the cross-validation set (as they varied between prediction and training sets), and the fourth the week the hold-out set (as it was never trained on). Our five cross-validation sets were defined as follows: August 16 to August 31, August 31 to September 15, September 15 to September 30, September 30 to October 15, and Finally October 15 to October 30 (it should be noted that the final fold included data from October 30 to November 10 as a prediction set, though this does not cross into the hold-out set). The cross-validation sets were used to select features and find optimal hyperparameters for our model, and the hold-out set was used to simulate real-world performance of our model. | ||
+ | |||
<br><br> | <br><br> |
Revision as of 06:12, 17 October 2018
M E T H O D S
Our dataset was split into three portions: training, cross-validation, and a hold-out test set. The hold-out set was left untouched until the completion of our methods to provide an accurate real-world simulation of our models’ performance. Our base training set was defined from July 1 to August 15th. While WHO reports extended back as far as May 22, we chose to start on July 1 in order to have enough prior data for feature calculation. Our cross-validation dataset was defined from August 15 to November 10. Finally, our hold-out set started from November 11 and extended to a final date in January/February, which varied for each defined target variable depending on the respective range: a 6-8 week forecast implies a larger time frame between current and forecast date than a 2-4 week forecast, and so the 6-8 week forecast holdout set would end prior to the 2-4 week forecast. It may seem that the cross-validation set outweighs the training set significantly, but this was mitigated with the use of a rolling window forecast - a gold standard for cross-validation in time series forecasting. Rolling window cross-validation is easiest understood with the following example. Given a dataset spanning four weeks, a rolling window forecast would dictate that we train on the first week, predict on the second week, then train on the first two weeks, predict on the third, and finally train on the first three weeks and predict on the fourth. In this example, the first week would be the base-training set (as it was never predicted on and was included in the training set of each fold), the second and third weeks the cross-validation set (as they varied between prediction and training sets), and the fourth the week the hold-out set (as it was never trained on). Our five cross-validation sets were defined as follows: August 16 to August 31, August 31 to September 15, September 15 to September 30, September 30 to October 15, and Finally October 15 to October 30 (it should be noted that the final fold included data from October 30 to November 10 as a prediction set, though this does not cross into the hold-out set). The cross-validation sets were used to select features and find optimal hyperparameters for our model, and the hold-out set was used to simulate real-world performance of our model.
The Hough Circle Transform is used as the method of circle detection found in the app. Open Computer Vision (OpenCV) is a library that can be imported into Android Studio in order to perform image analysis-based methods. The image taken by the smartphone camera is converted into a grayscale photo. This essentially makes the image more readable in terms of edge detection and "round" estimation. There are several parameters that can be modified and calibrated in order to detect an accurate amount of circles:
- Maximum Radius: The smallest value for the radius of a detected circle
- Minimum Radius: The largest value for the radius of a detected circle
- Minimum Distance: The smallest distance between the centers of any two detected circles
- Edge Gradient Value: The roundness of each detected circle
- Threshold Value: The amount of memory the system has to store the detected circles
The 6 by 6 grid located underneath the first row can be loaded with experimental samples and a percentage value can be determined on a scale from the negative control to the positive control. The circle detection process loops through until the maximum radius reaches 120 pixels. If anywhere from 35 to 40 circles are detected in total, then the loop stops. However, if there are fewer circles detected, then the loops restarts to finish through the maximum radius until anywhere from 25 to 40 circles are detected properly. If less than 25 circles are detected, then an error is caught and another picture is requested to be used. Zooming in or zooming out could possibly make the circle detection process easier for the system and more efficient. The following formula is used to calculate the relative percentage values:
The results are then displayed based upon the row in which the circles fall in on the base of the Chrome-Q hardware. The app is able to determine the row in which the circles fall in by comparing y-coordinates. If the y-values are similar to each other, then the circles are classified as being on the same row. The relative values are then transferred to another page within the app where the user is able to enter information that could help contribute to our machine learning model, CALM. The application uses the latitude, longitude, and timestamp values obtained from the phone's GPS to effectively determine where and when the test was run. When the user submits the data, the results are sent to a MySQL database, which is a part of the Relational Database Service (RDS) as a part of the Amazon Web Services (AWS) platform.
We hope to see CALM in use throughout the cholera field within the next few years as medical organizations begin using it to prevent outbreaks and better distribute medical supplies. As cholera already has a cure, a machine-learning based approach to predicting and preventing cholera, especially one that is open-source and free to use, will drastically reduce the time, energy, and money required to treat an infected population. Finally, we believe the CALM project will not only treat millions of people affected with cholera, but will also begin efforts to use CALM’s foundation to predict other diseases such as malaria and parasitic infections.
CALM began as a subcomponent of Lambert’s 2018 project and rapidly developed throughout the beginning of the 2018 season. In late May Lambert participated in the Day One Challenge, an Atlanta-based AI competition, and won. Through further collaboration and outreach with the Day One organization Lambert has been able to receive feedback and advice from professionals in a variety of fields, such as epidemiology, computer science, machine learning, and business. As CALM develops further, we hope to not only see other teams adopt the platform to address other issues, but also for healthcare organizations across the world to utilize CALM and adapt it to other diseases.