M O D E L I N G
CALM MODEL
OVERVIEW
The ongoing Yemeni cholera outbreak, triggered by a devastating civil war, has been deemed “the worst outbreak in history”, with the primary cause of mortality being the inefficient allocation of medicines. This resource allocation problem has been enabled by a lack of a comprehensive forecast detailing when, where, and with how many cases cholera will strike, which would allow for the targeted distribution of relief supplies and outbreak mitigation. In response to this problem, we have constructed CALM, the Cholera Artificial Learning Model, a system of four extreme-gradient-boosting (XGBoost) machine learning models that, in tandem, forecast the exact number (with an error margin of 4.787 cases per 10,000) of cholera cases any given Yemeni governorate will experience for multiple time intervals ranging from 2 weeks to 2 months.
MODEL
We utilized XGBoost, a random forest-based, extreme gradient boosting algorithm, to construct each of our four models. The nature of the task was time series forecasting (regression).
Figure 1: XGBoost Stream.
Through bootstrap aggregation, the construction of multiple (often hundreds) of decision trees that are trained on random subsets of the data and then collectively vote for the final prediction, XGBoost is able to address variance-related error (overfitting). XGBoost also addresses the converse, bias-related error (underfitting), through gradient boosting: the process by which each decision tree is constructed with a greater focus on the samples the prior trees had difficulties with (Chen and Guestrin, 2016). As opposed to simpler regression techniques utilized by other models (see Uniqueness of Approach), XGBoost is able to gain a far deeper understanding of the data through nonlinear relations (while being able to distinguish from noise), making it an ultimately more robust choice of algorithm.
We constructed four separate models for four forecast ranges: 0-2 weeks, 2-4 weeks, 4-6 weeks, and 6-8 weeks in the future. The combination of these four models allows us to produce a comprehensive cholera forecast, detailing exactly how many new cases each Yemeni governorate will experience in each of the aforementioned time frames.
Figure 2: Diagram of CALM Conceptual Structure.
DATA
Lambert iGEM used multiple unique timeseries in the design of CALM. Features were calculated for each governorate and for governorates neighboring the respective governorate. Features were also calculated over multiple time frames: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Data from which features were calculated includes conflict fatalities, rainfall, past cases, and past deaths.
FEATURE ENGINEERING
Feature engineering is the crux of applied machine learning, and so we went through an exhaustive feature extraction and selection process in order to arrive at our final features. First, we extracted 45,000 potentially relevant features using the tsFresh package, which calculates an expansive array of time series features on our data (Christ et al., 2018). The objective of calculating these many features was the hope to capture ideal representations of our data: while the majority of these features would not be used in the final model, our coverage of this expansive set allowed us to ensure the best features would be found. We also calculated features over a series of overlapping time frames in order to provide varying frames of reference and lags: 8 weeks prior, 6 weeks prior, 4 weeks prior, 2 weeks prior, and 1 week prior. Features describing geographically neighboring governorates (through taking the mean) were also calculated. While having more data is usually beneficial, in this case, our number of training examples was far outnumbered by the number of features. Therefore, a demanding feature selection process was required. Using tsFresh’s scalable hypothesis tests with a false discovery rate of 0.001, we were able to calculate features statistically relevant to each time-range prediction, providing us with four sets of features ~15,000 in number for each time-frame prediction. Next, we removed collinear features, or those that were 97% correlated with each other, as these features would be redundant to our model. This provided us with sets of ~10,000 features to further narrow. We trained and tuned an extreme gradient boosting model, XGBoost, to rank the features in order of importance for each time-range prediction. Utilizing the ranking produce, we recursively added features based on if they added to our cross-validation loss (the root mean square error across all five cross-validation folds). This allowed us to arrive at the best 30-50 features for each time-range. All in all, we were able to remove ~99.9% of our original features.
TUNING
We utilized Bayesian Optimization to find optimal hyperparameters for our model. In contrast with a brute-force search over a defined set of hyperparameters, Bayesian Optimization tracks prior evaluations to form probabilistic assumptions on an objective function given a set of hyperparameters, allowing informed choices to be made on which hyperparameters to try (Snoek et al., 2012). This allowed us to converge at optimal hyperparameters with far greater efficiency.
RESULTS
Our models are able to predict the exact number of cases any given governorate in Yemen will experience across multiple two-week intervals, with all of our models being able to predict within a margin of 5 cholera cases per 10,000 people in the hold-out set. Hold-out error represents our model’s performance in real-world simulation, as the hold-out dataset was left untouched until final model evaluation. Our cross-validation error, similarly low, represents our model’s performance on a reliable, but not entirely untouched dataset, as the cross-validation dataset was used for hyperparameter tuning and feature selection. The mean number of cases any given governorate in Yemen experienced within a two week span was approximately 19.148, with the standard deviation being 21.311. As, in real-world simulation, all four of our predictive models are able to predict around ⅕ of a standard deviation of the number cases, our predictions are robust and reliable across all time frames. However, as our predictive timeframe passes farther into the future, the cross-validation error decreases and the hold-out error increases. This could be seen as a sign of marginal overfitting, but can also be attributed to the time-shift in data as the predictive range is farther ahead: cholera 6-8 weeks ahead of a given date can look different than 2-4 weeks ahead, though 4 weeks later the 2-4 week model will see the 6-8 week data.
Figure 2: XGBoost Predictions of New Cases 0 to 2, 2 to 4, 4 to 6, and 6 to 8 Weeks in Advance for Five Governorates: Our forecasts for each time frame vs a sliding window of real cases. The window represents a two-week interval corresponding to the forecast range with single data points, sliding over the data and summing up the cholera cases falling in the interval. For example, in the 0-2 week forecast plot, on September 15th we predicted there would be 92 new cholera cases per 10,000 people in the next two weeks in the governorate of YE-SN (Sana’a). Then, two weeks later, shown as the red true-value, Sana’a experienced ~92 cases. However, the date for the true-value datapoint remains September 15th, as the value describes the number of cases 0-2 weeks in the future. The red value refers to the true value, or the number of new cholera cases actually experienced by the respective governorate in the corresponding time range (2-4 weeks from present, 4-6 weeks from present, etc.). Cross-validation predictions (green) were completed with a rolling-window method, as described earlier (see methods), and the hold-out predictions (blue) were done separately, with the model training on all data previous to the holdout set.
Our four model system is able to accurately and comprehensively forecast cholera outbreaks across 21 governorates exhibiting heterogeneous behaviors. Five governorates were chosen to represent the entire range of behavior exhibited by all 21 of Yemen’s governorates. YE-AM (Amran), YE-DA (Dhale), and YE-MW (Al Mahwit) experienced the greatest cumulative number of cholera cases from May 22nd to February 18th, respectively, making them three of the four governorates most affected by cholera overall in the given timeframe. It should be noted that the 3rd most affected governorate, YE-SA (Amanat Al Asimah) was not included, as it is technically a municipality that is enclosed by the YE-SN governorate (which is already included in the 5 governorates chosen). YE-RA (Raymah) and YE-SN (Sana’a) were chosen due to the sudden spike in cholera cases both experienced. While it was affected less than other governorates, between January and February (part of the hold-out set), YE-RA saw a sudden increase in cholera cases, peaking at ~20 new cases every two weeks (per 10,000 people). YE-SA experienced the most sudden outbreak of all governorates, with incidence rates reaching as a high as 92 cases every two weeks (per 10,000 people) between mid-September and October. With this representative sample of five governorates, we show that CALM performs well on multiple governorates exhibiting a diverse range of behaviors. It is worth noting that YE-SA and YE-RA present a rare and interesting case - a sudden outbreak. As our predictive range increases, it becomes more difficult to predict sudden spikes, due to either a lack of information many weeks prior or the events preceding a sharp outbreak not having occurred yet. As a result, longer range models seem to predict sharp outbreaks with a certain lag. This can be seen from the 4-6 week model’s forecasting of the YE-RA outbreak, which shows an upward trend, but not a full spike being predicted. However, this is where the combination of all four of our models becomes most useful. While long-range models cannot easily predict outbreaks, our shorter-range models are able to pick up the slack once the outbreak becomes closer. Specifically, our 0-2 week forecasting model is able to predict incidence spikes in YE-RA and YE-SA, so even if our long-range models were unable to predict the outbreak immediately, we would still detect the outbreak at a later date.
ElectroPen Model
Voltage outputs from any piezoelectric-based system are dependent on the force exerted on the crystal, allowing for the reorganization of the ions, creating the corresponding output [32]. In order to characterize the capability of a small lighter to generate high voltages (on the order of kilovolts), the underlying mechanism within the lighter must first be described. The mechanism (which will henceforth be referred to as the “hammer action”) comprises of two springs, a hammer (metal piece striking the crystal), and the PZT crystal itself connected to a metal conductor. The hammer action functions in three phases: a loading phase, a release phase, and a relaxation phase, each directed by various components. During the loading phase, the hammer is held in a locked position as the lower spring and upper springs are being compressed using the user exerted force. The entire upper portion of the casing moves upwards, while the hammer is still locked in a loading position. The time interval of the loading phase is dependent on the force exerted by the user, and has no effect on the output voltage as the hammer's movement is dependent on the spring release. At the release phase, once the action begins to approach is critical point, the lower casing with a connected wedge pushes the hammer out of the latch while the spring is compressed at maximum, beginning to release the hammer. Then, the hammer switches from the lock to unlock state, allowing the lower spring to extend and project the hammer towards and onto the piezoelectric crystal, with the upper spring remaining compressed. The relaxation phase then constitutes the user pulling back, extending the upper spring, forcing the hammer downwards into its original state with no effect on the lower spring. Analysis of high-speed videos of the hammer releasing indicate that the hammer is able to reach a maximum velocity of 8 m/s at a peak acceleration of 30,000 m/s2. Further analysis indicates this rapid acceleration produces jerk of up to 300,000,000 m/s3 from this small hammer action, indicating the extreme nature of the design, allowing for the production of a powerful resultant force striking the crystal. In an effort to characterize the correlation between the experimentally obtained voltage outputs using the ElectroPen and the theoretical outputs, a piezoelectric static voltage theoretical model was used. Through this model and the data values established as constants , the theoretical voltage output of the crystal found within the lighter is a maximum of 2699 Volts, with a lower output under normal conditions due to the strain on the crystal from expanding towards its maximum length, as well as the resistance caused by the copper wires and metal conductors in the lighter. As the values from the described lighter are of the same magnitude, it can be declared that the theoretical basis confirms the obtained values from the experimental trials.
After conducting electroporation trials with the described protocol (Refer to Protocols below), growth of colonies as well as GFP expression were subsequently analyzed. Plates with growth were isolated and inoculated into liquid cultures, and quantified using the plate reader . GFP expression levels were thereby analyzed with a comparison of the ElectroPen to the outputs from an industrial electroporator (BioRad MicroPulser). With the negative control serving as the baseline comparison for validating positive expression of GFP, it can clearly be seen that there is a significant difference in fluorescence intensity (represented as Fluorescence/OD600) between the negative and positive control, confirming that GFP was successfully electroporated. The experimental samples were conducted using the ElectroPen and the positive control with the BioRad electroporator. It can be seen that the fluorescence intensity values for the ElectroPen trials are similar to the outputs produced by the standard electroporator, indicating successful electroporation, uptake of DNA, and expression of GFP by the E. coli bacteria. The obtained transformation efficiency (similar to the BioRad Micropulser) from the experimental trial additionally indicates the functionality of the ElectroPen, presenting it as a powerful device for infield and low-resource settings.
Figure 4: Mechanical model of the ElectroPen. a Depiction of the piezoelectric ignition found within a conventional lighter. b Parts of the mechanism, including from top to bottom: metal conductor housing crystal, loading/relaxation phase springs, hammer, release phase spring, and applied force casing. c Image trace of the hammer action throughout the three phases. The point of focus (marked in red) is the hammer arm. d Position graph of the hammer arm in comparison with the applied force casing through the three separate phases. e-g Motion of hammer during the release phase. e Displacement of hammer arm during release phase. f Velocity of the hammer during the release phase, reaching a peak of 8m/s. g Acceleration of the hammer during the release phase, reaching a peak of 30,000 m/s2 producing a corresponding force of 10N.
Figure 5: Derived mathematical model for calculating theoretical maximum voltage for the ElectroPen. Using previously described models, a final model was obtained incorporating dimensions and the disc-orientation of the ElectroPen, resulting in an amplified voltage output. Using established parameters for piezoelectric constants, we arrived at a theoretical maximum of 2.699kV, with the resistance in our system due to strain constraints and wire resistance.
After conducting electroporation trials with the described protocol (Refer to Protocols below), growth of colonies as well as GFP expression were subsequently analyzed. Plates with growth were isolated and inoculated into liquid cultures, and quantified using the plate reader . GFP expression levels were thereby analyzed with a comparison of the ElectroPen to the outputs from an industrial electroporator (BioRad MicroPulser). With the negative control serving as the baseline comparison for validating positive expression of GFP, it can clearly be seen that there is a significant difference in fluorescence intensity (represented as Fluorescence/OD600) between the negative and positive control, confirming that GFP was successfully electroporated. The experimental samples were conducted using the ElectroPen and the positive control with the BioRad electroporator. It can be seen that the fluorescence intensity values for the ElectroPen trials are similar to the outputs produced by the standard electroporator, indicating successful electroporation, uptake of DNA, and expression of GFP by the E. coli bacteria. The obtained transformation efficiency (similar to the BioRad Micropulser) from the experimental trial additionally indicates the functionality of the ElectroPen, presenting it as a powerful device for infield and low-resource settings.
Figure 6: a Fluorescence expression from the trials conducted with the electroporator and ElectroPen. As can be seen, GFP expression from the ElectroPen is comparable to the electroporator (BioRad Micropulser), indicating successful transformation and uptake of DNA. b-c Efficiency of the trials. It can be seen that the ElectroPen has similar transformation efficiencies as the electroporator, demonstrating its functionality.