Model
Introduction
Our team generated mathematical models in order to better characterize our overall workflow. The primary task was to use these rigorous approaches in order to determine specific disease-specific markers (i.e. regions of promoter methylation that were consistent across all patients for a particular disease). Although literature provided some biomarkers, our team believed that due to the novelty of our approach, characterizing existing datasets and measurements of beta methylation values to derive a new panel of biomarkers would be more effective.
Cancer patients are known to show abnormally high levels of cell-free DNA in the plasma which is typically derived from cancer cells that have undergone apoptosis or programmed cell death. The optimal means to characterize a tumor through the methylation profile of the plasma DNA is to derive a set of genes that are all incurring high levels of hyper-methylation. Our team used measurements of 485,000 unique CpG markers to guide our DNA probe design and help characterize our understanding of the disease state fundamentals. Along the way, we expanded the modularity of our approach so that it can be used for other diseases and enabled this functionality within our biomarker discovery tool.
Key achievements from our modeling were
-
Developed a model to inform DNA probe design for any methylome data
-
Created a primer design script to generate a list of all possible methylation positions
-
Made our Random Forest algorithms publicly available to other iGEM teams
-
Expanded cross-platform functionality for any TCGA methylome data set
-
Made a significant advance in using epigenetic parameters to guide wetlab circuit design
Using Unsupervised Machine Learning to Inform our Project Design
Using a large sample size of both affected and control tissue (n=2191), a HCC classifier was constructed with high specificity and sensitivity. A classifier determines which biomarkers within a specific tissue type can be used to predict the overall disease state of the sample. The goal of these algorithms was to identify a subset of hyper-methylated promoter regions that could accurately identify tumorous tissue samples based on the methylation readouts at these individual loci. To increase sensitivity and specificity, it was critical to select a set of markers that had high specificity to avoid including healthy patients in the diagnostic outcome and to ensure all diseased patients were being accounted for.
Overview of dataset
This study involved the analysis of 485,000 unique CpG markers to generate a final set of CpG markers that were enhanced in HCC patients. The following data represent the sample characteristics in the training and validation cohort used by the algorithm. This data was integrated with bisulfite sequencing measurements from IlluminaSeq 450 microarray platform and formed the starting point of our data set. To get an extract of the methylation dataset, please email ucsdigem@gmail.com. We are not making the dataset public at the moment due to a confidentiality agreement with one of our laboratory collaborators. The following screenshots show the first and last five rows respectively of the post-processed data.
PatientType - This is the binary classifier used to distinguish whether or not the tissue sample is HCC(1) or normal(0). The data is organized such that all of the HCC patient data (1221 samples) is first and is followed by the healthy patient data (970 samples).
Explaining the Theory Behind our Model
In order for our team to develop the model, we had to make a foundational assumption that CpG markers that have a maximal difference in methylation between the two sample types would most likely demonstrate detectable methylation differences in the cfDNA of HCC patients . Our team then used Random Forest algorithms and LASSO (Least Absolute Shrinkage and Selection Operator) to further reduce the biomarker list. In selecting the markers, the t-statistic method with Empirical Bayes was used to shrink the variance and Benjamini-Hochberg procedure was used to control the false discovery rate at a significance level of 0.05. Using these methods and values, a panel of top 1000 markers were generated from which 401 with good experimental amplification profiles were selected. To generate the methylation profiles of the training and validation cohorts, we used and generated bisulfite measurements: this method changes unmethylated cytosines into thymines whereas the methylated cytosines are protected . The final ratio calculated evaluates the total amount of unaffected cytosines divided by the total number of unaffected cytosines and base-modified cytosines.
Outcomes and Analysis
The two analyses yielded overlapping markers where used to test sensitivity and specificity against the training and validation data sets.
Random Forest Analysis : # of training observations: 1752, # of test observations: 439
In a Random Forest prediction model, feature importances give a sense of which variables have the most effect in these models. The sklearn package has a .feature_importances_ attribute that returns an array of each features importance in the model. To identify the top 30 positions that contribute to the model, we take the top 30 feature weights and organize them in descending order below.
Random Forest Methylation Biomarkers - (Marker, Feature Weight)
Biomarker ID | Feature Weight |
X4.1324870 | 0.09226192627385134 |
X17.80358809 | 0.059085931369960085 |
X5.171538557 | 0.057999353985870285 |
X4.1324877 | 0.03303274292587381 |
X11.47624801 | 0.03174275980572196 |
X3.101808857 | 0.02760567525917319 |
X17.49295615 | 0.02695179769504988 |
X10.135072960 | 0.026502243602910897 |
X7.151106060 | 0.025569011566135204 |
X10.103534546 | 0.017009344232785523 |
X6.31527889 | 0.016819558850001068 |
X10.88684020 | 0.013249692773994187 |
X2.113931518 | 0.012937163887263686 |
X7.151106022 | 0.011656198731186432 |
X21.36421467 | 0.011245567212164166 |
X10.134141823 | 0.01078311019638933 |
X8.53851151 | 0.010115600105707685 |
X10.14701815 | 0.00988707517481865 |
X1.226296852 | 0.00823810513451597 |
X17.47286802 | 0.008169863614495784 |
X4.1324849 | 0.007778442275209609 |
X6.41528449 | 0.007534845972666597 |
X7.27155000 | 0.007265779808743397 |
X11.69707285 | 0.007172146198231417 |
X8.2879889 | 0.0067021680826400585 |
X8.48656343 | 0.006613567990276688 |
X18.61143902 | 0.006435928106157796 |
X4.1324842 | 0.006322098138107917 |
X6.16729606 | 0.006154748779256597 |
X12.20522379 | 0.0061085520577785464 |
Confusion Matrix Structure
Negative Test | Positive Test | Total | |
Disease Absent | True Negative (TN) | False Positive (FP) | TN + FP |
Disease Present | False Negative (FN) | True Positive (TP) | FN + TP |
Total | TN + FN | FP + TP | FP + TP |
Confusion Matrix & Classification Report: A confusion matrix is a summary of the prediction results generated from the model on a classification problem.
TP : Case where the disease is present and the test predicts the state accurately.
FP : Case where the disease is absent, but the test fails and predicts it as present.
FN : Case where disease is present, but the test fails and predicts it as absent.
TN : Case where the disease is absent and the test predicts the state accurately.
Accuracy = ( (TP + TN) / Total)
Precision = (TP / (TP + FP) )
Sensitivity = (TP / (TP+FN) )
Specificity = (TN / (TN + FP) )
F1 Score - Measure of a binary classification’s accuracy by taking a weighted average of the recall and precision values.
ROC Analysis : Receiving Operator Characteristic (ROC) curves are typically used to study the output of a classifier for a binary classification problem. In this model, the outputs were labeled as 1 for HCC patient samples and 0 for healthy patient samples. The ROC curve maps the true positive rate (y-axis) against the false positive rate (x-axis) where the top left corner is the ideal point as this indicates a true positive rate of 1 and a false positive rate of 0. We often observe and try to optimize the area under this curve (AUC) which is some value between 0 and 1. In the ROC curve generated by this random forest model, the AUC = 0.9876165690739536.
Random Forest Feature Estimator : As the name suggests, the Random Forest algorithm produces a unique forest of decision trees each time it is run. As a result, each run of the algorithm produces a different set of features with unique weights. The purpose of the feature estimator is to take the output of 50 random forest algorithm instances and generate a weighted list of top features/biomarkers. Since the features included in each individual output may not be the same and may be in different orders depending upon the decision tree that was generated, it is essential to weight all these outputs. The weighted and ranked features have been listed in the table below:
Note - The biomarkers that overlap with the ones produced in the above random forest analysis are labeled in bold
A majority of the biomarkers in the features estimator are bolded indicating they match the features identified in the example random forest run. This is expected since the estimator should encompass the most likely features that will be found in any given Random Forest.
Making our tool accessible to researchers and iGEM community
Although our original intent was to use this tool to aid in biomarker discovery for hepatocellular carcinoma markers, our team realized that the approach and theory behind our model were universal and could be applicable to any existing methylome data . As such, we expanded the modularity of our approach to take methylome data from the TCGA and generate a subsequent classifier. In the proposed state, methylation data from the TCGA or separate Illumina sequencing platform can be ingested into the platform which will output a list of biomarkers to evaluate further. This output will be linked to a primer design and ordering tool which will allow scientists to efficiently find biomarkers for patient samples. The TCGA database contains data from over 8500 tumor samples of 33 tumor types and the Gene Expression Omnibus database contains an additional 60000 samples including several thousand normal blood and tissue samples that can be abstracted as controls. The Illumina Human Methylation 450 microarray-based assay that is used to generate methylation datasets covers over 450,000 loci that include human CpG islands and gene promoter regions that are frequently aberrantly methylated in cancer cells. The TCGA integration will make this machine-learning model applicable to cancers outside of HCC. This enables other teams and researchers to rapidly identify potentially hyper-methylated markers for various cancer types and perform a further wet-lab investigation at these markers . With the optimally designed set of biomarkers for these cancers, teams can design probes to assay patient methylation levels at these loci and perhaps implement targeted demethylation techniques. It is also likely that as new patient data is ingested into the platform, new markers that were initially not statistically significant to include will become more relevant across the datasets. Lastly, it is also possible to use the set of all markers across cancers to make predictions on whether a patient is at a predisposition for a combination of multiple common cancers. Overall the availability of these markers will enable more rapid cancer diagnostics alongside the advancement of liquid biopsy sampling
Conclusion
We believe that this epigenetics-driven modeling approach will prove useful to many iGEM teams and the cancer research community as a whole going forward. Using the overlapping technique with LASSO and Random Forest help provide unparalleled discovery ability and its modularity for any disease with existing methylome data makes it a powerful tool for others in their research as well
References
- http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization
- https://ianlondon.github.io/blog/pickling-basics/
- https://www.datacamp.com/community/tutorials/random-forests-classifier-python
- https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45
- https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/
- https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/
- https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it