Team:UC San Diego/Model

Model

Introduction

Our team generated mathematical models in order to better characterize our overall workflow. The primary task was to use these rigorous approaches in order to determine specific disease-specific markers (i.e. regions of promoter methylation that were consistent across all patients for a particular disease). Although literature provided some biomarkers, our team believed that due to the novelty of our approach, characterizing existing datasets and measurements of beta methylation values to derive a new panel of biomarkers would be more effective.

Cancer patients are known to show abnormally high levels of cell-free DNA in the plasma which is typically derived from cancer cells that have undergone apoptosis or programmed cell death. The optimal means to characterize a tumor through the methylation profile of the plasma DNA is to derive a set of genes that are all incurring high levels of hyper-methylation. Our team used measurements of 485,000 unique CpG markers to guide our DNA probe design and help characterize our understanding of the disease state fundamentals. Along the way, we expanded the modularity of our approach so that it can be used for other diseases and enabled this functionality within our biomarker discovery tool.

Key achievements from our modeling were

Developed a model to inform DNA probe design for any methylome data
Created a primer design script to generate a list of all possible methylation positions
Made our Random Forest algorithms publicly available to other iGEM teams
Expanded cross-platform functionality for any TCGA methylome data set
Made a significant advance in using epigenetic parameters to guide wetlab circuit design

Using Unsupervised Machine Learning to Inform our Project Design

Using a large sample size of both affected and control tissue (n=2191), a HCC classifier was constructed with high specificity and sensitivity. A classifier determines which biomarkers within a specific tissue type can be used to predict the overall disease state of the sample. The goal of these algorithms was to identify a subset of hyper-methylated promoter regions that could accurately identify tumorous tissue samples based on the methylation readouts at these individual loci. To increase sensitivity and specificity, it was critical to select a set of markers that had high specificity to avoid including healthy patients in the diagnostic outcome and to ensure all diseased patients were being accounted for.

Overview of dataset

This study involved the analysis of 485,000 unique CpG markers to generate a final set of CpG markers that were enhanced in HCC patients. The following data represent the sample characteristics in the training and validation cohort used by the algorithm. This data was integrated with bisulfite sequencing measurements from IlluminaSeq 450 microarray platform and formed the starting point of our data set. To get an extract of the methylation dataset, please email ucsdigem@gmail.com. We are not making the dataset public at the moment due to a confidentiality agreement with one of our laboratory collaborators. The following screenshots show the first and last five rows respectively of the post-processed data.

PatientType - This is the binary classifier used to distinguish whether or not the tissue sample is HCC(1) or normal(0). The data is organized such that all of the HCC patient data (1221 samples) is first and is followed by the healthy patient data (970 samples).

Explaining the Theory Behind our Model

In order for our team to develop the model, we had to make a foundational assumption that CpG markers that have a maximal difference in methylation between the two sample types would most likely demonstrate detectable methylation differences in the cfDNA of HCC patients . Our team then used Random Forest algorithms and LASSO (Least Absolute Shrinkage and Selection Operator) to further reduce the biomarker list. In selecting the markers, the t-statistic method with Empirical Bayes was used to shrink the variance and Benjamini-Hochberg procedure was used to control the false discovery rate at a significance level of 0.05. Using these methods and values, a panel of top 1000 markers were generated from which 401 with good experimental amplification profiles were selected. To generate the methylation profiles of the training and validation cohorts, we used and generated bisulfite measurements: this method changes unmethylated cytosines into thymines whereas the methylated cytosines are protected . The final ratio calculated evaluates the total amount of unaffected cytosines divided by the total number of unaffected cytosines and base-modified cytosines.

Outcomes and Analysis

The two analyses yielded overlapping markers where used to test sensitivity and specificity against the training and validation data sets.

Random Forest Analysis : # of training observations: 1752, # of test observations: 439

In a Random Forest prediction model, feature importances give a sense of which variables have the most effect in these models. The sklearn package has a .feature_importances_ attribute that returns an array of each features importance in the model. To identify the top 30 positions that contribute to the model, we take the top 30 feature weights and organize them in descending order below.

Random Forest Methylation Biomarkers - (Marker, Feature Weight)

Biomarker ID	Feature Weight
X4.1324870	0.09226192627385134
X17.80358809	0.059085931369960085
X5.171538557	0.057999353985870285
X4.1324877	0.03303274292587381
X11.47624801	0.03174275980572196
X3.101808857	0.02760567525917319
X17.49295615	0.02695179769504988
X10.135072960	0.026502243602910897
X7.151106060	0.025569011566135204
X10.103534546	0.017009344232785523
X6.31527889	0.016819558850001068
X10.88684020	0.013249692773994187
X2.113931518	0.012937163887263686
X7.151106022	0.011656198731186432
X21.36421467	0.011245567212164166
X10.134141823	0.01078311019638933
X8.53851151	0.010115600105707685
X10.14701815	0.00988707517481865
X1.226296852	0.00823810513451597
X17.47286802	0.008169863614495784
X4.1324849	0.007778442275209609
X6.41528449	0.007534845972666597
X7.27155000	0.007265779808743397
X11.69707285	0.007172146198231417
X8.2879889	0.0067021680826400585
X8.48656343	0.006613567990276688
X18.61143902	0.006435928106157796
X4.1324842	0.006322098138107917
X6.16729606	0.006154748779256597
X12.20522379	0.0061085520577785464

Confusion Matrix Structure

	Negative Test	Positive Test	Total
Disease Absent	True Negative (TN)	False Positive (FP)	TN + FP
Disease Present	False Negative (FN)	True Positive (TP)	FN + TP
Total	TN + FN	FP + TP	FP + TP

Confusion Matrix & Classification Report: A confusion matrix is a summary of the prediction results generated from the model on a classification problem.

TP : Case where the disease is present and the test predicts the state accurately.

FP : Case where the disease is absent, but the test fails and predicts it as present.

FN : Case where disease is present, but the test fails and predicts it as absent.

TN : Case where the disease is absent and the test predicts the state accurately.

Accuracy = ( (TP + TN) / Total)

Precision = (TP / (TP + FP) )

Sensitivity = (TP / (TP+FN) )

Specificity = (TN / (TN + FP) )

F1 Score - Measure of a binary classification’s accuracy by taking a weighted average of the recall and precision values.

ROC Analysis : Receiving Operator Characteristic (ROC) curves are typically used to study the output of a classifier for a binary classification problem. In this model, the outputs were labeled as 1 for HCC patient samples and 0 for healthy patient samples. The ROC curve maps the true positive rate (y-axis) against the false positive rate (x-axis) where the top left corner is the ideal point as this indicates a true positive rate of 1 and a false positive rate of 0. We often observe and try to optimize the area under this curve (AUC) which is some value between 0 and 1. In the ROC curve generated by this random forest model, the AUC = 0.9876165690739536.

Random Forest Feature Estimator : As the name suggests, the Random Forest algorithm produces a unique forest of decision trees each time it is run. As a result, each run of the algorithm produces a different set of features with unique weights. The purpose of the feature estimator is to take the output of 50 random forest algorithm instances and generate a weighted list of top features/biomarkers. Since the features included in each individual output may not be the same and may be in different orders depending upon the decision tree that was generated, it is essential to weight all these outputs. The weighted and ranked features have been listed in the table below:

Note - The biomarkers that overlap with the ones produced in the above random forest analysis are labeled in bold

A majority of the biomarkers in the features estimator are bolded indicating they match the features identified in the example random forest run. This is expected since the estimator should encompass the most likely features that will be found in any given Random Forest.

Making our tool accessible to researchers and iGEM community

Although our original intent was to use this tool to aid in biomarker discovery for hepatocellular carcinoma markers, our team realized that the approach and theory behind our model were universal and could be applicable to any existing methylome data . As such, we expanded the modularity of our approach to take methylome data from the TCGA and generate a subsequent classifier. In the proposed state, methylation data from the TCGA or separate Illumina sequencing platform can be ingested into the platform which will output a list of biomarkers to evaluate further. This output will be linked to a primer design and ordering tool which will allow scientists to efficiently find biomarkers for patient samples. The TCGA database contains data from over 8500 tumor samples of 33 tumor types and the Gene Expression Omnibus database contains an additional 60000 samples including several thousand normal blood and tissue samples that can be abstracted as controls. The Illumina Human Methylation 450 microarray-based assay that is used to generate methylation datasets covers over 450,000 loci that include human CpG islands and gene promoter regions that are frequently aberrantly methylated in cancer cells. The TCGA integration will make this machine-learning model applicable to cancers outside of HCC. This enables other teams and researchers to rapidly identify potentially hyper-methylated markers for various cancer types and perform a further wet-lab investigation at these markers . With the optimally designed set of biomarkers for these cancers, teams can design probes to assay patient methylation levels at these loci and perhaps implement targeted demethylation techniques. It is also likely that as new patient data is ingested into the platform, new markers that were initially not statistically significant to include will become more relevant across the datasets. Lastly, it is also possible to use the set of all markers across cancers to make predictions on whether a patient is at a predisposition for a combination of multiple common cancers. Overall the availability of these markers will enable more rapid cancer diagnostics alongside the advancement of liquid biopsy sampling

Conclusion

We believe that this epigenetics-driven modeling approach will prove useful to many iGEM teams and the cancer research community as a whole going forward. Using the overlapping technique with LASSO and Random Forest help provide unparalleled discovery ability and its modularity for any disease with existing methylome data makes it a powerful tool for others in their research as well

References

http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization
https://ianlondon.github.io/blog/pickling-basics/
https://www.datacamp.com/community/tutorials/random-forests-classifier-python
https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45
https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/
https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/
https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it