Team:UC San Diego/Model



Our team generated mathematical models in order to better characterize our overall workflow. The primary task was to use these rigorous approaches in order to determine specific disease-specific markers (i.e. regions of promoter methylation that were consistent across all patients for a particular disease). Although literature provided some biomarkers, our team believed that due to the novelty of our approach, characterizing existing datasets and measurements of beta methylation values to derive a new panel of biomarkers would be more effective.

Cancer patients are known to show abnormally high levels of cell-free DNA in the plasma which is typically derived from cancer cells that have undergone apoptosis or programmed cell death. The optimal means to characterize a tumor through the methylation profile of the plasma DNA is to derive a set of genes that are all incurring high levels of hyper-methylation. Our team used measurements of 485,000 unique CpG markers to guide our DNA probe design and help characterize our understanding of the disease state fundamentals. Along the way, we expanded the modularity of our approach so that it can be used for other diseases and enabled this functionality within our biomarker discovery tool.


Key achievements from our modeling were

  1. Developed a model to inform DNA probe design for any methylome data

  2. Created a primer design script to generate a list of all possible methylation positions

  3. Made our Random Forest algorithms publicly available to other iGEM teams

  4. Expanded cross-platform functionality for any TCGA methylome data set

  5. Made a significant advance in using epigenetic parameters to guide wetlab circuit design

Using Unsupervised Machine Learning to Inform our Project Design

Using a large sample size of both affected and control tissue (n=2191), a HCC classifier was constructed with high specificity and sensitivity. A classifier determines which biomarkers within a specific tissue type can be used to predict the overall disease state of the sample. The goal of these algorithms was to identify a subset of hyper-methylated promoter regions that could accurately identify tumorous tissue samples based on the methylation readouts at these individual loci. To increase sensitivity and specificity, it was critical to select a set of markers that had high specificity to avoid including healthy patients in the diagnostic outcome and to ensure all diseased patients were being accounted for.


Overview of dataset

This study involved the analysis of 485,000 unique CpG markers to generate a final set of CpG markers that were enhanced in HCC patients. The following data represent the sample characteristics in the training and validation cohort used by the algorithm. This data was integrated with bisulfite sequencing measurements from IlluminaSeq 450 microarray platform and formed the starting point of our data set. To get an extract of the methylation dataset, please email We are not making the dataset public at the moment due to a confidentiality agreement with one of our laboratory collaborators. The following screenshots show the first and last five rows respectively of the post-processed data.

PatientType - This is the binary classifier used to distinguish whether or not the tissue sample is HCC(1) or normal(0). The data is organized such that all of the HCC patient data (1221 samples) is first and is followed by the healthy patient data (970 samples).

Explaining the Theory Behind our Model

In order for our team to develop the model, we had to make a foundational assumption that CpG markers that have a maximal difference in methylation between the two sample types would most likely demonstrate detectable methylation differences in the cfDNA of HCC patients. A second foundational assumption was that we would only consider hypermethylated markers and ignore hypomethylated markers as these are less abundant and overall global hypomethylation is difficult to ascertain with certain techniques. Our team then used Random Forest algorithms and LASSO (Least Absolute Shrinkage and Selection Operator) to further reduce the biomarker list. In selecting the markers, the t-statistic method with Empirical Bayes was used to shrink the variance and Benjamini-Hochberg procedure was used to control the false discovery rate at a significance level of 0.05. Using these methods and values, the data was processed to roughly 450 markers. By reducing the markers to a smaller subset, we can ensure that only the markers that are most differentially methylated and have a dynamic methylation range are used for the clustering.

The Random Forest algorithm is a supervised machine learning algorithm that can be used to solve both classification and regression problems. It is a robust algorithm that avoids over-fitting by creating decisions trees from randomly selected data samples and generating predictions from each tree. The algorithm then takes predictions from all of the trees to generate the relative feature importance for the model. It was essential to perform several data processing steps prior to running this algorithm to separate the data into feature variables (methylation markers) and target variables (HCC or healthy). The input and output of Random Forest are listed below. Note that the accuracy of the model improves if the number of estimators is increased, but this also introduces higher computational burden and the chance of overfitting the data.


  1. Number of Estimators - the number of decision trees in the forest

  2. Number of Features - the total number of significant features to output from the resulting decision tree


  1. Summary Statistics - error metrics, confusion matrix, and classification report

  2. Sensitivity Analysis - AUROC curve

  3. Feature List - list of identified features and their respective weights

Similar to the Random Forest procedure, LASSO is used to shrink the total number of features being used to model the data. This algorithm uses a process called regularization which penalizes features in the data in order to keep only the most important factors. The most important parameter in a LASSO regression is the alpha value. When alpha=0, a simple linear regression is being performed on the data. When alpha approaches 1 however, most of the feature coefficients reach 0 which indicates they have no weight on the model. The only downside of LASSO is that it tends to overfit the data and has a lower predictive capability when compared to Random Forest.


Consider the alpha value - The regularization parameter, ranges from 0-1 where alpha = 0 is equivalent to a linear regression and alpha = 1 causes all features to be dropped from the model


  1. Test Statistics - for each value of alpha, the number of features used and the test/training score for the model

  2. Feature List - list of identified features and their coefficients

By running both selection algorithms, we are able to derive a subset of biomarkers that strongly contribute to the classification model. Taking an overlap of the two lists produced from each method yields a set of computationally likely disease-causing methylation loci.

Outcomes and Analysis

The two analyses yielded overlapping markers where used to test sensitivity and specificity against the training and validation data sets.

Random Forest Analysis : # of training observations: 1752, # of test observations: 439

In a Random Forest prediction model, feature importances give a sense of which variables have the most effect in these models. The sklearn package has a .feature_importances_ attribute that returns an array of each features importance in the model. To identify the top 30 positions that contribute to the model, we take the top 30 feature weights and organize them in descending order below.

Random Forest Methylation Biomarkers - (Marker, Feature Weight)

Biomarker ID Feature Weight
X4.1324870 0.09226192627385134
X17.80358809 0.059085931369960085
X5.171538557 0.057999353985870285
X4.1324877 0.03303274292587381
X11.47624801 0.03174275980572196
X3.101808857 0.02760567525917319
X17.49295615 0.02695179769504988
X10.135072960 0.026502243602910897
X7.151106060 0.025569011566135204
X10.103534546 0.017009344232785523
X6.31527889 0.016819558850001068
X10.88684020 0.013249692773994187
X2.113931518 0.012937163887263686
X7.151106022 0.011656198731186432
X21.36421467 0.011245567212164166
X10.134141823 0.01078311019638933
X8.53851151 0.010115600105707685
X10.14701815 0.00988707517481865
X1.226296852 0.00823810513451597
X17.47286802 0.008169863614495784
X4.1324849 0.007778442275209609
X6.41528449 0.007534845972666597
X7.27155000 0.007265779808743397
X11.69707285 0.007172146198231417
X8.2879889 0.0067021680826400585
X8.48656343 0.006613567990276688
X18.61143902 0.006435928106157796
X4.1324842 0.006322098138107917
X6.16729606 0.006154748779256597
X12.20522379 0.0061085520577785464

Confusion Matrix Structure

Confusion Matrix & Classification Report: A confusion matrix is a summary of the prediction results generated from the model on a classification problem.

Negative Test Positive Test Total
Disease Absent True Negative (TN) False Positive (FP) TN + FP
Disease Present False Negative (FN) True Positive (TP) FN + TP
Total TN + FN FP + TP FP + TP

TP : Case where the disease is present and the test predicts the state accurately.

FP : Case where the disease is absent, but the test fails and predicts it as present.

FN : Case where disease is present, but the test fails and predicts it as absent.

TN : Case where the disease is absent and the test predicts the state accurately.

Accuracy = ( (TP + TN) / Total)

Precision = (TP / (TP + FP) )

Sensitivity = (TP / (TP+FN) )

Specificity = (TN / (TN + FP) )

F1 Score - Measure of a binary classification’s accuracy by taking a weighted average of the recall and precision values.

Micro Average - Aggregates the contributions of all classes to compute the average metric, used when class imbalance is suspected. (More examples of one class than another)

Macro Average - Computes the metric independently for each class and then takes the average, thus treating all classes equally.

Weighted Average - Each class contribution to the average is weighted by the relative number of examples available for it.

Confusion Matrix - Model Data

Negative Test Positive Test Total
Disease Absent 167 (TN) 7 (FP) 174 (TN + FP)
Disease Present 15 (FN) 250 (TP) 265 (FN + TP)
Total 182 (TN + FN) 257 (FP+TP) 439
Accuracy = (167+250 / 439) = 0.949886
Sensitivity = (250 / 250+15) = 0.943396
Specificity = (167 / 167+7) = 0.959770

Classification Report- Model Data

Precision Recall F1-score Support (# of Samples)
0 (Healthy) 0.92 (167/192) 0.96 ( 167/174) 0.94 174
1 (Disease) 0.97 (250/257) 0.94 (250/265) 0.96 265
micro average 0.94 0.94 0.94 439
macro average 0.93 0.94 0.94 439
weighted average 0.94 0.94 0.94 439

ROC Analysis : Receiving Operator Characteristic (ROC) curves are typically used to study the output of a classifier for a binary classification problem. In this model, the outputs were labeled as 1 for HCC patient samples and 0 for healthy patient samples. The ROC curve maps the true positive rate (y-axis) against the false positive rate (x-axis) where the top left corner is the ideal point as this indicates a true positive rate of 1 and a false positive rate of 0. We often observe and try to optimize the area under this curve (AUC) which is some value between 0 and 1. In the ROC curve generated by this random forest model, the AUC = 0.9876165690739536.

Random Forest Feature Estimator : As the name suggests, the Random Forest algorithm produces a unique forest of decision trees each time it is run. As a result, each run of the algorithm produces a different set of features with unique weights. The purpose of the feature estimator is to take the output of 50 random forest algorithm instances and generate a weighted list of top features/biomarkers. Since the features included in each individual output may not be the same and may be in different orders depending upon the decision tree that was generated, it is essential to weight all these outputs. The weighted and ranked features have been listed in the table below:

Note - The biomarkers that overlap with the ones produced in the above random forest analysis are labeled in bold

Feature Estimator - (Rank, Marker ID)

Rank Biomarker ID
1 X4.1324870 (#1)
2 X4.1324849 (#21)
3 X4.1324842 (#28)
4 X11.64644496
5 X10.88684020 (#12)
6 X11.47624801 (#5)
7 X17.80358809 (#2)
8 X4.1324877 (#4)
9 X11.1102570
10 X7.151106060 (#9)
11 X7.151106022 (#14)
12 X21.36421467 (#15)
13 X5.171538557 (#3)
14 X6.41528449 (#22)
15 X3.101808857 (#6)
16 X7.35301161
17 X17.49295615 (#7)
18 X7.140267061
19 X10.6162175
20 X12.20522379 (#30)
21 X19.14993513
22 X10.135072960 (#8)
23 X6.170494251
24 X22.37813041
25 X15.55569496
26 X1.226296852 (#19)
27 X7.139929764
28 X8.2879889 (#25)
29 X8.124332861
30 X10.3823804

A majority of the biomarkers in the features estimator are bolded indicating they match the features identified in the example random forest run. This is expected since the estimator should encompass the most likely features that will be found in any given Random Forest.

LASSO Feature List:

The LASSO method was run on the dataset using several different alpha parameters including alpha = [0.0001, 0.01, 0.05, 0.1, and 1]. The training score, test score, and number of features used for each alpha level are indicated below.

Alpha Level Training Score Test Score Number of Features
Alpha=1 0 -0.004419100127760922 0
Alpha=0.1 0.15677933815289924 0.13977060185368662 2
Alpha=0.05 0.4381200357272692 0.42151384889843724 8
Alpha=0.01 0.6807028476075634 0.6585068207316982 26
Alpha=0.0001 0.8291845505629482 0.6949522662970736 393

We can immediately notice that as the value of alpha approaches 0, the number of features used in the model increases and the prediction accuracy also improves. This is accurate with how the alpha parameter works as a value of Alpha=0 is equivalent to running a linear regression and including all of the features. The visualization below is generated by the LASSO analysis and shows the various features with different coefficient magnitudes at the varying alpha levels. The spread for alpha=1 is a straight line along zero coefficient magnitude because all of the features are dropped at this alpha level. The spread for alpha=0.0001 is very similar to the linear regression spread which is the lower bound of alpha. For further analysis, we sort the top 20 features produced from the Alpha=0.01 run and compare them to the results from Random Forest.

Biomarker ID Alpha=0.0001 Alpha=0.01 Alpha=0.05
X17.80358809 0.066168 0.14503 0
X10.6162175 0.06678 0.123655 0
X11.1102570 0.105655 0.118012 0.082344
X21.36421467 0.382851 0.113295 0
X1.226296852 0.064078 0.099545 0.002631
X3.101808857 0.053619 0.096017 0.161699
X14.23291079 0.088821 0.093794 0.029928
X19.18874658 0.045012 0.057451 0.003233
X19.36233435 0.111286 0.053625 0
X8.2879889 0.017818 0.04995 0.014396
X6.41528449 0.070034 0.045128 0
X5.172982345 0.033804 0.039364 0
X11.69707285 0.021109 0.033732 0
X17.55456535 0.08954 0.022602 0
X1.53794245 0.093721 0.017271 0
X6.134491421 0.037699 0.005953 0
X2.127933601 0.023571 0.003084 0
X7.83479508 0.01037 0.002575 0
X6.3247680 -0.020572 0.000515 0
X3.150321178 0 0 0

It is important to note that as we increase the alpha level, some of the features that have a significant weight at a lower level drop out of the model. After sorting based on the alpha=0.01 column, we find an overlapping 9 biomarkers between the Random Forest and LASSO outputs.

Making our tool accessible to researchers and iGEM community

Although our original intent was to use this tool to aid in biomarker discovery for hepatocellular carcinoma markers, our team realized that the approach and theory behind our model were universal and could be applicable to any existing methylome data . As such, we expanded the modularity of our approach to take methylome data from the TCGA and generate a subsequent classifier. In the proposed state, methylation data from the TCGA or separate Illumina sequencing platform can be ingested into the platform which will output a list of biomarkers to evaluate further. This output will be linked to a primer design and ordering tool which will allow scientists to efficiently find biomarkers for patient samples. The TCGA database contains data from over 8500 tumor samples of 33 tumor types and the Gene Expression Omnibus database contains an additional 60000 samples including several thousand normal blood and tissue samples that can be abstracted as controls. The Illumina Human Methylation 450 microarray-based assay that is used to generate methylation datasets covers over 450,000 loci that include human CpG islands and gene promoter regions that are frequently aberrantly methylated in cancer cells.

The TCGA integration will make this machine-learning model applicable to cancers outside of HCC. This enables other teams and researchers to rapidly identify potentially hyper-methylated markers for various cancer types and perform a further wet-lab investigation at these markers . With the optimally designed set of biomarkers for these cancers, teams can design probes to assay patient methylation levels at these loci and perhaps implement targeted demethylation techniques. It is also likely that as new patient data is ingested into the platform, new markers that were initially not statistically significant to include will become more relevant across the datasets. Lastly, it is also possible to use the set of all markers across cancers to make predictions on whether a patient is at a predisposition for a combination of multiple common cancers. Overall the availability of these markers will enable more rapid cancer diagnostics alongside the advancement of liquid biopsy sampling

Once our algorithm generates a list of hypermethylated biomarkers, the primer design tool takes the output loci and allows the researcher to design primers for each position. The primers, which are typically 24-36 basepairs in length, will have a methylated cytosine at the position specified by algorithms. This methylated base must also be 5-10 base pairs within the 3' end of the primer. Due to those restrictions, the primer design tool enables the researcher to generate a variety of different primers for a given locus based on their own experimental design. Future applications of this tool would involve creating a recommendation system for optimal primer design based on the biomarker output. Lastly, this tool can be linked with a primer purchasing platform such as IDT to provide a step by step workflow for the researchers.


We believe that this epigenetics-driven modeling approach will prove useful to many iGEM teams and the cancer research community as a whole going forward. Using the overlapping technique with LASSO and Random Forest help provide unparalleled discovery ability and its modularity for any disease with existing methylome data makes it a powerful tool for others in their research as well