Team:UC San Diego/Model

Model

Introduction

Our team generated mathematical models in order to better characterize our overall workflow. The primary task was to use these rigorous approaches in order to determine specific disease-specific markers (i.e. regions of promoter methylation that were consistent across all patients for a particular disease). Although literature provided some biomarkers, our team believed that due to the novelty of our approach, characterizing existing datasets and measurements of beta methylation values to derive a new panel of biomarkers would be more effective.

Our post-doctoral supervisors in the Zhang Lab used a HCC-specific methylation marker panel where they compared the methylation profiles of HCC tissue against normal leukocytes. Cancer patients are known to show abnormally high levels of cell-free DNA in the plasma which is typically derived from cancer cells that have undergone apoptosis or programmed cell death. This cfDNA extraction technique is minimally invasive for patients relative to conventional tissue biopsy. In addition, routine blood samples can be drawn from patients for real-time monitoring. The optimal means to characterize a tumor through the methylation profile of the plasma DNA is to derive a set of genes that are all incurring high levels of hyper-methylation.

generalworkflow

Using Unsupervised Machine Learning to Inform our Project Design

Using a large sample size of both affected and control tissue (n=2100), a HCC classifier was constructed with high specificity and sensitivity. A classifier determines which biomarkers within a specific tissue type can be used to predict the overall disease state of the sample. The goal of these algorithms was to identify a subset of hyper-methylated promoter regions that could accurately identify tumorous tissue samples based on the methylation readouts at these individual loci. To increase sensitivity and specificity, it was critical to select a set of markers that had high specificity to avoid including healthy patients in the diagnostic outcome and to ensure all diseased patients were being accounted for.

This study involved the analysis of 485,000 unique CpG markers to generate a final set of CpG markers that were enhanced in HCC patients. The following data represent the sample characteristics in the training and validation cohort used by the algorithm. This data was integrated with bisulfite sequencing measurements from IlluminaSeq 450 microarray platform and formed the starting point of our data set.

    In order for our team to develop the model, we had to make a foundational assumption that
  1. CpG markers that have a maximal difference in methylation between the two sample types would most likely demonstrate detectable methylation differences in the cfDNA of HCC patients.

Explaining the Theory Behind our Model

Our team then used Random Forest algorithms and LASSO (Least Absolute Shrinkage and Selection Operator) to further reduce the biomarker list. In selecting the markers, t-statistic method with Empirical Bayes was used to shrink the variance and Benjamini-Hochberg procedure was used to control the false discovery rate at a significance level of 0.05. Using these methods and values, a panel of top 1000 markers were generated from which 401 with good experimental amplification profiles were selected. Since methylation sites occur in close proximity and are likely to be co-methylated, methylation correlated blocks (MCBs) were generated using a Pearson correlation method. To generate the methylation profiles of the training and validation cohorts, we used bisulfite measurements: this method changes unmethylated cytosines into thymines whereas the methylated cytosines are protected. The final ratio calculated evaluates the total amount of unaffected cytosines divided by the total number of unaffected cytosines and base-modified cytosines.

Outcomes and Analysis

The two analyses yielded overlapping markers where used to test sensitivity and specificity against the training and validation data sets.

In the end, our team was able to identify the following gene panel as an accurate indicator for HCC markers, and used this to specifically design probes for methylation analysis.

Making our tool accessible to researchers and iGEM community

Although our original intent was to use this tool to aid in biomarker discovery for hepatocellular carcinoma markers, our team realized that the approach and theory behind our model was universal and could be applicable to any existing methylome data. As such, we expanded the modularity of our approach to take methylome data from the TCGA and generate a subsequent classifier. In the proposed state, methylation data from the TCGA or separate Illumina sequencing platform can be ingested into the platform which will output a list of biomarkers to evaluate further. This output will be linked to a primer design and ordering tool which will allow scientists to efficiently find biomarkers for patient samples. The TCGA database contains data from over 8500 tumor samples of 33 tumor types and the Gene Expression Omnibus database contains an additional 60000 samples including several thousand normal blood and tissue samples that can be abstracted as controls. The Illumina Human Methylation 450 microarray based assay that is used to generate methylation datasets covers over 450,000 loci that include human CpG islands and gene promoter regions that are frequently aberrantly methylated in cancer cells. The TCGA integration will make this machine-learning model applicable to cancers outside of HCC. This enables other teams and researchers to rapidly identify potentially hyper-methylated markers for various cancer types and perform further wet-lab investigation at these markers. With the optimally designed set of biomarkers for these cancers, teams can design probes to assay patient methylation levels at these loci and perhaps implement targeted demethylation techniques. It is also likely that as new patient data is ingested into the platform, new markers that were initially not statistically significant to include will become more relevant across the datasets. Lastly, it is also possible to use the set of all markers across cancers to make predictions on whether a patient is at a predisposition for a combination of multiple common cancers. Overall the availability of these markers will enable more rapid cancer diagnostics alongside the advancement of liquid biopsy sampling.

References