Difference between revisions of "Team:UC San Diego/Model"

Revision as of 10:58, 17 October 2018

Model

Introduction

Our team generated mathematical models in order to better characterize our overall workflow. The primary task was to use these rigorous approaches in order to determine specific disease-specific markers (i.e. regions of promoter methylation that were consistent across all patients for a particular disease). Although literature provided some biomarkers, our team believed that due to the novelty of our approach, characterizing existing datasets and measurements of beta methylation values to derive a new panel of biomarkers would be more effective.

Our post-doctoral supervisors in the Zhang Lab used a HCC-specific methylation marker panel where they compared the methylation profiles of HCC tissue against normal leukocytes. Cancer patients are known to show abnormally high levels of cell-free DNA in the plasma which is typically derived from cancer cells that have undergone apoptosis or programmed cell death. This cfDNA extraction technique is minimally invasive for patients relative to conventional tissue biopsy. In addition, routine blood samples can be drawn from patients for real-time monitoring. The optimal means to characterize a tumor through the methylation profile of the plasma DNA is to derive a set of genes that are all incurring high levels of hyper-methylation.

Using Unsupervised Machine Learning to Inform our Project Design

Using a large sample size of both affected and control tissue (n=2100), a HCC classifier was constructed with high specificity and sensitivity. A classifier determines which biomarkers within a specific tissue type can be used to predict the overall disease state of the sample. The goal of these algorithms was to identify a subset of hyper-methylated promoter regions that could accurately identify tumorous tissue samples based on the methylation readouts at these individual loci. To increase sensitivity and specificity, it was critical to select a set of markers that had high specificity to avoid including healthy patients in the diagnostic outcome and to ensure all diseased patients were being accounted for.

This study involved the analysis of 485,000 unique CpG markers to generate a final set of CpG markers that were enhanced in HCC patients. The following data represent the sample characteristics in the training and validation cohort used by the algorithm. This data was integrated with bisulfite sequencing measurements from IlluminaSeq 450 microarray platform and formed the starting point of our data set.

CpG markers that have a maximal difference in methylation between the two sample types would most likely demonstrate detectable methylation differences in the cfDNA of HCC patients.

Explaining the Theory Behind our Model

Our team then used Random Forest algorithms and LASSO (Least Absolute Shrinkage and Selection Operator) to further reduce the biomarker list. In selecting the markers, t-statistic method with Empirical Bayes was used to shrink the variance and Benjamini-Hochberg procedure was used to control the false discovery rate at a significance level of 0.05. Using these methods and values, a panel of top 1000 markers were generated from which 401 with good experimental amplification profiles were selected. Since methylation sites occur in close proximity and are likely to be co-methylated, methylation correlated blocks (MCBs) were generated using a Pearson correlation method. To generate the methylation profiles of the training and validation cohorts, we used bisulfite measurements: this method changes unmethylated cytosines into thymines whereas the methylated cytosines are protected. The final ratio calculated evaluates the total amount of unaffected cytosines divided by the total number of unaffected cytosines and base-modified cytosines.

Outcomes and Analysis

The two analyses yielded overlapping markers where used to test sensitivity and specificity against the training and validation data sets.

In the end, our team was able to identify the following gene panel as an accurate indicator for HCC markers, and used this to specifically design probes for methylation analysis.

Making our tool accessible to researchers and iGEM community

Although our original intent was to use this tool to aid in biomarker discovery for hepatocellular carcinoma markers, our team realized that the approach and theory behind our model was universal and could be applicable to any existing methylome data. As such, we expanded the modularity of our approach to take methylome data from the TCGA and generate a subsequent classifier. In the proposed state, methylation data from the TCGA or separate Illumina sequencing platform can be ingested into the platform which will output a list of biomarkers to evaluate further. This output will be linked to a primer design and ordering tool which will allow scientists to efficiently find biomarkers for patient samples. The TCGA database contains data from over 8500 tumor samples of 33 tumor types and the Gene Expression Omnibus database contains an additional 60000 samples including several thousand normal blood and tissue samples that can be abstracted as controls. The Illumina Human Methylation 450 microarray based assay that is used to generate methylation datasets covers over 450,000 loci that include human CpG islands and gene promoter regions that are frequently aberrantly methylated in cancer cells. The TCGA integration will make this machine-learning model applicable to cancers outside of HCC. This enables other teams and researchers to rapidly identify potentially hyper-methylated markers for various cancer types and perform further wet-lab investigation at these markers. With the optimally designed set of biomarkers for these cancers, teams can design probes to assay patient methylation levels at these loci and perhaps implement targeted demethylation techniques. It is also likely that as new patient data is ingested into the platform, new markers that were initially not statistically significant to include will become more relevant across the datasets. Lastly, it is also possible to use the set of all markers across cancers to make predictions on whether a patient is at a predisposition for a combination of multiple common cancers. Overall the availability of these markers will enable more rapid cancer diagnostics alongside the advancement of liquid biopsy sampling.

@@ Line 1: / Line 1: @@
 {{UC_San_Diego}}
 <html>
+  <head>
+  </head>
+  <body>
+    <div id="wrapper">
+      <div class="section">
-<div class="clear"></div>
+        <h2>Model</h2>
+        <img src="" alt="" class="hcIconMain" />
+        <h3>Introduction</h3>
-<div class="column full_size">
+        <p>Our team generated mathematical models in order to better characterize our overall workflow. The primary task was to use these rigorous approaches in order to determine specific disease-specific markers (i.e. regions of promoter methylation that were consistent across all patients for a particular disease). Although literature provided some biomarkers, our team believed that due to the novelty of our approach, characterizing existing datasets and measurements of beta methylation values to derive a new panel of biomarkers would be more effective. </p>
-<h1> Modeling</h1>
+        <p>Our post-doctoral supervisors in the Zhang Lab used a HCC-specific methylation marker panel where they compared the methylation profiles of HCC tissue against normal leukocytes. Cancer patients are known to show abnormally high levels of cell-free DNA in the plasma which is typically derived from cancer cells that have undergone apoptosis or programmed cell death. This cfDNA extraction technique is minimally invasive for patients relative to conventional tissue biopsy. In addition, routine blood samples can be drawn from patients for real-time monitoring. The optimal means to characterize a tumor through the methylation profile of the plasma DNA is to derive a set of genes that are all incurring high levels of hyper-methylation. </p>
+        <img src="" alt="generalworkflow" class="mlImg" />
-<p>Mathematical models and computer simulations provide a great way to describe the function and operation of BioBrick Parts and Devices. Synthetic Biology is an engineering discipline, and part of engineering is simulation and modeling to determine the behavior of your design before you build it. Designing and simulating can be iterated many times in a computer before moving to the lab. This award is for teams who build a model of their system and use it to inform system design or simulate expected behavior in conjunction with experiments in the wetlab.</p>
+        <h3>Using Unsupervised Machine Learning to Inform our Project Design</h3>
+        <p>Using a large sample size of both affected and control tissue (n=2100), a HCC classifier was constructed with high specificity and sensitivity. A classifier determines which biomarkers within a specific tissue type can be used to predict the overall disease state of the sample. The goal of these algorithms was to identify a subset of hyper-methylated promoter regions that could accurately identify tumorous tissue samples based on the methylation readouts at these individual loci. To increase sensitivity and specificity, it was critical to select a set of markers that had high specificity to avoid including healthy patients in the diagnostic outcome and to ensure all diseased patients were being accounted for. </p>
-</div>
+        <p>This study involved the analysis of 485,000 unique CpG markers to generate a final set of CpG markers that were enhanced in HCC patients. The following data represent the sample characteristics in the training and validation cohort used by the algorithm. This data was integrated with bisulfite sequencing measurements from IlluminaSeq 450 microarray platform and formed the starting point of our data set.</p>
-<div class="clear"></div>
+        <ol>In order for our team to develop the model, we had to make a foundational assumption that
+          <li>CpG markers that have a maximal difference in methylation between the two sample types would most likely demonstrate detectable methylation differences in the cfDNA of HCC patients. </li>
-<div class="column full_size">
+          <li></li>
-<h3> Gold Medal Criterion #3</h3>
+        </ol>
-<p>
+        <h3>Explaining the Theory Behind our Model</h3>
-Convince the judges that your project's design and/or implementation is based on insight you have gained from modeling. This could be either a new model you develop or the implementation of a model from a previous team. You must thoroughly document your model's contribution to your project on your team's wiki, including assumptions, relevant data, model results, and a clear explanation of your model that anyone can understand.
+        <p>Our team then used Random Forest algorithms and LASSO (Least Absolute Shrinkage and Selection Operator) to further reduce the biomarker list. In selecting the markers, t-statistic method with Empirical Bayes was used to shrink the variance and Benjamini-Hochberg procedure was used to control the false discovery rate at a significance level of 0.05. Using these methods and values, a panel of top 1000 markers were generated from which 401 with good experimental amplification profiles were selected. Since methylation sites occur in close proximity and are likely to be co-methylated, methylation correlated blocks (MCBs) were generated using a Pearson correlation method. To generate the methylation profiles of the training and validation cohorts, we used bisulfite measurements: this method changes unmethylated cytosines into thymines whereas the methylated cytosines are protected. The final ratio calculated evaluates the total amount of unaffected cytosines divided by the total number of unaffected cytosines and base-modified cytosines. </p>
-<br><br>
+        <h3>Outcomes and Analysis</h3>
-The model should impact your project design in a meaningful way. Modeling may include, but is not limited to, deterministic, exploratory, molecular dynamic, and stochastic models. Teams may also explore the physical modeling of a single component within a system or utilize mathematical modeling for predicting function of a more complex device.
+        <p>The two analyses yielded overlapping markers where used to test sensitivity and specificity against the training and validation data sets. </p>
-</p>
+        <p>In the end, our team was able to identify the following gene panel as an accurate indicator for HCC markers, and used this to specifically design probes for methylation analysis. </p>
+        <h3>Making our tool accessible to researchers and iGEM community</h3>
-<p>
+        <p>Although our original intent was to use this tool to aid in biomarker discovery for hepatocellular carcinoma markers, our team realized that the approach and theory behind our model was universal and could be applicable to any existing methylome data. As such, we expanded the modularity of our approach to take methylome data from the TCGA and generate a subsequent classifier. In the proposed state, methylation data from the TCGA or separate Illumina sequencing platform can be ingested into the platform which will output a list of biomarkers to evaluate further. This output will be linked to a primer design and ordering tool which will allow scientists to efficiently find biomarkers for patient samples. The TCGA database contains data from over 8500 tumor samples of 33 tumor types and the Gene Expression Omnibus database contains an additional 60000 samples including several thousand normal blood and tissue samples that can be abstracted as controls. The Illumina Human Methylation 450 microarray based assay that is used to generate methylation datasets covers over 450,000 loci that include human CpG islands and gene promoter regions that are frequently aberrantly methylated in cancer cells. The TCGA integration will make this machine-learning model applicable to cancers outside of HCC.  This enables other teams and researchers to rapidly identify potentially hyper-methylated markers for various cancer types and perform further wet-lab investigation at these markers. With the optimally designed set of biomarkers for these cancers, teams can design probes to assay patient methylation levels at these loci and perhaps implement targeted demethylation techniques. It is also likely that as new patient data is ingested into the platform, new markers that were initially not statistically significant to include will become more relevant across the datasets. Lastly, it is also possible to use the set of all markers across cancers to make predictions on whether a patient is at a predisposition for a combination of multiple common cancers. Overall the availability of these markers will enable more rapid cancer diagnostics alongside the advancement of liquid biopsy sampling. </p>
-Please see the <a href="https://2018.igem.org/Judging/Medals"> 2018
+        <h3>References</h3>
- Medals Page</a> for more information.
+      </div>
-</p>
+    </div>
-</div>
+  </body>
-<div class="column two_thirds_size">
-<h3>Best Model Special Prize</h3>
-<p>
-To compete for the <a href="https://2018.igem.org/Judging/Awards">Best Model prize</a>, please describe your work on this page  and also fill out the description on the <a href="https://2018.igem.org/Judging/Judging_Form">judging form</a>. Please note you can compete for both the gold medal criterion #3 and the best model prize with this page.
-<br><br>
-You must also delete the message box on the top of this page to be eligible for the Best Model Prize.
-</p>
-</div>
-<div class="column third_size">
-<div class="highlight decoration_A_full">
-<h3> Inspiration </h3>
-<p>
-Here are a few examples from previous teams:
-</p>
-<ul>
-<li><a href="https://2016.igem.org/Team:Manchester/Model">2016 Manchester</a></li>
-<li><a href="https://2016.igem.org/Team:TU_Delft/Model">2016 TU Delft</li>
-<li><a href="https://2014.igem.org/Team:ETH_Zurich/modeling/overview">2014 ETH Zurich</a></li>
-<li><a href="https://2014.igem.org/Team:Waterloo/Math_Book">2014 Waterloo</a></li>
-</ul>
-</div>
-</div>
 </html>