Difference between revisions of "Team:McMaster/Model"

(Undo revision 397055 by Aline-Claire (talk))
 
(48 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{McMaster}}
+
{{McMaster/Header}}
 +
 
 
<html>
 
<html>
 +
<meta name="viewport" content="width=device-width, initial-scale=1">
  
  
 +
<div class="w3-mgem-lightpurple">
 +
  <div class="w3-container w3-content w3-center" style="max-width:1200px;padding: 30px 30px;">
 +
  <div style="position: relative; left: 25%;">
 +
    <h2><span style="color: #ebe8e8;">Dry Lab Project: Literature Search</span></h2>
 +
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style=" color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">After working out the basics of the wet lab experiment, we knew that we needed to find small proteins which aggregated into plaques. So, the dry lab team wrote a script that searched online protein databases for annotations corresponding to user-provided search queries, and filtered out any proteins that exceeded a user-specified length. This script, using those search queries and protein names, also returned a list of relevant studies from PubMed, which we used to identify appropriate proteins for the experiment. This helped inform the wet lab team&rsquo;s decisions and future plans.</span></p>
  
<div class="column full_size">
+
    <h2><span style="color: #ebe8e8;">Dry Lab Project: NGS Modelling</span></h2>
<h1> Modeling</h1>
+
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style="color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The wet lab team worked closely with us to help us understand the data that they expected to provide: millions of short reads of mutated sequences derived from amyloid beta cDNA, separated into distinct timepoints. If the experiment worked as intended, patterns in the data would become apparent, and would strengthen with time.</span></p>
 +
    <p>&nbsp;</p>
 +
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style="color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Our predictive model was designed to identify these patterns (given a few timepoints&rsquo; worth of data) and predict the shape of the data in future timepoints (whether or not a given protein was conserved, and whether a specific mutation was favoured, and so on). If these predictions matched the results of the future timepoints of the experiment, then we would have validated the model&rsquo;s ability to successfully predict reality, i.e. the model&rsquo;s accuracy.</span></p>
 +
    <p>&nbsp;</p>
 +
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style="color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">First, given millions of short reads in FASTA or FASTQ format, we needed to align them to the correct places on the reference genome, in order to identify whether a given nucleotide in a read originated from a newly-introduced mutation or from the unmutated sequence. We performed alignment using Minimap2 (1), a recently-developed algorithm which outperforms older, more commonly-used algorithms like Bowtie 2 and BWA-MEM. We manipulated the resulting output using Samtools (2), making slight modifications to allow the data to be easily visualized with the Integrative Genomics Viewer (3), as shown here:</span> </p><br>
  
<p>Mathematical models and computer simulations provide a great way to describe the function and operation of BioBrick Parts and Devices. Synthetic Biology is an engineering discipline, and part of engineering is simulation and modeling to determine the behavior of your design before you build it. Designing and simulating can be iterated many times in a computer before moving to the lab. This award is for teams who build a model of their system and use it to inform system design or simulate expected behavior in conjunction with experiments in the wetlab.</p>
+
    <center><img src="https://static.igem.org/mediawiki/2018/thumb/3/3c/T--McMaster--NGS1.png/800px-T--McMaster--NGS1.png"></center>
 +
 +
    <p>&nbsp;</p>
 +
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style="color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">We also wrote scripts using Pysam to identify the most common mutations (4), and then isolate the reads with these mutations to allow us to investigate further, as shown here:</span></p><br>
  
</div>
+
    <center><img src="https://static.igem.org/mediawiki/2018/thumb/1/1a/T--McMaster--NGS2.png/800px-T--McMaster--NGS2.png"></center>
<div class="clear"></div>
+
  
<div class="column full_size">
+
    <p>&nbsp;</p>
<h3> Gold Medal Criterion #3</h3>
+
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style="color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Next, in order to make predictions, we needed synthetic data to help inform those predictions. So, we wrote a script to quickly generate millions of short reads. The script takes a reference sequence, a list of mutations, and some mutation rates, and generates a FASTA file containing mutated reads, and for convenience, also outputs a text file containing the protein generated by each read.</span></p>
<p>
+
    <p>&nbsp;</p>
Convince the judges that your project's design and/or implementation is based on insight you have gained from modeling. This could be either a new model you develop or the implementation of a model from a previous team. You must thoroughly document your model's contribution to your project on your team's wiki, including assumptions, relevant data, model results, and a clear explanation of your model that anyone can understand.  
+
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style="color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">We also generated a plot for each timepoint, showing the fractions of reads that conserved each amino acid from the reference protein. The spikes in this plot correspond to the amino acids which are encoded by multiple different codons, making them more likely to be correctly translated even when the genetic sequence is mutated. Here is one such plot:</span></p><br>
<br><br>
+
The model should impact your project design in a meaningful way. Modeling may include, but is not limited to, deterministic, exploratory, molecular dynamic, and stochastic models. Teams may also explore the physical modeling of a single component within a system or utilize mathematical modeling for predicting function of a more complex device.
+
</p>
+
  
<p>
+
    <center><img src="https://static.igem.org/mediawiki/2018/5/56/T--McMaster--NGS3.png"></center>
Please see the <a href="https://2018.igem.org/Judging/Medals"> 2018
+
Medals Page</a> for more information.  
+
</p>
+
</div>
+
  
<div class="column two_thirds_size">
+
    <p>&nbsp;</p>
<h3>Best Model Special Prize</h3>
+
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style="color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Finally, we had to make the actual predictions. We refrained from using more sophisticated machine learning algorithms like neural networks, because we were not certain whether a neural network trained purely on our synthetic data (with no real data for testing/validation) would be able to generalize to real data. While pre-trained networks are useful for identifying broad-level features (5, 6), and while there are regularization methods to minimize overfitting (7), we simply had no guarantee that any of the features in our synthetic data would be present in the real data. So, we used basic linear regression from Scikit Learn (8) to predict the fraction of reads that would conserve each amino acid from the reference protein. Here is a plot showing synthetic data for three timepoints and a predicted fourth timepoint:</span></p><br>
  
<p>
+
    <center><img src="https://static.igem.org/mediawiki/2018/7/76/T--McMaster--NGS4.png"></center>
To compete for the <a href="https://2018.igem.org/Judging/Awards">Best Model prize</a>, please describe your work on this page  and also fill out the description on the <a href="https://2018.igem.org/Judging/Judging_Form">judging form</a>. Please note you can compete for both the gold medal criterion #3 and the best model prize with this page.  
+
<br><br>
+
    <p>&nbsp;</p>
You must also delete the message box on the top of this page to be eligible for the Best Model Prize.
+
    <p style="line-height: 1.38; margin-top: 0pt; margin-bottom: 0pt;"><span style="color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Our future goal for this model is to refine it based on real data from the wet lab team. If there are broad-level features which can be mimicked by synthetic data, then it might be feasible to implement deep learning by pretraining on synthetic data, and then training and testing the model on real data. Ultimately, this model should help to identify the specific nucleotides implicated in amyloid plaque formation.</span></p>
</p>
+
    <p><br /><span style="color: #ebe8e8;"><strong>References</strong></span></p>
 +
    <ol style="margin-top: 0pt; margin-bottom: 0pt;">
 +
    <li style="list-style-type: decimal;  color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="font-size: 12pt; font-family: georgia, palatino; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Li H. Minimap2: pairwise alignment for nucleotide sequences. <em>Bioinformatics.</em> 2018 Sep 15;34(18):3094&ndash;100.</span></li>
 +
    <li style="list-style-type: decimal;  color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="font-size: 12pt; font-family: georgia, palatino; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. <em>Bioinformatics</em>. 2009 Aug 15;25(16):2078&ndash;9.</span></li>
 +
    <li style="list-style-type: decimal;  color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="font-size: 12pt; font-family: georgia, palatino; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Robinson JT, Thorvaldsd&oacute;ttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative Genomics Viewer. Nature <em>Biotechnology</em>. 2011 Jan;29(1):24&ndash;6.</span></li>
 +
    <li style="list-style-type: decimal;  color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="font-size: 12pt; font-family: georgia, palatino; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Heger A, Jacobs K, et al. Pysam. GitHub. 2009. Available from: https://github.com/pysam-developers/pysam</span></li>
 +
    <li style="list-style-type: decimal; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="font-size: 12pt; font-family: georgia, palatino; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S. Why does unsupervised pre-training help deep learning?. <em>Journal of Machine Learning Research</em>. 2010;11(Feb):625-60.</span></li>
 +
    <li style="list-style-type: decimal;  color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="font-size: 12pt; font-family: georgia, palatino; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks?. 2014. p. 3320&ndash;8.</span></li>
 +
    <li style="list-style-type: decimal; font-size: 11pt; font-family: Arial; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="font-size: 12pt; font-family: georgia, palatino; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. <em>Journal of Machine Learning Research</em>. 2014 Jan 1;15(1):1929-58.</span></li>
 +
    <li style="list-style-type: decimal; font-size: 11pt; font-family: Arial; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre;"><span style="font-size: 12pt; font-family: georgia, palatino; color: #ebe8e8; background-color: transparent; font-weight: 400; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: machine learning in Python. <em>Journal of Machine Learning Research</em>. 2011;12(Oct):2825-30.</span></li>
 +
</ol>
 +
</ol>
 +
</br></br>
 +
</div></div></div></div></div>
 +
</body>
 +
</html>
  
</div>
+
{{McMaster/Footer}}
 
+
 
+
<div class="column third_size">
+
<div class="highlight decoration_A_full">
+
<h3> Inspiration </h3>
+
<p>
+
Here are a few examples from previous teams:
+
</p>
+
<ul>
+
<li><a href="https://2016.igem.org/Team:Manchester/Model">2016 Manchester</a></li>
+
<li><a href="https://2016.igem.org/Team:TU_Delft/Model">2016 TU Delft</li>
+
<li><a href="https://2014.igem.org/Team:ETH_Zurich/modeling/overview">2014 ETH Zurich</a></li>
+
<li><a href="https://2014.igem.org/Team:Waterloo/Math_Book">2014 Waterloo</a></li>
+
</ul>
+
</div>
+
</div>
+
 
+
</html>
+

Latest revision as of 06:25, 17 October 2018

Dry Lab Project: Literature Search

After working out the basics of the wet lab experiment, we knew that we needed to find small proteins which aggregated into plaques. So, the dry lab team wrote a script that searched online protein databases for annotations corresponding to user-provided search queries, and filtered out any proteins that exceeded a user-specified length. This script, using those search queries and protein names, also returned a list of relevant studies from PubMed, which we used to identify appropriate proteins for the experiment. This helped inform the wet lab team’s decisions and future plans.

Dry Lab Project: NGS Modelling

The wet lab team worked closely with us to help us understand the data that they expected to provide: millions of short reads of mutated sequences derived from amyloid beta cDNA, separated into distinct timepoints. If the experiment worked as intended, patterns in the data would become apparent, and would strengthen with time.

 

Our predictive model was designed to identify these patterns (given a few timepoints’ worth of data) and predict the shape of the data in future timepoints (whether or not a given protein was conserved, and whether a specific mutation was favoured, and so on). If these predictions matched the results of the future timepoints of the experiment, then we would have validated the model’s ability to successfully predict reality, i.e. the model’s accuracy.

 

First, given millions of short reads in FASTA or FASTQ format, we needed to align them to the correct places on the reference genome, in order to identify whether a given nucleotide in a read originated from a newly-introduced mutation or from the unmutated sequence. We performed alignment using Minimap2 (1), a recently-developed algorithm which outperforms older, more commonly-used algorithms like Bowtie 2 and BWA-MEM. We manipulated the resulting output using Samtools (2), making slight modifications to allow the data to be easily visualized with the Integrative Genomics Viewer (3), as shown here:


 

We also wrote scripts using Pysam to identify the most common mutations (4), and then isolate the reads with these mutations to allow us to investigate further, as shown here:


 

Next, in order to make predictions, we needed synthetic data to help inform those predictions. So, we wrote a script to quickly generate millions of short reads. The script takes a reference sequence, a list of mutations, and some mutation rates, and generates a FASTA file containing mutated reads, and for convenience, also outputs a text file containing the protein generated by each read.

 

We also generated a plot for each timepoint, showing the fractions of reads that conserved each amino acid from the reference protein. The spikes in this plot correspond to the amino acids which are encoded by multiple different codons, making them more likely to be correctly translated even when the genetic sequence is mutated. Here is one such plot:


 

Finally, we had to make the actual predictions. We refrained from using more sophisticated machine learning algorithms like neural networks, because we were not certain whether a neural network trained purely on our synthetic data (with no real data for testing/validation) would be able to generalize to real data. While pre-trained networks are useful for identifying broad-level features (5, 6), and while there are regularization methods to minimize overfitting (7), we simply had no guarantee that any of the features in our synthetic data would be present in the real data. So, we used basic linear regression from Scikit Learn (8) to predict the fraction of reads that would conserve each amino acid from the reference protein. Here is a plot showing synthetic data for three timepoints and a predicted fourth timepoint:


 

Our future goal for this model is to refine it based on real data from the wet lab team. If there are broad-level features which can be mimicked by synthetic data, then it might be feasible to implement deep learning by pretraining on synthetic data, and then training and testing the model on real data. Ultimately, this model should help to identify the specific nucleotides implicated in amyloid plaque formation.


References

  1. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15;34(18):3094–100.
  2. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078–9.
  3. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative Genomics Viewer. Nature Biotechnology. 2011 Jan;29(1):24–6.
  4. Heger A, Jacobs K, et al. Pysam. GitHub. 2009. Available from: https://github.com/pysam-developers/pysam
  5. Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S. Why does unsupervised pre-training help deep learning?. Journal of Machine Learning Research. 2010;11(Feb):625-60.
  6. Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks?. 2014. p. 3320–8.
  7. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. 2014 Jan 1;15(1):1929-58.
  8. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: machine learning in Python. Journal of Machine Learning Research. 2011;12(Oct):2825-30.