Team:McMaster/Model

Dry Lab Project: Literature Search

After working out the basics of the wet lab experiment, we knew that we needed to find small proteins which aggregated into plaques. So, the dry lab team wrote a script that searched online protein databases for annotations corresponding to user-provided search queries, and filtered out any proteins that exceeded a user-specified length. This script, using those search queries and protein names, also returned a list of relevant studies from PubMed, which we used to identify appropriate proteins for the experiment. This helped inform the wet lab team’s decisions and future plans.

Dry Lab Project: NGS Modelling

The wet lab team worked closely with us to help us understand the data that they expected to provide: millions of short reads of mutated sequences derived from amyloid beta cDNA, separated into distinct timepoints. If the experiment worked as intended, patterns in the data would become apparent, and would strengthen with time.

 

Our predictive model was designed to identify these patterns (given a few timepoints’ worth of data) and predict the shape of the data in future timepoints (whether or not a given protein was conserved, and whether a specific mutation was favoured, and so on). If these predictions matched the results of the future timepoints of the experiment, then we would have validated the model’s ability to successfully predict reality, i.e. the model’s accuracy.

 

First, given millions of short reads in FASTA or FASTQ format, we needed to align them to the correct places on the reference genome, in order to identify whether a given nucleotide in a read originated from a newly-introduced mutation or from the unmutated sequence. We performed alignment using Minimap2 (1), a recently-developed algorithm which outperforms older, more commonly-used algorithms like Bowtie 2 and BWA-MEM. We manipulated the resulting output using Samtools (2), making slight modifications to allow the data to be easily visualized with the Integrative Genomics Viewer (3), as shown here:


 

We also wrote scripts using Pysam to identify the most common mutations (4), and then isolate the reads with these mutations to allow us to investigate further, as shown here:


 

Next, in order to make predictions, we needed synthetic data to help inform those predictions. So, we wrote a script to quickly generate millions of short reads. The script takes a reference sequence, a list of mutations, and some mutation rates, and generates a FASTA file containing mutated reads, and for convenience, also outputs a text file containing the protein generated by each read.

 

We also generated a plot for each timepoint, showing the fractions of reads that conserved each amino acid from the reference protein. The spikes in this plot correspond to the amino acids which are encoded by multiple different codons, making them more likely to be correctly translated even when the genetic sequence is mutated. Here is one such plot:


 

Finally, we had to make the actual predictions. We refrained from using more sophisticated machine learning algorithms like neural networks, because we were not certain whether a neural network trained purely on our synthetic data (with no real data for testing/validation) would be able to generalize to real data. While pre-trained networks are useful for identifying broad-level features (5, 6), and while there are regularization methods to minimize overfitting (7), we simply had no guarantee that any of the features in our synthetic data would be present in the real data. So, we used basic linear regression from Scikit Learn (8) to predict the fraction of reads that would conserve each amino acid from the reference protein. Here is a plot showing synthetic data for three timepoints and a predicted fourth timepoint:


 

Our future goal for this model is to refine it based on real data from the wet lab team. If there are broad-level features which can be mimicked by synthetic data, then it might be feasible to implement deep learning by pretraining on synthetic data, and then training and testing the model on real data. Ultimately, this model should help to identify the specific nucleotides implicated in amyloid plaque formation.


References

  1. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15;34(18):3094–100.
  2. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078–9.
  3. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative Genomics Viewer. Nature Biotechnology. 2011 Jan;29(1):24–6.
  4. Heger A, Jacobs K, et al. Pysam. GitHub. 2009. Available from: https://github.com/pysam-developers/pysam
  5. Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S. Why does unsupervised pre-training help deep learning?. Journal of Machine Learning Research. 2010;11(Feb):625-60.
  6. Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks?. 2014. p. 3320–8.
  7. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. 2014 Jan 1;15(1):1929-58.
  8. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: machine learning in Python. Journal of Machine Learning Research. 2011;12(Oct):2825-30.