Team:TUDelft/Model/Software

Modeling Software Tool

Overview

As part of our sequence verification method, we need a means to quickly and accessibly classify sequenced DNA as natural EPO genes, artificial EPO genes, or other DNA. We created a software tool in R, an open-source programming language that can be augmented with packages that are convenient for working with genomic data (R core team, 2015). The tool can perform the classification for any gene based on comparison of the input sequence to a custom database. Our database contains natural sequences with variations corresponding to misreads, and artificial sequences with codon mutations around exon-exon junctions. Natural genes do not have exon-exon junctions, which allows their distinction from artificial DNA. Our tool calculates alignment scores, based on which an input sequence (fasta file or typed sequence) is classified using the k-nearest neighbor machine learning algorithm (Torgo et al., 2010). The database accommodates every new artificial EPO DNA that is analyzed. In this way, evolved EPO gene doping sequences update the database resulting in ever-increasing accuracy. The second branch of software we developed is able to find a minimal set of guide RNAs (gRNAs) to detect the plethora of variations that can be made to a gene. The repository to our software can be found here.

Timeline Figure 1. Schematic of our software tool.

Our software tool has two main goals:
- Distinguish gene doping sequences from endogenous DNA.
- Design a minimal set of gRNAs to most efficiently detect the high amount of variance that can be present in gene doping genes.


The tool we developed meets these goals and has the following features:
- Our tool can currently tell apart more than 80% of gene doping sequences from any other DNA.*
- Our tool allows us to reduce the required guide RNAs to detect gene doping from 10^104 to only 12 guide RNAs.


*Due to the implementation of a machine learning algorithm the tool will develop even better abilities towards distinguishing these sequences in the future.

1. Sequence Classification Tool

Method

Database set-up

There is a wide variety of genes that are potential targets for gene doping. Our main focus is on the Erythropoietin (EPO) gene, which is one of the genes that is most likely to be used for gene doping, according to professor Hidde Haisma, gene doping expert at the University of Groningen. Our software tool can easily be extended to any other gene. Our custom database contains three groups of reference sequences: endogenous human EPO DNA, gene doping EPO DNA and DNA from endogenous sources other than EPO.

The reference sequences for normal human EPO DNA are the actual human EPO sequence (Clark et. al., 2016) together with 300 sequences that we simulated with 6% false reads, which has been specified as the accuracy of the MinIon device in the case of a single read-through that we use in our method (Loman et al., 2015).

We loaded many possible gene doping EPO DNA sequences. To start these include human EPO cDNA (Clark et. al., 2016) and its codon substituted versions around the junctions. We also included animal cDNA EPO sequences from Genbank (Clark et. al., 2016) as possible EPO doping based on a suggestion by Eli van der Sluis, technician at the Bionanoscience Department at Delft University of Technology.

Furthermore, we included sequences designed by our team which involve further codon mutations. We do not include all codon substituted versions, because there would be roughly 10^104 of those, which is computationally impossible to generate and unnecessary. Therefore, we provide the algorithm with a basic set of sequences that contain all codon variations within 9 bps around the first three junctions, a set of 72 sequences. The code for the generation of these sequences is given at downloads.

Directed learning will take place after in real life. Thereby, the database is further extended with sequences that appear in athlete samples and might give a prediction from where further development will take place.

Additionally, we added 30 DNA strings of 150 bps as DNA from other endogenous sources than EPO. The 30 DNA strings were randomly chosen fragements from four other human genes: ACTN3 (Beggs A.H. et al., 1992), Growth Hormone (Seeburg, P.H., 1982), Vascular Endothelial Growth Factor (Tischer, E. et al., 1991) and Oxytocin Receptor (Inoue, T. et al., 1994). The chosen genes all have GC contents that are similar to the human EPO gene (57%)(Clark et. al., 2016, Biologics International Corp, 2018). Therefore, they are especially important genomic DNA fragments to use for classification reference as they would be hardest to distinguish based on combinatorics. The reference genes were furthermore all chosen to be of length 150 bps, since that is the expected cell free DNA length for genomic DNA (Fan et al., 2008, Lo et al., 2010). In this way we are able to accurately distinguish between gene doping DNA and DNA that does not code for EPO as is described in results.

Data Simulation

To train our gene doping detection algorithm, we need to accurately predict the data our algorithm would come into contact with in a working setting. We need to take into account both the type of DNA and the length of DNA. We generated all sequences using R (R core team, 2015). Packages used for data simulation are seqinR (Charif and Lobry, 2007), Rbioinf by (Gentleman, 2018), Biostrings (H. Pages et al.) and Magicfor (K. Makiyama, 2016).


We generated DNA in sizes that are also expected when sequencing with Oxford Nanopore technologies (Li et. al., 2018). The code takes as input a fasta file containing DNA, the amount of cut sequences that need to be generated and a minimal length threshold that can be set to prevent sequences below a certain length being generated. When using this code, pay attention to the fact that it assumes the origin of the sequence as being human. In this case, an exponential distribution can be assumed. See Li et. al., 2018, for distributions as a result of other organisms. The output sequences are generated at a certain point in the gene and for a certain length. The starting point of the sequences is determined using a random distribution normalised to the length of the input sequence. The length of the sequences is determined using an exponential distribution normalised to the length of the input sequence, with the sequences below the minimal length treshold removed from the list. Then the starting point and sequences length are both randomly sampled for the amount of times as output sequences are required, creating the output sequences. These are saved into one fasta file of multiple sequences. See sequence cutter at the bottom of this page for the code.

Alignment and Scoring

Based on the reference sequences, we align input sequences in two steps. Firstly, we align the input sequence with a selection of the natural human human EPO and it’s 6% misread (Loman et al., 2015) versions to see whether the sequence is the natural EPO or a breakdown component of this. Subsequently, if we find the sequence is non-natural human EPO, we add a second step in which we align the input sequence to the doping EPO sequences in the database. In each step the average scores are compared and the machine learning algorithm k-nearest neighbors (Torgo et al., 2010) is used to classify the outcome scores. These scores are determined based on a combination of two main alignment algorithms: the Smith-Waterman Algorithm and the Needleman-Wunsch Algorithm that were build in the BioStrings package in R (Pagès H, et al., 2018). Both determine local and global sequences alignment respectively.

The Smith-Waterman Algorithm (Smith, et al., 1981) is a dynamic programming algorithm that guarantees to find the optimal local alignment within the scoring system used. One starts with the two sequences that need to be aligned, that are put on the axis of the empty scoring matrix $H$. Then a series of 4 subsequent steps follows.

1. At this stage a substitution matrix is chosen that includes the scores adhered to aligning and non-aligning bases (in case of nucleotide sequences as we use here). In our case we chose the substitution matrix $S(a,b)$ as follows.

$$S(a,b)= \left\{ \begin{array}{ll} 1 & for & a_{i}= b_{j} \\ -3 & for & a_{i}\neq b_{j} \\ \end{array} \right. $$



To account for insertions and deletions, another input required in this algorithm is the gap extension W_n, which is the penalty for a gap with length n. The gap penalty is the point reduction for each gap in the alignment. We use an affine gap penalty, which is subdivided into a gapOpening penalty for a first gap induction, and a gapExtension penalty for every base that the gap is extended for.

$$W_{n}=C*n+D, (C>0,D>0)$$



$$ gapOpening = D = 5$$
$$gapExtension = C = 2 $$



2. Constructing a scoring matrix

Subsequently, the scoring matrix $H$ is constructed. The matrix $H$ consists of $(p+1)*(q+1)$ dimensions for sequences of length $p$ and $q$ respectively.

$$ H_{n0} = H_{0m} = 0, for 0 \leq n \leq p and 0 \leq m \leq q $$


3. Filling the scoring matrix

The scoring matrix is filled as follows, where $H_{i-1,j-1}+s(a_{i},b_{j})$ gives the score of aligning elements, $H_{i-n,j}-W_{n}$ gives the score for aligning element $a_{i}$ at the end of a gap of length $n$, and $H_{i,j-m}-W_{m}$ gives the score for $b_{j}$ at the end of a gap length $m$. Lastly, $0$ is for elements that do now show any similarity.

$$ H_{i,j}= \begin{cases} \ max(H_{i-1,j-1}+s(a_{i},b_{j}), & (1\leq i\leq p, 1\leq j\leq q) \\ max_{n≥1} (H_{i-n,j}-W_{n}) \\ max_{m≥1} (H_{i,j-m}-W_{m})\\ 0 \end{cases} $$



4. Tracing back

After filling the complete scoring matrix $H$, a traceback ensues starting at the highest score in the matrix and ending in a cell with score $0$. To obtain a second best alignment, the same is done for the second highest score.

The Needleman-Wunsch (Needleman, et al., 1970) algorithm differs from the Smith-Waterman algorithm (Smith, et al., 1981) on initialisation, scoring and traceback and computes global instead of local alignment. In the Needleman-Wunsch algorithm, the first row and first column are subject to the gap penalty. Also, in the Needleman-Wunsch algorithm, the scores can be negative. Lastly, the traceback always starts at the lower right corner of the matrix and ends at the upper left corner.

Combination of Scoring

We use the Smith-Waterman algorithm (Smith, et al., 1981) and the Needleman-Wunsch algorithm (Needleman, et al., 1970) to compute the local and global alignment of the input sequence with respect to our custom database. Additionally, we use a hybrid form, semi-global alignment that has been implemented in the BioStrings package in R (Pagès H, et al., 2018). We then study the effect of each alignment algorithm on the gene doping classification. We add all alignment scores for a sequence to give a final total alignment score. We varied the weights of the different alignment scores to identify their influence on the characterisation of gene doping and found varying outcomes, as can be seen in the results section.

Machine Learning

We use the k-nearest neighbor algorithm to classify the input sequence as natural EPO, artificial EPO or other DNA. The k-nearest neighbor algorithm is a non-parametric pattern recognition method that can classify elements based on their value compared to direct nearest neighbors. In this instance, we chose to classify based on the four nearest neighbours. If no distinction can be made based on the four neighbors, the algorithm as has been implemented in R (Torgo, L., 2010) will tell. The classification is carried out in two steps. Firstly, the input sequence is aligned to natural EPO DNA. In a second step, the input DNA is aligned to gene doping DNA, resulting in the distinction between gene doping DNA and any other non-EPO DNA.

The k-NN algorithm (Torgo, L., 2010) takes a training and testing set as input, where the training set is the database. The most important input parameters are:



- $$Cl1$$ is a vector with classification labels, in which the number of elements match the number of classes in the training data set. This means that this vector contains elements corresponding to the amount of classes and the amount of corresponding elements to be classified in these classes for the training data set. In this case $$cl1$$ is a vector of factors that are ‘false’ for the first $$45$$ elements of the normalized_score matrix, meaning classified as gene doping, and ‘true’ for the subsequent elements, meaning not classified as gene doping.

- $$k$$ is the amount of nearest neighbors taken into account per point to judge the classification.

- Less than $$k-l$$ contradicting votes are allowed. By setting $$l=0$$, we make sure that sequenes will only be classified as one of each group when all k nearest neighbors of that sequence belong to the same category.

We opened up our sequence classification tool to people lacking bioinformatics knowledge by the creation of a Graphical User Interface that can be used to input hand-typed sequences. These sequences are then stored in an Excel file and can then be run through the subsequent stages of the program. Alternatively, users can load fasta files, the general output of sequencing systems.

Based on the input coming from loaded sequences, the alogrithm accommodates new gene doping sequences it finds into the database. Thereby, the tool can evolve with the sequences designed by gene doping creators.

Accurate alignment algorithms are generally computationally intensive. Our algorithm also runs every input sequence along the sequences in the expanding custom database. Therefore, computation time may increase in time. At present, it takes approximately 10 minutes to run the code for 5 input sequences on 4 parallel processor cores when the workspace has been downloaded. Complete running without workspace takes around 1.5 hours on 6 parallel processor cores. To speed up the running time, we therefore include the basic workspace. The code checks whether the core elements in the workspace are present and determines what parts to still run based on this. Also, we included packages for parallel core running (Calaway, R., 2017, Weston, S., 2017, Calaway, R., 2018). In the future, we recommend complementation of the code with a pattern recognition based sequence choice to align with to prevent extreme increases in running time.

Results and Conclusions

Based on the scores obtained in the steps described before, we constructed figure 3. In figure 3, the classification in the three categories, natural EPO, artificial EPO and other DNA, is displayed. Here, the alignment scores to natural human DNA are displayed on the x-axis and the alignment scores to artificial EPO are put on the y-axis. The blue sector shows the natural human EPO sequence and derivatives with a 6% misread rate (Loman et al., 2015) resulting from sequencing. All have a positive alignment score to natural EPO.

The green points represent the DNA sequences of 150 bps from the genes other than EPO. The red points are then de gene doping DNA which consists of a diverse set of codon varied EPO sequences as well as animal EPO sequences.

From figure 3, it becomes apparent we chose the k-nearest neighbor algorithm because of the local structure that make the algorithm should be responsive to correctly classify the relatively spread group of EPO gene doping DNA. Additionally, the value of a learning algorithm is highlighted because of the variable properties of gene doping DNA.

Data Data
Data Data
Figure 3. The classification of natural EPO with 6% misread rate (Loman et al., 2015), artificial EPO and any other DNA based on alignment scores with natural EPO and artificial EPO sequences in the database. Left top based on equal weighting of local, global and semi-global alignment scores respectively. Right top: 10-1-1, Left bottom:1-10-1, Right Bottom:1-1-10.

From figure 3 and the data it can be deducted that especially the global alignment score is an important factor in distinguishing gene doping DNA. Therefore, we recommend to further optimize the weights starting with relatively higher global alignment weights.

We challenged our method at the Hackathon, where participants believed that the optimal doping strategy would be to design sequences on the boundary between doping sequences and other DNA. However, these sequences should then still result in functional EPO proteins, which will entail highly extensive further research and is a questionable effort.

Within the team we also had a competition to beat the algorithm by doping sequence design. Many of us submitted a sequence and the algorithm was able to distinguish more than 80% of them as gene doping. Additionally, the input of any other natural DNA than EPO has always been classified as such. All in all, the algorithm has a great predictive power. The extending database will in the future only even extend this power, which makes us confident that gene doping can in this way rightly be distinguished.

Further Applications

The diagnosis early stage diagnosis of frontotemporal dementia and parkinsonism have been a big challenge. When these defects are a result of FTDP-17 defects, that may lead to this disease is a junction defect at exon 10, whereby splicing is complicated (Buee et al., 2000). After mRNA sequencing, these splicing complications may be included in a database based on which our software can tell whether a patient should be diagnosed with these diseases as one example.

2. gRNA Array Minimization Tool

Downloads

References

  1. Beggs A.H. et al.(1992). Cloning and characterization of two human skeletal muscle alpha-actinin genes located on chromosomes 1 and 11. J. Biol. Chem. 267 (13), 9281-9288.
  2. Biologics International Corp (2018). GC Content Calculator. Used on 20-09-2018 at: https://www.biologicscorp.com/tools/GCContent#.W8W_0hMzY1I.
  3. Bryan, J. and Zhao, J. (2018). Manage Google Spreadsheets from R. R package version 0.3.0.
  4. Buee L. et al. (2000). Tau protein isoforms, phosphorylation and role in neurodegenerative disorders. Brain Res. Brain Res. Rev. 33:95–130.
  5. Calaway, R. (2017). R package doSNOW. License: GPL2.
  6. Calaway, R. (2018). Foreach Parallel Adaptor for the 'parallel' Package. License: GPL2.
  7. Chang, W. and Borges Ribeiro, B. (2018). shinydashboard: Create Dashboards with 'Shiny'. R package version 0.7.0.
  8. Charif D., Lobry J.R. (2007) SeqinR 1.0-2: A Contributed Package to the R Project for Statistical Computing Devoted to Biological Sequences Retrieval and Analysis. In: Bastolla U., Porto M., Roman H.E., Vendruscolo M. (eds) Structural Approaches to Sequence Evolution. Biological and Medical Physics, Biomedical Engineering. Springer, Berlin, Heidelberg.
  9. Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2016). GenBank. Nucleic Acids Research, 44(Database issue), D67–D72. http://doi.org/10.1093/nar/gkv1276.
  10. Durinck, S. (n.d.). biomaRt v2.28.0. Interface to BioMart databases.
  11. Fan, H.C. et al. (2008). Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood.Proc. Natl. Acad. Sci. USA. 105: 16266-16271.
  12. Gentleman, R. (2018). RBioinf: RBioinf. R package version 1.40.0.
  13. Inoue, T. et al. (1994). Structural organization of the human oxytocin receptor gene. J. Biol. Chem. 269 (51), 32451-32456.
  14. Li, Y., et al. (2018). DeepSimulator: a deep simulator for Nanopore sequencing, Bioinformatics, Volume 34, Issue 17, Pages 2899–2908. https://doi.org/10.1093/bioinformatics/bty223.
  15. Lo, Y.M.D. et al. (2010). Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus.Sci. Transl. Med. 2: 61ra91.
  16. Loman, N.J. et al. (2015). A complete bacterial genome assembled de novo using only nanopore sequencing data Nat Methods, 12, pp. 733-735.
  17. Makiyama, K. (2016). magicfor: Magic Functions to Obtain Results from for Loops. R package version 0.1.0.
  18. Needleman, S.B. and Wunsch, C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 48 (3): 443–53. doi:10.1016/0022-2836(70)90057-4.
  19. Pagès H, Aboyoun P, Gentleman R, DebRoy S (2018). Biostrings: Efficient manipulation of biological strings. R package version 2.48.0.
  20. R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  21. Ripley, B. D. (1996) Pattern Recognition and Neural Networks.Cambridge.
  22. Seeburg,P.H. (1982). The human growth hormone gene family: nucleotide sequences show recent divergence and predict a new polypeptide hormone. DNA 1 (3), 239-249.
  23. Smith, Temple F. and Waterman, Michael S. (1981). Identification of Common Molecular Subsequences. Journal of Molecular Biology. 147: 195–197. doi:10.1016/0022-2836(81)90087-5.
  24. Tischer, E. et al. (1991). The human gene for vascular endothelial growth factor. Multiple protein forms are encoded through alternative exon splicing. J. Biol. Chem. 266 (18), 11947-11954.
  25. Torgo, L. (2010) Data Mining using R (knn): learning with case studies, CRC Press (ISBN: 9781439810187).
  26. Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
  27. Weston, S. (2017). Foreach Parallel Adaptor for the Rmpi Package. License: GPL-2.
  28. Xie, Y. (2018). DT: A Wrapper of the JavaScript Library 'DataTables'. R package version 0.4.