Overview
As part of our sequence verification method, we need a means to quickly and accessibly classify sequenced DNA as natural EPO genes, artificial EPO genes, or other DNA. We created a software tool in R, an open-source programming language that can be augmented with packages that are convenient for working with genomic data (R core team, 2015). The tool can perform the classification for any gene based on comparison of the input sequence to a custom database. Our database contains natural sequences with variations corresponding to misreads, and artificial sequences with codon mutations around exon-exon junctions. Natural genes do not have exon-exon junctions, which allows their distinction from artificial DNA. Our tool calculates alignment scores, based on which an input sequence (fasta file or typed sequence) is classified using the k-nearest neighbor machine learning algorithm (Torgo et al., 2010). The database accommodates every new artificial EPO DNA that is analyzed. In this way, evolved EPO gene doping sequences update the database resulting in ever-increasing accuracy. The second branch of software we developed is able to find a minimal set of guide RNAs (gRNAs) to detect the plethora of variations that can be made to a gene. The repository to our software can be found here.
Our software tool has two main goals:
- Distinguish gene doping sequences from endogenous DNA.
- Design a minimal set of gRNAs to most efficiently detect the high amount of variance that can be present in gene doping genes.
The tool we developed meets these goals and has the following features:
- Our tool can currently tell apart more than 80% of gene doping sequences from any other DNA.*
- Our tool allows us to reduce the required guide RNAs to detect gene doping from 10104 to only 12 guide RNAs.
*Due to the implementation of a machine learning algorithm the tool will develop even better abilities towards distinguishing these sequences in the future.
1. Sequence Classification Tool
Method
Database set-up
There is a wide variety of genes that are potential targets for gene doping. Our main focus is on the Erythropoietin (EPO) gene, which is one of the genes that is most likely to be used for gene doping, according to professor Hidde Haisma, gene doping expert at the University of Groningen. Our software tool can easily be extended to any other gene. Our custom database contains three groups of reference sequences: endogenous human EPO DNA, gene doping EPO DNA and DNA from endogenous sources other than EPO.
The reference sequences for normal human EPO DNA are the actual human EPO sequence (Clark et. al., 2016) together with 300 sequences that we simulated with 6% false reads, which has been specified as the accuracy of the MinIon device in the case of a single read-through that we use in our method (Loman et al., 2015).
We loaded many possible gene doping EPO DNA sequences. To start these include human EPO cDNA (Clark et. al., 2016) and its codon substituted versions around the junctions. We also included animal cDNA EPO sequences from Genbank (Clark et. al., 2016) as possible EPO doping based on a suggestion by Eli van der Sluis, technician at the Bionanoscience Department at Delft University of Technology.
Furthermore, we included sequences designed by our team which involve further codon mutations. We do not include all codon substituted versions, because there would be roughly 10^104 of those, which is computationally impossible to generate and unnecessary. Therefore, we provide the algorithm with a basic set of sequences that contain all codon variations within 9 bps around the first three junctions, a set of 72 sequences. The code for the generation of these sequences is given at downloads.
Directed learning will take place after classification. The database is further extended with sequences that appear in athlete samples. As the detected sequences form a start for gene doping developers to continue development, the addition of these sequences to the database is highly valuable to improve our detection accuracy.
Additionally, we added 30 DNA strings of 150 bps as DNA from other endogenous sources than EPO. The 30 DNA strings were randomly chosen fragements from four other human genes: ACTN3 (Beggs A.H. et al., 1992), Growth Hormone (Seeburg, P.H., 1982), Vascular Endothelial Growth Factor (Tischer, E. et al., 1991) and Oxytocin Receptor (Inoue, T. et al., 1994). The chosen genes all have GC contents that are similar to the human EPO gene (57%)(Clark et. al., 2016, Biologics International Corp, 2018). Therefore, they are especially important genomic DNA fragments to use for classification reference as they would be hardest to distinguish based on combinatorics. The reference genes were furthermore all chosen to be of length 150 bps, since that is the expected cell free DNA length for genomic DNA (Fan et al., 2008, Lo et al., 2010). In this way we are able to accurately distinguish between gene doping DNA and DNA that does not code for EPO as is described in results.
Data Simulation
To train our gene doping detection algorithm, we need to accurately predict the data our algorithm would come into contact with in a working setting. We need to take into account both the type of DNA and the length of DNA. We generated all sequences using R (R core team, 2015). Packages used for data simulation are seqinR (Charif and Lobry, 2007), Rbioinf by (Gentleman, 2018), Biostrings (H. Pages et al.) and Magicfor (K. Makiyama, 2016).
We generated DNA in sizes that are also expected when sequencing with Oxford Nanopore technologies (Li et. al., 2018). The code takes as input a fasta file containing DNA, the amount of cut sequences that need to be generated and a minimal length threshold that can be set to prevent sequences below a certain length being generated. When using this code, pay attention to the fact that it assumes the origin of the sequence as being human. In this case, an exponential distribution can be assumed. See Li et. al., 2018, for distributions as a result of other organisms. The output sequences are generated at a certain point in the gene and for a certain length. The starting point of the sequences is determined using a random distribution normalised to the length of the input sequence. The length of the sequences is determined using an exponential distribution normalised to the length of the input sequence, with the sequences below the minimal length treshold removed from the list. Then the starting point and sequences length are both randomly sampled for the amount of times as output sequences are required, creating the output sequences. These are saved into one fasta file of multiple sequences. See sequence cutter at the bottom of this page for the code.
Alignment and Scoring
Based on the reference sequences, we align input sequences in two steps. Firstly, we align the input sequence with a selection of the natural human human EPO and it’s 6% misread (Loman et al., 2015) versions to see whether the sequence is the natural EPO or a breakdown component of this. Subsequently, if we find the sequence is non-natural human EPO, we add a second step in which we align the input sequence to the doping EPO sequences in the database. In each step the average scores are compared and the machine learning algorithm k-nearest neighbors (Torgo et al., 2010) is used to classify the outcome scores. These scores are determined based on a combination of two main alignment algorithms: the Smith-Waterman Algorithm and the Needleman-Wunsch Algorithm that were build in the BioStrings package in R (Pagès H, et al., 2018). Both determine local and global sequences alignment respectively.
The Smith-Waterman Algorithm (Smith, et al., 1981) is a dynamic programming algorithm that guarantees to find the optimal local alignment within the scoring system used. One starts with the two sequences that need to be aligned, that are put on the axis of the empty scoring matrix $H$. Then a series of 4 subsequent steps follows.
1. At this stage a substitution matrix is chosen that includes the scores adhered to aligning and non-aligning bases (in case of nucleotide sequences as we use here). In our case we chose the substitution matrix $S(a,b)$ as follows.
To account for insertions and deletions, another input required in this algorithm is the gap extension W_n, which is the penalty for a gap with length n. The gap penalty is the point reduction for each gap in the alignment. We use an affine gap penalty, which is subdivided into a gapOpening penalty for a first gap induction, and a gapExtension penalty for every base that the gap is extended for.
$$gapExtension = C = 2 $$
2. Constructing a scoring matrix
Subsequently, the scoring matrix $H$ is constructed. The matrix $H$ consists of $(p+1)*(q+1)$ dimensions for sequences of length $p$ and $q$ respectively.
3. Filling the scoring matrix
The scoring matrix is filled as follows, where $H_{i-1,j-1}+s(a_{i},b_{j})$ gives the score of aligning elements, $H_{i-n,j}-W_{n}$ gives the score for aligning element $a_{i}$ at the end of a gap of length $n$, and $H_{i,j-m}-W_{m}$ gives the score for $b_{j}$ at the end of a gap length $m$. Lastly, $0$ is for elements that do now show any similarity.
4. Tracing back
After filling the complete scoring matrix $H$, a traceback ensues starting at the highest score in the matrix and ending in a cell with score $0$. To obtain a second best alignment, the same is done for the second highest score.
The Needleman-Wunsch (Needleman, et al., 1970) algorithm differs from the Smith-Waterman algorithm (Smith, et al., 1981) on initialisation, scoring and traceback and computes global instead of local alignment. In the Needleman-Wunsch algorithm, the first row and first column are subject to the gap penalty. Also, in the Needleman-Wunsch algorithm, the scores can be negative. Lastly, the traceback always starts at the lower right corner of the matrix and ends at the upper left corner.
Combination of Scoring
We use the Smith-Waterman algorithm (Smith, et al., 1981) and the Needleman-Wunsch algorithm (Needleman, et al., 1970) to compute the local and global alignment of the input sequence with respect to our custom database. Additionally, we use a hybrid form, semi-global alignment that has been implemented in the BioStrings package in R (Pagès H, et al., 2018). We then study the effect of each alignment algorithm on the gene doping classification. We add all alignment scores for a sequence to give a final total alignment score. We varied the weights of the different alignment scores to identify their influence on the characterisation of gene doping and found varying outcomes, as can be seen in the results section.
Machine Learning
We use the k-nearest neighbor algorithm to classify the input sequence as natural EPO, artificial EPO or other DNA. The k-nearest neighbor algorithm is a non-parametric pattern recognition method that can classify elements based on their value compared to direct nearest neighbors. In this instance, we chose to classify based on the four nearest neighbours. If no distinction can be made based on the four neighbors, the algorithm as has been implemented in R (Torgo, L., 2010) will tell. The classification is carried out in two steps. Firstly, the input sequence is aligned to natural EPO DNA. In a second step, the input DNA is aligned to gene doping DNA, resulting in the distinction between gene doping DNA and any other non-EPO DNA.
The k-NN algorithm (Torgo, L., 2010) takes a training and testing set as input, where the training set is the database. The most important input parameters are:
- $Cl1$ is a vector with classification labels, in which the number of elements match the number of classes in the training data set. This means that this vector contains elements corresponding to the amount of classes and the amount of corresponding elements to be classified in these classes for the training data set. In this case $cl1$ is a vector of factors that are ‘false’ for the first $45$ elements of the normalized_score matrix, meaning classified as gene doping, and ‘true’ for the subsequent elements, meaning not classified as gene doping.
- $k$ is the amount of nearest neighbors taken into account per point to judge the classification.
- Less than $k-l$ contradicting votes are allowed. By setting $l=0$, we make sure that sequenes will only be classified as one of each group when all k nearest neighbors of that sequence belong to the same category.
We opened up our sequence classification tool to people lacking bioinformatics knowledge by the creation of a Graphical User Interface that can be used to input hand-typed sequences. These sequences are then stored in an Excel file and can then be run through the subsequent stages of the program. Alternatively, users can load fasta files, the general output of sequencing systems.
Based on the input coming from loaded sequences, the alogrithm accommodates new gene doping sequences it finds into the database. Thereby, the tool can evolve with the sequences designed by gene doping creators.
Accurate alignment algorithms are generally computationally intensive. Our algorithm also runs every input sequence along the sequences in the expanding custom database. Therefore, computation time may increase in time. At present, it takes approximately 10 minutes to run the code for 5 input sequences on 4 parallel processor cores when the workspace has been downloaded. Complete running without workspace takes around 1.5 hours on 6 parallel processor cores. To speed up the running time, we therefore include the basic workspace. The code checks whether the core elements in the workspace are present and determines what parts to still run based on this. Also, we included packages for parallel core running (Calaway, R., 2017, Weston, S., 2017, Calaway, R., 2018). In the future, we recommend complementation of the code with a pattern recognition based sequence choice to align with to prevent extreme increases in running time.
Results and Conclusions
Based on the scores obtained in the steps described before, we constructed figure 2. In figure 2, the classification in the three categories, natural EPO, artificial EPO and other DNA, is displayed. Here, the alignment scores to natural human DNA are displayed on the x-axis and the alignment scores to artificial EPO are put on the y-axis. The blue sector shows the natural human EPO sequence and derivatives with a 6% misread rate (Loman et al., 2015) resulting from sequencing. All have a positive alignment score to natural EPO.
The green points represent the DNA sequences of 150 bps from the genes other than EPO. The red points are the gene doping DNA which consists of a diverse set of codon varied EPO sequences as well as animal EPO sequences.
From figure 2, it becomes apparent we chose the k-nearest neighbor algorithm. The k-nearest neighbours algorithm is good at identifying local structures in datasets, which is essential with the large spread of scores that have been attributed to the gene doping sequences. Additionally, the value of a learning algorithm is highlighted because of the variable properties of gene doping DNA. The algorithm was tested with alignments based on four scoring weight distributions (figures 2-5). For the figures with all weight distributions, please see the dropdown below figure 2.
In order to determine the influence of the alignment algorithm on the classification abilities of the program we varied the weights attached to the scores from each algorithm systematically. The best performing algorithm is shown in figure 2 and adheres most value to the global alignment score. With this weighting, we were able to distinguish more than 80% of the sequences designed by our team and the cyber security specialists at the Hackathon. The classification resulting from the other weights was worse. The following figures display the scoring with the different weighings.
From figure 2-5, and the data it can be deducted that especially the global alignment score is an important factor in distinguishing gene doping DNA. Therefore, we recommend to further optimize the weights starting with relatively higher global alignment weights.
We challenged our method at the Hackathon, where participants believed that the optimal doping strategy would be to design sequences on the boundary between doping sequences and other DNA. It will nevertheless be hard for doping creators to design further sequences that go undetected by our tool, given that the protein should still be functional as well.
Within the team we also had a competition to beat the algorithm by doping sequence design. Many of us submitted a sequence and the algorithm was able to distinguish more than 80% of them as gene doping. Additionally, the input of any other natural DNA than EPO has always been classified as such. All in all, the algorithm has a great predictive power. The extending database will in the future only even extend this power, which makes us confident that gene doping can in this way rightly be distinguished.
Further Applications
The early stage diagnosis of frontotemporal dementia has been a big challenge. Frontotemporal dementia has been associated with splicing defects in gene FTDP-17 (Buee et al., 2000), resulting in mRNA coding for disfunctional proteins. Our tool can be extended to classify the mRNA sequences and thereby may provide many patients with an early stage diagnosis.
2. gRNA Array Minimization Tool
The second software tool we made can be linked to the first one and also used independently. It is a gRNA generation tool with a unique functionality: it generates all possible gRNA’s at a single position, including all codon substituted versions (figure 6).
This is very useful for catching gene dopers, since they might change the codons in their doping gene to avoid detection. It can also be useful for any other application that deals with high amounts of synonymous mutations in the gene they want to find.
For our application, this gRNA generation tool is an addition to the sequence classification tool discussed above. The sequence classification tool adds doping sequences to it’s database. The sequences have either an amino acid sequence which is already known by us, or a different amino acid sequence. If it is a protein sequence which is already known, the database improvement stops there. This is because gRNA’s have already been made for this protein sequence, so generating the gRNA’s again is not necessary.
However, if it is a new protein sequence, this sequence is fed into the gRNA generation tool. New gRNA’s are generated and added to a gRNA database. This databases consists of all gRNA’s which we have designed to catch gene doping athletes. The gRNA’s stored in the database are used to bind to our target with our fusion protein. By updating the database in this way, we increase the chance of catching gene doping users with every run.
Method
We have developed the gRNA generator tool in R, adapted from the gRNA model. The figure below guides you through the steps of the code. For a more detailed description of the tool, please click the dropdown below the figure.
The tool takes as input:
- A .fasta file containing a DNA sequence.
- A vector with target region(s). These are the bps that the tool searches for gRNA positions. They can be exon-exon junctions as used in our application, but also regions in e.g. the active site of the protein.
- Properties of the Cas protein that will be used. These include the PAM sequence (PAM), the prefered gRNA size (gRNA size) and the amount of base pairs in the gRNA that perfectly should match the score (Cas on target). Please note that the tool currently only supports single base pair PAM sequences and assumes the Cas on target is about the bps directly attached to the PAM.
The tool first translates the input gene into an amino acid sequence and into a vector with the amount of possible codons per amino acid position. This vector is used to identify regions with little codon substitutions possible. Using the target region and Cas property input parameters, the target region(s) of the gene is or are scanned for possible gRNA positions.
Combining the information on the possible gRNA locations and the vector with the possible codons, gRNA positions are selected where the least amount of codon substitutions are possible. This is an essential step, because selecting these positions greatly decreases the amount of gRNA’s that need to be generated, while keeping the same amount of (doping) sequences that we can detect. Supplying more or larger target regions is beneficial to the result of the program: the possibility increases that a target region overlaps with a region with low amounts of possible codons. This means it is more likely that a gRNA position is found where only very few gRNA’s need to be generated. Increasing the target region does increase the running time of the program. At these positions, gRNA’s are then generated and stored. This output can be exported as a fasta file.
The tool supports frame-independent translation, PAM identification on both strands and varying gRNA lengths. It currently does not support PAM sequences of more than 1 specific nucleotide or direct input of amino acid sequences. The programming language we used for this tool, R, is an open-source programming language that can be augmented with packages that are convenient for working with genomic data (R core team, 2015). Packages used are seqinR (Charif and Lobry, 2007), Biostrings (H. Pages et. al.) and Magicfor (K. Makiyama, 2016).
We used this tool to generate all possible gRNA’s at exon-exon junctions in EPO, taking into account all codon substitutions possible. The tool is far more versatile, however. It can easily generate gRNA’s at a position of choice for all possible codon substitutions. This can also be useful for other applications that deal with high amounts of synonymous mutations in the gene they want to sequence.
Downloads
The R files coding for the tools described in this page can be found here.
The files for the sequence classification tool consist of the final code (final code sequence classification) and a workspace. Loading this workspace into your R console greatly reduces running time of the sequence classification tool.
The files for the database generation can be found here. They include the random sequence generator, the sequence mutator, the codon substituted junction regions code and the sequence cutter code.
The file for the gRNA generator tool can be found here.
References
- Beggs A.H. et al.(1992). Cloning and characterization of two human skeletal muscle alpha-actinin genes located on chromosomes 1 and 11. J. Biol. Chem. 267 (13), 9281-9288.
- Biologics International Corp (2018). GC Content Calculator. Used on 20-09-2018 at: https://www.biologicscorp.com/tools/GCContent#.W8W_0hMzY1I.
- Bryan, J. and Zhao, J. (2018). Manage Google Spreadsheets from R. R package version 0.3.0.
- Buee L. et al. (2000). Tau protein isoforms, phosphorylation and role in neurodegenerative disorders. Brain Res. Brain Res. Rev. 33:95–130.
- Calaway, R. (2017). R package doSNOW. License: GPL2.
- Calaway, R. (2018). Foreach Parallel Adaptor for the 'parallel' Package. License: GPL2.
- Chang, W. and Borges Ribeiro, B. (2018). shinydashboard: Create Dashboards with 'Shiny'. R package version 0.7.0.
- Charif D., Lobry J.R. (2007) SeqinR 1.0-2: A Contributed Package to the R Project for Statistical Computing Devoted to Biological Sequences Retrieval and Analysis. In: Bastolla U., Porto M., Roman H.E., Vendruscolo M. (eds) Structural Approaches to Sequence Evolution. Biological and Medical Physics, Biomedical Engineering. Springer, Berlin, Heidelberg.
- Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2016). GenBank. Nucleic Acids Research, 44(Database issue), D67–D72. http://doi.org/10.1093/nar/gkv1276.
- Durinck, S. (n.d.). biomaRt v2.28.0. Interface to BioMart databases.
- Fan, H.C. et al. (2008). Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood.Proc. Natl. Acad. Sci. USA. 105: 16266-16271.
- Gentleman, R. (2018). RBioinf: RBioinf. R package version 1.40.0.
- Inoue, T. et al. (1994). Structural organization of the human oxytocin receptor gene. J. Biol. Chem. 269 (51), 32451-32456.
- Li, Y., et al. (2018). DeepSimulator: a deep simulator for Nanopore sequencing, Bioinformatics, Volume 34, Issue 17, Pages 2899–2908. https://doi.org/10.1093/bioinformatics/bty223.
- Lo, Y.M.D. et al. (2010). Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus.Sci. Transl. Med. 2: 61ra91.
- Loman, N.J. et al. (2015). A complete bacterial genome assembled de novo using only nanopore sequencing data Nat Methods, 12, pp. 733-735.
- Makiyama, K. (2016). magicfor: Magic Functions to Obtain Results from for Loops. R package version 0.1.0.
- Needleman, S.B. and Wunsch, C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 48 (3): 443–53. doi:10.1016/0022-2836(70)90057-4.
- Pagès H, Aboyoun P, Gentleman R, DebRoy S (2018). Biostrings: Efficient manipulation of biological strings. R package version 2.48.0.
- R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
- Ripley, B. D. (1996) Pattern Recognition and Neural Networks.Cambridge.
- Seeburg,P.H. (1982). The human growth hormone gene family: nucleotide sequences show recent divergence and predict a new polypeptide hormone. DNA 1 (3), 239-249.
- Smith, Temple F. and Waterman, Michael S. (1981). Identification of Common Molecular Subsequences. Journal of Molecular Biology. 147: 195–197. doi:10.1016/0022-2836(81)90087-5.
- Tischer, E. et al. (1991). The human gene for vascular endothelial growth factor. Multiple protein forms are encoded through alternative exon splicing. J. Biol. Chem. 266 (18), 11947-11954.
- Torgo, L. (2010) Data Mining using R (knn): learning with case studies, CRC Press (ISBN: 9781439810187).
- Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
- Weston, S. (2017). Foreach Parallel Adaptor for the Rmpi Package. License: GPL-2.
- Xie, Y. (2018). DT: A Wrapper of the JavaScript Library 'DataTables'. R package version 0.4.