Revision as of 14:37, 1 October 2018

Bioinformatics

Identifying a target

When early discussions about the project took place, we considered targeting the aphid microbiome using sgRNAs, hoping that the bacteria they hold have a CRISPR system that we could essentially hijack. However, this idea was quickly put to rest as we ran a short bioinformatic test to identify whether the bacteria had a CRISPR system. These results can be seen below, and show that there is no identifiable CRISPR system naturally in Buchnera aphidicola.

The above screenshot shows "3 Analysed Sequences" as the bacterial genome FASTA sequence was assembled into 3 contigs. This genome can be found on NCBI, or here.

Integrating Human Practices - Protecting Local Fauna

For our human practices, we communicated with several stakeholders about our project, and their general knowledge and opinions of GM. One concern we found, especially when talking to retirees, was for the well-being of the local fauna. More specifically, we initially spoke to an amateur beekeeper, who expressed concerns about honeybee welfare by raising a previously unknown issue. That is, they suggested that wasps feed on aphid honeydew, and have been known to attack bees when they are low on food. Obviously any risk to bee populations would make the project completely unfavourable, so we took up the issue with the Welsh Beekeepers’ Association (WBKA). These professional beekeepers rapidly informed us that the honeybees themselves actually feed on aphid honeydew. The reduction of a food source for these bees is of minor concern; instead, the risk of the siRNAs that are secreted in the aphid honeydew raises more immediate and interesting concerns.

To address these concerns and prove that our project is safe for the field, we analysed binding sites of every potential siRNA produced from our RNAi gene constructs, and ran a BLASTn of each of the produced siRNAs against any of the transcriptomes of any aphid predators, bees, and humans. Where the transcriptome sequences were not available, we used the genomic sequences. Here, it is important to note that as siRNAs only bind to RNA sequences, they can only affect transcribed sequences, so much of the matches to genomic sequences will not be in coding regions. This is why we used transcriptomic sequences where available.

In addition, many genomes were not available for direct aphid predators, so instead of ruling these out, we took the genome sequence of the next closely available organism(s), moving up taxa, using the assumption that the genomic regions between these relatives and the actual predators must be at least somewhat conserved, and thus still serve as good indicators. Finally, we did not attempt to analyse the genomes of organisms that may eat the honeydew, but not the aphid directly, as this information is fairly sparse and indefinite. However, if interested, one can download the program from the link at the bottom of this page, and run the analysis themselves.

Modelling toxicity in silico

To analyse the potential toxicity of each siRNA, we entered each of their full FASTA sequences (BCR3, SP3, and C002) into the program, which splits them a full complement of 21 nucleotide siRNAs (for maximum siRNA toxicity, assuming the minimum length for an siRNA is 21 nucleotides) using a single base sliding window. Each of these output siRNAs was then run against each of the genomes or transcriptomes we were interested in with a BLASTn, appropriately paramitised for short sequence searches. Standard BLAST alignment scoring optimises for non-gapped and non-mismatched results, so to ensure accurate alignment we implemented a strict E-value cutoff of <0.05. As a result, all matches are of 16-21 nucleotides in length with 100% alignment. Adding mismatches would rapidly increase the E-values to non-significant levels using the standard BLAST model. In the future, we would run a motif analysis that would allow for gaps and mismatches, and may increase the number of hits as a result. However, if one desires to run the script themselves, these values can be changed, showing a greater number of results as the minimum limit for the E-value increases. Below you can see the output windows with some annotation for each of the three siRNAs.

BCR3

The full unedited output for BCR3 can be downloaded here. This shows where each siRNA has matches in the host genome, allowing for more detailed analysis.

For example, the above PDF shows a single match against Apis mellifera, the Western honeybee. Of course, harming this species could have huge ecological consequences, so before our insecticides could be applied we would need to ensure that it could cause no harm to essential species. Thus, when we look at where the siRNA binds in the transcriptome of the bee, we find the following result:

The locus in the dataset where the siRNA matched to has been removed as it was not predicted in a newer annotation since running the bioinformatic analysis. Therefore, this siRNA is safe to the honeybee, according to its current sequencing data (01/10/2018).

SP3

The full xlsx file showing where each siRNA hits within each genome can be downloaded here.

This siRNA has no hits against the honeybee genome. The outputs at the bottom of the PDF were once hits against transcripts, but NCBI has had annotation added to them, and considers them non-functional, and thereby safe.

C002 (positive control)

The full file can be downloaded here. This is the positive control, an siRNA that has been used in research before and targets transcripts in the salivary glands of aphids. It is a much shorter sequence than BCR3 and SP3, meaning fewer siRNAs are produced, reducing the potential off-target effects. In future, we would optimise our BCR3 and SP3 sequences using this bioinformatic analysis, to create a shorter pre-siRNA from these genes, thus reducing the chance of off-target effects, potentially to include sequences that only occur in the aphid genomes. Of course, as more complete sequencing data becomes available, these results have the potential to change.

Conclusions

As you can see, there are potential risks of the deployment of our siRNAs. However, for genomes with annotation, such as the Apis mellifera genome, when we analyse the locus of the siRNA match, we find no alignment to known genes (some previous gene models did identify matches, but are now defunct), thereby making it safe to bees. Initially, we wanted to analyse other loci where the siRNAs matched, but quickly found that these genomes are currently (01/10/2018) completely unannotated, or have such poor functional annotation that it is impossible to actually look at the function of the DNA at the matched loci. It was possible for Apis mellifera as this genome has been extensively analysed, and because we used the transcriptomic data due to it being available. Due to the other organisms sequences being genomic as opposed to transcriptomic, there is a high chance that the regions where our siRNAs matched are actually not transcribed, making the siRNA safe. However, because of the lack of annotation, we cannot completely rule this out. In the future, as more annotation becomes available, a clearer picture will be painted, and more accurate conclusions can be drawn.

Of course there are likely many more organisms that may come in contact with these siRNAs due to massive food webs, but we decided to focus our attention on the most important species, or those that are most likely to come in contact with them. We also set the minimum e-value to be 0.05. As you can see, all of the hits against the host aphids genome (Myzus persicae) have the lowest e-values of 0.002, and therefore have the highest accuracy. When you increase the minimum e-value, say to 0.5, hits do come up against other genomes, including the human genome. This is because the e-value is calculated from the bit-score, a complex algorithm calculated based on the number of matches, mismatches, or gaps between sequences. These bit-scores can be seen on the original xlsx files that can be downloaded.

Fortunately, for those who are interested or want to try their own analysis, a link to the bioinformatics script can be found on Dr. Daniel Pass's GitHub. This script could prove to be a very useful tool for anyone who wishes to analyse potential siRNA toxicity. The input sequence to be made into siRNA sequences can be changed, the siRNA length (the default is 21 because siRNAs are usually 21-25 nucleotides in length, and so 21 nucleotides provides the highest toxicity, as shorter sequences are likely to come up with more matches, and will include all matches for siRNAs of larger lengths). In addition, the genomic or transcriptomic sequences to have a BLASTn run against can be changed. Thus, this tool is very flexible, and potentially useful to other siRNA using projects, be them for iGEM or not. Finally, the script comes with annotated help for all the variables and how to change them in the command. To see this, in the command window, enter the script name with the "-h" or "-help" switch (note, you don't need the quotation marks!).

@@ Line 30: / Line 30: @@
 <br><br><br>
 <h2 style="color:green !important"><center>Modelling toxicity <i>in silico</i></center></h2><br>
-<p>To analyse the toxicity of each siRNA, we entered each of their full FASTA sequences (BCR3, SP3, and C002) into the program, which splits them into 21 nucleotide probes (for maximum siRNA toxicity, assuming the minimum length for an siRNA is 21 nucleotides), each sliding by 1 nucleotide each time. Each of these output siRNAs was then run against each of the genomes or transcriptomes we were interested in with a BLASTn. Here it is important to note that due to our filtering of the E-value, the output only shows regions with a 100% match, anywhere from about 16 nucleotides to 21 nucleotides. This is because these then have no mismatches, which lowers the E-value. In reality, many of these sequences will be near perfect (but not quite) matches across the entire 21 nucleotide region, but the display only shows the 100% matches as these have low E-values. Adding mismatches would rapidly increase the E-values to non-significant levels. However, if one desires to run the script themselves, these values can be changed, showing a greater number of results as the minimum limit for the E-value increases. Below you can see the output windows with some annotation for each of the three siRNAs.
+<p>To analyse the potential toxicity of each siRNA, we entered each of their full FASTA sequences (BCR3, SP3, and C002) into the program, which splits them a full complement of 21 nucleotide siRNAs (for maximum siRNA toxicity, assuming the minimum length for an siRNA is 21 nucleotides) using a single base sliding window. Each of these output siRNAs was then run against each of the genomes or transcriptomes we were interested in with a BLASTn, appropriately paramitised for short sequence searches. Standard BLAST alignment scoring optimises for non-gapped and non-mismatched results, so to ensure accurate alignment we implemented a strict E-value cutoff of <0.05. As a result, all matches are of 16-21 nucleotides in length with 100% alignment.  Adding mismatches would rapidly increase the E-values to non-significant levels using the standard BLAST model. In the future, we would run a motif analysis that would allow for gaps and mismatches, and may increase the number of hits as a result. However, if one desires to run the script themselves, these values can be changed, showing a greater number of results as the minimum limit for the E-value increases. Below you can see the output windows with some annotation for each of the three siRNAs.
 <br><br><br>
 <h3 style="color:green !important">BCR3</h3><br><br>
@@ Line 51: / Line 51: @@
 <object width="80%" height="500px" data="https://static.igem.org/mediawiki/2018/b/b2/T--Cardiff_Wales--SP3_processed.pdf"></object></center>
 <br><br>
-One again, the full xlsx file showing where each siRNA hits within each genome can be downloaded <a href="https://2018.igem.org/File:T--Cardiff_Wales--SP3_unprocessed.xlsx">here.</a>
+The full xlsx file showing where each siRNA hits within each genome can be downloaded <a href="https://2018.igem.org/File:T--Cardiff_Wales--SP3_unprocessed.xlsx">here.</a>
 <br><br>
 This siRNA has no hits against the honeybee genome. The outputs at the bottom of the PDF were once hits against transcripts, but NCBI has had annotation added to them, and considers them non-functional, and thereby safe.
@@ Line 58: / Line 58: @@
 <object width="80%" height="500px" data="https://static.igem.org/mediawiki/2018/1/12/T--Cardiff_Wales--C002_processed.pdf"></object></center>
 <br><br>
-Again, the full file can be downloaded <a href="https://2018.igem.org/File:T--Cardiff_Wales--C002_unprocessed.xlsx">here.</a> This is the positive control, an siRNA that has been used in research before and targets transcripts in the salivary glands of aphids. It is a much shorter sequence than BCR3 and SP3, meaning fewer siRNAs are produced, reducing the potential off-target effects. In future, we would optimise our BCR3 and SP3 sequences using this bioinformatic analysis, to create a shorter pre-siRNA from these genes, thus reducing the chance of off-target effects, potentially to include sequences that only occur in the aphid genomes. Of course, as more complete sequencing data becomes available, these results will likely change.
+The full file can be downloaded <a href="https://2018.igem.org/File:T--Cardiff_Wales--C002_unprocessed.xlsx">here.</a> This is the positive control, an siRNA that has been used in research before and targets transcripts in the salivary glands of aphids. It is a much shorter sequence than BCR3 and SP3, meaning fewer siRNAs are produced, reducing the potential off-target effects. In future, we would optimise our BCR3 and SP3 sequences using this bioinformatic analysis, to create a shorter pre-siRNA from these genes, thus reducing the chance of off-target effects, potentially to include sequences that only occur in the aphid genomes. Of course, as more complete sequencing data becomes available, these results have the potential to change.
 <br><br>
 <h2 style="color:green !important"><center>Conclusions</center></h2><br><br>
-As you can see, there are potential risks of the deployment of our siRNAs. However, for genomes with annotation, such as the <i>Apis mellifera</i> genome, when we analyse the locus of the siRNA match, we find that is has been removed and is not considered functional, thereby making it safe to bees. Initially, we wanted to analyse other loci where the siRNAs matched, but quickly found that these genomes are currently (01/10/2018) completely unannotated, or have such poor functional annotation that it is impossible to actually look at the function of the DNA at the matched loci. It was possible for <i>Apis mellifera</i> as this genome has been extensively analysed, and because we used the transcriptomic data due to it being available. Due to the other organisms sequences being genomic as opposed to transcriptomic, there is a high chance that the regions where our siRNAs matched are actually not transcribed, making the siRNA safe. However, because of the lack of annotation, we cannot completely rule this out. In the future, as more annotation becomes available, a clearer picture will be painted, and more accurate conclusions can be drawn.
+As you can see, there are potential risks of the deployment of our siRNAs. However, for genomes with annotation, such as the <i>Apis mellifera</i> genome, when we analyse the locus of the siRNA match, we find no alignment to known genes (some previous gene models did identify matches, but are now defunct), thereby making it safe to bees. Initially, we wanted to analyse other loci where the siRNAs matched, but quickly found that these genomes are currently (01/10/2018) completely unannotated, or have such poor functional annotation that it is impossible to actually look at the function of the DNA at the matched loci. It was possible for <i>Apis mellifera</i> as this genome has been extensively analysed, and because we used the transcriptomic data due to it being available. Due to the other organisms sequences being genomic as opposed to transcriptomic, there is a high chance that the regions where our siRNAs matched are actually not transcribed, making the siRNA safe. However, because of the lack of annotation, we cannot completely rule this out. In the future, as more annotation becomes available, a clearer picture will be painted, and more accurate conclusions can be drawn.
-<br><br> Of course there are likely many more organisms that may come in contact with these siRNAs due to massive food webs, but it is impractical to attempt to build these and analyse them all. We also set the minimum e-value to be 0.05 (5% chance that the hit is due to chance, if you like). As you can see, all of the hits against the host aphids genome (<i>Myzus persicae</i>) have the lowest e-values, of 0.002. When you increase the minimum e-value, say to 0.5, hits do come up against other genomes, including the human genome. However, with an e-value of 0.5, there is a 50% chance that this is not a true hit. This is because the e-value is calculated from the bit-score, a complex algorithm that gives each base a value depending on whether it is a match, mismatch, or gap between sequences. These bit-scores can be seen on the original xlsx files that can be downloaded.
+<br><br> Of course there are likely many more organisms that may come in contact with these siRNAs due to massive food webs, but we decided to focus our attention on the most important species, or those that are most likely to come in contact with them. We also set the minimum e-value to be 0.05. As you can see, all of the hits against the host aphids genome (<i>Myzus persicae</i>) have the lowest e-values of 0.002, and therefore have the highest accuracy. When you increase the minimum e-value, say to 0.5, hits do come up against other genomes, including the human genome. This is because the e-value is calculated from the bit-score, a complex algorithm calculated based on the number of matches, mismatches, or gaps between sequences. These bit-scores can be seen on the original xlsx files that can be downloaded.
 <br><br> Fortunately, for those who are interested or want to try their own analysis, a link to the bioinformatics script can be found on Dr. Daniel Pass's <a href="https://github.com/passdan/IGEMCardiff2018">GitHub.</a> This script could prove to be a very useful tool for anyone who wishes to analyse potential siRNA toxicity. The input sequence to be made into siRNA sequences can be changed, the siRNA length (the default is 21 because siRNAs are usually 21-25 nucleotides in length, and so 21 nucleotides provides the highest toxicity, as shorter sequences are likely to come up with more matches, and will include all matches for siRNAs of larger lengths). In addition, the genomic or transcriptomic sequences to have a BLASTn run against can be changed. Thus, this tool is very flexible, and potentially useful to other siRNA using projects, be them for iGEM or not. Finally, the script comes with annotated help for all the variables and how to change them in the command. To see this, in the command window, enter the script name with the "-h" or "-help" switch (note, you don't need the quotation marks!).

Difference between revisions of "Team:Cardiff Wales/Bioinformatics"