Team:Cardiff Wales/Bioinformatics


Identifying a target

When early discussions about the project took place, we considered targeting the aphid microbiome using single guide RNAs (sgRNAs), hoping that the bacteria they hold have a CRISPR system that we could essentially hijack. However, this idea was quickly put to rest as we ran a short bioinformatic test to identify whether the bacteria had a CRISPR system. These results can be seen below, and show that there is no identifiable CRISPR system naturally in Buchnera aphidicola.

Figure 1. Bioinformatic analysis showing that Buchnera aphidicola lack a CRISPR system.

The above screenshot shows "3 Analysed Sequences" as the bacterial genome FASTA sequence was assembled into 3 contigs. This genome can be found on NCBI, or here.

Integrating Human Practices - Protecting Local Fauna

For our human practices, we communicated with several stakeholders about our project, and their general knowledge and opinions of GM. One concern we found, especially when talking to retirees, was for the well-being of the local fauna. More specifically, we initially spoke to an amateur beekeeper, who expressed concerns about honeybee welfare by raising a previously unknown issue. That is, they suggested that wasps feed on aphid honeydew, and have been known to attack bees when they are low on food. Obviously any risk to bee populations would make the project completely unfavourable, so we took up the issue with the Welsh Beekeepers’ Association (WBKA). These professional beekeepers rapidly informed us that the honeybees themselves actually feed on aphid honeydew. The reduction of a food source for these bees is of minor concern; instead, the risk of the siRNAs that are secreted in the aphid honeydew raises more immediate and interesting concerns.

To address these concerns and prove that our project is safe for the field, we analysed binding sites of every potential siRNA produced from our RNAi gene constructs, and ran a BLASTn of each of the produced siRNAs against any of the transcriptomes of any aphid predators, bees, and humans. Where the transcriptome sequences were not available, we used the genomic sequences. Here, it is important to note that as siRNAs only bind to RNA sequences, they can only affect transcribed sequences, so much of the matches to genomic sequences will not be in coding regions. This is why we used transcriptomic sequences where available. For our target organism, Myzus persicae we used both genomic and transcriptomic sequences to allow comparison between other samples of the same type.

In addition, many genomes were not available for direct aphid predators, so instead of ruling these out, we took the genome sequence of the next closely available organism(s), moving up taxa, using the assumption that the genomic regions between these relatives and the actual predators must be at least somewhat conserved, and thus still serve as good indicators. Finally, we did not attempt to analyse the genomes of organisms that may eat the honeydew, but not the aphid directly, as this information is fairly sparse and indefinite. However, if interested, one can download the program from the link at the bottom of this page, and run the analysis themselves.

Modelling toxicity in silico

To analyse the potential toxicity of each siRNA, we entered each of their full FASTA sequences (BCR3, SP3, and C002) into the program, which splits them into a full complement of 21 nucleotide siRNAs (for maximum siRNA toxicity, assuming the minimum length for an siRNA is 21 nucleotides) using a single base sliding window. Each of these output siRNAs was then run against each of the genomes or transcriptomes we were interested in with a BLASTn, appropriately paramitised for short sequence searches. Standard BLAST alignment scoring optimises for non-gapped and non-mismatched results, so to ensure accurate alignment we implemented a strict E-value cutoff of <0.05. As a result, all matches are of 16-21 nucleotides in length with 100% alignment. We also specified that these must be unique matches, which removes multiple hits from the same siRNA of slightly different lengths (16-21 nucleotides) in the same sequence. Adding mismatches would rapidly increase the E-values to non-significant levels using the standard BLAST model. In the future, we would run a motif analysis that would allow for gaps and mismatches, and may increase the number of hits as a result. However, if one desires to run the script themselves, these values can be changed, showing a greater number of results as the minimum limit for the E-value increases. Below you can see the output windows with some annotation for each of the three siRNAs.


The full unedited output for BCR3 can be downloaded as an xlsx file here. This shows where each siRNA has matches in the host genome, allowing for more detailed analysis. In addition, should the above PDF not load, it can be downloaded here. Despite us running the analysis against both the genome and transcriptome of Myzus persicae, hits only showed up against the genome for the BCR3 RNAi construct, unlike the other two RNAi constructs which both had hits against the genome and transcriptome. A graphical output is seen in Figure 2 below.

Figure 2. Graphical representation showing the number of hits from a BLASTn search of each possible siRNA produced from BCR3. Note that the Y axis is logarithmic.

For example, the above PDF shows a single match against Apis mellifera, the Western honeybee. Of course, harming this species could have huge ecological consequences, so before our insecticides could be applied we would need to ensure that it could cause no harm to essential species. When we look at where the siRNA binds in the transcriptome of the bee, we find the following result:

Figure 3. Bioinformatic analysis showing transcriptomic hit location against Apis mellifera. The region has been annotated and has been decided it is non-functional, and the siRNA is thereby safe.

The locus in the dataset where the siRNA matched to has been removed as it was not predicted in a newer annotation since running the bioinformatic analysis. Therefore, this siRNA is safe to the honeybee, according to its current sequencing data (01/10/2018).


The full xlsx file showing where each siRNA hits within each genome can be downloaded here. If the above PDF doesn't load, this can also be downloaded here. A graphical output is seen in Figure 4 below.

Figure 4. Graphical representation showing the number of hits from a BLASTn search of each possible siRNA produced from SP3. Note that the Y axis is logarithmic.

This siRNA has no hits against the honeybee genome. The outputs at the bottom of the PDF were once hits against transcripts, but NCBI has had annotation added to them, and considers them non-functional, and thereby safe.

C002 (positive control)

The full xlsx file can be downloaded here, and the above PDF, should it not load, can be downloaded here. This is the positive control, an siRNA that has been used in research before and targets transcripts in the salivary glands of aphids. A graphical output is seen in Figure 5 below.

Figure 5. Graphical representation showing the number of hits from a BLASTn search of each possible siRNA produced from the positive control C002.

The C002 siRNA is a much shorter sequence than BCR3 and SP3, meaning fewer siRNAs are produced, reducing the potential off-target effects. In future, we would optimise our BCR3 and SP3 sequences using this bioinformatic analysis, to create a shorter pre-siRNA from these genes, thus reducing the chance of off-target effects, potentially to include sequences that only occur in the aphid genomes. Of course, as more complete sequencing data becomes available, these results have the potential to change.


As you can see, there are potential risks of the deployment of our siRNAs. However, for genomes with annotation, such as the Apis mellifera genome, when we analyse the locus of the siRNA match, we find no alignment to known genes (some previous gene models did identify matches, but are now defunct), thereby making it safe to bees. Initially, we wanted to analyse other loci where the siRNAs matched, but quickly found that these genomes are currently (01/10/2018) completely unannotated, or have such poor functional annotation that it is impossible to actually look at the function of the DNA at the matched loci. It was possible for Apis mellifera as this genome has been extensively analysed, and because we used the transcriptomic data due to it being available. Due to the other organisms sequences being genomic as opposed to transcriptomic, there is a high chance that the regions where our siRNAs matched are actually not transcribed, making the siRNA safe. However, because of the lack of annotation, we cannot completely rule this out. In the future, as more annotation becomes available, a clearer picture will be painted, and more accurate conclusions can be drawn.

Of course there are likely many more organisms that may come in contact with these siRNAs due to massive food webs, but we decided to focus our attention on the most important species, or those that are most likely to come in contact with them. We also set the minimum e-value to be 0.05. As you can see, all of the hits against the host aphids genome (Myzus persicae) have the lowest e-values of 0.002, and therefore have the highest accuracy. When you increase the minimum e-value, to 0.5 for example, hits do come up against other genomes, including the human genome. This is because the e-value is calculated from the bit-score, a complex algorithm calculated based on the number of matches, mismatches, or gaps between sequences. These bit-scores can be seen on the original xlsx files that can be downloaded.

Fortunately, for those who are interested or want to try their own analysis, a link to the bioinformatics script can be found on Dr. Daniel Pass's GitHub. This script could prove to be a very useful tool for anyone who wishes to analyse potential siRNA toxicity. The input sequence to be made into siRNA sequences can be changed, the siRNA length (the default is 21 because siRNAs are usually 21-25 nucleotides in length, and so 21 nucleotides provides the highest toxicity, as shorter sequences are likely to come up with more matches, and will include all matches for siRNAs of larger lengths). In addition, the genomic or transcriptomic sequences to have a BLASTn run against can be changed. Thus, this tool is very flexible, and potentially useful to other siRNA using projects, be them for iGEM or not. Finally, the script comes with annotated help for all the variables and how to change them in the command. To see this, in the command window, enter the script name with the "-h" or "-help" switch (note, you don't need the quotation marks!).