Difference between revisions of "Team:Munich/DataAnalysis"

Line 433: Line 433:
 
             contamination detection using sequ-into.
 
             contamination detection using sequ-into.
 
         </p>
 
         </p>
 
  <blockquote class="quote-card blue-card">
 
 
   
 
  </blockquote>
 
 
 
 
     </div>
 
     </div>
 
     </div>
 
     </div>

Revision as of 01:46, 18 October 2018

Phactory

Data Analysis

Data Analysis

One of the most essential aspects when it comes to manufacturing therapeutics and, thus, a major consideration while designing Phactory is quality control. To tackle this issue from a bioinformatician’s point-of-view, we analysed minION Oxford Nanopore sequencing data in order to draw conclusions on purity, origin, and functionality of the bacteriophages genome.

For this purpose, the wetlab team sequenced several phage genomes: T7, 3S, NES, FFP. Analysis of the results has been performed using an in-house developed software poreSTAT [1]. For each sequencing sample, it is determined how many bases are sequenced, what base-pair yield has been achieved and how many minION pores were used.

Considering the sequence in which the reads have been acquired, it can nicely be seen how the used chip gets worn out with each sequencing experiment. While for the first sequencing experiment many pores are show high average read lengths, the used pores average read lengths decreases with the number of sequencing experiments.

T7
3S
NES
FFP

The read and base-pair yield is summarized in the following table:

Experiment Sequencing Time Reads Sequenced Base-pairs sequenced
T7 12h 02min 424,198 1.27 × 109
3S 3h 42min 77,092 2.53 × 108
NES 3h 58min 39,501 2.31 × 108
NFFP 5h 42min 27,633 1.23 × 108

Phage Genome Assembly

Genome assembly refers to aligning and merging fragments in order to reconstruct the original sequence. Particularly Nanopore sequencing is well suitable for genome assembly since its long reads allow to reduce the ambiguity of highly similar and repetitive sequences. The read distributions of the sequencing experiments are shown in Figure 2.

T7
3S
NES
FFP

During the initial screening it was observable that the read lengths do not approach the expected length of the phage genomes in the 50-70 kbp range for T7 and 100 kbp range for the remaining phages. Smaller reads generally lead to more ambiguity and but are inevitable due to experimental limitations.

The usual assembly algorithms require the reference genome. As bacteriophage genomes can be highly mosaic, i.e. the genome of many phage species appear to be composed of numerous individual modules, we could not use the reference from the database. Consequently, we were forced to perform a de novo assembly, a method of creating the original sequence by aligning and merging fragments without the aid of a reference genome.

There are several tools available for de novo genome assembly. Unfortunately, most approaches have been developed for short-read genome assembly (2nd generation sequencing, Illumina) and employ a de bruijn graph approach [10].

Here we have 3rd generation sequencing data which must be handled totally different from old short-read sequencing data: the reads are less perfect in terms of sequencing errors. While short-reads nowadays have error-rates of about 1% (e.g. 1 base out of 100 is reported incorrectly), this error is up to 15% for nanopore sequencing data using newest sequencing chemistry (R9.4 at the time of wiki-freeze). *T*T*REFS if time*T*T*T

Assemblers suitable for 3rd generation sequencing data are nominal, canu [2] and miniasm [3] being the most prevalent ones. Both rely on a overlap-layout-consensus approach, which, historically, can be seen as the father of all assemblers [10] (see celera assembler [5]).

We first used canu to assemble our genomes. The main limitation to any approach in 3rd generation sequencing data analysis is contamination. Particularly for assembly, contamination is disastrous because it can lead the assembler into faulty assemblies – depending on the phylogenetic distance of the original sample and the contamination. In the worst case, the assembler is the collapses all organism sequences or gives no output sequence at all.

We thus developed sequ-into to first detect the contamination and also get rid of contamination-originated reads. More on the performance and finding while using sequ-into can be found at our Software page.

After eliminating contamination we noticed non-uniform coverage of the sequence in the phage genome assemblies after re-aligning the reads to the assembly, which can be seen in Figure 3.

FFP

In theory, we should observe uniform coverage over the full genome since there is no bias for read template generation during sample preparation (due to random primers).

However, the first half of the sequence has a lower coverage than the remaining part and there is a high-coverage region in the middle and end of the assembled genome.

Terminal repeats are common in phage genomes. Therefore, it can happen that double coverage within the sequence occurs due to missassembly of the genome. As a result, initial and terminal residues of the sequences are forcematched in the interim of the assembled sequence rather than at its edges.

We thus tried to use the other assembler, miniasm, which is known for very fast assemblies but little error correction. However, this error correction can be achieved by combining miniasm with minimap [4] for read mapping and racon [6] for polishing the sequences.

#!/usr/bin/env sh

INREADS=$1
ASMFOLDER=$2
ASMPREFIX=$3

THREADS=$4

if [ -z "$4" ]
then
    THREADS=4
fi

# path to used executables
MINIMAP2=minimap2
MINIASM=miniasm
GRAPHMAP=graphmap
RACON=racon

# first we must overlap all reads with each other
$MINIMAP2 -x ava-ont -t$THREADS $INREADS $INREADS > $ASMFOLDER/$ASMPREFIX.paf

# then miniasm can create alignment
$MINIASM -f $INREADS $ASMFOLDER/$ASMPREFIX.paf > $ASMFOLDER/$ASMPREFIX.gfa

# extract unitigs (an assembly of fragments for which there are no #competing choices in terms of internal overlaps) from miniasm
awk '$1 ~/S/ {print ">"$2"\n"$3}' $ASMFOLDER/$ASMPREFIX.gfa > $ASMFOLDER/$ASMPREFIX.unitigs.fasta

# align reads with unitigs
$MINIMAP2 $ASMFOLDER/$ASMPREFIX.unitigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.unitigs.paf

# find contigs (contiguous join of unitigs)
$RACON $INREADS $ASMFOLDER/$ASMPREFIX.unitigs.paf $ASMFOLDER/$ASMPREFIX.unitigs.fasta > $ASMFOLDER/$ASMPREFIX.contigs.fasta

~/progs/minimap2/minimap2 -x map-ont -a -t$THREADS $ASMFOLDER/$ASMPREFIX.contigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.reads.mm2.sam
$GRAPHMAP align -r $ASMFOLDER/$ASMPREFIX.contigs.fasta -d $INREADS -o $ASMFOLDER/$ASMPREFIX.reads.gm.sam
                                    
                            

And can be started simply from the command-line using: ./assemble.sh FQ_file PATH_TO_ASM_FOLDER PREFIX_of_output>

This finally led to a reasonable assembly after rearranging the aforementioned sequences which initially were overrepresented in the sequence.

Phage Genome Annotation

After having a core genome we want to check how many protein-coding genes we can find in the genome. Two of the programs widely used in the field are glimmer [7] and genemark [8].

Since genermark has been shown to perform better in benchmark tests [9], we used this tool for the genome annotation. We ran the tool on the assembled genome in FASTA format generating a gene annotation file (gff3) for the genome highlighting all coding sequences. For easier and more compact usage, we transformed the genome in fasta format with the annotation in gff3 into the embl flat file format.

In conclusion, we can describe the assembled genomes as follows:

Genome #Genes reference #Genes #Genes+ #Genes- %CG content Genome Length
T7 60 70 69 1 0.4833 39,684
3S 277 423 305 118 0.4019 164,860
NES 969 821 148 0.3396 373,576
FFP 534 353 181 0.3522 368,471

It can be seen that the number of detected genes using genemark is higher than the number of the genes in the reference. This is most likely due to incorrectly sequenced bases leading to an early stop codon. More sophisticated polishing steps or higher quality Illumina sequencing reads could possibly avoid this.

Using the embl flat file format we visualized the phage genomes in a circular genome diagram plot (Figure 5).

A generic square placeholder image with rounded corners in a figure.
3S
A generic square placeholder image with rounded corners in a figure.
FFP
A generic square placeholder image with rounded corners in a figure.
NES

Here we must note several things. For the 3S genome, we can see that at certain positions we see a high decrease in the coverage (at 58kbp, 72kbp and 110kbp). At these positions no reads align to the reference genome. Since this realignment was performed using graphmap [], while the inital assembly pipeline uses minimap2 internally, this could also be an alignment artifact.

For the NES genome we can see a similar behaviour at 330kbp. Additionally we can see two spikes at the ends of the genome. Finally the FFP genome again has the same problems as the NES genome ends.

Conclusion

Sequ-into has been shown to be able to allow rapid genome assembly from a Nanopore sequencing sample such as encountered in Phactory manufacturing experiments. We use state-of-the-art 3rd generation sequencing data analysis tools within our framework to overcome difficulties frequently experienced by scientists in Nanopore applications. In result, we not only introduce our step by step evolved results of the analysis, assembly and finally annotation of the bacteriophage genomes but also make clear how to deal with Nanopore sequencing data already widely used in synthetic biology and rising. Furthermore our custom phage genome of the bacteriophage 3S allows further analysis and usage in the manufacturing cycle, for example, a fast contamination detection using sequ-into.

1

References

  1. Abudayyeh, O.O. et al., 2016. C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector. Science 353, aaf5573.
  2. Boothby, T.C., Tapia, H., Brozena, A.H., Piszkiewicz, S., Smith, A.E., Giovannini, I., Rebecchi, L., Pielak, G.J., Koshland, D., and Goldstein, B., 2017. Tardigrades use intrinsically disordered proteins to survive desiccation. Mol Cell 65, 975ñ984.
  3. Centers for Disease Control and Prevention, 2017. Antibiotic/Antimicrobial Resistance.
  4. GDDiergezondheid, 2017. Mastitis (uierontsteking).
  5. Gootenberg, J.S. et al., 2017. Nucleic acid detection with CRISPR-Cas13a/C2c2. Science 356, 438ñ442.
  6. Jia, B. et al., 2017. CARD 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research 45(D1), D566ñD573.
  7. Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J.A., Charpentier, E., 2012. A Programmable Dual-RNA-Guided DNA Endonuclease in Adaptive Bacterial Immunity. Science 337, 816-821.
  8. Liu, L. et al., 2017. The molecular architecture for RNA-guided RNA cleavage by Cas13a. Cell 170, 714ñ726.e10.
  9. McArthur, A.G. et al., 2013. The comprehensive antibiotic resistance database. Antimicrobial Agents and Chemotherapy 57, 3348ñ3357.
  10. McArthur, A.G. and Wright, G.D., 2015. Bioinformatics of antimicrobial resistance in the age of molecular epidemiology. Current Opinion in Microbiology 27, 45ñ50.
  11. Schwechheimer, C., and Kuehn, M.J., 2015. Outer-membrane vesicles from Gram-negative bacteria: biogenesis and functions. Nature Reviews: Microbiology 13, 605-619.
  12. Sloan, D., Batista, A., and Loeb, A., 2017. The Resilience of Life to Astrophysical Events. Scientific Reports 7, 5419-5424.
  13. Statistics Netherlands, 2015. Dutch dairy in figures.
  14. World Health Organization, 2016. Antibiotic resistance.
  15. Zetsche, B, Gootenberg, J.S., Abudayyeh, O.O., Slaymaker, I.M., Makarova, K.S., Essletzbichler, P., Volz, S.E., van der Oost, J., Regev, Aviv, Koonin, E.V., Zhang, F., 2015. Cpf1 Is a Single RNA-Guided Endonuclease of a Class 2 CRISPR-Cas System. Cell 163, 759-771.