Data Analysis
One of the most essential aspects when it comes to manufacturing therapeutics and, thus, a major consideration while designing Phactory is quality control. To tackle this issue from a bioinformatician’s point-of-view, we analysed minION Oxford Nanopore sequencing data in order to draw conclusions on purity, origin, and functionality of the bacteriophages genome.
For this purpose, the wetlab team sequenced several phage genomes: T7, 3S, NES, FFP. Analysis of the results has been performed using an in-house developed software poreSTAT [1]. For each sequencing sample, it is determined how many bases are sequenced, what base-pair yield has been achieved and how many minION pores were used.
Considering the sequence in which the reads have been acquired, it can nicely be seen how the used chip gets worn out with each sequencing experiment. While for the first sequencing experiment many pores are show high average read lengths, the used pores average read lengths decreases with the number of sequencing experiments.
Experiment | Sequencing Time | Reads Sequenced | Base-pairs sequenced |
---|---|---|---|
T7 | 12h 02min | 424,198 | 1.27 × 109 |
3S | 3h 42min | 77,092 | 2.53 × 108 |
NES | 3h 58min | 39,501 | 2.31 × 108 |
NFFP | 5h 42min | 27,633 | 1.23 × 108 |
Phage Genome Assembly
Genome assembly refers to aligning and merging fragments in order to reconstruct the original sequence. Particularly Nanopore sequencing is well suitable for genome assembly since its long reads allow to reduce the ambiguity of highly similar and repetitive sequences. The read distributions of the sequencing experiments are shown in Figure 2.
During the initial screening it was observable that the read lengths do not approach the expected length of the phage genomes in the 50-70 kbp range for T7 and 100 kbp range for the remaining phages. Smaller reads generally lead to more ambiguity and but are inevitable due to experimental limitations.
The usual assembly algorithms require the reference genome. As bacteriophage genomes can be highly mosaic, i.e. the genome of many phage species appear to be composed of numerous individual modules, we could not use the reference from the database. Consequently, we were forced to perform a de novo assembly, a method of creating the original sequence by aligning and merging fragments without the aid of a reference genome.
There are several tools available for de novo genome assembly. Unfortunately, most approaches have been developed for short-read genome assembly (2nd generation sequencing, Illumina) and employ a de bruijn graph approach [10].
Here we have 3rd generation sequencing data which must be handled totally different from old short-read sequencing data: the reads are less perfect in terms of sequencing errors. While short-reads nowadays have error-rates of about 1% (e.g. 1 base out of 100 is reported incorrectly), this error is up to 15% for nanopore sequencing data using newest sequencing chemistry (R9.4 at the time of wiki-freeze). *T*T*REFS if time*T*T*T
Assemblers suitable for 3rd generation sequencing data are nominal, canu [2] and miniasm [3] being the most prevalent ones. Both rely on a overlap-layout-consensus approach, which, historically, can be seen as the father of all assemblers [10] (see celera assembler [5]).
We first used canu to assemble our genomes. The main limitation to any approach in 3rd generation sequencing data analysis is contamination. Particularly for assembly, contamination is disastrous because it can lead the assembler into faulty assemblies – depending on the phylogenetic distance of the original sample and the contamination. In the worst case, the assembler is the collapses all organism sequences or gives no output sequence at all.
We thus developed sequ-into to first detect the contamination and also get rid of contamination-originated reads. More on the performance and finding while using sequ-into can be found at our Software page.
After eliminating contamination we noticed non-uniform coverage of the sequence in the phage genome assemblies after re-aligning the reads to the assembly, which can be seen in Figure 3.
In theory, we should observe uniform coverage over the full genome since there is no bias for read template generation during sample preparation (due to random primers).
However, the first half of the sequence has a lower coverage than the remaining part and there is a high-coverage region in the middle and end of the assembled genome.
Terminal repeats are common in phage genomes. Therefore, it can happen that double coverage within the sequence occurs due to missassembly of the genome. As a result, initial and terminal residues of the sequences are forcematched in the interim of the assembled sequence rather than at its edges.
We thus tried to use the other assembler, miniasm, which is known for very fast assemblies but little error correction. However, this error correction can be achieved by combining miniasm with minimap [4] for read mapping and racon [6] for polishing the sequences.
#!/usr/bin/env sh INREADS=$1 ASMFOLDER=$2 ASMPREFIX=$3 THREADS=$4 if [ -z "$4" ] then THREADS=4 fi # path to used executables MINIMAP2=minimap2 MINIASM=miniasm GRAPHMAP=graphmap RACON=racon # first we must overlap all reads with each other $MINIMAP2 -x ava-ont -t$THREADS $INREADS $INREADS > $ASMFOLDER/$ASMPREFIX.paf # then miniasm can create alignment $MINIASM -f $INREADS $ASMFOLDER/$ASMPREFIX.paf > $ASMFOLDER/$ASMPREFIX.gfa # extract unitigs (an assembly of fragments for which there are no #competing choices in terms of internal overlaps) from miniasm awk '$1 ~/S/ {print ">"$2"\n"$3}' $ASMFOLDER/$ASMPREFIX.gfa > $ASMFOLDER/$ASMPREFIX.unitigs.fasta # align reads with unitigs $MINIMAP2 $ASMFOLDER/$ASMPREFIX.unitigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.unitigs.paf # find contigs (contiguous join of unitigs) $RACON $INREADS $ASMFOLDER/$ASMPREFIX.unitigs.paf $ASMFOLDER/$ASMPREFIX.unitigs.fasta > $ASMFOLDER/$ASMPREFIX.contigs.fasta ~/progs/minimap2/minimap2 -x map-ont -a -t$THREADS $ASMFOLDER/$ASMPREFIX.contigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.reads.mm2.sam $GRAPHMAP align -r $ASMFOLDER/$ASMPREFIX.contigs.fasta -d $INREADS -o $ASMFOLDER/$ASMPREFIX.reads.gm.sam
And can be started simply from the command-line using: ./assemble.sh FQ_file PATH_TO_ASM_FOLDER PREFIX_of_output>
This finally led to a reasonable assembly after rearranging the aforementioned sequences which initially were overrepresented in the sequence.
Phage Genome Annotation
After having a core genome we want to check how many protein-coding genes we can find in the genome. Two of the programs widely used in the field are glimmer [7] and genemark [8].
Since genermark has been shown to perform better in benchmark tests [9], we used this tool for the genome annotation. We ran the tool on the assembled genome in FASTA format generating a gene annotation file (gff3) for the genome highlighting all coding sequences. For easier and more compact usage, we transformed the genome in fasta format with the annotation in gff3 into the embl flat file format.
Genome | #Genes reference | #Genes | #Genes+ | #Genes- | %CG content | Genome Length |
---|---|---|---|---|---|---|
T7 | 60 | 70 | 69 | 1 | 0.4833 | 39,684 |
3S | 277 | 423 | 305 | 118 | 0.4019 | 164,860 |
NES | 969 | 821 | 148 | 0.3396 | 373,576 | |
FFP | 534 | 353 | 181 | 0.3522 | 368,471 |
It can be seen that the number of detected genes using genemark is higher than the number of the genes in the reference. This is most likely due to incorrectly sequenced bases leading to an early stop codon. More sophisticated polishing steps or higher quality Illumina sequencing reads could possibly avoid this.
Using the embl flat file format we visualized the phage genomes in a circular genome diagram plot (Figure 5).
Here we must note several things. For the 3S genome, we can see that at certain positions we see a high decrease in the coverage (at 58kbp, 72kbp and 110kbp). At these positions no reads align to the reference genome. Since this realignment was performed using graphmap [], while the inital assembly pipeline uses minimap2 internally, this could also be an alignment artifact.
For the NES genome we can see a similar behaviour at 330kbp. Additionally we can see two spikes at the ends of the genome. Finally the FFP genome again has the same problems as the NES genome ends.
Conclusion
Sequ-into has been shown to be able to allow rapid genome assembly from a Nanopore sequencing sample such as encountered in Phactory manufacturing experiments. We use state-of-the-art 3rd generation sequencing data analysis tools within our framework to overcome difficulties frequently experienced by scientists in Nanopore applications. In result, we not only introduce our step by step evolved results of the analysis, assembly and finally annotation of the bacteriophage genomes but also make clear how to deal with Nanopore sequencing data already widely used in synthetic biology and rising. Furthermore our custom phage genome of the bacteriophage 3S allows further analysis and usage in the manufacturing cycle, for example, a fast contamination detection using sequ-into.