|
|
Line 1: |
Line 1: |
− | {{Munich/Phactory}}
| |
− | <html>
| |
− | <style>
| |
| | | |
− | .pictureTitle{
| |
− | background: linear-gradient(rgba(0,0,0,.5), rgba(0,0,0,.8)), url("https://static.igem.org/mediawiki/2018/9/9a/T--Munich--ProjectTitle.jpg");
| |
− | background-repeat: no-repeat;
| |
− | background-size: cover;
| |
− | background-position:center;
| |
− | }
| |
− |
| |
− |
| |
− |
| |
− | @media only screen and (max-width: 575.98px) {}
| |
− |
| |
− | @media only screen and (max-width: 767.98px) {}
| |
− |
| |
− | @media only screen and (max-width: 991.98px) {}
| |
− |
| |
− | @media only screen and (max-width: 1199.98px) {}
| |
− | </style>
| |
− |
| |
− | <div class="pictureTitle container-fluid text-center mb-0 align-items-center text-light">
| |
− |
| |
− | <div class="display-2 mb-0">
| |
− | Data Analysis
| |
− | </div>
| |
− | <!--<h4>First Blick of Phactory</h4>-->
| |
− |
| |
− | </div>
| |
− |
| |
− |
| |
− | <div class="phaContainer">
| |
− | <aside id="phaContentsOuter">
| |
− | <aside id="phaContents" class="table-of-contents">
| |
− | <!-- will be generated with JS -->
| |
− | </aside>
| |
− | </aside>
| |
− |
| |
− | <main class="post-content">
| |
− |
| |
− | <!-- <h1>Data Analysis</h1>
| |
− | <p>The very top title, I am not sure if we still need that</p>-->
| |
− |
| |
− |
| |
− |
| |
− | <h2>Data Analysis</h2>
| |
− | <div class="row">
| |
− | <div class="col-12">
| |
− | <p>One of the most essential aspects when it comes to manufacturing therapeutics and, thus, a major
| |
− | consideration while designing Phactory is quality control. To tackle this issue from a bioinformatician’s
| |
− | point-of-view, we analysed <a href="https://nanoporetech.com/products/minion">minION Oxford Nanopore</a>
| |
− | sequencing data in order to draw conclusions on purity, origin, and functionality of the bacteriophages genome.
| |
− | </p>
| |
− | <p>For this purpose, the wetlab team sequenced several phage genomes: T7, 3S, NES, FFP. Analysis of the results
| |
− | has been performed using an in-house developed software poreSTAT [1]. For each sequencing sample, it is
| |
− | determined how many bases are sequenced, what base-pair yield has been achieved and how many minION pores
| |
− | were used.
| |
− | </p>
| |
− | <p>
| |
− | Considering the sequence in which the reads have been acquired, it can nicely be seen how the used chip
| |
− | gets worn out with each sequencing experiment. While for the first sequencing experiment many pores are
| |
− | show high average read lengths, the used pores average read lengths decreases with the number of
| |
− | sequencing experiments.
| |
− | </p>
| |
− | <div></div>
| |
− | <div class="row">
| |
− | <div class="col-12 col-md-6">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/e/e7/T--munich--t7_summary.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">T7</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | <div class="col-12 col-md-6">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/6/66/T--munich--3S_summary.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">3S</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | </div>
| |
− | <div class="row">
| |
− | <div class="col-12 col-md-6">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/b/bd/T--munich--NES_summary.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">NES</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | <div class="col-12 col-md-6">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/b/b3/T--munich--ffp_summary.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">FFP</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | </div>
| |
− | </div>
| |
− | </div>
| |
− | <div class="row">
| |
− | <div class="col-12">
| |
− | <table class="table">
| |
− | <p>The read and base-pair yield is summarized in the following table:</p>
| |
− | <thead>
| |
− | <tr>
| |
− | <th scope="col">Experiment</th>
| |
− | <th scope="col">Sequencing Time</th>
| |
− | <th scope="col">Reads Sequenced</th>
| |
− | <th scope="col">Base-pairs sequenced</th>
| |
− | </tr>
| |
− | </thead>
| |
− | <tbody>
| |
− | <tr>
| |
− | <th scope="row">T7</th>
| |
− | <td>12h 02min</td>
| |
− | <td>424,198</td>
| |
− | <td>1.27 × 10<sup>9</sup></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <th scope="row">3S</th>
| |
− | <td>3h 42min</td>
| |
− | <td>77,092</td>
| |
− | <td>2.53 × 10<sup>8</sup></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <th scope="row">NES</th>
| |
− | <td>3h 58min</td>
| |
− | <td>39,501</td>
| |
− | <td>2.31 × 10<sup>8</sup></td>
| |
− | </tr>
| |
− | <tr>
| |
− | <th scope="row">NFFP</th>
| |
− | <td>5h 42min</td>
| |
− | <td>27,633</td>
| |
− | <td>1.23 × 10<sup>8</sup></td>
| |
− | </tr>
| |
− | </tbody>
| |
− | </table>
| |
− | <h2>Phage Genome Assembly</h2>
| |
− | <div class="row">
| |
− | <div class="col-12">
| |
− | <p>
| |
− | Genome assembly refers to aligning and merging fragments in order to reconstruct the original sequence.
| |
− | Particularly Nanopore sequencing is well suitable for genome assembly since its long reads allow to reduce
| |
− | the ambiguity of highly similar and repetitive sequences. The read distributions of the sequencing experiments
| |
− | are shown in Figure 2.
| |
− | </p>
| |
− | <div class="row">
| |
− | <div class="col-12 col-md-6">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/c/cd/T--munich--T7_all_reads.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">T7</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | <div class="col-12 col-md-6">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/2/26/T--munich--3S_all_reads.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">3S</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | </div>
| |
− | <div class="row">
| |
− | <div class="col-12 col-md-6">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/8/83/T--munich--NES_all_reads.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">NES</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | <div class="col-12 col-md-6">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/6/6a/T--munich--FFP_all_reads.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">FFP</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | </div>
| |
− | </div>
| |
− | </div>
| |
− | <p>
| |
− | During the initial screening it was observable that the read lengths do not approach the expected
| |
− | length of the phage genomes in the 50-70 kbp range for T7 and 100 kbp range for the remaining phages.
| |
− | Smaller reads generally lead to more ambiguity and but are inevitable due to experimental limitations.
| |
− | </p>
| |
− | <p>
| |
− | The usual assembly algorithms require the reference genome. As bacteriophage genomes can be highly mosaic,
| |
− | i.e. the genome of many phage species appear to be composed of numerous individual modules, we could not use
| |
− | the reference from the database. Consequently, we were forced to perform a <i>de novo</i> assembly, a method of
| |
− | creating the original sequence by aligning and merging fragments without the aid of a reference genome.
| |
− | </p>
| |
− | <p>
| |
− | There are several tools available for de novo genome assembly. Unfortunately, most approaches have been
| |
− | developed for short-read genome assembly (2nd generation sequencing, Illumina) and employ a de bruijn graph
| |
− | approach [10].
| |
− | </p>
| |
− | <p>
| |
− | Here we have 3rd generation sequencing data which must be handled totally different from old short-read
| |
− | sequencing data: the reads are less perfect in terms of sequencing errors. While short-reads nowadays have
| |
− | error-rates of about 1% (e.g. 1 base out of 100 is reported incorrectly), this error is up to 15% for
| |
− | nanopore sequencing data using newest sequencing chemistry (R9.4 at the time of wiki-freeze).
| |
− | *T*T*REFS if time*T*T*T
| |
− | </p>
| |
− | <p>
| |
− | Assemblers suitable for 3rd generation sequencing data are nominal, canu [2] and miniasm [3] being the
| |
− | most prevalent ones. Both rely on a overlap-layout-consensus approach, which, historically, can be seen
| |
− | as the father of all assemblers [10] (see celera assembler [5]).
| |
− | </p>
| |
− | <p>
| |
− | We first used canu to assemble our genomes. The main limitation to any approach in 3rd generation
| |
− | sequencing data analysis is contamination. Particularly for assembly, contamination is disastrous
| |
− | because it can lead the assembler into faulty assemblies – depending on the phylogenetic distance of
| |
− | the original sample and the contamination. In the worst case, the assembler is the collapses all
| |
− | organism sequences or gives no output sequence at all.
| |
− | </p>
| |
− | <p>
| |
− | We thus developed <a href="https://2018.igem.org/Team:Munich/Software">sequ-into</a> to first detect
| |
− | the contamination and also get rid of contamination-originated
| |
− | reads. More on the performance and finding while using sequ-into can be found at
| |
− | our <a href="https://2018.igem.org/Team:Munich/Software">Software</a> page.
| |
− | </p>
| |
− | <p>
| |
− | After eliminating contamination we noticed non-uniform coverage of the sequence
| |
− | in the phage genome assemblies after re-aligning the reads to the assembly, which can be seen in Figure 3.
| |
− | </p>
| |
− |
| |
− | <div class="col-12">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/6/68/T--munich--ffp_before_fix.png" class="figure-img img-fluid rounded">
| |
− | <figcaption class="figure-caption">FFP</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | <p>
| |
− | In theory, we should observe uniform coverage over the full genome since there is no bias for read
| |
− | template generation during sample preparation (due to random primers).
| |
− | </p>
| |
− | <p>
| |
− | However, the first half of the sequence has a lower coverage than the remaining part and there is
| |
− | a high-coverage region in the middle and end of the assembled genome.
| |
− | </p>
| |
− | <p>
| |
− | Terminal repeats are common in phage genomes. Therefore, it can happen that double coverage within
| |
− | the sequence occurs due to missassembly of the genome. As a result, initial and terminal residues
| |
− | of the sequences are forcematched in the interim of the assembled sequence rather than at its edges.
| |
− | </p>
| |
− | <p>
| |
− | We thus tried to use the other assembler, miniasm, which is known for very fast assemblies but little
| |
− | error correction. However, this error correction can be achieved by combining miniasm with minimap [4]
| |
− | for read mapping and racon [6] for polishing the sequences.
| |
− | </p>
| |
− | <div class="card mt-2">
| |
− |
| |
− | <div class="card-header" id="heading2" style="background-color: #018DCD;">
| |
− | <h5 class="mb-0">
| |
− | <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapse2"
| |
− | aria-expanded="false" aria-controls="collapse2">
| |
− | <span class="text-white font-weight-bold">The assembly pipeline</span>
| |
− | </button>
| |
− | </h5>
| |
− | </div>
| |
− |
| |
− | <div id="collapse2" class="collapse" aria-labelledby="heading2" data-parent="#accordion">
| |
− | <div class="card-body">
| |
− | <div class="row mt-2">
| |
− | <pre>
| |
− | #!/usr/bin/env sh
| |
− |
| |
− | INREADS=$1
| |
− | ASMFOLDER=$2
| |
− | ASMPREFIX=$3
| |
− |
| |
− | THREADS=$4
| |
− |
| |
− | if [ -z "$4" ]
| |
− | then
| |
− | THREADS=4
| |
− | fi
| |
− |
| |
− | # path to used executables
| |
− | MINIMAP2=minimap2
| |
− | MINIASM=miniasm
| |
− | GRAPHMAP=graphmap
| |
− | RACON=racon
| |
− |
| |
− | # first we must overlap all reads with each other
| |
− | $MINIMAP2 -x ava-ont -t$THREADS $INREADS $INREADS > $ASMFOLDER/$ASMPREFIX.paf
| |
− |
| |
− | # then miniasm can create alignment
| |
− | $MINIASM -f $INREADS $ASMFOLDER/$ASMPREFIX.paf > $ASMFOLDER/$ASMPREFIX.gfa
| |
− |
| |
− | # extract unitigs (an assembly of fragments for which there are no #competing choices in terms of internal overlaps) from miniasm
| |
− | awk '$1 ~/S/ {print ">"$2"\n"$3}' $ASMFOLDER/$ASMPREFIX.gfa > $ASMFOLDER/$ASMPREFIX.unitigs.fasta
| |
− |
| |
− | # align reads with unitigs
| |
− | $MINIMAP2 $ASMFOLDER/$ASMPREFIX.unitigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.unitigs.paf
| |
− |
| |
− | # find contigs (contiguous join of unitigs)
| |
− | $RACON $INREADS $ASMFOLDER/$ASMPREFIX.unitigs.paf $ASMFOLDER/$ASMPREFIX.unitigs.fasta > $ASMFOLDER/$ASMPREFIX.contigs.fasta
| |
− |
| |
− | ~/progs/minimap2/minimap2 -x map-ont -a -t$THREADS $ASMFOLDER/$ASMPREFIX.contigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.reads.mm2.sam
| |
− | $GRAPHMAP align -r $ASMFOLDER/$ASMPREFIX.contigs.fasta -d $INREADS -o $ASMFOLDER/$ASMPREFIX.reads.gm.sam
| |
− |
| |
− | </pre>
| |
− | <p>And can be started simply from the command-line using: <code>./assemble.sh FQ_file PATH_TO_ASM_FOLDER PREFIX_of_output></code></p>
| |
− | </div>
| |
− | </div>
| |
− | </div>
| |
− | </div>
| |
− | </div>
| |
− | <div></div>
| |
− | <p>
| |
− | This finally led to a reasonable assembly after rearranging the aforementioned sequences which initially
| |
− | were overrepresented in the sequence.
| |
− | </p>
| |
− |
| |
− | <h2>Phage Genome Annotation</h2>
| |
− | <div class="row">
| |
− | <div class="col-12">
| |
− | <p>
| |
− | After having a core genome we want to check how many protein-coding genes we can find in the genome.
| |
− | Two of the programs widely used in the field are glimmer [7] and genemark [8].
| |
− | </p>
| |
− | <p>
| |
− | Since genermark has been shown to perform better in benchmark tests [9], we used this tool for the
| |
− | genome annotation. We ran the tool on the assembled genome in FASTA format generating a gene annotation file
| |
− | (gff3) for the genome highlighting all coding sequences.
| |
− |
| |
− | For easier and more compact usage, we transformed the genome in fasta format with the annotation in gff3
| |
− | into the embl flat file format.
| |
− | </p>
| |
− |
| |
− | <div class="row">
| |
− | <div class="col-12">
| |
− | <table class="table">
| |
− | <p>
| |
− | In conclusion, we can describe the assembled genomes as follows:
| |
− | </p>
| |
− | <thead>
| |
− | <tr>
| |
− | <th scope="col">Genome</th>
| |
− | <th scope="col">#Genes reference</th>
| |
− | <th scope="col">#Genes</th>
| |
− | <th scope="col">#Genes+</th>
| |
− | <th scope="col">#Genes-</th>
| |
− | <th scope="col">%CG content</th>
| |
− | <th scope="col">Genome Length</th>
| |
− | </tr>
| |
− | </thead>
| |
− | <tbody>
| |
− | <tr>
| |
− | <th scope="row">T7</th>
| |
− | <td>60</td>
| |
− | <td>70</td>
| |
− | <td>69</td>
| |
− | <td>1</td>
| |
− | <td>0.4833</td>
| |
− | <td>39,684</td>
| |
− | </tr>
| |
− | <tr>
| |
− | <th scope="row">3S</th>
| |
− | <td>277</td>
| |
− | <td>423</td>
| |
− | <td>305</td>
| |
− | <td>118</td>
| |
− | <td>0.4019</td>
| |
− | <td>164,860</td>
| |
− | </tr>
| |
− | <tr>
| |
− | <th scope="row">NES</th>
| |
− | <td></td>
| |
− | <td>969</td>
| |
− | <td>821</td>
| |
− | <td>148</td>
| |
− | <td>0.3396</td>
| |
− | <td>373,576</td>
| |
− | </tr>
| |
− | <tr>
| |
− | <th scope="row">FFP</th>
| |
− | <td></td>
| |
− | <td>534</td>
| |
− | <td>353</td>
| |
− | <td>181</td>
| |
− | <td>0.3522</td>
| |
− | <td>368,471</td>
| |
− | </tr>
| |
− | </tbody>
| |
− | </table>
| |
− | <p>It can be seen that the number of detected genes using genemark is higher than the number of the genes
| |
− | in the reference. This is most likely due to incorrectly sequenced bases leading to an early stop codon.
| |
− | More sophisticated polishing steps or higher quality Illumina sequencing reads could possibly avoid this.
| |
− | </p>
| |
− | <p>
| |
− | Using the embl flat file format we visualized the phage genomes in a circular genome diagram plot (Figure 5).
| |
− | </p>
| |
− | <div class="row">
| |
− | <div class="col-12 col-md-4">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/6/6a/T--Munich--Results_3S_edit_phage.png" class="figure-img img-fluid rounded" alt="A generic square placeholder image with rounded corners in a figure.">
| |
− | <figcaption class="figure-caption">3S</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | <div class="col-12 col-md-4">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/9/99/T--Munich--results_FFP_edit_phage.png" class="figure-img img-fluid rounded" alt="A generic square placeholder image with rounded corners in a figure.">
| |
− | <figcaption class="figure-caption">FFP</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | <div class="col-12 col-md-4">
| |
− | <figure class="figure">
| |
− | <img src="https://static.igem.org/mediawiki/2018/8/80/T--Munich--Results_NES_phage.png" class="figure-img img-fluid rounded" alt="A generic square placeholder image with rounded corners in a figure.">
| |
− | <figcaption class="figure-caption">NES</figcaption>
| |
− | </figure>
| |
− | </div>
| |
− | </div>
| |
− | <div></div>
| |
− | <p>
| |
− | Here we must note several things. For the 3S genome, we can see that at certain positions we see a high
| |
− | decrease in the coverage (at 58kbp, 72kbp and 110kbp). At these positions no reads align to the reference genome.
| |
− | Since this realignment was performed using graphmap [], while the inital assembly pipeline uses minimap2 internally,
| |
− | this could also be an alignment artifact.
| |
− | </p>
| |
− | <p>
| |
− | For the NES genome we can see a similar behaviour at 330kbp. Additionally we can see two spikes at the ends of
| |
− | the genome. Finally the FFP genome again has the same problems as the NES genome ends.
| |
− | </p>
| |
− |
| |
− |
| |
− | <h2>Conclusion</h2>
| |
− | <div class="row">
| |
− | <div class="col-12">
| |
− | <p>
| |
− | Sequ-into has been shown to be able to allow rapid genome assembly from a Nanopore sequencing sample such
| |
− | as encountered in Phactory manufacturing experiments. We use state-of-the-art 3rd generation sequencing
| |
− | data analysis tools within our framework to overcome difficulties frequently experienced by scientists in
| |
− | Nanopore applications. In result, we not only introduce our step by step evolved results of the analysis,
| |
− | assembly and finally annotation of the bacteriophage genomes but also make clear how to deal with Nanopore
| |
− | sequencing data already widely used in synthetic biology and rising. Furthermore our custom phage genome of
| |
− | the bacteriophage 3S allows further analysis and usage in the manufacturing cycle, for example, a fast
| |
− | contamination detection using sequ-into.
| |
− | </p>
| |
− | </div>
| |
− | </div>
| |
− |
| |
− |
| |
− | <a href=https://academic.oup.com/bfg/article/11/1/25/191455>1</a>
| |
− |
| |
− | <div id="phareferences" class="row">
| |
− | <h2>References</h2>
| |
− | <div class="col-12">
| |
− | <ol>
| |
− | <li><a href="https://academic.oup.com/bfg/article/11/1/25/191455">Zhenyu Li, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang, Jun Gan, Nan Li, Xuesong Hu, Binghang Liu, Bicheng Yang, Wei Fan; Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph, Briefings in Functional Genomics, Volume 11, Issue 1, 1 January 2012, Pages 25–37. </a></li>
| |
− | </ol>
| |
− | </div>
| |
− | </div>
| |
− |
| |
− |
| |
− |
| |
− | </main>
| |
− | </div>
| |
− |
| |
− | <script type="text/javascript" src="https://2018.igem.org/Template:Munich/PhactoryContentsJS?action=raw&ctype=text/javascript"></script>
| |
− |
| |
− | </html>
| |
− |
| |
− | {{Munich/PhactoryFooter}}
| |