Difference between revisions of "Team:Munich/DataAnalysis"

(clean)
 
Line 1: Line 1:
{{Munich/Phactory}}
 
<html>
 
<style>
 
  
    .pictureTitle{
 
        background: linear-gradient(rgba(0,0,0,.5), rgba(0,0,0,.8)), url("https://static.igem.org/mediawiki/2018/9/9a/T--Munich--ProjectTitle.jpg");
 
        background-repeat: no-repeat;
 
        background-size: cover;
 
        background-position:center;
 
    } 
 
       
 
 
 
@media only screen and (max-width: 575.98px) {}
 
 
@media only screen and (max-width: 767.98px) {}
 
 
@media only screen and (max-width: 991.98px) {}
 
 
@media only screen and (max-width: 1199.98px) {}
 
</style>
 
 
<div class="pictureTitle container-fluid text-center mb-0 align-items-center text-light">
 
 
                    <div class="display-2 mb-0">
 
                    Data Analysis
 
                    </div>
 
        <!--<h4>First Blick of Phactory</h4>-->
 
 
</div>
 
 
   
 
<div class="phaContainer">
 
  <aside id="phaContentsOuter">
 
  <aside id="phaContents" class="table-of-contents">
 
    <!-- will be generated with JS -->
 
  </aside>
 
  </aside>
 
 
  <main class="post-content">
 
     
 
    <!-- <h1>Data Analysis</h1>
 
    <p>The very top title, I am not sure if we still need that</p>-->
 
 
   
 
     
 
    <h2>Data Analysis</h2>
 
    <div class="row">
 
    <div class="col-12">
 
        <p>One of the most essential aspects when it comes to manufacturing therapeutics and, thus, a major
 
            consideration while designing Phactory is quality control. To tackle this issue from a bioinformatician’s
 
            point-of-view, we analysed <a href="https://nanoporetech.com/products/minion">minION Oxford Nanopore</a>
 
            sequencing data in order to draw conclusions on purity, origin, and functionality of the bacteriophages genome.
 
        </p>
 
        <p>For this purpose, the wetlab team sequenced several phage genomes: T7, 3S, NES, FFP. Analysis of the results
 
            has been performed using an in-house developed software poreSTAT [1]. For each sequencing sample, it is
 
            determined how many bases are sequenced, what base-pair yield has been achieved and how many minION pores
 
            were used.
 
        </p>
 
        <p>
 
            Considering the sequence in which the reads have been acquired, it can nicely be seen how the used chip
 
            gets worn out with each sequencing experiment. While for the first sequencing experiment many pores are
 
            show high average read lengths, the used pores average read lengths decreases with the number of
 
            sequencing experiments.
 
        </p>
 
        <div></div>
 
        <div class="row">
 
                <div class="col-12 col-md-6">
 
                  <figure class="figure">
 
                    <img src="https://static.igem.org/mediawiki/2018/e/e7/T--munich--t7_summary.png" class="figure-img img-fluid rounded">
 
                    <figcaption class="figure-caption">T7</figcaption>
 
                      </figure>
 
                    </div>
 
                <div class="col-12 col-md-6">
 
                    <figure class="figure">
 
                        <img src="https://static.igem.org/mediawiki/2018/6/66/T--munich--3S_summary.png" class="figure-img img-fluid rounded">
 
                        <figcaption class="figure-caption">3S</figcaption>
 
                            </figure>
 
                </div>
 
            </div>
 
            <div class="row">
 
                    <div class="col-12 col-md-6">
 
                      <figure class="figure">
 
                        <img src="https://static.igem.org/mediawiki/2018/b/bd/T--munich--NES_summary.png" class="figure-img img-fluid rounded">
 
                        <figcaption class="figure-caption">NES</figcaption>
 
                          </figure>
 
                        </div>
 
                    <div class="col-12 col-md-6">
 
                        <figure class="figure">
 
                            <img src="https://static.igem.org/mediawiki/2018/b/b3/T--munich--ffp_summary.png" class="figure-img img-fluid rounded">
 
                            <figcaption class="figure-caption">FFP</figcaption>
 
                                </figure>
 
                    </div>
 
                </div>
 
    </div>
 
    </div>
 
    <div class="row">
 
    <div class="col-12">
 
    <table class="table">
 
            <p>The read and base-pair yield is summarized in the following table:</p>
 
    <thead>
 
        <tr>
 
            <th scope="col">Experiment</th>
 
            <th scope="col">Sequencing Time</th>
 
            <th scope="col">Reads Sequenced</th>
 
            <th scope="col">Base-pairs sequenced</th>
 
        </tr>
 
        </thead>
 
        <tbody>
 
        <tr>
 
            <th scope="row">T7</th>
 
            <td>12h 02min</td>
 
            <td>424,198</td>
 
            <td>1.27 &times; 10<sup>9</sup></td>
 
        </tr>
 
        <tr>
 
            <th scope="row">3S</th>
 
            <td>3h 42min</td>
 
            <td>77,092</td>
 
            <td>2.53 &times; 10<sup>8</sup></td>
 
        </tr>
 
        <tr>
 
            <th scope="row">NES</th>
 
            <td>3h 58min</td>
 
            <td>39,501</td>
 
            <td>2.31 &times; 10<sup>8</sup></td>
 
        </tr>
 
        <tr>
 
            <th scope="row">NFFP</th>
 
            <td>5h 42min</td>
 
            <td>27,633</td>
 
            <td>1.23 &times; 10<sup>8</sup></td>
 
        </tr>
 
        </tbody>
 
    </table>
 
    <h2>Phage Genome Assembly</h2>
 
    <div class="row">
 
    <div class="col-12">
 
        <p>
 
            Genome assembly refers to aligning and merging fragments in order to reconstruct the original sequence.
 
            Particularly Nanopore sequencing is well suitable for genome assembly since its long reads allow to reduce
 
            the ambiguity of highly similar and repetitive sequences. The read distributions of the sequencing experiments
 
            are shown in Figure 2.
 
        </p>
 
        <div class="row">
 
                <div class="col-12 col-md-6">
 
                  <figure class="figure">
 
                    <img src="https://static.igem.org/mediawiki/2018/c/cd/T--munich--T7_all_reads.png" class="figure-img img-fluid rounded">
 
                    <figcaption class="figure-caption">T7</figcaption>
 
                      </figure>
 
                    </div>
 
                <div class="col-12 col-md-6">
 
                    <figure class="figure">
 
                        <img src="https://static.igem.org/mediawiki/2018/2/26/T--munich--3S_all_reads.png" class="figure-img img-fluid rounded">
 
                        <figcaption class="figure-caption">3S</figcaption>
 
                            </figure>
 
                </div>
 
            </div>
 
            <div class="row">
 
                    <div class="col-12 col-md-6">
 
                      <figure class="figure">
 
                        <img src="https://static.igem.org/mediawiki/2018/8/83/T--munich--NES_all_reads.png" class="figure-img img-fluid rounded">
 
                        <figcaption class="figure-caption">NES</figcaption>
 
                          </figure>
 
                        </div>
 
                    <div class="col-12 col-md-6">
 
                        <figure class="figure">
 
                            <img src="https://static.igem.org/mediawiki/2018/6/6a/T--munich--FFP_all_reads.png" class="figure-img img-fluid rounded">
 
                            <figcaption class="figure-caption">FFP</figcaption>
 
                                </figure>
 
                    </div>
 
                </div>
 
        </div>
 
        </div>
 
        <p>
 
            During the initial screening it was observable that the read lengths do not approach the expected
 
            length of the phage genomes in the 50-70 kbp range for T7 and 100 kbp range for the remaining phages.
 
            Smaller reads generally lead to more ambiguity and but are inevitable due to experimental limitations.
 
        </p>
 
        <p>
 
            The usual assembly algorithms require the reference genome. As bacteriophage genomes can be highly mosaic,
 
            i.e. the genome of many phage species appear to be composed of numerous individual modules, we could not use
 
            the reference from the database. Consequently, we were forced to perform a <i>de novo</i> assembly, a method of
 
            creating the original sequence by aligning and merging fragments without the aid of a reference genome.
 
        </p>
 
        <p>
 
            There are several tools available for de novo genome assembly. Unfortunately, most approaches have been
 
            developed for short-read genome assembly (2nd generation sequencing, Illumina) and employ a de bruijn graph
 
            approach [10].
 
        </p>
 
        <p>
 
            Here we have 3rd generation sequencing data which must be handled totally different from old short-read
 
            sequencing data: the reads are less perfect in terms of sequencing errors. While short-reads nowadays have
 
            error-rates of about 1% (e.g. 1 base out of 100 is reported incorrectly), this error is up to 15% for
 
            nanopore sequencing data using newest sequencing chemistry (R9.4 at the time of wiki-freeze).
 
            *T*T*REFS if time*T*T*T
 
        </p>
 
        <p>
 
            Assemblers suitable for 3rd generation sequencing data are nominal, canu [2] and miniasm [3] being the
 
            most prevalent ones. Both rely on a overlap-layout-consensus approach, which, historically, can be seen
 
            as the father of all assemblers [10] (see celera assembler [5]).
 
        </p>
 
        <p>
 
            We first used canu to assemble our genomes. The main limitation to any approach in 3rd generation
 
            sequencing data analysis is contamination. Particularly for assembly, contamination is disastrous
 
            because it can lead the assembler into faulty assemblies – depending on the phylogenetic distance of
 
            the original sample and the contamination. In the worst case, the assembler is the collapses all
 
            organism sequences or gives no output sequence at all.
 
        </p>
 
        <p>
 
            We thus developed <a href="https://2018.igem.org/Team:Munich/Software">sequ-into</a> to first detect
 
            the contamination and also get rid of contamination-originated
 
            reads. More on the performance and finding while using sequ-into can be found at
 
            our <a href="https://2018.igem.org/Team:Munich/Software">Software</a> page.
 
        </p>
 
        <p>
 
            After eliminating contamination we noticed non-uniform coverage of the sequence 
 
            in the phage genome assemblies after re-aligning the reads to the assembly, which can be seen in Figure 3.
 
        </p>
 
       
 
        <div class="col-12">
 
                <figure class="figure">
 
                  <img src="https://static.igem.org/mediawiki/2018/6/68/T--munich--ffp_before_fix.png" class="figure-img img-fluid rounded">
 
                  <figcaption class="figure-caption">FFP</figcaption>
 
                    </figure>
 
                  </div>
 
        <p>
 
            In theory, we should observe uniform coverage over the full genome since there is no bias for read
 
            template generation during sample preparation (due to random primers).
 
        </p>
 
        <p>
 
            However, the first half of the sequence has a lower coverage than the remaining part and there is
 
            a high-coverage region in the middle and end of the assembled genome.
 
        </p>
 
        <p>
 
            Terminal repeats are common in phage genomes. Therefore, it can happen that double coverage within
 
            the sequence occurs due to missassembly of the genome. As a result, initial and terminal residues
 
            of the sequences are forcematched in the interim of the assembled sequence rather than at its edges.
 
        </p>
 
        <p>
 
            We thus tried to use the other assembler, miniasm, which is known for very fast assemblies but little
 
            error correction. However, this error correction can be achieved by combining miniasm with minimap [4]
 
            for read mapping and racon [6] for polishing the sequences.
 
        </p>
 
        <div class="card mt-2">
 
 
                <div class="card-header" id="heading2" style="background-color: #018DCD;">
 
                    <h5 class="mb-0">
 
                        <button class="btn btn-link collapsed" data-toggle="collapse" data-target="#collapse2"
 
                            aria-expanded="false" aria-controls="collapse2">
 
                            <span class="text-white font-weight-bold">The assembly pipeline</span>
 
                        </button>
 
                    </h5>
 
                </div>
 
   
 
                <div id="collapse2" class="collapse" aria-labelledby="heading2" data-parent="#accordion">
 
                    <div class="card-body">
 
                        <div class="row mt-2">
 
                            <pre>
 
#!/usr/bin/env sh
 
 
INREADS=$1
 
ASMFOLDER=$2
 
ASMPREFIX=$3
 
 
THREADS=$4
 
 
if [ -z "$4" ]
 
then
 
    THREADS=4
 
fi
 
 
# path to used executables
 
MINIMAP2=minimap2
 
MINIASM=miniasm
 
GRAPHMAP=graphmap
 
RACON=racon
 
 
# first we must overlap all reads with each other
 
$MINIMAP2 -x ava-ont -t$THREADS $INREADS $INREADS > $ASMFOLDER/$ASMPREFIX.paf
 
 
# then miniasm can create alignment
 
$MINIASM -f $INREADS $ASMFOLDER/$ASMPREFIX.paf > $ASMFOLDER/$ASMPREFIX.gfa
 
 
# extract unitigs (an assembly of fragments for which there are no #competing choices in terms of internal overlaps) from miniasm
 
awk '$1 ~/S/ {print ">"$2"\n"$3}' $ASMFOLDER/$ASMPREFIX.gfa > $ASMFOLDER/$ASMPREFIX.unitigs.fasta
 
 
# align reads with unitigs
 
$MINIMAP2 $ASMFOLDER/$ASMPREFIX.unitigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.unitigs.paf
 
 
# find contigs (contiguous join of unitigs)
 
$RACON $INREADS $ASMFOLDER/$ASMPREFIX.unitigs.paf $ASMFOLDER/$ASMPREFIX.unitigs.fasta > $ASMFOLDER/$ASMPREFIX.contigs.fasta
 
 
~/progs/minimap2/minimap2 -x map-ont -a -t$THREADS $ASMFOLDER/$ASMPREFIX.contigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.reads.mm2.sam
 
$GRAPHMAP align -r $ASMFOLDER/$ASMPREFIX.contigs.fasta -d $INREADS -o $ASMFOLDER/$ASMPREFIX.reads.gm.sam
 
                                   
 
                            </pre>
 
                            <p>And can be started simply from the command-line using:  <code>./assemble.sh FQ_file PATH_TO_ASM_FOLDER PREFIX_of_output></code></p>
 
                        </div>
 
                    </div>
 
                </div>
 
            </div>
 
    </div>
 
<div></div>
 
<p>
 
    This finally led to a reasonable assembly after rearranging the aforementioned sequences which initially
 
    were overrepresented in the sequence. 
 
</p>
 
     
 
    <h2>Phage Genome Annotation</h2>
 
    <div class="row">
 
    <div class="col-12">
 
    <p>
 
        After having a core genome we want to check how many protein-coding genes we can find in the genome.
 
        Two of the programs widely used in the field are glimmer [7] and genemark [8].
 
    </p>
 
    <p>
 
        Since genermark has been shown to perform better in benchmark tests [9], we used this tool for the
 
        genome annotation. We ran the tool on the assembled genome in FASTA format generating a gene annotation file
 
        (gff3) for the genome highlighting all coding sequences.
 
       
 
        For easier and more compact usage, we transformed the genome in fasta format with the annotation in gff3
 
        into the embl flat file format.
 
    </p>
 
   
 
    <div class="row">
 
            <div class="col-12">
 
            <table class="table">
 
            <p>
 
                In conclusion, we can describe the assembled genomes as follows:
 
            </p>
 
              <thead>
 
                <tr>
 
                  <th scope="col">Genome</th>
 
                  <th scope="col">#Genes reference</th>
 
                  <th scope="col">#Genes</th>
 
                  <th scope="col">#Genes+</th>
 
                  <th scope="col">#Genes-</th>
 
                  <th scope="col">%CG content</th>
 
                  <th scope="col">Genome Length</th>
 
                </tr>
 
              </thead>
 
              <tbody>
 
                <tr>
 
                    <th scope="row">T7</th>
 
                    <td>60</td>
 
                    <td>70</td>
 
                    <td>69</td>
 
                    <td>1</td>
 
                    <td>0.4833</td>
 
                    <td>39,684</td>
 
                </tr>
 
                <tr>
 
                    <th scope="row">3S</th>
 
                    <td>277</td>
 
                    <td>423</td>
 
                    <td>305</td>
 
                    <td>118</td>
 
                    <td>0.4019</td>
 
                    <td>164,860</td>
 
                </tr>
 
                <tr>
 
                    <th scope="row">NES</th>
 
                    <td></td>
 
                    <td>969</td>
 
                    <td>821</td>
 
                    <td>148</td>
 
                    <td>0.3396</td>
 
                    <td>373,576</td>
 
                </tr>
 
                <tr>
 
                    <th scope="row">FFP</th>
 
                    <td></td>
 
                    <td>534</td>
 
                    <td>353</td>
 
                    <td>181</td>
 
                    <td>0.3522</td>
 
                    <td>368,471</td>
 
                </tr>
 
              </tbody>
 
            </table>
 
    <p>It can be seen that the number of detected genes using genemark is higher than the number of the genes
 
        in the reference. This is most likely due to incorrectly sequenced bases leading to an early stop codon.
 
        More sophisticated polishing steps or higher quality Illumina sequencing reads could possibly avoid this.
 
    </p>
 
    <p>
 
        Using the embl flat file format we visualized the phage genomes in a circular genome diagram plot (Figure 5).
 
    </p>
 
    <div class="row">
 
        <div class="col-12 col-md-4">
 
        <figure class="figure">
 
        <img src="https://static.igem.org/mediawiki/2018/6/6a/T--Munich--Results_3S_edit_phage.png" class="figure-img img-fluid rounded" alt="A generic square placeholder image with rounded corners in a figure.">
 
        <figcaption class="figure-caption">3S</figcaption>
 
            </figure>
 
    </div>
 
        <div class="col-12 col-md-4">
 
        <figure class="figure">
 
        <img src="https://static.igem.org/mediawiki/2018/9/99/T--Munich--results_FFP_edit_phage.png" class="figure-img img-fluid rounded" alt="A generic square placeholder image with rounded corners in a figure.">
 
        <figcaption class="figure-caption">FFP</figcaption>
 
            </figure>
 
        </div>
 
        <div class="col-12 col-md-4">
 
        <figure class="figure">
 
        <img src="https://static.igem.org/mediawiki/2018/8/80/T--Munich--Results_NES_phage.png" class="figure-img img-fluid rounded" alt="A generic square placeholder image with rounded corners in a figure.">
 
        <figcaption class="figure-caption">NES</figcaption>
 
            </figure>
 
        </div>
 
        </div>
 
    <div></div>
 
    <p>
 
        Here we must note several things. For the 3S genome, we can see that at certain positions we see a high
 
        decrease in the coverage (at 58kbp, 72kbp and 110kbp). At these positions no reads align to the reference genome.
 
        Since this realignment was performed using graphmap [], while the inital assembly pipeline uses minimap2 internally,
 
        this could also be an alignment artifact.
 
    </p>
 
    <p>
 
        For the NES genome we can see a similar behaviour at 330kbp. Additionally we can see two spikes at the ends of
 
        the genome. Finally the FFP genome again has the same problems as the NES genome ends.
 
    </p>
 
     
 
   
 
    <h2>Conclusion</h2>
 
    <div class="row">
 
    <div class="col-12">
 
        <p>
 
            Sequ-into has been shown to be able to allow rapid genome assembly from a Nanopore sequencing sample such
 
            as encountered in Phactory manufacturing experiments. We use state-of-the-art 3rd generation sequencing
 
            data analysis tools within our framework to overcome difficulties frequently experienced by scientists in
 
            Nanopore applications. In result, we not only introduce our step by step evolved results of the analysis,
 
            assembly and finally annotation of the bacteriophage genomes but also make clear how to deal with Nanopore
 
            sequencing data already widely used in synthetic biology and rising. Furthermore our custom phage genome of
 
            the bacteriophage 3S allows further analysis and usage in the manufacturing cycle, for example, a fast
 
            contamination detection using sequ-into.
 
        </p>
 
    </div>
 
    </div>
 
     
 
     
 
<a href=https://academic.oup.com/bfg/article/11/1/25/191455>1</a>
 
     
 
<div id="phareferences" class="row">
 
<h2>References</h2>
 
<div class="col-12">
 
<ol>
 
<li><a href="https://academic.oup.com/bfg/article/11/1/25/191455">Zhenyu Li, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang, Jun Gan, Nan Li, Xuesong Hu, Binghang Liu, Bicheng Yang, Wei Fan; Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph, Briefings in Functional Genomics, Volume 11, Issue 1, 1 January 2012, Pages 25–37. </a></li>
 
</ol>
 
</div>
 
</div>
 
 
 
   
 
 
</main>
 
</div>
 
 
<script type="text/javascript" src="https://2018.igem.org/Template:Munich/PhactoryContentsJS?action=raw&amp;ctype=text/javascript"></script>
 
 
</html>
 
 
{{Munich/PhactoryFooter}}
 

Latest revision as of 03:21, 18 October 2018