Latest revision as of 23:37, 17 October 2018

Bioinformatics

Bioinformatics

After a succesfull sequencing has been performed and you’re left with raw data containing millions and millions (and millions) of lines of base sequences, all of this needs to be processed and interpreted. This is where the interdisciplinary field of bioinformatics comes in. A vast range of software tools are available, tailored to different kinds of analysis as well as being unique to the different sequencing methods being used.

Most of the tools we used were available through the free website Usegalaxy.org which as well let us do the processing on their servers. Because we also made use of nanopore sequencing, tailored tools used for the MinION data were available from their community hub which could be run from a terminal window.

Experiment

We decided to create our bioinformatics pipeline from scratch. This was not an easy task however as nanopore technology is novel and many of the available pipelines are tailored to illumina sequencing. Generally though, a basic transcriptomics pipeline looks like the following: Alignment to a reference genome, gene counting and differential gene expression [1]. However a couple of data processing steps were needed for the nanopore data beforehand such as demultiplexing and adapter trimming.

Demultiplexing and Adapter Trimming

Because the sequencing itself runs pooled samples containing both the barcoded cultured- and control-group samples, the data produced needs to be demultiplexed i.e separated into files containing the reads from respective groups. Because the barcodes used to fingerprint each group is made up of its own base sequence, this also had to be removed or ”trimmed” from the data, leaving us with the pure mRNA sequences. This was achieved using a free nanopore community tool called porechop.

Genome Alignment

The base sequences needs to be aligned to the reference genome of the sequenced species in question for the downstream data analysis. This is important because we want to know where each sequence actually lies in the genome and which genes they correspond to. Genome alignment was done using another community tool called minimap2.

Figure 1. Running demultiplexing and barcode trimming from the terminal. The programme first separates the reads according to barcode and then searches for available possible barcodes to be trimmed off.

Gene Counting

Gene counting basically means that you count how many times each mRNA sequence (aligned over a gene from the previous step) occurs. This in turn directly correlates to the amount of up- or down-regulation of that particular gene. A lot of different tools were available for gene counting but ”featureCounts” was chosen through galaxy.

After the differential gene expression analysis is done the data was filtered twice, one time for the best adjusted P-value and subsequently for the highest (meaning the most significant) fold changes. Left were a couple of candidate genes which could be easily identified by their gene ID through various databases such as NCBI.

Figure 2. Results of a differential gene expression analysis using Deseq2 on test files. The genes (shown with their gene ID) as well as their mean base length and several statistical results can be seen.

Result

Validating our Transcriptomics Pipeline

The transcriptomics pipeline was tried out and validated using read files available from the internet. The files consisted of two datasets of E. coli (triplicates) cultured in regular LB and a sugar solution respectively.

Figure 3. Results of the differential gene expression analysis using Deseq2 on test files. The genes (shown with their gene ID) as well as their mean base length and several statistical results can be seen.

Figure 4. Results of the differential gene expression after filtering for statistical significance and fold change.

The results after searching for the genes in the NCBI database showed that the most expressed gene from the sugar-cultured E. coli was shown to be involved in a type of sugar system, proving that the pipeline was indeed working.

Figure 5. Highly expressed gene produced from the pipeline matching a glucose specific gene.

Figure 6. Results of the differential gene expression done on our own data.

Analyzing Our Own Sequencing Data

Table 1. The first few genes as a result of the differential gene expression analysis seen in Figure 6 together with their promotor sequence and function in the organism.

Gene ID	Gene name	Promotor sequence	Function	Fold change
ER3413_45 ER3413_70 ER3413_87 ER3413_126 ER3413_173	apaG leuA murG panD frr	ggcaccatgcagggtcactacgaaatgatcgatgaaa ttgacatccgtttttgtatccagtaactctaaaagc - tagacactaaacaaaaatcgggcaatactgcgtga ttacccgtaatatgtttaatcagggctatacttagcac	protein associated with Co2+ and Mg2+ efflux 2-isopropylmalate synthase N-acetylglucosaminyl transferase putative inner membrane protein inner membrane protein, UPF0118 family	0.40 0.40 0.40 0.40 0.40

The resuts from our runs unfortunately did not produce as good results as seen above. Due to the major issues with sequencing and actually generating enough data, it can be seen in figure 6 what kind of effects it had. Judging by the adjusted p-values it is clear that even though the genes can indeed be identified as seen in Table 1 the statistical significance is extremely uncertain (the minimal accepted threshold is an adjusted p-value of <= 0.05). Any up-or down regulation of fold-change of interest was not able to be identified either. Looking at these errors it can be assumed that no major change in fold-change as well as low significancy is due to simply not enough data being generated from the prior sequencing step. Because of these facts no gene could be identified as a possible candidate for our reporter system.

References

[1] Galaxyproject, 2018. Reference-based RNA-Seq data analysis Galaxyproject Date of visit 2018-10-15

@@ Line 5: / Line 5: @@
 {{Uppsala/javascript/scroll-button}}
 {{Uppsala/javascript/redirect_js}}
+{{Uppsala/buttons}}
 <html>
@@ Line 10: / Line 12: @@
          <style type="text/css">
+            .parallax {
+                /* The image used */
+                background-image: url("https://static.igem.org/mediawiki/2018/9/99/T--Uppsala--Transcriptomics-HEADER_2.jpeg");
+            }
+            .side-img a{
+              display: inline-block;
+              color: black;
+              padding-left: 5px;
+              text-decoration: none;
+             }
+              .inner-card-text a{
+              display: inline-block;
+              color: black;
+              padding-left: 5px;
+              text-decoration: none;
+             }
          </style>
@@ Line 21: / Line 45: @@
-  <div class="svg-wrapper">
+  <div class="svg-wrapper" id="Project_Description">
@@ Line 158: / Line 182: @@
      <div class="body">
          <div class="parallax"></div>
-        <div class="igem-icon"><a href="https://2018.igem.org/Main_Page"><img src="https://static.igem.org/mediawiki/2018/b/b0/T--Uppsala--graylogo.png"></a></div>
+   <div class="igem-icon"><a href="https://2018.igem.org/Team:Uppsala"><img src="https://static.igem.org/mediawiki/2018/c/cf/T--Uppsala--WormBusterLogo_Black.png"></a></div>
+        <div class= "content blur-box" style="font-size:16px;">
-        <div class ="scroll-down-button">
-             <section id="section02" class="demo">
-                <h1></h1>
-                <a href="#scrolldown"><span></span></a>
-             </section>
+<!-- CONTENT OF WHATS ON THE PAGE -->
+         <div id="toc" class="toc">
+             <div id="toctitle"></div>
+            <ul>
+                <li class="toclevel tocsection"><a href="#Project_Description" class="scroll"> <span id="whereYouAre"> Bioinformatics</span> </a>
+                        <ul>
+                            <li class="toclevel nav-item active"><a href="#Exp" class="nav-link scroll"> Experiment</a></li>
+                            <li class="toclevel nav-item"><a href="#Results" class="nav-link scroll">  Results</a></li>
+                            <li class="toclevel nav-item"><a href="#References" class="nav-link scroll"> References </a></li>
+                        </ul>
+                 </li>
+             </ul>
          </div>
-        <div class= "content blur-box" style="font-size:16px;">
              <div class ="content-text" id="scrolldown" >
                  <div style="height:5em;"></div>
                  <!-- FROM THIS POINT DOWNWARDS YOU START ADDING YOUR STUFF -->
@@ Line 173: / Line 236: @@
 <div class="card-holder">
-     <div class="content-card-heading">
          <h1>Bioinformatics</h1>
-    </div>
@@ Line 181: / Line 243: @@
 <p>After a succesfull sequencing has been performed and you’re left with raw data containing millions and millions (and millions) of lines of base sequences, all of this needs to be processed and interpreted. This is where the interdisciplinary field of bioinformatics comes in. A vast range of software tools are available, tailored to different kinds of analysis as well as being unique to the different sequencing methods being used.<br><br>
-Most of the tools we used were available through the free website Usegalaxy.org which as well let us do the processing on their servers. Because we also made use of nanopore sequencing, tailored tools used for the MinION data were available from their community hub which could be run from a terminal window. </p><br><br>
+Most of the tools we used were available through the free website Usegalaxy.org which as well let us do the processing on their servers. Because we also made use of nanopore sequencing, tailored tools used for the MinION data were available from their community hub which could be run from a terminal window. </p>
-<h2>Experiment</h2>
+<h2 id="Exp">Experiment</h2>
-<p>We decided to create our bioinformatics pipeline from scratch. Generally, a basic transcriptomics pipeline looks like the following: Alignment to a reference genome, gene counting and differential gene expression. However a couple of data processing steps were needed for the nanopore data beforehand such as demultiplexing and adapter trimming.</p><br>
+<p>We decided to create our bioinformatics pipeline from scratch. This was not an easy task however as nanopore technology is novel and many of the available pipelines are tailored to illumina sequencing. Generally though, a basic transcriptomics pipeline looks like the following: Alignment to a reference genome, gene counting and differential gene expression [1]. However a couple of data processing steps were needed for the nanopore data beforehand such as demultiplexing and adapter trimming.</p><br>
+<h3>Demultiplexing and Adapter Trimming</h3>
-<h3>Demultiplexing and adapter trimming</h3>
-<p>Because the sequencing itself runs pooled samples containing both the barcoded cultured- and control-group samples, the data produced needs to be demultiplexed i.e separated into files containing the reads from respective groups. Because the barcodes used to fingerprint each group is made up of its own base sequence, this also had to be removed or ”trimmed” from the data, leaving us with the pure mRNA sequences. This was achieved using a free nanopore community tool called porechop.</p>
                  </div>
@@ Line 197: / Line 259: @@
                          <div class="side-text">
                              <!-- Here you put your paragraphs -->
-                             <p><b>Figure 1:</b> Running demultiplexing and barcode trimming from the terminal. The programme first separates the reads according to barcode and then searches for available possible barcodes to be trimmed off.</p>
+                             <p>Because the sequencing itself runs pooled samples containing both the barcoded cultured- and control-group samples, the data produced needs to be demultiplexed i.e separated into files containing the reads from respective groups. Because the barcodes used to fingerprint each group is made up of its own base sequence, this also had to be removed or ”trimmed” from the data, leaving us with the pure mRNA sequences. This was achieved using a free nanopore community tool called porechop.</p><br>
+                             <h3>Genome Alignment</h3>
+<p>The base sequences needs to be aligned to the reference genome of the sequenced species in question for the downstream data analysis. This is important because we want to know where each sequence actually lies in the genome and which genes they correspond to. Genome alignment was done using another community tool called minimap2.</p>
@@ Line 205: / Line 268: @@
                          <div class="side-img" style="background-color:darkolivegreen;">
                             <!-- Here goes the big image to the right -->
-                           <img src="https://static.igem.org/mediawiki/2018/3/3b/T--Uppsala--Transcriptomics-Demultiplexing.png">
+                          <img src="https://static.igem.org/mediawiki/2018/3/3b/T--Uppsala--Transcriptomics-Demultiplexing.png">
+                            <a href="https://static.igem.org/mediawiki/2018/3/3b/T--Uppsala--Transcriptomics-Demultiplexing.png"><p><b>Figure 1.</b> Running demultiplexing and barcode trimming from the terminal. The programme first separates the reads according to barcode and then searches for available possible barcodes to be trimmed off.</p></a>
                          </div>
                      </div>
+                </div>
                  <!--End of template with side picture -->
-<br><br>
-<h3>Genome alignment</h3>
-<p>The base sequences needs to be aligned to the reference genome of the sequenced species in question for the downstream data analysis. This is important because we want to know where each sequence actually lies in the genome and which genes they correspond to. Genome alignment was done using another community tool called minimap2.</p>
-<h3>Gene counting</h3>
+<div class="card-holder">
-<p>Gene counting basically means that you count how many times each mRNA sequence (aligned over a gene from the previous step) occurs. This in turn directly correlates to the amount of up- or down-regulation of that particular gene. A lot of different tools were available for gene counting but ”featureCounts” was chosen through galaxy.</p>
+<h3>Gene Counting</h3>
                        </div>
@@ Line 227: / Line 291: @@
                          <div class="side-text">
                              <!-- Here you put your paragraphs -->
-                             <p><b>Figure 2:</b> Results of a differential gene expression analysis using Deseq2 on test files. The genes (shown with their gene ID) as well as their mean base length and several statistical results can be seen.</p>
+                             <p>Gene counting basically means that you count how many times each mRNA sequence (aligned over a gene from the previous step) occurs. This in turn directly correlates to the amount of up- or down-regulation of that particular gene. A lot of different tools were available for gene counting but ”featureCounts” was chosen through galaxy.</p><br><br>
+                            <p>After the differential gene expression analysis is done the data was filtered twice, one time for the best adjusted P-value and subsequently for the highest (meaning the most significant) fold changes. Left were a couple of candidate genes which could be easily identified by their gene ID through various databases such as NCBI.</p>
@@ Line 235: / Line 301: @@
                          <div class="side-img" style="background-color:darkolivegreen;">
                             <!-- Here goes the big image to the right -->
-                           <img src="https://static.igem.org/mediawiki/2018/a/a9/T--Uppsala--Transcriptomics-Bioinformatics2.png">
+                      <img src="https://static.igem.org/mediawiki/2018/a/a9/T--Uppsala--Transcriptomics-Bioinformatics2.png">
+                            <a href="https://static.igem.org/mediawiki/2018/a/a9/T--Uppsala--Transcriptomics-Bioinformatics2.png"><p><b>Figure 2.</b> Results of a differential gene expression analysis using Deseq2 on test files. The genes (shown with their gene ID) as well as their mean base length and several statistical results can be seen.</p></a>
                          </div>
@@ Line 242: / Line 310: @@
                  <!--End of template with side picture -->
 <br><br>
-<p>After the differential gene expression analysis is done the data was filtered twice, one time for the best adjusted P-value and subsequently for the highest (meaning the most significant) fold changes. Left were a couple of candidate genes which could be easily identified by their gene ID through various databases such as NCBI.</p>
-<h2>Result</h2>
-<p>The transcriptomics pipeline was tried out and validated using read files available from the internet. The files consisted of two datasets of E. Coli (triplicates) cultured in regular LB and a sugar solution respectively.</p><br><br>
+<h2 id="Results">Result</h2>
-                      </div>
+<h3>Validating our Transcriptomics Pipeline</h3>
-<!--Start of template with side picutre -->
+<p>The transcriptomics pipeline was tried out and validated using read files available from the internet. The files consisted of two datasets of <i>E. coli</i> (triplicates) cultured in regular LB and a sugar solution respectively.</p><br>
-                <div class="card-holder">
-                     <div class="content-card pic-next-to-text">
-                        <div class="side-text">
+                    <!---------------NEW TEMPLATE---------------->
-                            <!-- Here you put your paragraphs -->
-                             <p><b>Figure 3:</b> Results of the differential gene expression analysis using Deseq2 on test files. The genes (shown with their gene ID) as well as their mean base length and several statistical results can be seen.</p>
+       <div class="card-holder">
+                    <div class="content-card content-card-2">
+                        <div class="inner-card left-card">
+                             <br>
+                             <!--change src to that of the image you want-->
+                            <img class="content-card-img" src="https://static.igem.org/mediawiki/2018/a/a9/T--Uppsala--Transcriptomics-Bioinformatics2.png">
+                            <div class="inner-card-text">
+                                <!-- start of paragraph-->
+                                <a href="https://static.igem.org/mediawiki/2018/a/a9/T--Uppsala--Transcriptomics-Bioinformatics2.png"><p><b>Figure 3.</b> Results of the differential gene expression analysis using Deseq2 on test files. The genes (shown with their gene ID) as well as their mean base length and several statistical results can be seen.</p></a>
+                            </div>
+                            <!-- end of paragraph -->
                          </div>
+                        <div class="inner-card right-card">
+                            <br>
+                            <img class="content-card-img" src="https://static.igem.org/mediawiki/2018/8/81/T--Uppsala--Transcriptomics-Bioinformatics3.png">
+                            <div class="inner-card-text">
+                                <!-- start of paragraph -->
+                               <a href="https://static.igem.org/mediawiki/2018/8/81/T--Uppsala--Transcriptomics-Bioinformatics3.png">           <p><b>Figure 4.</b> Results of the differential gene expression after filtering for statistical significance and fold change.</p></a>
+                                <!-- End of paragraphs -->
+                            </div>
-                        <div class="side-img" style="background-color:darkolivegreen;">
-                           <!-- Here goes the big image to the right -->
-                           <img src="https://static.igem.org/mediawiki/2018/a/a9/T--Uppsala--Transcriptomics-Bioinformatics2.png">
                          </div>
                      </div>
+                </div>
                  <!--End of template with side picture -->
-<br><br>
+<p>The results after searching for the genes in the NCBI database showed that the most expressed gene from the sugar-cultured <i>E. coli</i> was shown to be involved in a type of sugar system, proving that the pipeline was indeed working.</p><br>
                  </div>
-<!--Start of template with side picutre -->
-                 <div class="card-holder">
+  <div class="card-holder">
-                     <div class="content-card pic-next-to-text">
+                     <div class="content-card content-card-2">
-                         <div class="side-text">
+                         <div class="inner-card left-card">
-                             <!-- Here you put your paragraphs -->
-                             <p><b>Figure 4:</b> Results of the differential gene expression after filtering for statistical significance and fold change.</p>
+                             <br>
+                             <!--change src to that of the image you want-->
+                           <img class="content-card-img" src="https://static.igem.org/mediawiki/2018/4/4c/T--Uppsala--Transcriptomics-Bioinformatics4.png">
+                             <div class="inner-card-text">
+                                <!-- start of paragraph-->
+                                <a href="https://static.igem.org/mediawiki/2018/4/4c/T--Uppsala--Transcriptomics-Bioinformatics4.png"><p><b>Figure 5.</b> Highly expressed gene produced from the pipeline matching a glucose specific gene.</p></a>
+                            </div>
+                            <!-- end of paragraph -->
+                        </div>
+                        <div class="inner-card right-card">
+                            <br>
+                       <img class="content-card-img" src="https://static.igem.org/mediawiki/2018/3/3c/T--Uppsala--Transcriptomics-Bioinformatics5.png">
+                            <div class="inner-card-text">
+                                <!-- start of paragraph -->
+                               <a href="https://static.igem.org/mediawiki/2018/3/3c/T--Uppsala--Transcriptomics-Bioinformatics5.png"> <p><b>Figure 6.</b> Results of the differential gene expression done on our own data.</p></a>
+                                <!-- End of paragraphs -->
+                            </div>
                          </div>
+                    </div>
+                </div>
+      <div class="card-holder">
-                        <div class="side-img" style="background-color:darkolivegreen;">
+                <br>
-                           <!-- Here goes the big image to the right -->
-                           <img src="https://static.igem.org/mediawiki/2018/8/81/T--Uppsala--Transcriptomics-Bioinformatics3.png">
-                        </div>
-                    </div>
+            <h3>Analyzing Our Own Sequencing Data</h3>
+            <p><b>Table 1.</b> The first few genes as a result of the differential gene expression analysis seen in Figure 6
-                 <!--End of template with side picture -->
+                 together with their promotor sequence and function in the organism.</p>
-<br><br>
+                 <!--Start of template with side picutre -->
-<p>The results after searching for the genes in the NCBI database showed that the most expressed gene from the sugar-cultured E. Coli was shown to be involved in a type of sugar system, proving that the pipeline was indeed working.</p><br><br>
-                </div>
-<!--Start of template with side picutre -->
-                <div class="card-holder">
-                    <div class="content-card pic-next-to-text">
-                        <div class="side-text">
                              <!-- Here you put your paragraphs -->
-                             <p><b>Figure 5:</b> Highly expressed gene produced from the pipeline matching a glucose specific gene.</p>
+                             <table class="pgrouptable tablesorter our-table" style="width: 100%;" cellspacing="0" cellpadding="0">
+                        <thead><tr>
+                    <th style= “width: auto”>Gene ID</th>
+                    <th style= “width: auto” >Gene name</th>
+                    <th style= “width: auto” >Promotor sequence</th>
+                    <th style= “width: auto” >Function</th>
+                    <th style= “width: auto” >Fold change</th>
+                    </tr></thead>
+                    <tbody><tr>
+                    <td>
+                    ER3413_45<br>
+                    ER3413_70<br>
+                    ER3413_87<br>
+                    ER3413_126<br>
+                    ER3413_173
+                    </td>
+                    <td >
+                    apaG<br>
+                    leuA<br>
+                    murG<br>
+                    panD<br>
+                    frr
+                    </td>
+                    <td>
+                    ggcaccatgcagggtcactacgaaatgatcgatgaaa<br>
+                    ttgacatccgtttttgtatccagtaactctaaaagc<br>
+                    <p>-</p><br>
+                    tagacactaaacaaaaatcgggcaatactgcgtga<br>
+                    ttacccgtaatatgtttaatcagggctatacttagcac
+                    </td>
+                    <td>
+                    protein associated with Co2+ and Mg2+ efflux<br>
+-isopropylmalate synthase<br>
+                    N-acetylglucosaminyl transferase<br>
+                    putative inner membrane protein<br>
+                    inner membrane protein, UPF0118 family
+                    </td>
+                    <td>
+.40<br>
+.40<br>
+.40<br>
+.40<br>
+.40
+                    </td>
+                    </tr><tr>
+                    </tr></tbody></table>
-                        </div>
-                        <div class="side-img" style="background-color:darkolivegreen;">
-                           <!-- Here goes the big image to the right -->
-                           <img src="https://static.igem.org/mediawiki/2018/4/4c/T--Uppsala--Transcriptomics-Bioinformatics4.png">
-                        </div>
-                    </div>
+<!-- End of Code For TABLE -->
-                <!--End of template with side picture -->
-<br><br>
-                    <p>The resuts from our runs unfortunately did not produce as good results as seen above. Due to the major issues with sequencing and actually generating enough data, it can be seen in figure 4 what kind of effect it had. Judging by the adjusted p-values it is clear that even though the genes can indeed be identified the statistical significance is extremely uncertain (the minimal accepted threshold is an adjusted p-value of &#62; 0.05). Any up-or down regulation of fold-change of interest was not able to be identified either. Looking at these errors it can be assumed that no major change in fold-change as well as low significancy is due to simply not enough data being generated from the prior sequencing step. Because of these facts no gene could be identified even as a candidate.</p>
+          <p>The resuts from our runs unfortunately did not produce as good results as seen above. Due to the major issues with sequencing and actually generating enough data, it can be seen in figure 6 what kind of effects it had. Judging by the adjusted p-values it is clear that even though the genes can indeed be identified as seen in Table 1 the statistical significance is extremely uncertain (the minimal accepted threshold is an adjusted p-value of &#60;&#61; 0.05). Any up-or down regulation of fold-change of interest was not able to be identified either. Looking at these errors it can be assumed that no major change in fold-change as well as low significancy is due to simply not enough data being generated from the prior sequencing step. Because of these facts no gene could be identified as a possible candidate for our reporter system.</p>
-                    <h1>References</h1>
-                    <p><b>[1]</b> Reference here</p>
-                    <p><b>[2]</b> Reference here</p>
                  </div>
+<div class="card-holder">
+<h2 id="References">References</h2>
+<p><b>[1]</b> Galaxyproject, 2018. Reference-based RNA-Seq data analysis <a href="https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html">Galaxyproject</a> Date of visit 2018-10-15</p>
+</div>
                  <!-- HERE ENDS THE PORTION WHERE YOU PUT IN YOUR CONTENT-->
                  <div style="height:5em;"></div>
              </div>
          </div>
-    </div>
+        </div>
-                    </body>
+    </body>
 </html>

Difference between revisions of "Team:Uppsala/Transcriptomics/Bioinformatics"