Difference between revisions of "Team:Montpellier/Model"

Line 20: Line 20:
 
<section>
 
<section>
  
<p>Bacteria such as <i>E. coli</i> are very well characterized, and we know the features and the typical nucleotide patterns that have to be modified in order to tune gene expression. However, <i>L. jensenii</i> is a little-known bacterium and researchers only recently started to engineer its genome for biomedical applications [1]. With this study, we propose a general pipeline to identify promoters found in the genome of <i>L. jensenii</i>, a first step towards the characterisation of this organism.</p>
+
<h2>Modeling approach to identify regulatory elements</h2><hr/>
  
<p>The aim of our modeling was to predict which natural pre-gene sequences from <i>L. jensenii</i> were most likely to mediate strong gene expression. To do so, we used bioinformatic tools to identify recognizable patterns in those pre-gene sequences such as the Shine Dalgarno sequence for initiation of translation and -10 and -35 sequences for initiation of transcription. Then we selected sequences having those patterns to test their gene expression force with RFP as an output signal. We wanted to find natural sequences mediating different gene expression levels  to build a toolbox of promoters sequences for <i>L. jensenii.</i></p>
+
<p>In model bacteria such as <i>E. coli</i> decades of detailed molecular studies have permitted to identify the main sequence motifs controlling gene expression [1-3] and how they might be fined-tuned or standardized [4-6]. For most bacteria, however, these patterns are unknown and this drastically limits perspective of bioengineering. For example, the potential of <i>L. jensenii</i> for biomedical applications has been realized only recently [7] and the bacterium remain poorly characterized.</p>
  
<p>First, we needed to extract those pre-gene sequences from the full genome. We downloaded it from NCBI (<i>L. jensenii</i> JV-V16). For this we followed the protocol of the following article that had the goal to identify promoter sequences on <i>L. plantarum</i> : "Genome-wide prediction and validation of sigma70 promoters in <i>Lactobacillus plantarum WCFS1</i>”. We developed a python script to do it. Selecting a maximum length of 100bp, a minimum of 25bp and no overlap, upstream each gene sequence.<p>
+
<p>As the scope of synthetic biology expand to less known organism, identifying and testing the main regulatory motifs involved in controlling transcription and translation has become a common problem. To the best of our knowledge, no simple tool have been developed to facilitate this task.</p>
  
<p>Then we used the “MEME SUITE: tools for motif discovery and searching” to identify relevant patterns on those sequences. With the results we were able to identify different known patterns present on multiples pre-gene sequences (number of hits):</p>
+
<p>Here, we report a generic pipeline to identify the most common promoters, Shine-Dalgarno elements and terminators from a genomic sequence alone. We implemented the entire pipeline in a <a class="lien" href="https://github.com/iGEMMontpellier/Pre-genes-sequences-extraction">Python script</a>, which  we have made publicly available.  We applied this tool to a reference genome sequence of L. jensenii strain JV-V16 (NCBI RefSeq : <a class="lien" href="https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP018809.1">NZ_CP018809.1</a>).<p>
 +
 
 +
<p>Our aim is to establish a list of putative regulatory elements from which to selected candidates for further experimental characterization. This represent a powerful approach to establish collections of standard regulatory elements without prior knowledge.</p>
 +
 
 +
<h2>Identification of promoters and RBSs from genomic sequence alone</h2><hr/>
 +
 
 +
<p>We sought to identify repeated sequence motifs located upstream of coding-sequences, hypothesizing that this approach would be sufficient to extract the signature of the most widespread promoter (sigma70) and that Shine-Dalgarno motifs. Encouragingly, we found that a similar approach had been successfully applied to the genome of <i>Lactobacillus plantarum</i> [8].</p>
 +
 
 +
<p>We extracted 100 nucleotides upstream of each coding sequences annotated in the reference genome (or up to the previous coding sequence, excluding sequences the were less than 25 nts). We wrote the resulting sequence into a multifasta file which we used as an input for the MEME web service. MEME (Multiple Em for Motif Elicitation) is a well-established tool for the discovery of biological sequence motifs with no prior expectation [9]. The algorithm is capable to automatically discover motifs that are repeated in the input sequences, which we expect for sigma70 promoter and an RBS in our extracted sequences.</p>
 +
 
 +
<p>We first verified our script by running the <i>Lactobacillus plantarum</i> genome studied previously [8]. We found very similar results. For example, Figure 1 provides a comparison of our data with the -10 promoter previously identified in <i>Lactobacillus plantarum</i> (Figure 1).</p>
 +
 
 +
<figure>
 +
<img class="image_figure" src=""/><br/>
 +
<figcaption><span class="underline">Figure 1:</span> Comparison of the sequence LOGO for the -10 motif of sigma 70 promoter previously identified  (up) and from our own analysis (down). Logo represent the information content at each position of the aligned sequences. For example, positions with 2 bits of information are conserved in all sequences. If half of the sequence contain one letter and the other half another letter, the information content is 1.The size of the letter is proportional to its conservation and letter are ranked with the maximum conservation on top.</figcaption>
 +
</figure>
 +
 
 +
<p>We next ran the the 1251 input sequences extracted from the genome of <i>L. jensenii</i>. MEME identified 176 hits for a motif similar to a promoter (Figure 2). The -10 motif (TATAAT) is clearly identifiable and closely resemble that of <i>L. plantarurm</i> (see above), as well as that of <i>E. coli</i> [2] (Figure 2A). The -35 motif is less marked (with only two TT showing some signal). The spacer is AT-rich.  We submitted the entire motif to the MAST (Motif Alignment & Search Tool) webservice. This algorithm is part of the MEME suite and reports the exact location of the motifs in the input sequences [9]. Parsing the MAST output allowed us to plot the distribution of the motifs with respect to the gene’s start codon (Figure 2B).</p>
 +
 
 +
<figure>
 +
<img class="image_figure" src=""/><br/>
 +
<figcaption><span class="underline">Figure 2:</span> A: Sequence LOGO for the putative promoter motif of L. Jensenii (JV-V16) based on 176  hits identified by MEME. The typical -10 motif TATAAT is clearly identifiable. The -35 motifs is hard to identify. B: Localisation of the motif of the putative promoter from the start codon of the genes. We can clearly see the promoters are not always located just after the RBS (beginning in -15 on average) this may be due to the presence of natural spacers or regulatory elements on the pre-gene sequence.</figcaption>
 +
</figure>
 +
 
 +
<p>MEME also identified 531 hits for a motif strongly resembling the SD motifs characteristic to other Lactobacil and, again, <i>E. coli</i> [1] (Figure 3A). Parsing of the MAST output revealed that most of the hits were located at the expected distance of -3 nucleotides with respect to the start codon of the gene (Figure 1B). Inconsistent positions might be due to erroneous annotation of the start.</p>
 +
 
 +
<figure>
 +
<img class="image_figure" src="https://static.igem.org/mediawiki/2018/b/b7/T--Montpellier--promoter_mtp.png"/><br/>
 +
<figcaption><span class="underline">Figure 3:</span> A: Sequence LOGO for the putative SD motif  of <i>L. Jensenii</i> (JV-V16) based on 531 hits identified by MEME.  B: Localisation of the motif of the RBS from the start codon of the genes.</figcaption>
 +
</figure>
 +
 
 +
<h2>Modeling-driven sequence design</h2><hr/>
 +
 
 +
<p>Ideally, the next step would have been to relate the sequence motifs we were able to identify with functional genomic data. Proteomics data are hard to generate and usually rare. Indeed, we could not find any such dataset for <i>Lactobacillus jensenii</i> or related strains. In contrast, transcriptomics data are easier to generate (RNA-seq). Unfortunately, we could not find any publicly available dataset either, and did not have funding to run an RNA-Seq experiment ourselves.<br/>
 +
We did identify a metatranscriptomic study on the vaginal microflora, in which <i>L. jensenii</i> was amongst the dominant strains [10]. For lack of better data, we tried to investigate this but the sequencing reads were not publicly available and we were unable to reach the authors of the study. Other data provided in the articles were not useful for our purpose.<br/>
 +
RNA-seq data would have allowed us to score our putative promoters and chose a subset spanning a range of measured strength for further experimental characterization in a standardized context. We also could have used this analysis to develop a model a promoter and use it to design new promoter. Likewise, RNA-seq data could have been used to better map terminator and score terminator efficiency.</p>
 +
 
 +
<p>In the absence of such data, we used the position weight matrices (PWM) of the identified motifs to score our genetic elements and selected them on this basis . We used MAST [9] map and score the motifs on our original input sequences, and extracted the sequence from 100bp to the start codon. We selected 4 sequences showing both promoter and RBS hits (best candidate), 3 showing only the promoter motif (medium candidate), 2 with only RBS pattern (low candidate) and 1 without any hits (null candidate) (Table 1). In addition, we created a synthetic sequence bearing the consensus sequences from the promoter and RBS motifs. We also chose the promoter of rpsU previously shown to be active in <i>L. jensenii</i> as positive control [11].</p>
 +
 
 +
 
 +
<p>We sent these sequence for synthesis and cloned them into <i>L. jensenii</i>’s plasmid pLEM415 in such a way as to control expression of RFP. Cloning were successful. Transformations in <i>L. jensenii</i> was more problematic than expected (see Toolbox section). Unfortunately, we were not able to successfully transform these plasmid to measure the activity of these regulatory sequences.</p>
 +
 
 +
<p>Annotated scripts to perform these analysis on any genome sequence are available here.</p>
 +
<p>**************************************************************************************************</p>
  
 
<p>The RBS pattern of the bacteria, a Shine Dalgarno sequence specific of <i>L. jensenii</i>:
 
<p>The RBS pattern of the bacteria, a Shine Dalgarno sequence specific of <i>L. jensenii</i>:

Revision as of 22:22, 17 October 2018

Modeling

Modeling approach to identify regulatory elements


In model bacteria such as E. coli decades of detailed molecular studies have permitted to identify the main sequence motifs controlling gene expression [1-3] and how they might be fined-tuned or standardized [4-6]. For most bacteria, however, these patterns are unknown and this drastically limits perspective of bioengineering. For example, the potential of L. jensenii for biomedical applications has been realized only recently [7] and the bacterium remain poorly characterized.

As the scope of synthetic biology expand to less known organism, identifying and testing the main regulatory motifs involved in controlling transcription and translation has become a common problem. To the best of our knowledge, no simple tool have been developed to facilitate this task.

Here, we report a generic pipeline to identify the most common promoters, Shine-Dalgarno elements and terminators from a genomic sequence alone. We implemented the entire pipeline in a Python script, which we have made publicly available. We applied this tool to a reference genome sequence of L. jensenii strain JV-V16 (NCBI RefSeq : NZ_CP018809.1).

Our aim is to establish a list of putative regulatory elements from which to selected candidates for further experimental characterization. This represent a powerful approach to establish collections of standard regulatory elements without prior knowledge.

Identification of promoters and RBSs from genomic sequence alone


We sought to identify repeated sequence motifs located upstream of coding-sequences, hypothesizing that this approach would be sufficient to extract the signature of the most widespread promoter (sigma70) and that Shine-Dalgarno motifs. Encouragingly, we found that a similar approach had been successfully applied to the genome of Lactobacillus plantarum [8].

We extracted 100 nucleotides upstream of each coding sequences annotated in the reference genome (or up to the previous coding sequence, excluding sequences the were less than 25 nts). We wrote the resulting sequence into a multifasta file which we used as an input for the MEME web service. MEME (Multiple Em for Motif Elicitation) is a well-established tool for the discovery of biological sequence motifs with no prior expectation [9]. The algorithm is capable to automatically discover motifs that are repeated in the input sequences, which we expect for sigma70 promoter and an RBS in our extracted sequences.

We first verified our script by running the Lactobacillus plantarum genome studied previously [8]. We found very similar results. For example, Figure 1 provides a comparison of our data with the -10 promoter previously identified in Lactobacillus plantarum (Figure 1).


Figure 1: Comparison of the sequence LOGO for the -10 motif of sigma 70 promoter previously identified (up) and from our own analysis (down). Logo represent the information content at each position of the aligned sequences. For example, positions with 2 bits of information are conserved in all sequences. If half of the sequence contain one letter and the other half another letter, the information content is 1.The size of the letter is proportional to its conservation and letter are ranked with the maximum conservation on top.

We next ran the the 1251 input sequences extracted from the genome of L. jensenii. MEME identified 176 hits for a motif similar to a promoter (Figure 2). The -10 motif (TATAAT) is clearly identifiable and closely resemble that of L. plantarurm (see above), as well as that of E. coli [2] (Figure 2A). The -35 motif is less marked (with only two TT showing some signal). The spacer is AT-rich. We submitted the entire motif to the MAST (Motif Alignment & Search Tool) webservice. This algorithm is part of the MEME suite and reports the exact location of the motifs in the input sequences [9]. Parsing the MAST output allowed us to plot the distribution of the motifs with respect to the gene’s start codon (Figure 2B).


Figure 2: A: Sequence LOGO for the putative promoter motif of L. Jensenii (JV-V16) based on 176 hits identified by MEME. The typical -10 motif TATAAT is clearly identifiable. The -35 motifs is hard to identify. B: Localisation of the motif of the putative promoter from the start codon of the genes. We can clearly see the promoters are not always located just after the RBS (beginning in -15 on average) this may be due to the presence of natural spacers or regulatory elements on the pre-gene sequence.

MEME also identified 531 hits for a motif strongly resembling the SD motifs characteristic to other Lactobacil and, again, E. coli [1] (Figure 3A). Parsing of the MAST output revealed that most of the hits were located at the expected distance of -3 nucleotides with respect to the start codon of the gene (Figure 1B). Inconsistent positions might be due to erroneous annotation of the start.


Figure 3: A: Sequence LOGO for the putative SD motif of L. Jensenii (JV-V16) based on 531 hits identified by MEME. B: Localisation of the motif of the RBS from the start codon of the genes.

Modeling-driven sequence design


Ideally, the next step would have been to relate the sequence motifs we were able to identify with functional genomic data. Proteomics data are hard to generate and usually rare. Indeed, we could not find any such dataset for Lactobacillus jensenii or related strains. In contrast, transcriptomics data are easier to generate (RNA-seq). Unfortunately, we could not find any publicly available dataset either, and did not have funding to run an RNA-Seq experiment ourselves.
We did identify a metatranscriptomic study on the vaginal microflora, in which L. jensenii was amongst the dominant strains [10]. For lack of better data, we tried to investigate this but the sequencing reads were not publicly available and we were unable to reach the authors of the study. Other data provided in the articles were not useful for our purpose.
RNA-seq data would have allowed us to score our putative promoters and chose a subset spanning a range of measured strength for further experimental characterization in a standardized context. We also could have used this analysis to develop a model a promoter and use it to design new promoter. Likewise, RNA-seq data could have been used to better map terminator and score terminator efficiency.

In the absence of such data, we used the position weight matrices (PWM) of the identified motifs to score our genetic elements and selected them on this basis . We used MAST [9] map and score the motifs on our original input sequences, and extracted the sequence from 100bp to the start codon. We selected 4 sequences showing both promoter and RBS hits (best candidate), 3 showing only the promoter motif (medium candidate), 2 with only RBS pattern (low candidate) and 1 without any hits (null candidate) (Table 1). In addition, we created a synthetic sequence bearing the consensus sequences from the promoter and RBS motifs. We also chose the promoter of rpsU previously shown to be active in L. jensenii as positive control [11].

We sent these sequence for synthesis and cloned them into L. jensenii’s plasmid pLEM415 in such a way as to control expression of RFP. Cloning were successful. Transformations in L. jensenii was more problematic than expected (see Toolbox section). Unfortunately, we were not able to successfully transform these plasmid to measure the activity of these regulatory sequences.

Annotated scripts to perform these analysis on any genome sequence are available here.

**************************************************************************************************

The RBS pattern of the bacteria, a Shine Dalgarno sequence specific of L. jensenii:


Figure 1: JV-V16 Shine Dalgarno (RBS) motif with 531 hits on MEME.

The pattern of L. jensenii promoters:


Figure 2: JV-V16 Promoter motif for 176 hits.

We can clearly see the -10 pattern: TATAAT which resembles to some of other Lactobacilli patterns.

Furthermore we identified the -35 pattern by selecting the sequences having a promoter pattern for an other MEME run:


Figure 3: JV-V16 -35 motif with 42 hits.

We also found pattern looking like terminators tails:


Figure 4: Motif of the tails of promoters for 222 hits.

With those informations on what RBS, promoters and terminator tails looked like on L. jensenii we then ran a MAST (MEME SUITE) to found those patterns on our pre-genes sequences. We then selected some to test their promoters forces with a RFP signal, we took 4 with RBS and promoter patterns, 3 with only promoter patterns, 2 with only RBS pattern, and finally one without any. Plus, we created an artificial sequence with promoter and RBS with the most common letter from the power weight matrice of each patterns. We also took a characterized promoter from jensenii as positive control, and a sequence with only RFP as negative control.

We then cloned those sequences on the plasmid for L. jensenii (pLEM415) and prepared them to be transformed on L.jensenii. Sadly their were some issue with the transformation with L. jensenii and we were not able to have results on this part of the project.

References
[1] Angela Marcobal, Xiaowen Liu, Wenlei Zhang, Antony S. Dimitrov, Letong Jia, Peter P. Lee, Timothy R. Fouts, Thomas P. Parks, and Laurel A. Lagenaur. (2016). Expression of Human Immunodeficiency Virus Type 1 Neutralizing Antibody Fragments Using Human Vaginal Lactobacillus. Aids Research And Human Retroviruses, Volume 32, Number 10/11.
[2]