Modeling approach to identify regulatory elements
In model bacteria such as E. coli decades of detailed molecular studies have permitted to identify the main sequence motifs controlling gene expression [1-3] and how they might be fined-tuned or standardized [4-6]. For most bacteria, however, these patterns are unknown and this drastically limits perspective of bioengineering. For example, the potential of L. jensenii for biomedical applications has been realized only recently [7] and the bacterium remain poorly characterized.
As the scope of synthetic biology expand to less known organism, identifying and testing the main regulatory motifs involved in controlling transcription and translation has become a common problem. To the best of our knowledge, no simple tool have been developed to facilitate this task.
Here, we report a generic pipeline to identify the most common promoters, Shine-Dalgarno elements and terminators from a genomic sequence alone. We implemented the entire pipeline in a Python script, which we have made publicly available. We applied this tool to a reference genome sequence of L. jensenii strain JV-V16 (NCBI RefSeq : NZ_CP018809.1).
Our aim is to establish a list of putative regulatory elements from which to selected candidates for further experimental characterization. This represent a powerful approach to establish collections of standard regulatory elements without prior knowledge.
Identification of promoters and RBSs from genomic sequence alone
We sought to identify repeated sequence motifs located upstream of coding-sequences, hypothesizing that this approach would be sufficient to extract the signature of the most widespread promoter (sigma70) and that Shine-Dalgarno motifs. Encouragingly, we found that a similar approach had been successfully applied to the genome of Lactobacillus plantarum [8].
We extracted 100 nucleotides upstream of each coding sequences annotated in the reference genome (or up to the previous coding sequence, excluding sequences the were less than 25 nts). We wrote the resulting sequence into a multifasta file which we used as an input for the MEME web service. MEME (Multiple Em for Motif Elicitation) is a well-established tool for the discovery of biological sequence motifs with no prior expectation [9]. The algorithm is capable to automatically discover motifs that are repeated in the input sequences, which we expect for sigma70 promoter and an RBS in our extracted sequences.
We first verified our script by running the Lactobacillus plantarum genome studied previously [8]. We found very similar results. For example, Figure 1 provides a comparison of our data with the -10 promoter previously identified in Lactobacillus plantarum (Figure 1).
We next ran the the 1251 input sequences extracted from the genome of L. jensenii. MEME identified 176 hits for a motif similar to a promoter (Figure 2). The -10 motif (TATAAT) is clearly identifiable and closely resemble that of L. plantarurm (see above), as well as that of E. coli [2] (Figure 2A). The -35 motif is less marked (with only two TT showing some signal). The spacer is AT-rich. We submitted the entire motif to the MAST (Motif Alignment & Search Tool) webservice. This algorithm is part of the MEME suite and reports the exact location of the motifs in the input sequences [9]. Parsing the MAST output allowed us to plot the distribution of the motifs with respect to the gene’s start codon (Figure 2B).
MEME also identified 531 hits for a motif strongly resembling the SD motifs characteristic to other Lactobacil and, again, E. coli [1] (Figure 3A). Parsing of the MAST output revealed that most of the hits were located at the expected distance of -3 nucleotides with respect to the start codon of the gene (Figure 1B). Inconsistent positions might be due to erroneous annotation of the start.
Modeling-driven sequence design
Ideally, the next step would have been to relate the sequence motifs we were able to identify with functional genomic data. Proteomics data are hard to generate and usually rare. Indeed, we could not find any such dataset for Lactobacillus jensenii or related strains. In contrast, transcriptomics data are easier to generate (RNA-seq). Unfortunately, we could not find any publicly available dataset either, and did not have funding to run an RNA-Seq experiment ourselves.
We did identify a metatranscriptomic study on the vaginal microflora, in which L. jensenii was amongst the dominant strains [10]. For lack of better data, we tried to investigate this but the sequencing reads were not publicly available and we were unable to reach the authors of the study. Other data provided in the articles were not useful for our purpose.
RNA-seq data would have allowed us to score our putative promoters and chose a subset spanning a range of measured strength for further experimental characterization in a standardized context. We also could have used this analysis to develop a model a promoter and use it to design new promoter. Likewise, RNA-seq data could have been used to better map terminator and score terminator efficiency.
In the absence of such data, we used the position weight matrices (PWM) of the identified motifs to score our genetic elements and selected them on this basis . We used MAST [9] map and score the motifs on our original input sequences, and extracted the sequence from 100bp to the start codon. We selected 4 sequences showing both promoter and RBS hits (best candidate), 3 showing only the promoter motif (medium candidate), 2 with only RBS pattern (low candidate) and 1 without any hits (null candidate) (Table 1). In addition, we created a synthetic sequence bearing the consensus sequences from the promoter and RBS motifs. We also chose the promoter of rpsU previously shown to be active in L. jensenii as positive control [11].
We sent these sequence for synthesis and cloned them into L. jensenii’s plasmid pLEM415 in such a way as to control expression of RFP. Cloning were successful. Transformations in L. jensenii was more problematic than expected (see Toolbox section). Unfortunately, we were not able to successfully transform these plasmid to measure the activity of these regulatory sequences.
Annotated scripts to perform these analysis on any genome sequence are available here.
**************************************************************************************************
The RBS pattern of the bacteria, a Shine Dalgarno sequence specific of L. jensenii:
The pattern of L. jensenii promoters:
We can clearly see the -10 pattern: TATAAT which resembles to some of other Lactobacilli patterns.
Furthermore we identified the -35 pattern by selecting the sequences having a promoter pattern for an other MEME run:
We also found pattern looking like terminators tails:
With those informations on what RBS, promoters and terminator tails looked like on L. jensenii we then ran a MAST (MEME SUITE) to found those patterns on our pre-genes sequences. We then selected some to test their promoters forces with a RFP signal, we took 4 with RBS and promoter patterns, 3 with only promoter patterns, 2 with only RBS pattern, and finally one without any. Plus, we created an artificial sequence with promoter and RBS with the most common letter from the power weight matrice of each patterns. We also took a characterized promoter from jensenii as positive control, and a sequence with only RFP as negative control.
We then cloned those sequences on the plasmid for L. jensenii (pLEM415) and prepared them to be transformed on L.jensenii. Sadly their were some issue with the transformation with L. jensenii and we were not able to have results on this part of the project.
References | |
---|---|
[1] | Angela Marcobal, Xiaowen Liu, Wenlei Zhang, Antony S. Dimitrov, Letong Jia, Peter P. Lee, Timothy R. Fouts, Thomas P. Parks, and Laurel A. Lagenaur. (2016). Expression of Human Immunodeficiency Virus Type 1 Neutralizing Antibody Fragments Using Human Vaginal Lactobacillus. Aids Research And Human Retroviruses, Volume 32, Number 10/11. |
[2] |