Modeling approach to identify regulatory elements

In model bacteria such as E. coli decades of detailed molecular studies have permitted to identify the main sequence motifs controlling gene expression [1-3] and how they might be fined-tuned or standardized [4-6]. For most bacteria, however, these patterns are unknown and this drastically limits perspective of bioengineering. For example, the potential of L. jensenii for biomedical applications has been realized only recently [7] and the bacterium remain poorly characterized.

As the scope of synthetic biology expand to less known organism, identifying and testing the main regulatory motifs involved in controlling transcription and translation has become a common problem. To the best of our knowledge, no simple tool have been developed to facilitate this task.

Here, we report a generic pipeline to identify the most common promoters, Shine-Dalgarno elements and terminators from a genomic sequence alone. We implemented the entire pipeline in a Python script, which we have made publicly available. We applied this tool to a reference genome sequence of L. jensenii strain JV-V16 (NCBI RefSeq : NZ_CP018809.1).

Our aim is to establish a list of putative regulatory elements from which to selected candidates for further experimental characterization. This represent a powerful approach to establish collections of standard regulatory elements without prior knowledge.

Identification of promoters and RBSs from genomic sequence alone

We sought to identify repeated sequence motifs located upstream of coding-sequences, hypothesizing that this approach would be sufficient to extract the signature of the most widespread promoter (sigma70) and that Shine-Dalgarno motifs. Encouragingly, we found that a similar approach had been successfully applied to the genome of Lactobacillus plantarum [8].

We extracted 100 nucleotides upstream of each coding sequences annotated in the reference genome (or up to the previous coding sequence, excluding sequences the were less than 25 nts). We wrote the resulting sequence into a multifasta file which we used as an input for the MEME web service. MEME (Multiple Em for Motif Elicitation) is a well-established tool for the discovery of biological sequence motifs with no prior expectation [9]. The algorithm is capable to automatically discover motifs that are repeated in the input sequences, which we expect for sigma70 promoter and an RBS in our extracted sequences.

We first verified our script by running the Lactobacillus plantarum genome studied previously [8]. We found very similar results. For example, Figure 1 provides a comparison of our data with the -10 promoter previously identified in Lactobacillus plantarum (Figure 1).

Figure 1: Comparison of the sequence LOGO for the -10 motif of sigma 70 promoter previously identified (up) and from our own analysis (down). Logo represent the information content at each position of the aligned sequences. For example, positions with 2 bits of information are conserved in all sequences. If half of the sequence contain one letter and the other half another letter, the information content is 1.The size of the letter is proportional to its conservation and letter are ranked with the maximum conservation on top.

We next ran the the 1251 input sequences extracted from the genome of L. jensenii. MEME identified 176 hits for a motif similar to a promoter (Figure 2). The -10 motif (TATAAT) is clearly identifiable and closely resemble that of L. plantarurm (see above), as well as that of E. coli [2] (Figure 2A). The -35 motif is less marked (with only two TT showing some signal). The spacer is AT-rich. We submitted the entire motif to the MAST (Motif Alignment & Search Tool) webservice. This algorithm is part of the MEME suite and reports the exact location of the motifs in the input sequences [9]. Parsing the MAST output allowed us to plot the distribution of the motifs with respect to the gene’s start codon (Figure 2B).

Figure 2: A: Sequence LOGO for the putative promoter motif of L. Jensenii (JV-V16) based on 176 hits identified by MEME. The typical -10 motif TATAAT is clearly identifiable. The -35 motifs is hard to identify. B: Localisation of the motif of the putative promoter from the start codon of the genes. We can clearly see the promoters are not always located just after the RBS (beginning in -15 on average) this may be due to the presence of natural spacers or regulatory elements on the pre-gene sequence.

MEME also identified 531 hits for a motif strongly resembling the SD motifs characteristic to other Lactobacil and, again, E. coli [1] (Figure 3A). Parsing of the MAST output revealed that most of the hits were located at the expected distance of -3 nucleotides with respect to the start codon of the gene (Figure 1B). Inconsistent positions might be due to erroneous annotation of the start.

Figure 3: A: Sequence LOGO for the putative SD motif of L. Jensenii (JV-V16) based on 531 hits identified by MEME. B: Localisation of the motif of the RBS from the start codon of the genes.

Modeling-driven sequence design

Ideally, the next step would have been to relate the sequence motifs we were able to identify with functional genomic data. Proteomics data are hard to generate and usually rare. Indeed, we could not find any such dataset for Lactobacillus jensenii or related strains. In contrast, transcriptomics data are easier to generate (RNA-seq). Unfortunately, we could not find any publicly available dataset either, and did not have funding to run an RNA-Seq experiment ourselves.
We did identify a metatranscriptomic study on the vaginal microflora, in which L. jensenii was amongst the dominant strains [10]. For lack of better data, we tried to investigate this but the sequencing reads were not publicly available and we were unable to reach the authors of the study. Other data provided in the articles were not useful for our purpose.
RNA-seq data would have allowed us to score our putative promoters and chose a subset spanning a range of measured strength for further experimental characterization in a standardized context. We also could have used this analysis to develop a model a promoter and use it to design new promoter. Likewise, RNA-seq data could have been used to better map terminator and score terminator efficiency.

In the absence of such data, we used the position weight matrices (PWM) of the identified motifs to score our genetic elements and selected them on this basis . We used MAST [9] map and score the motifs on our original input sequences, and extracted the sequence from 100bp to the start codon. We selected 4 sequences showing both promoter and RBS hits (best candidate), 3 showing only the promoter motif (medium candidate), 2 with only RBS pattern (low candidate) and 1 without any hits (null candidate) (Table 1). In addition, we created a synthetic sequence bearing the consensus sequences from the promoter and RBS motifs. We also chose the promoter of rpsU previously shown to be active in L. jensenii as positive control [11].

We sent these sequence for synthesis and cloned them into L. jensenii’s plasmid pLEM415 in such a way as to control expression of RFP. Cloning were successful. Transformations in L. jensenii was more problematic than expected (see Toolbox section). Unfortunately, we were not able to successfully transform these plasmid to measure the activity of these regulatory sequences.

Annotated scripts to perform these analysis on any genome sequence are available here.

[1] Shultzaberger, R. K., Bucheimer, R. E., Rudd, K. E., & Schneider, T. D. (2001). Anatomy of Escherichia coli ribosome binding sites1. Journal of molecular biology, 313(1), 215-228.
[2] Shultzaberger, R. K., Chen, Z., Lewis, K. A., & Schneider, T. D. (2006). Anatomy of Escherichia coli σ 70 promoters. Nucleic acids research, 35(3), 771-788.
[3] Carafa, Y. D. A., Brody, E., & Thermes, C. (1990). Prediction of rho-independent Escherichia coli transcription terminators: a statistical analysis of their RNA stem-loop structures. Journal of molecular biology, 216(4), 835-858.
[4] Mutalik, V. K., Guimaraes, J. C., Cambray, G., Mai, Q. A., Christoffersen, M. J., Martin, L., ... & Keasling, J. D. (2013). Quantitative estimation of activity and quality for collections of functional genetic elements. Nature methods, 10(4), 347.
[5] Mutalik, V. K., Guimaraes, J. C., Cambray, G., Lam, C., Christoffersen, M. J., Mai, Q. A., ... & Endy, D. (2013). Precise and reliable gene expression via standard transcription and translation initiation elements. Nature methods, 10(4), 354.
[6] Cambray, G., Guimaraes, J. C., Mutalik, V. K., Lam, C., Mai, Q. A., Thimmaiah, T., ... & Endy, D. (2016). Measurement and modeling of intrinsic transcription terminators. Nucleic acids research, 44(14), 7006.
[7] Marcobal, A., Liu, X., Zhang, W., Dimitrov, A. S., Jia, L., Lee, P. P., ... & Lagenaur, L. A. (2016). Expression of human immunodeficiency virus type 1 neutralizing antibody fragments using human vaginal Lactobacillus. AIDS research and human retroviruses, 32(10-11), 964-971.
[8] Todt, T. J., Wels, M., Bongers, R. S., Siezen, R. S., Van Hijum, S. A., & Kleerebezem, M. (2012). Genome-wide prediction and validation of sigma70 promoters in Lactobacillus plantarum WCFS1. PloS one, 7(9), e45097.
[9] Bailey, T. L., Johnson, J., Grant, C. E., & Noble, W. S. (2015). The MEME suite. Nucleic acids research, 43(W1), W39-W49.
[10] Macklaim, J. M., Fernandes, A. D., Di Bella, J. M., Hammond, J. A., Reid, G., & Gloor, G. B. (2013). Comparative meta-RNA-seq of the vaginal microbiota and differential expression by Lactobacillus iners in health and dysbiosis. Microbiome, 1(1), 12.
[11] Marcobal, A., Liu, X., Zhang, W., Dimitrov, A. S., Jia, L., Lee, P. P., ... & Lagenaur, L. A. (2016). Expression of human immunodeficiency virus type 1 neutralizing antibody fragments using human vaginal Lactobacillus. AIDS research and human retroviruses, 32(10-11), 964-971.