Biological systems are characterized as highly complicated interactome from systems biology point of view. The rational design of individual basic parts like selection of promoter is crucial for predictable modulation of gene expression especially in prokaryotic organisms, the prokaryotic promoter contains multiple elements which mainly determined by the promoter core consisting of -10 box, -35 box, and transcription-start site +1. And the interplay between the promoter core and operators or genetic circut is complex. Although we have had large amounts of promoter parts in iGEM database, we need to design scientific predicting approaches to measure and characterize the regulatory abilities of these prokaryotic promoter parts that are difficult to be tested experimentally. The aim of optimization of identified promoter core region is to eliminate any unwanted non-functional promoter sequence. Our team proposed a brand new computational tool to predict promoter intensity in E.coil, and allow us to quickly identify the most critical sequence sites or units which play a key role in determining promoter strength or activity. We hope our method will contribute to help the bottom-up design of genetic circut, and speed up the understandings of the controlling network for precisely regulating gene expression. We believe that our newly developed computational method can be used as an efficient and useful tool for systems biology and synthetic biology.
Prediction of the promoter intensity is an old question. Intensive efforts have been put into interpreting and dissecting the modes of interplays in order to develop a rational basis for promoter engineering computationally. Klausdieter Weller et al used occurrence frequencies of consensus pattern with empircal regression model to predict the promoter intensity . Ashok Palaniappan, Ramit B and Keshav Aditya RP(2017 iGEM team from Sri Venkateswara College of Engineering, Anna University) developed a computational website using sequence information of -35 hexamer and -10 hexamer to predict sigma 70 promoter activity . Despite all these methods can predict promoter strength, the predict performance is still unsatisfied. In addition, these methods can not measure the contribution to the activities of promoters embeded in the sequences
We use sequence information to predict promoter activity. Our training set gene sequences include both the orignal promoter of E.coil and the recomposed promoter which are derived from E.coil promoters infected by lambda bacteriophage. 68 bp DNA fragment from transcription start site position −49 to position +19 are selected including RNA polymerase site for Sextama (−35 region), double hilex liquated site for Pribnow (−10 region) and other linked and assistant sequence fragments . We combined a new feature encode method and together with machine learning algorithm to predict promoter strength and consensus unit or specific site may have direct correlation to the promoter intensity.
Feature encoding and XGBoost training
Like protein-proein interaction (PPI) prediction, one of main computational challenges is to find a suitable way to fully describe the information of sequence. Like feature encode method in PPI prediction, we used similar descriptor conjoint triad (three base as one unit) to describe the information of sequence, and calculate occurance frequencies of each unit to project promoter sequence into a homogeneous vector space by counting the frequencies of each unit type.
In addition, we also used sequence based information to describe information on each site. We project sequence(“ATCG”) to numeric vector format(“1234”) as a new feature vector and concatenate above two feature vectors together as 132-dimensional feature vector to predict the activity of promoters. The process of encoding feature vectors is described as below:
Fig.1. The flowchart of feature encoding. We calculated the unit frequency and encode each sequence site as numeric format, then concatenate above two vectors as 132-dimensional feature vector to predict promoter activity.
IWe used a machine learning framework named ‘XGBoost’ to training our promoter intensity prediction model. In the training step, we used a grid search approach within a limited range to minimize the overfitting of the prediction model, leave one out crossover validation was used to investigate the training set. Predicted accuracy defined by that is associated with mean-square-error and R^2 was used to select the parameters
N means the round times of Leave one out crossover validation, is the predict value, means the raw value of promoter strength.
According to our predicted result, we found that our predicted promoter intensity is highly correlated with real promoter activity.
Fig.2. The task of performance on test dataset
One advantage of XGBoost and the other boosting method (For instance, GBDT) is that it can measure the importance of each feature for prediction task. The benefit of using the gradient boost algorithm is that after the boosting tree is created, the importance score for each attribute can be obtained directly. In general, the importance score measures the value of the feature in the construction of the decision tree in the model. The more a feature is used to build a decision tree in a model, the more important it is.
Finally, the results of an attribute in all the boosting trees are weighted and summed, and then are averaged to obtain the importance score. We showed the top15 most important features with their score.
Fig. 3. The importance of Top15 features.
As shown in Fig. 2, we can see that the frequence of CCG unit plays the most important role in predict promoter strength. Moroever, we also can see that AAA, ACT, and AAT also play important roles in regulating the promoter activity. It further agrees with the TATAA Pribnow box (-10 sequence region). We found that the position 51 on the sequence(+2) region plays the most important functions when compared with other promoter sequence site which is near to the transcript start site. Position 36 (-14) is near to the Pribnow box. Position 4 (-46), position7 (-43) is near to the Saxtama box (around -35 sequence region).
Using our model, we predicted and measured a wide range of E.coil promoter activities based on the Ecoil promoter database ‘PromEC’ which includes 471 promoter sequences (see predict_promEC.txt). According to our prediction, we randomly selected three promoters: rplj (predict value:1.723286152), dapA (predict value:1.235077024) and caiF (predict value:0.68780911) to fuse with green fluorescent protein (GFP) to test the promoter strength and verify our prediction.
Material and protocol
1. Clone the eGFP gene (720bp) and purify by D205 StarPrep Gel Extraction Kit StarPrep
Digest the rplJ-pET28(a)/ dapA-pET28(a)/ caiF-pET28(a) (we create 3 new vectors with prplJ base on the pET28(a) vector by DNA endonuclease BamH I (Thermo).
Gibson ligation technology is used in this experiment.
1) Culture the positive clone in the LB medium for 12 hours.
2) Adjust each cultures’ OD600 until 2.1.
3) Add 1mL cultures to new 30mL LB medium.
4) And then culture for 12 hours.
Fluorescence intensity measurement
1) First, pipette 15 microliters of bacteria into the slides.
2) Second, check the slides under the Inverted fluorescence Microscope and obtain the Fluorescent images.
3) Third, analyze the Fluorescent images by ImageJ and get the diagram which shows the fluorescence intensity.
Fig.4. The GFP test result, demonstrating the promoter activity order is as same as our prediction
Discussion and future work
Our method gives a promising way to realize the minimal functional promoter and identify the most important region in the promoter sequence. In the future, we want to perform truncation or point mutation of intertested sequence, and compare the influence of different mutation typesin the activity of promoter to further verify our prediction model.
Attachted is our prediction results, click and download it.
 Zong Y, Zhang H M, Cheng L, et al. Insulated transcriptional elements enable precise design of genetic circuits[J]. Nature Communications, 2017, 8(1):52.
 Weller K, Recknagel R D. Promoter strength prediction based on occurrence frequencies of consensus patterns[J]. Journal of Theoretical Biology, 1994, 171(4):355-359.
 Liang G, Li Z. Scores of generalized base properties for quantitative sequence-activity modelings for E. coli promoters based on support vector machine.[J]. Journal of Molecular Graphics & Modelling, 2007, 26(1):269-281.
 Friedman J H. Greedy function approximation: A gradient boosting machine.[J]. Annals of Statistics, 2001, 29(5):1189-1232.