Zhaowenxue (Talk | contribs) |
Zhaowenxue (Talk | contribs) |
||
Line 1: | Line 1: | ||
− | < | + | Background |
− | + | We have built a promoter library of inner promoters with various strength in pf5. We characterized the strength by a fluorescent reporter gene, <latin>firefly luciferase</latin. Having gotten so many statistics, we thought about determine element for the strength of promoters. As we all know, strength composite promoters usually have high affinity with RNA polymerase related to the UP elements like −35 sequences and −10 sequences. We wonder the connection between the promoter sequence and its strength. So we use the resulting fluorescence data to construct a model with the promoter sequence. | |
− | + | There are four reasons for our modeling: | |
− | + | 1. To facilitate the promoter transformation of pseudomonas fluorescence in new site; | |
− | + | 2. To reveal the correlation between promoter weights and their strength; | |
− | + | 3. To improve the parameter setting of the promoter strength prediction in the species outside the large intestine with the method of position weight; | |
− | + | 4. To provide reference for other microbe promoter strength prediction modeling. | |
+ | Overview | ||
+ | <p>We set up a model for facilitating the transformation of new chassis bacteria <latin>Pseudomonas fluorescenspf-5</latin>, such as building a promoter library. In our model,encode each three-base sequence with quaternary number(A,T,C,G represents quaternary number 0,1,2,3 ),and then get a vector with 64 dimensions from AAA to GGG, which calculates the frequency of appearance for each type in DNA sequence.</p> | ||
+ | <p>We preliminarily assume that there is a simple linear relationship between the 64 components of this vector and the final result, and preliminarily confirm our hypothesis by comparing the difference between calculated value and measured value by solving sparse equations. Subsequently, we further strengthened our model by using the neural network model with stronger stability and more relevant parameters, and obtained an algorithm to predict the promoter strength. </p> | ||
+ | <p>We tested 25 promoters’ strength with firefly luciferase and another variable goes to promoter sequence gotten from pf5 genome in ncbi. As we know, house-keeping genes sometiomes express strongly and consistently. So we picked up 70 base pairs up from several genes of them as strong promoters in <latin>pf-5</latin>. As for the weak promoter, we selected weak promoters that change insignificantly under different ionic stresses from the transcriptome information in <latin>Lim, C. K.[1] </latin>The method is the same as the construction of strong promoter. </p> | ||
+ | <p>By combining three bases to encode DNA, solving a sparse system of equations,and strengthening the linear relationship between system input and promoter strength by neural network, a good correlation was found between the promoter sequence and promoter strength. It is through this mutually-informing modelling that we were able to figure out the relationship between promoter sequence and strength in P. fluorescence. This model informs us a promoter strength based on its specific sequence and it has also been verified by using other parts. This approach is also of valuable reference for future teams’ project. </p> | ||
+ | Assumptions | ||
+ | 1. It is assumed that the promoter strength is only related to the core DNA sequence.In other words, the DNA sequence can uniquely determine the promoter strength. | ||
+ | 2. It is assumed that the laboratory environment has no impact on the test results.In other words, our test data can fully reflect the promoter strength | ||
+ | 3.It is assumed that there is a simple monotonic relationship between the obtained fluorescence data and the promoter strength: the larger the fluorescence value is, the stronger the promoter can be considered. | ||
+ | DNA Sequence Encoding | ||
+ | The 5’~3’ region includes 70bp DNA information in each promoter. Currently, we have got 25 promoter sequences. We need to encode these 70 bases in order to establish a mathematical model. Notice that there’re only four types of bases(A,T,C and G),and calculating the numbers of each type is not enough to reflect interaction between adjacent bases.[2] Thus, we consider 3 continuous bases (Fig 1). | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | Fig 1. Coding method of triple bases | |
− | + | We encode each three-base sequence with quaternary number(A,T,C,G represents quaternary number 0,1,2,3 ), and by running a C++ program (the whole code is shown in appendix), we get a vector with 64 dimensions from AAA to GGG, which calculates the frequency of appearance for each type in DNA sequence. | |
− | + | After encoding, all DNA sequences are transformed into a vector with 64 dimensions. The entire data is stored in the document “data.xlsx”. For example, the vectors of promoters ampC and araA are shown as follows: | |
− | + | Table 1. coded vector schematic table | |
− | + | ampC: 3 1 5 1 1 0 1 0 2 0 4 2 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 4 1 0 2 0 0 0 1 4 0 1 3 0 1 4 2 2 0 1 0 1 0 0 0 1 1 3 2 2 0 2 4 | |
− | + | araA: 0 0 1 0 1 1 1 0 2 0 2 2 0 1 2 1 0 1 1 2 0 2 1 2 1 3 1 0 1 2 0 1 1 0 2 1 3 1 2 2 0 2 4 1 2 0 0 1 0 2 2 1 1 1 1 0 1 2 0 0 2 0 1 0 | |
− | + | At the end of the encoding step, we get an 25*64 matrix which reflects the features of these DNA sequences of each promoters. This encoding method has deep internal connection with the characteristics of DNA sequence, and it simultaneously contains the sequential relationship and details of DNA sequence. With this matrix, a mathematical model can be established. | |
+ | Model establishment and solution | ||
+ | In order to simplify the problem, we assume that the connect between matrix and promoter strength is linear, in other words, the vector \({x}\) exists when equation \({y=Ax}\) is satisfied. | ||
+ | Now, through a lot of experiments, we have measured a total of 24 values, and all twenty-two valid values are shown below. Each data is the average of the results of three parallel experiments. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | Table 2. Corresponding table for natural logarithm processing of data | |
− | + | Promoter y data the value of ln(y) Promoter y data the value of ln(y) | |
− | + | P1 4407487.7 15.29881524 p14 5318.7 8.578977924 | |
− | + | p2 3249342.7 14.99396328 p15 INVALID DATA | |
− | + | p3 711219.0 13.47473568 p16 637745.3 13.36569432 | |
− | + | p4 649617.0 13.38413824 p17 1750913.0 14.37564792 | |
− | + | p5 62342.7 11.04040133 p18 1360771.0 14.12356201 | |
+ | p6 375156.7 12.835099 p19 INVALID DATA | ||
+ | p7 223456.0 12.34114972 p20 1798665.0 14.40255528 | ||
+ | p8 2187057.0 14.59806736 p21 355658.0 12.78172487 | ||
+ | p9 85833.0 11.36015883 p22 35051.0 10.46455942 | ||
+ | p10 198750.0 12.19980303 p23 INVALID DATA | ||
+ | p11 945168.3 13.75911832 p24 726362.3 13.495803792 | ||
+ | p12 27812.7 10.23324683 p25 2045946.7 14.53137116 | ||
+ | p13 59438.3 10.99269464 | ||
+ | The gap in original data is too wide, so we take the natural log of the original data as the value of the vector . Now, what we're going to do is to invert the linear relationship vector \({x}\) by taking the linear transformation matrix \({A}\) and the resulting vector \({y}\).\({A,x}\) and \({y}\)satisfy the following equation: \({y=Ax}\),or \({x=A<sup>-1</sup>y}\). | ||
+ | This is a linear mapping of 64 variables but there are only 22 data points. In other words, this is an underdetermined equation which cannot be solved accurately by solving inhomogeneous linear equations. The sparse solution algorithm of underdetermined equations based on iteration3 will be adopted in the following part. | ||
+ | This is an effective algorithm to restore deterministic signals with fewer premeasured values. | ||
+ | Our goal is to improve the precision of signal reduction as much as possible. Consider the following optimization problem: | ||
+ | $$min \left \| x \right \|_{0}$$ | ||
+ | s.t.\; \; \; y=Ax | ||
+ | |||
+ | |||
+ | \(\left \| x \right \|_{0}\)is called \(l^{0}\)norm, which represents the number of nonzero components in vector x in mathematics. This value is equivalent to this expression: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <span class="image fit"> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/f/f3/T--SKLMT-China--model_because.png" alt="" /> | ||
+ | </span> | ||
+ | |||
+ | |||
+ | $$\lim_{\varepsilon \rightarrow 0}\sum_{i=1}^{64}\frac{x_{i}^{2}}{x_{i}^{2}+\varepsilon }, because\frac{x_{i}^{2}}{x_{i}^{2}+\varepsilon }=$$ | ||
+ | The following iterative steps are obtained from the perspective of numerical calculation: | ||
+ | ⑴At first , set the initial \(x^{(0)}=0,\varepsilon ^{(0)}=1)\value | ||
+ | ⑵After that, loop to calculate the current \(x^{(n+1)},\varepsilon ^{(n+1)})\value based on the previous value \(x^{(n)},\varepsilon ^{(n)})\with this formula: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | $$x^{(n+1)}=arg\, min\left \{ \sum_{i=1}^{64}\frac{x_{i}^{2}}{(x_{i}^{(n)})^{2}+\varepsilon^{(0)} } \right\},(y=Ax);$$ | ||
+ | $$\varepsilon ^{(0)}=min\left \{ \varepsilon ^{(n)} ,\frac{min(x_{i}^{(n)})}{N}\right \}$$ | ||
+ | ⑶Finally, if is less than a given bound such as 1e-6, abort this program, and regard current \(x^{(n)})\ as the solution vector of this underdetermined equation. | ||
+ | The explicit iterative form of the above steps is given below: | ||
+ | |||
+ | |||
+ | |||
+ | $$f(x,\lambda )=\sum_{i=1}^{64}\frac{x_{i}^{2}}{(x_{i}^{(n)})^{2}+\varepsilon^{(0)}=\sum_{i=1}^{64}\frac{x_{i}^{2}}{(x_{i}^{(n)})^{2}+\varepsilon^{(0)}+\lambda (Ax-y)$$ | ||
+ | we use Lagrange multiplier method to solve the minimum variable. Vector ({\lambda }\)with 22 dimensions is introduced. So calculate the partial derivative to each component of \({x}\),then solve this | ||
+ | equation: | ||
+ | |||
+ | $$\frac{\partial f}{\partial x_{i}}=0$$ | ||
+ | We find that the vector \({x}\)can be iterated by this formula: | ||
+ | |||
+ | $$x^{(n+1)}=D^{(n)}{A}'(AD^{(n)}{A}')^{-1}y.$$ | ||
+ | where we construct matrix \({D^{(n)} }\) as follow: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
$$D^{(n)}_{ij}= \begin{cases} | $$D^{(n)}_{ij}= \begin{cases} | ||
& \text{ 0, } i\neq j \\ | & \text{ 0, } i\neq j \\ | ||
− | & \text | + | & \text (x_{i}^{(n)})^{2}+\varepsilon ^{n},i=j |
\end{cases}$$ | \end{cases}$$ | ||
− | + | With the help of MATLAB, we wrote the program to find the final value (the whole script and vector is shown in appendix), and the corresponding y and measured values are compared as follows: | |
− | + | Table 3. Corresponding table of equation solution and test data | |
− | + | Promoter Measured data Calculated data Promoter Measured data Calculated data | |
− | + | p1 15.29881524 15.2988 p14 8.578977924 8.579 | |
− | + | p2 14.99396328 14.994 p15 INVALID DATA | |
− | + | p3 13.47473568 12.6034 p16 13.36569432 12.6364 | |
− | + | p4 13.38413824 13.3035 p17 14.37564792 14.3756 | |
− | + | p5 11.04040133 11.1562 p18 14.12356201 14.1236 | |
+ | p6 12.835099 12.3592 p19 INVALID DATA | ||
+ | p7 12.341156 11.628 p20 14.40255528 14.4026 | ||
+ | p8 14.59806736 14.5981 p21 12.78172487 13.6054 | ||
+ | p9 11.36015883 11.0234 p22 10.46455942 10.4646 | ||
+ | p10 12.19980303 11.9291 p23 14.58418488 14.5842 | ||
+ | p11 13.75911832 14.1454 p24 INVALID DATA | ||
+ | p12 10.23324683 10.2332 p25 14.53137116 14.5314 | ||
+ | p13 10.99269464 10.9927 | ||
+ | Part of the value without bold is used to solve the equation, so it is a perfect match; From the comparison of data in bold, it can be seen that the result of data fitting is good, with the average fitting degree higher than 95%. | ||
+ | The solution results of the matrix are good, which confirms the basic reliability of the assumed linear relationship. Therefore, we will use the neural network model to further improve the accuracy and stability of the model prediction. | ||
+ | Model Improvement--Algorithm optimization | ||
+ | By solving the matrix sparse solution algorithm, we can conclude that there is indeed a linear relationship between the total 64 data of AAA-GGG and the promoter strength. Notice that we used merely 22 sets of data to approximately solve the sparse solutions of 64 equations with a fitting precision higher than 95%, therefore, when we provide as much data as possible, the fitting precision of the model will greatly increase. | ||
+ | However, there are still many problems with this algorithm: | ||
+ | (1) All known data cannot be reused. The sparse solution is uniquely calculated as the data is determined, so the data is not very well utilized, and it is difficult to improve the accuracy of the original results as well. | ||
+ | (2) This algorithm is not well adapted to the extension of existing data sets. Every time a new set of data is added, all matrix operations change apparently, so it's not very stable when data updates. | ||
+ | On the basis of the linear structure of the original algorithm, we add more parameters to make the algorithm more stable and reliable. This algorithm adopts the neural network structure and improves the utilization rate of data by supervised learning. | ||
+ | The construction of neural network is shown in the Figure 2: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | ||
− | + | ||
+ | |||
+ | |||
+ | Fig.2 Schematic diagram of neural network's neural structure | ||
+ | |||
+ | There are 6 neurons in the hidden layer, and each neuron is connected with 32 input terminals. The corresponding relation is as follows: The binary number of integer 1~63 is exactly 6 bits, and the deficiency is added. Input and hidden neuron is connected when the number of the binary number is 1. In particular, we consider port 64 separately. | ||
+ | And then we give a certain weight to each edge between Input and hidden neuron ,and give to each edge between hidden neuron and output terminal. | ||
+ | The final output can be obtained by the following formula: | ||
+ | . | ||
+ | represents the input(output) data of hidden neuron . | ||
+ | We use gradient descent method to determine the learning direction. The objective function is the error of output and ideal values: | ||
+ | |||
+ | Our goal is to minimize the value of this function by generations of iterations. Difference between vector , and its next generation is calculated by this formula: | ||
+ | |||
+ | where represents learning speed. We use the chain rule to calculate the specific value of the gradient: | ||
+ | |||
$$\begin{cases} | $$\begin{cases} | ||
& \text{ if } \frac{\partial e}{\partial \omega_{ij} }= (y_{output}-\bar{y})\frac{\partial y_{output}}{\partial \omega _{ij}}=(y_{output}-\bar{y})x_{i}p_{j}\\ | & \text{ if } \frac{\partial e}{\partial \omega_{ij} }= (y_{output}-\bar{y})\frac{\partial y_{output}}{\partial \omega _{ij}}=(y_{output}-\bar{y})x_{i}p_{j}\\ | ||
Line 148: | Line 156: | ||
\end{cases}$$ | \end{cases}$$ | ||
− | + | At the same time, a reasonable learning speed is designed for the program, which can accelerate the convergence speed while ensuring the convergence. After several convergence tests, we set the learning speed as .That's where our mathematical model comes in. | |
+ | Model analysis and evaluation | ||
+ | After improvement of the above algorithm, the advantages of the model are as follows: | ||
+ | 1. High stability of the model. This algorithm adopts the mode of deep learning, which shortens the step length of vector change and expands the relevant linear parameters to 206, ensuring the precision of model fitting and greatly improving the stability of the model | ||
+ | 2. The input of the model can be extended conveniently. When the data of the promoter library is expanded, it only needs to input the latest measured data into our system, which is automatically imported into the database, and the model can also use this data for training immediately. | ||
+ | 3. The method of this model can be used for reference by other subjects. The coding method and mathematical treatment of the model are in line with the effect of DNA sequence on the promoter strength and can be transplanted to other similar studies such as <latin>E.coli</latin>. | ||
+ | At the same time, there are also many deficiencies in this model that need to be improved: | ||
+ | 1. The model is not robust to error data. When a group of incorrect fluorescent values are given in the input data, the fitting precision of the whole program will be greatly affected, and this interference will gradually weaken as the number of correct data increases | ||
+ | 2. Convergence speed and accuracy still need to be improved. If the number of training rounds is too high, the running speed of the program is too slow. When the number of rounds is too low, the results presented are often unsatisfactory. We need to find a balance between speed and accuracy. | ||
+ | Results | ||
+ | Through the test of the linear relationship of sparse equations, we prove that there is a strong linear correlation between the promoter strength and the number of different triple bases, and strengthen this linear relationship with the mode of deep learning. | ||
+ | We call the process of using all known data in turn for a round of learning a generation.The fitting results of the model under different generations are given below: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Fig.3 Error statistics after multi-generation training | ||
+ | <span class="image fit"> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/c/ca/T--SKLMT-China--Fig3Errorstatisticsaftermultigenerationtraining.png | ||
+ | " alt=" Fig.3 Error statistics after multi-generation training | ||
+ | " /> | ||
+ | </span> | ||
+ | It can be seen that the more generations we calculate, the better the data fitting degree we can get, and the absolute error is even lower than 0.01 after the 250, 000 repetitions of training.This indicates that the data fitting precision is very high. | ||
+ | After setting the number of learning rounds as 250, 000 generations, the coefficients between neurons in the input layer, hidden layer and output layer were obtained respectively. The complete table was placed in the appendix part.As long as a core sequence of DNA is input, our model can obtain the output value of promoter strength immediately with the existing parameters.The specific calculation process is as follows: | ||
+ | |||
+ | $$y_{output}=\sum_{i=1}^{6}y_{i}\cdot p_{i}\, ,\, y_{i}=\sum_{j=1}^{64}\omega _{ji}x_{j}$$ | ||
+ | We notice that there is a significant difference in the magnitude of the measurements between our group and the other groups. In order to better contribute our results to other teams, we will use the magnitude of our measurements and classify the promoters by numerical size. | ||
+ | The promoter intensity level corresponding to the data is as follows: | ||
+ | |||
+ | Using this standard, we quickly categorized 23 known promoters: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | Fig.4 The strength level of different promoters | ||
− | + | With the expansion of the promoter data, we can quickly calculate the strength level of the new promoter with the help of our model. | |
− | + | Further application | |
+ | In order to be of greater help to other teams, we would like to attach the source code of all the programs in the appendix for you to read and verify. At the same time, we also designed python programs to enable our research results to interact well with other teams. | ||
+ | 1. Provide convenient data insertion for other teams. When a new set of data is measured, just submit the core sequence and measurement values through our interface (please take the natural logarithm of the original data first), and then the data will be entered into our database. | ||
+ | 2. Easily control the speed and accuracy of deep learning by changing the learning generation. After a new generation is set, the program can automatically update all parameters with the new generation | ||
+ | 3.Predict promoter strength. After entering the core sequence, our python program will calculate the strength level of the promoter for you. | ||
+ | References | ||
+ | [1]Lim, C. K.Hassan, K. A.Tetu, S. G.Loper, J. E.Paulsen, I. T. The effect of iron limitation on the transcriptome and proteome of Pseudomonas fluorescens Pf-5. PLoS One. 2012, June 2012 | Volume 7 | Issue 6 | e39139,2-18 | ||
+ | [2] Xuan Zhou, Xin Zhou, Zhaojie Zhong, Prediction of start-up strength of E. coli promoter, Computers and Applied Chemistry, Jan.28,2014|Volume 31| Issue 1,101-103 | ||
+ | [3]Xiaoliang Cheng, Xuan Zheng, Weimin Han, An algorithm for solving sparse solutions of undetermined linear equations,College journal of applied mathematics,2013:28(2):235-248 |
Revision as of 21:57, 17 October 2018
Background We have built a promoter library of inner promoters with various strength in pf5. We characterized the strength by a fluorescent reporter gene, <latin>firefly luciferase</latin. Having gotten so many statistics, we thought about determine element for the strength of promoters. As we all know, strength composite promoters usually have high affinity with RNA polymerase related to the UP elements like −35 sequences and −10 sequences. We wonder the connection between the promoter sequence and its strength. So we use the resulting fluorescence data to construct a model with the promoter sequence. There are four reasons for our modeling: 1. To facilitate the promoter transformation of pseudomonas fluorescence in new site; 2. To reveal the correlation between promoter weights and their strength; 3. To improve the parameter setting of the promoter strength prediction in the species outside the large intestine with the method of position weight; 4. To provide reference for other microbe promoter strength prediction modeling. Overview
We set up a model for facilitating the transformation of new chassis bacteria <latin>Pseudomonas fluorescenspf-5</latin>, such as building a promoter library. In our model,encode each three-base sequence with quaternary number(A,T,C,G represents quaternary number 0,1,2,3 ),and then get a vector with 64 dimensions from AAA to GGG, which calculates the frequency of appearance for each type in DNA sequence.
We preliminarily assume that there is a simple linear relationship between the 64 components of this vector and the final result, and preliminarily confirm our hypothesis by comparing the difference between calculated value and measured value by solving sparse equations. Subsequently, we further strengthened our model by using the neural network model with stronger stability and more relevant parameters, and obtained an algorithm to predict the promoter strength.
We tested 25 promoters’ strength with firefly luciferase and another variable goes to promoter sequence gotten from pf5 genome in ncbi. As we know, house-keeping genes sometiomes express strongly and consistently. So we picked up 70 base pairs up from several genes of them as strong promoters in <latin>pf-5</latin>. As for the weak promoter, we selected weak promoters that change insignificantly under different ionic stresses from the transcriptome information in <latin>Lim, C. K.[1] </latin>The method is the same as the construction of strong promoter.
By combining three bases to encode DNA, solving a sparse system of equations,and strengthening the linear relationship between system input and promoter strength by neural network, a good correlation was found between the promoter sequence and promoter strength. It is through this mutually-informing modelling that we were able to figure out the relationship between promoter sequence and strength in P. fluorescence. This model informs us a promoter strength based on its specific sequence and it has also been verified by using other parts. This approach is also of valuable reference for future teams’ project.
Assumptions 1. It is assumed that the promoter strength is only related to the core DNA sequence.In other words, the DNA sequence can uniquely determine the promoter strength. 2. It is assumed that the laboratory environment has no impact on the test results.In other words, our test data can fully reflect the promoter strength 3.It is assumed that there is a simple monotonic relationship between the obtained fluorescence data and the promoter strength: the larger the fluorescence value is, the stronger the promoter can be considered. DNA Sequence Encoding The 5’~3’ region includes 70bp DNA information in each promoter. Currently, we have got 25 promoter sequences. We need to encode these 70 bases in order to establish a mathematical model. Notice that there’re only four types of bases(A,T,C and G),and calculating the numbers of each type is not enough to reflect interaction between adjacent bases.[2] Thus, we consider 3 continuous bases (Fig 1).
Fig 1. Coding method of triple bases
We encode each three-base sequence with quaternary number(A,T,C,G represents quaternary number 0,1,2,3 ), and by running a C++ program (the whole code is shown in appendix), we get a vector with 64 dimensions from AAA to GGG, which calculates the frequency of appearance for each type in DNA sequence.
After encoding, all DNA sequences are transformed into a vector with 64 dimensions. The entire data is stored in the document “data.xlsx”. For example, the vectors of promoters ampC and araA are shown as follows:
Table 1. coded vector schematic table
ampC: 3 1 5 1 1 0 1 0 2 0 4 2 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 4 1 0 2 0 0 0 1 4 0 1 3 0 1 4 2 2 0 1 0 1 0 0 0 1 1 3 2 2 0 2 4
araA: 0 0 1 0 1 1 1 0 2 0 2 2 0 1 2 1 0 1 1 2 0 2 1 2 1 3 1 0 1 2 0 1 1 0 2 1 3 1 2 2 0 2 4 1 2 0 0 1 0 2 2 1 1 1 1 0 1 2 0 0 2 0 1 0
At the end of the encoding step, we get an 25*64 matrix which reflects the features of these DNA sequences of each promoters. This encoding method has deep internal connection with the characteristics of DNA sequence, and it simultaneously contains the sequential relationship and details of DNA sequence. With this matrix, a mathematical model can be established.
Model establishment and solution
In order to simplify the problem, we assume that the connect between matrix and promoter strength is linear, in other words, the vector \({x}\) exists when equation \({y=Ax}\) is satisfied.
Now, through a lot of experiments, we have measured a total of 24 values, and all twenty-two valid values are shown below. Each data is the average of the results of three parallel experiments.
Table 2. Corresponding table for natural logarithm processing of data
Promoter y data the value of ln(y) Promoter y data the value of ln(y)
P1 4407487.7 15.29881524 p14 5318.7 8.578977924
p2 3249342.7 14.99396328 p15 INVALID DATA
p3 711219.0 13.47473568 p16 637745.3 13.36569432
p4 649617.0 13.38413824 p17 1750913.0 14.37564792
p5 62342.7 11.04040133 p18 1360771.0 14.12356201
p6 375156.7 12.835099 p19 INVALID DATA
p7 223456.0 12.34114972 p20 1798665.0 14.40255528
p8 2187057.0 14.59806736 p21 355658.0 12.78172487
p9 85833.0 11.36015883 p22 35051.0 10.46455942
p10 198750.0 12.19980303 p23 INVALID DATA
p11 945168.3 13.75911832 p24 726362.3 13.495803792
p12 27812.7 10.23324683 p25 2045946.7 14.53137116
p13 59438.3 10.99269464
The gap in original data is too wide, so we take the natural log of the original data as the value of the vector . Now, what we're going to do is to invert the linear relationship vector \({x}\) by taking the linear transformation matrix \({A}\) and the resulting vector \({y}\).\({A,x}\) and \({y}\)satisfy the following equation: \({y=Ax}\),or \({x=A-1y}\).
This is a linear mapping of 64 variables but there are only 22 data points. In other words, this is an underdetermined equation which cannot be solved accurately by solving inhomogeneous linear equations. The sparse solution algorithm of underdetermined equations based on iteration3 will be adopted in the following part.
This is an effective algorithm to restore deterministic signals with fewer premeasured values.
Our goal is to improve the precision of signal reduction as much as possible. Consider the following optimization problem:
$$min \left \| x \right \|_{0}$$
s.t.\; \; \; y=Ax
\(\left \| x \right \|_{0}\)is called \(l^{0}\)norm, which represents the number of nonzero components in vector x in mathematics. This value is equivalent to this expression:
<img src="" alt="" />
$$\lim_{\varepsilon \rightarrow 0}\sum_{i=1}^{64}\frac{x_{i}^{2}}{x_{i}^{2}+\varepsilon }, because\frac{x_{i}^{2}}{x_{i}^{2}+\varepsilon }=$$
The following iterative steps are obtained from the perspective of numerical calculation:
⑴At first , set the initial \(x^{(0)}=0,\varepsilon ^{(0)}=1)\value
⑵After that, loop to calculate the current \(x^{(n+1)},\varepsilon ^{(n+1)})\value based on the previous value \(x^{(n)},\varepsilon ^{(n)})\with this formula:
$$x^{(n+1)}=arg\, min\left \{ \sum_{i=1}^{64}\frac{x_{i}^{2}}{(x_{i}^{(n)})^{2}+\varepsilon^{(0)} } \right\},(y=Ax);$$
$$\varepsilon ^{(0)}=min\left \{ \varepsilon ^{(n)} ,\frac{min(x_{i}^{(n)})}{N}\right \}$$
⑶Finally, if is less than a given bound such as 1e-6, abort this program, and regard current \(x^{(n)})\ as the solution vector of this underdetermined equation.
The explicit iterative form of the above steps is given below:
$$f(x,\lambda )=\sum_{i=1}^{64}\frac{x_{i}^{2}}{(x_{i}^{(n)})^{2}+\varepsilon^{(0)}=\sum_{i=1}^{64}\frac{x_{i}^{2}}{(x_{i}^{(n)})^{2}+\varepsilon^{(0)}+\lambda (Ax-y)$$ we use Lagrange multiplier method to solve the minimum variable. Vector ({\lambda }\)with 22 dimensions is introduced. So calculate the partial derivative to each component of \({x}\),then solve this equation:
$$\frac{\partial f}{\partial x_{i}}=0$$ We find that the vector \({x}\)can be iterated by this formula:
$$x^{(n+1)}=D^{(n)}{A}'(AD^{(n)}{A}')^{-1}y.$$ where we construct matrix \({D^{(n)} }\) as follow:
$$D^{(n)}_{ij}= \begin{cases}
& \text{ 0, } i\neq j \\ & \text (x_{i}^{(n)})^{2}+\varepsilon ^{n},i=j
\end{cases}$$
With the help of MATLAB, we wrote the program to find the final value (the whole script and vector is shown in appendix), and the corresponding y and measured values are compared as follows: Table 3. Corresponding table of equation solution and test data Promoter Measured data Calculated data Promoter Measured data Calculated data p1 15.29881524 15.2988 p14 8.578977924 8.579 p2 14.99396328 14.994 p15 INVALID DATA p3 13.47473568 12.6034 p16 13.36569432 12.6364 p4 13.38413824 13.3035 p17 14.37564792 14.3756 p5 11.04040133 11.1562 p18 14.12356201 14.1236 p6 12.835099 12.3592 p19 INVALID DATA p7 12.341156 11.628 p20 14.40255528 14.4026 p8 14.59806736 14.5981 p21 12.78172487 13.6054 p9 11.36015883 11.0234 p22 10.46455942 10.4646 p10 12.19980303 11.9291 p23 14.58418488 14.5842 p11 13.75911832 14.1454 p24 INVALID DATA p12 10.23324683 10.2332 p25 14.53137116 14.5314 p13 10.99269464 10.9927 Part of the value without bold is used to solve the equation, so it is a perfect match; From the comparison of data in bold, it can be seen that the result of data fitting is good, with the average fitting degree higher than 95%. The solution results of the matrix are good, which confirms the basic reliability of the assumed linear relationship. Therefore, we will use the neural network model to further improve the accuracy and stability of the model prediction. Model Improvement--Algorithm optimization By solving the matrix sparse solution algorithm, we can conclude that there is indeed a linear relationship between the total 64 data of AAA-GGG and the promoter strength. Notice that we used merely 22 sets of data to approximately solve the sparse solutions of 64 equations with a fitting precision higher than 95%, therefore, when we provide as much data as possible, the fitting precision of the model will greatly increase. However, there are still many problems with this algorithm: (1) All known data cannot be reused. The sparse solution is uniquely calculated as the data is determined, so the data is not very well utilized, and it is difficult to improve the accuracy of the original results as well. (2) This algorithm is not well adapted to the extension of existing data sets. Every time a new set of data is added, all matrix operations change apparently, so it's not very stable when data updates. On the basis of the linear structure of the original algorithm, we add more parameters to make the algorithm more stable and reliable. This algorithm adopts the neural network structure and improves the utilization rate of data by supervised learning. The construction of neural network is shown in the Figure 2:
Fig.2 Schematic diagram of neural network's neural structure
There are 6 neurons in the hidden layer, and each neuron is connected with 32 input terminals. The corresponding relation is as follows: The binary number of integer 1~63 is exactly 6 bits, and the deficiency is added. Input and hidden neuron is connected when the number of the binary number is 1. In particular, we consider port 64 separately. And then we give a certain weight to each edge between Input and hidden neuron ,and give to each edge between hidden neuron and output terminal. The final output can be obtained by the following formula:
. represents the input(output) data of hidden neuron .
We use gradient descent method to determine the learning direction. The objective function is the error of output and ideal values:
Our goal is to minimize the value of this function by generations of iterations. Difference between vector , and its next generation is calculated by this formula:
where represents learning speed. We use the chain rule to calculate the specific value of the gradient:
$$\begin{cases}
& \text{ if } \frac{\partial e}{\partial \omega_{ij} }= (y_{output}-\bar{y})\frac{\partial y_{output}}{\partial \omega _{ij}}=(y_{output}-\bar{y})x_{i}p_{j}\\ & \text{ if } \frac{\partial e}{\partial p_{i}}= (y_{output}-\bar{y})\frac{\partial y_{output}}{\partial p_{i}}=(y_{output}-\bar{y})y_{i}
\end{cases}$$
At the same time, a reasonable learning speed is designed for the program, which can accelerate the convergence speed while ensuring the convergence. After several convergence tests, we set the learning speed as .That's where our mathematical model comes in. Model analysis and evaluation After improvement of the above algorithm, the advantages of the model are as follows: 1. High stability of the model. This algorithm adopts the mode of deep learning, which shortens the step length of vector change and expands the relevant linear parameters to 206, ensuring the precision of model fitting and greatly improving the stability of the model 2. The input of the model can be extended conveniently. When the data of the promoter library is expanded, it only needs to input the latest measured data into our system, which is automatically imported into the database, and the model can also use this data for training immediately. 3. The method of this model can be used for reference by other subjects. The coding method and mathematical treatment of the model are in line with the effect of DNA sequence on the promoter strength and can be transplanted to other similar studies such as <latin>E.coli</latin>. At the same time, there are also many deficiencies in this model that need to be improved: 1. The model is not robust to error data. When a group of incorrect fluorescent values are given in the input data, the fitting precision of the whole program will be greatly affected, and this interference will gradually weaken as the number of correct data increases 2. Convergence speed and accuracy still need to be improved. If the number of training rounds is too high, the running speed of the program is too slow. When the number of rounds is too low, the results presented are often unsatisfactory. We need to find a balance between speed and accuracy. Results Through the test of the linear relationship of sparse equations, we prove that there is a strong linear correlation between the promoter strength and the number of different triple bases, and strengthen this linear relationship with the mode of deep learning. We call the process of using all known data in turn for a round of learning a generation.The fitting results of the model under different generations are given below:
Fig.3 Error statistics after multi-generation training
<img src="
" alt=" Fig.3 Error statistics after multi-generation training
" />
It can be seen that the more generations we calculate, the better the data fitting degree we can get, and the absolute error is even lower than 0.01 after the 250, 000 repetitions of training.This indicates that the data fitting precision is very high.
After setting the number of learning rounds as 250, 000 generations, the coefficients between neurons in the input layer, hidden layer and output layer were obtained respectively. The complete table was placed in the appendix part.As long as a core sequence of DNA is input, our model can obtain the output value of promoter strength immediately with the existing parameters.The specific calculation process is as follows:
$$y_{output}=\sum_{i=1}^{6}y_{i}\cdot p_{i}\, ,\, y_{i}=\sum_{j=1}^{64}\omega _{ji}x_{j}$$ We notice that there is a significant difference in the magnitude of the measurements between our group and the other groups. In order to better contribute our results to other teams, we will use the magnitude of our measurements and classify the promoters by numerical size. The promoter intensity level corresponding to the data is as follows:
Using this standard, we quickly categorized 23 known promoters:
Fig.4 The strength level of different promoters
With the expansion of the promoter data, we can quickly calculate the strength level of the new promoter with the help of our model. Further application In order to be of greater help to other teams, we would like to attach the source code of all the programs in the appendix for you to read and verify. At the same time, we also designed python programs to enable our research results to interact well with other teams. 1. Provide convenient data insertion for other teams. When a new set of data is measured, just submit the core sequence and measurement values through our interface (please take the natural logarithm of the original data first), and then the data will be entered into our database. 2. Easily control the speed and accuracy of deep learning by changing the learning generation. After a new generation is set, the program can automatically update all parameters with the new generation 3.Predict promoter strength. After entering the core sequence, our python program will calculate the strength level of the promoter for you. References [1]Lim, C. K.Hassan, K. A.Tetu, S. G.Loper, J. E.Paulsen, I. T. The effect of iron limitation on the transcriptome and proteome of Pseudomonas fluorescens Pf-5. PLoS One. 2012, June 2012 | Volume 7 | Issue 6 | e39139,2-18 [2] Xuan Zhou, Xin Zhou, Zhaojie Zhong, Prediction of start-up strength of E. coli promoter, Computers and Applied Chemistry, Jan.28,2014|Volume 31| Issue 1,101-103 [3]Xiaoliang Cheng, Xuan Zheng, Weimin Han, An algorithm for solving sparse solutions of undetermined linear equations,College journal of applied mathematics,2013:28(2):235-248