1. MODEL INTRODUCTION

    DNAs or RNAs are linear macromolecules which are made up of five-membered deoxyribose rings combined with A, C, G, and T or U by phosphodiester bond. So, various positions and properties of bases will lead to different functions of sequences.
    Promoters, as specific DNA sequences, play an essential role in genetic information transferring. Here we use 38 promoter sequences from dataset and the logarithm of the promoter strength is used because of the large variation in strength.
    In this project we have building a model to predict promoter strength only by DNA sequence which based on DWT (Discrete Wavelet Transformation) and SVR (Support Vector Regression) with these promoter sequences.

1.1 DNA walk

    The biological function of DNA depends on the base distribution of its sequence, so we can map a DNA sequence to one or more discrete time sequences by some rule and express them as real or complex numbers, this method is called DNA walk which name comes from the concept of random walk in statistics.
    We obtain the 2-D discrete digital sequence using two-dimensional mapping.
    the formula is as follows:

Promoter sequences in dataset are 68 bp DNA fragment from transcription start site position -49 to position +19. After mapping, each dimension has 68 discrete numbers, and then we can get a one-dimensional sequence of 136 numbers by combining the X and Y.

1.2 Discrete Wavelet Transformation

The Haar transform is one of the oldest transform functions, proposed in 1910 by the Hungarian mathematician Alfréd Haar. As a special case of the Daubechies wavelet, the Haar wavelet is also known as Db1. It is found effective in applications such as signal and image compression in electrical and computer engineering as it provides a simple and computationally efficient approach for analyzing the local aspects of a signal.
In mathematics, the Haar wavelet is a sequence of rescaled "square-shaped" functions which together form a wavelet family or basis. Wavelet analysis is similar to Fourier analysis

    in that it allows a target function over an interval to be represented in terms of an orthonormal basis.     We don't need to understand the mathematical process in detail, because there are many packaged programs that can implement this algorithm. For example-PyWavelets, it is very easy to use and get started with. Just install the package, open the Python interactive shell and type:
    >>> import pywt
    >>> cA, cD = pywt.dwt([1, 2, 3, 4], 'db1')

1.3 Three-bases element

If we use only a single base to represent the DNA feature, we could not get the information of base interactions in the sequence. So, we choose three bases to characterize the DNA sequence.
Let P be a DNA sequence containing L bases.

P = R₁R₂R₃R₄...R_L

R_irepresents the base of the i position in the sequence. R₁R₂R₃ is the first corresponding three base element, R₂R₃R₄ is the second and so on. We can get a 1-D sequence with 64 elements with the format like:

S = {f₁f₂f₃f₄...f₆₄}

f₁ is the probability of AAA appearing in fragments. f₂ is the probability of AAG appearing in fragments. f₃ is the probability of AAC appearing in fragments, and so on.

1.4 Support Vector Regressiont

    The method of Support Vector Classification can be extended to solve regression problems.
    This method is called Support Vector Regression.
    The model produced by support vector classification only depends on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.
    We can use SVR package in Scikit-learn, a machine learning toolkit in Python, to do our job.

2. DETAIL

In part 1.1, we get a one-dimensional sequence. In order to remove noise, we can use Haar transform in 1.2 to get approximation and detail coefficients, and we choose decomposition level as 5 by 5-fold cross validation. Then we can get a sequence of 138 numbers.

To get more information about DNA sequence, we use three-bases element to characterize the interaction between bases. As we described in 1.3, we can get another sequence of the probability of 64 element (AAA, AAG, AAC, …, TTT). By combing the two sequence together, we can get an eigenvector of a DNA sequence which containing 202 numbers. After we get eigenvector, we can use SVR with Radial Basis Function to complete our model. The free parameters in the model are C, γ and ε. By 5-fold cross validation, we can confirm that C=10, γ=0.1 and ε=0.1. We choose 20% as test data randomly.

3. Result

We choose 20% as test data randomly and we get Mean Squared Error as follows:

MSE_train = 0.0083

MSE_test = 0.047

The results of the model are excellent.

References:
    https://en.wikipedia.org/wiki/Haar_wavelet
    https://en.wikipedia.org/wiki/Discrete_wavelet_transform
    https://en.wikipedia.org/wiki/SVR
    https://pywavelets.readthedocs.io/en/latest/ref/dwt-discrete-wavelet-transform.html
    http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
    Liang G, Li Z. Scores of generalized base properties for quantitative sequence-activity modelings for E. coli promoters based on support vector machine[J]. Journal of Molecular Graphics & Modelling, 2007, 26(1):269.