Line 358: | Line 358: | ||
\left[ {0,0,0,1} \right]\mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} se{q_i} = T | \left[ {0,0,0,1} \right]\mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} se{q_i} = T | ||
\end{array} \right.\] | \end{array} \right.\] | ||
− | + | <br /> For example, the ‘AGCTA’ can be transformed into the matrix: . | |
− | + | <br /> After processing the sequence data, we divide the data into two part: training data and test data, then try to training the model. | |
− | + | <br />The complex relationship between the Repeats sequence matrix and its DR-Score can be mapped by an SVM regression function ,,. In order to achieve our goals, the SVM model is constructed followed Vapnik et al: | |
+ | The performance of the SVM model is evaluated by two Squared correlation coefficient (): | ||
+ | <br /> | ||
+ | <math display='block'> | ||
+ | <mrow> | ||
+ | <msup> | ||
+ | <mi>R</mi> | ||
+ | <mn>2</mn> | ||
+ | </msup> | ||
+ | <mo>=</mo><mfrac> | ||
+ | <mrow> | ||
+ | <msup> | ||
+ | <mrow> | ||
+ | <mrow><mo>(</mo> | ||
+ | <mrow> | ||
+ | <mi>n</mi><mstyle displaystyle='true'> | ||
+ | <munderover> | ||
+ | <mo>∑</mo> | ||
+ | <mrow> | ||
+ | <mi>i</mi><mo>=</mo><mn>1</mn></mrow> | ||
+ | <mi>n</mi> | ||
+ | </munderover> | ||
+ | <mrow> | ||
+ | <mi>f</mi><mo stretchy='false'>(</mo><msub> | ||
+ | <mi>x</mi> | ||
+ | <mi>i</mi> | ||
+ | </msub> | ||
+ | <mo stretchy='false'>)</mo><msub> | ||
+ | <mi>y</mi> | ||
+ | <mi>i</mi> | ||
+ | </msub> | ||
+ | <mo>−</mo><mstyle displaystyle='true'> | ||
+ | <munderover> | ||
+ | <mo>∑</mo> | ||
+ | <mrow> | ||
+ | <mi>i</mi><mo>=</mo><mn>1</mn></mrow> | ||
+ | <mi>n</mi> | ||
+ | </munderover> | ||
+ | <mrow> | ||
+ | <mi>f</mi><mo stretchy='false'>(</mo><msub> | ||
+ | <mi>x</mi> | ||
+ | <mi>i</mi> | ||
+ | </msub> | ||
+ | <mo stretchy='false'>)</mo><mstyle displaystyle='true'> | ||
+ | <munderover> | ||
+ | <mo>∑</mo> | ||
+ | <mrow> | ||
+ | <mi>i</mi><mo>=</mo><mn>1</mn></mrow> | ||
+ | <mi>n</mi> | ||
+ | </munderover> | ||
+ | <mrow> | ||
+ | <msub> | ||
+ | <mi>y</mi> | ||
+ | <mi>i</mi> | ||
+ | </msub> | ||
+ | </mrow> | ||
+ | </mstyle></mrow> | ||
+ | </mstyle></mrow> | ||
+ | </mstyle></mrow> | ||
+ | <mo>)</mo></mrow></mrow> | ||
+ | <mn>2</mn> | ||
+ | </msup> | ||
+ | </mrow> | ||
+ | <mrow> | ||
+ | <mo stretchy='false'>(</mo><mi>n</mi><mstyle displaystyle='true'> | ||
+ | <munderover> | ||
+ | <mo>∑</mo> | ||
+ | <mrow> | ||
+ | <mi>i</mi><mo>=</mo><mn>1</mn></mrow> | ||
+ | <mi>n</mi> | ||
+ | </munderover> | ||
+ | <mrow> | ||
+ | <mi>f</mi><msup> | ||
+ | <mrow> | ||
+ | <mo stretchy='false'>(</mo><msub> | ||
+ | <mi>x</mi> | ||
+ | <mi>i</mi> | ||
+ | </msub> | ||
+ | <mo stretchy='false'>)</mo></mrow> | ||
+ | <mn>2</mn> | ||
+ | </msup> | ||
+ | <mo>−</mo><msup> | ||
+ | <mrow> | ||
+ | <mo stretchy='false'>(</mo><mstyle displaystyle='true'> | ||
+ | <munderover> | ||
+ | <mo>∑</mo> | ||
+ | <mrow> | ||
+ | <mi>i</mi><mo>=</mo><mn>1</mn></mrow> | ||
+ | <mi>n</mi> | ||
+ | </munderover> | ||
+ | <mrow> | ||
+ | <mi>f</mi><mo stretchy='false'>(</mo><msub> | ||
+ | <mi>x</mi> | ||
+ | <mi>i</mi> | ||
+ | </msub> | ||
+ | <mo stretchy='false'>)</mo></mrow> | ||
+ | </mstyle><mo stretchy='false'>)</mo></mrow> | ||
+ | <mn>2</mn> | ||
+ | </msup> | ||
+ | </mrow> | ||
+ | </mstyle><mo stretchy='false'>)</mo><mo stretchy='false'>(</mo><mi>n</mi><mstyle displaystyle='true'> | ||
+ | <munderover> | ||
+ | <mo>∑</mo> | ||
+ | <mrow> | ||
+ | <mi>i</mi><mo>=</mo><mn>1</mn></mrow> | ||
+ | <mi>n</mi> | ||
+ | </munderover> | ||
+ | <mrow> | ||
+ | <msub> | ||
+ | <mi>y</mi> | ||
+ | <mi>i</mi> | ||
+ | </msub> | ||
+ | <mo>−</mo><msup> | ||
+ | <mrow> | ||
+ | <mo stretchy='false'>(</mo><mstyle displaystyle='true'> | ||
+ | <munderover> | ||
+ | <mo>∑</mo> | ||
+ | <mrow> | ||
+ | <mi>i</mi><mo>=</mo><mn>1</mn></mrow> | ||
+ | <mi>n</mi> | ||
+ | </munderover> | ||
+ | <mrow> | ||
+ | <msub> | ||
+ | <mi>y</mi> | ||
+ | <mi>i</mi> | ||
+ | </msub> | ||
+ | </mrow> | ||
+ | </mstyle><mo stretchy='false'>)</mo></mrow> | ||
+ | <mn>2</mn> | ||
+ | </msup> | ||
+ | </mrow> | ||
+ | </mstyle><mo stretchy='false'>)</mo></mrow> | ||
+ | </mfrac> | ||
+ | </mrow> | ||
+ | </math> | ||
+ | <br /> | ||
Revision as of 02:04, 15 October 2018
miniToe Family
1. Enzymes Mutation
1.1 The Four Keys in miniToe System
The wet lab members give us four important sites, Gln104, Tyr176, Phe155, His29, which play import roles in binding and cleavage in protein Csy4. Considering 20 kinds of amino acids, we have 80 mutants to explore and choose if we only have one site mutated.
Before we begin to design the protein mutants, we first looking into the working process of miniToe structure to find that which are the most important keys in our system.
Fig.1-1 The working process of miniToe system
(1)The miniToe structure is produced and accumulated.
(2)The Csy4 is produced with IPTG induced.
(3)The Csy4 binds to the miniToe structure and form the rm the Csy4-miniToe complex
(4)The Csy4 cleavage the special site and divide the miniToe structure into two parts: the Csy4-crRNA complex and the mRNA of sfGFP.
(5)The sfGFP is produced.
From the description above, we can get four key problems in our system to make sure that our system can work successfully:
(1)Does the Csy4 dock correctly with the miniToe structure (hairpin)?
(2)How about the binding ability between the Csy4 and miniToe structure (hairpin)?
(3)How about the cleavage ability between the Csy4 and miniToe structure (hairpin)?
(4)Does crRNA release from the RBS?
The most impressive way to explore four problems is to model our system at the atom level by molecular dynamics. And there are lots of work in exploring the Csy4-RNA complex by molecular dynamics.
1.2 Molecular Dynamics
Molecular dynamics (MD) is a computer simulation method for studying the physical movements of atoms and molecules. The atoms and molecules are allowing to interact for a fixed period of time, giving a view of the dynamic evolution of the system. In the most common version, the trajectories of atoms and molecules are determined by numerically solving Newton's equations of motion for a system of interacting particles, whose forces between the particles and their potential energies are often calculated using interatomic potentials or molecular mechanics force fields.
To a system which consists of molecule or atoms, the total energy of a system includes kinetic energy and potential energy,which can be describe by the formula below: $E = {E_{kin}} + U$
where donates the kinetic energy and donates the potential energy.
In a molecule system, the total potential energy can be calculated by adding the 、 bond stretching potentials energy 、angle bending potentials energy 、torsion angle potentials energy 、out-of-plane potentials energy and some other cross effect together,which also can be describe by the formula below: $U = {U_{nb}} + {U_b} + {U_\theta } + {U_\phi } + {U_\chi } + {U_{cross}}$
This formula above also called the force field in molecular dynamics’ theory. There are many force field in the world that based on the statistical thermodynamics and empirical result. In the research of the protein and nucleic acid, the Amber force field is one of the best force field in the world. So we choose Amber as our force field and the formula of this field show below: \[\begin{array}{l} E = \sum\limits_{bond} {{K_b}{{({r_{ij}} - {r_0})}^2}} + \sum\limits_{angle} {{K_\theta }{{(\theta - {\theta _0})}^2}} + \sum\limits_{dihedral} {\frac{{{K_\phi }[1 + \cos (n\phi - {\phi _0})]}}{2}} \\ + \sum\limits_{impr} {\frac{{{K_\chi }[1 + \cos (n\chi - {\chi _0})]}}{2}} + \sum\limits_{nobond} {{\varepsilon _{ij}}\left[ {{{\left( {\frac{{R_{ij}^0}}{{{R_{ij}}}}} \right)}^{12}} - 2{{\left( {\frac{{R_{ij}^0}}{{{R_{ij}}}}} \right)}^6}} \right]} + \sum\limits_{nobond} {\frac{{{q_i}{q_j}}}{{{R_{ij}}}}} \end{array}\]
The items in the formula refers to bond stretching term、angle bending potentials、dihedral angle potentials、put of plane angle potentials、improper dihedral angle potentials、Van Der Waals interaction and Coulombic interaction terms in order.
Now considering the system which contains of consists of molecule or atoms in the classic mechanics, the atom or molecule can be characterized as follow:
The position of atom is , the speed of atom is , the acceleration of atom is , .
After the integral operation, we can get two formula of and : ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over v} _i} = {\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over v} _i}^0 + {\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over a} _i}t$
where the refers to initial speed and the refers to initial
According to the classic mechanics, the force applied to atoms is the negative gradient of potential energy: ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over F} _i} = - {\nabla _i}U = - \left( {\frac{\partial }{{\partial x}}\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over i} + \frac{\partial }{{\partial y}}\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over j} + \frac{\partial }{{\partial z}}\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over k} } \right)U$
Due to Newton’s second law, the acceleration of atom can also be described as: ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over a} _i} = \frac{{{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\rightharpoonup$}} \over F} }_i}}}{{{m_i}}}$
Using the total formula above together, we can have the full process of the molecular dynamics described and the Fig.2-2 shows the flow chart of MD in program.
Fig.1-2 The flow chart of MD in program
1.3 Logic Line
When choosing the Csy4 mutants, we choose the four problems which discuss before as the four keys, and we choose the molecular dynamics as our main tools to look into our system in atom level. So what’s is the logic line between them?
We considering the mutants choosing-problem by following description:
What we know and proved by the experiment is that the wild-type Csy4 with the miniToe system is working well, which means that all the important key problems we discussion did not exist in the wild-type Csy4. The wild-type Csy4 can dock correctly with the miniToe structure and the Csy4 have a good ability to bind and cleave the miniToe structure, finally the crRNA release from the RBS. So we choose the wild-type Csy4 as a standard, and all the Csy4 mutant can check the four key problems by comparing to wild-type Csy4.
So the four key problems transform into two main problems: how to describe four key problems and compare between the wild-type and mutant in mathematical? That’s what we discuss in the following two parts.
1.4 How to describe four key problems in mathematical form
We are going to check the molecular dynamics of wild-type in the miniToe system to show you how to describe four key problems in mathematical form in this part.
For the first problem, we define a matrix called interaction matrix which can describe the interaction possibility between every amino acid of the protein and every nucleic acid of the hairpin, this interaction matrix can be calculated by the catRAPID graphic. We submit the wild-type Csy4 and miniToe structure respectively to online service and it can return us the interaction matrix. The Fig.1-3 is the heatmap of interaction matrix for wild-type Csy4 in the hairpin region.
Fig.1-3 The heatmap of interaction matrix for wild-type Csy4.
And the rest three problems is solved by the molecular dynamics. The work of molecular dynamics is mostly based on the Jiří Šponer’s work, but still something different.
In order to explore the rest three key problems, we prepare two structure for the molecular dynamics. And the geometries of our miniToe system is based on the X-ray structure of Csy4/RNA complex with the cleaved RNA (PDB ID: 4AL5, resolution 2.0 A), it can be seen in the Fig.1-4.
Fig.1-4 the X-ray structure of Csy4/RNA complex
The first structure called precursor complex is preparing for the second and third problem: how about the binding ability and cleavage ability between the Csy4 and miniToe structure (hairpin)? The precursor complex consists of two part: wild-type Csy4 and miniToe structure that before cleaved. It describes the structure in the period the after the Csy4 binding to hairpin but didn’t cleave the hairpin in the special site. The Csy4 structure is coming from the X-ray structure we mentioned before while the miniToe structure is constructed totally by the rational model: we put the sequence into the mFold to generate the secondary structure of hairpin then the tertiary structure is produced by the RNAComposer. The molecular docking between Csy4 and miniToe structure is carried out by PatchDock. The precursor complex of wild-type Csy4 can seem in the Fig.1-5
Fig.1-5. The precursor complex of wild-type Csy4
After getting the precursor complex, we begin to prepare for the simulation. Missing hydrogen atoms were added by PDBFixer, Forcefield we used is amber ff98SB. The system is immersed in a rectangular TIP3P water box. After minimizing the energy of Protein/RNA system, we give some restriction to the RNA chain to make sure that the structure will not become an unreasonable structure when the temperature rises. All the reaction runs with the PBC condition under 300K and 1 atm in the NPT. The time step is 2 fs. The total simulation time is 50 ns. Gromacs and OpenMM are the most common software we used in Ubuntu 16.04 and Window10. The equipment we used is Intel i7 6700HQ with the NVIDIA GTX 960M 4G, it can simulate about 100-150 ns per day under the GPU acceleration. And the trajectories result is analyzed by Pymol and MDAnalysis.
For the second problem, what we can get from the simulation data is protein binding free energy to describe the ability of binding. We use the data in 30-50 ns to calculate it, the 20 ns data in the beginning is aborted to make sure the structure is smooth when calculation. The result of binding free energy for wild-type Csy4 is .
For the third problem, what we can get from the simulation data is some significant distance of key interaction in the active site of Csy4 to describe the ability of cleavage. Jiří Šponer points out some important key interactions of the active site including Ser148(OG)-G20(O2’)、Ser150(OG)-G20(O3’)、Ser151(OG)-G20(N2’). By exploring Jiří Šponer’s work, we finally choose the Ser151(OG)-G20(N2’) as our mathematical form in the tired problem. The distance curve of Ser151(OG)-G20(N2’) for wild-type Csy4 can be seen in Fig.1-6. We get the similar result comparing to Jiří Šponer’s work.
Fig.1-6. The distance of Ser151(OG)-G20(N2’) in wild-type Csy4
The second structure called product structure is preparing for the fourth problem: does crRNA release from the RBS. The product complex consists of two part: wild-type Csy4 and miniToe structure that after cleaved. It describes the structure in the period the after the Csy4 binding and cleaving the hairpin in the special site. The Csy4 structure is coming from the X-ray structure we mentioned before while the miniToe structure which is cleaved constructed totally by the rational model: we put two RNA sequence into the SimRNAweb to finish the molecular docking of two chains RNA and generate the tertiary structure. And the molecular docking between Csy4 and miniToe structure is carried out by PatchDock. The product complex of wild-type Csy4 can seem in the Fig.1-7
Fig.1-7 The product complex of wild-type Csy4
We also explore the product complex by molecular dynamics follow the protocol mentioned before, but this time we only set the restriction to RBS chain while the crRNA chain is free in moving.
For the fourth problem, what we can get from the simulation data is the RMSD describing the structure movement for the crRNA chain to be the mathematical form. We can see the RMSD in Fig.1-8. The RMSD is unstable which give an explanation to experiment that crRNA is release from RBS.
Fig.1-8 The RMSD of the product complex of wild-type Csy4
1.5 How to compare four mathematical forms between wild-type and mutants
We are going to check the molecular dynamics of miniToe system with the mutant Csy4 to show you how to compare the four mathematical forms we choose before between the wild-type Csy4 and Csy4 mutants. In the following description, we give an example using the mutant Q104A to show you how to make comparing.
We first use the SwissModel to generate the tertiary structure with the template Csy4 (PDB ID: 4AL5, resolution 2.0 A). And using the molecular dynamics, we can get the four mathematical form showing in the Fig.1-9.
Fig.1-9 The four key problems in mathematical forms for Csy4-Q104A
Now we divided the four curves into two kind of data: the matrix and the numerical value. The interaction matrix and the curve can be regard as matrix because the curve is discrete, and the binding free energy is just an numerical value.
For the matrix we can use Euclidean distance to describe the difference between two matric: $D(p,{q_{WT}}) = \sqrt {\sum\limits_i^m {\sum\limits_j^n {{{({p_{i,j}} - {q^{WT}}_{i,j})}^2}} } } $ For the free bind ing energy, we used the formula below to calculate the difference between the wild type and mutant: $\ln ({K_{drel}}) = \ln (\frac{{{K_{dWT}}}}{{{K_{dMUT}}}}) = {G_{binding}}$
According to description above, we define four value used to compare four key problems between mutant and wild-type: ${D_1}({\mathop{\rm int}} teraction\mathop {}\limits^{} matrix)$ $\ln ({K_{drel}})$ ${D_3}(Ser151 - G20\mathop {}\limits^{} curve)$ ${D_4}(RMSD)$
For the mutant Q104A, the four is showing in the following chart
Csy4-Mutant | \[{D_1}\] | $\ln ({K_{drel}})$ | ${D_3}$ | ${D_4}$ |
---|---|---|---|---|
Q104A | 0.483 | 2483 | 9.48 | 30.82 |
1.6 Conclusion and Result
From the discussion in 1.4 and 1.5, we have the tools to evaluate the four key problems in mathematical form and find the method to compare between mutant and wild-type. The next step is to find out the mutant which fit our needs.
In order to save the computer resource, we only choose the top 10 mutants in to explore the molecular dynamics. And we mainly focus on the which describe the ability of cleavage and while the is an alternative value being considered. And for the we need limitation in the right border to make sure we will not choose the mutant which is totally inactive in cleavage. Luckily, we find that the Csy4-H29A is an inactive protein which has been proved. So we choose 13.41, which is the value for Csy4-H29A as the right border.
Csy4-Mutant | ||||
---|---|---|---|---|
WT | 0 | 0 | 0 | 0 |
Q104A | 0.483 | 2483 | 9.48 | 30.82 |
Y176F | 0.592 | -382 | 11.61 | 40.62 |
F155A | 0.233 | -1627 | 13.41 | 35.71 |
H29A | 0.173 | 833 | 15.29 | 316.22 |
2. Hairpin Mutation
2.1 The Large Mutant Library
Starting with the hairpin we want to design. We know that the hairpin used in miniToe structure is coming from the Repeat Area in CRISPR type I-F system, it can be recognized and cleaved by Csy4, and it also called DR in CRISPR system. The Fig.2-1 shows a stander CRISPR array. The yellow line between two DR is guide RNA(gRNA)which is caught from foreign DNA.
Fig.2-1 the structure of CRISPR Array
Fig.2-2 The structure of miniToe
2.2 Pre-processing Algorithm
Combining the bioinformatics and machine learning, we present an algorithm to pre-processing our big mutation library. Fig.2-3 is the flow chart of the pre-processing algorithm.
Fig.2-3 The flow chart of this pre-processing algorithm
The first step of pre-processing algorithm is to find all the repeat area in the genome as training input. We download the genome as much as possible from NCBI. With the help of PILER-CR, which is an algorithm used to find the Repeat Area in bioinformatics, we can get the Repeat Area and CROSPR array from genome quickly. And we only focus on the Repeat Area whose length are 28 and 29 bp because there are some research showing that the length of Repeat Area in CRISPR type I-F system are 28 and 29 bp. In the first step, we download about 5000 genomes and find out 119 Repeat Areas which are 28 and 29 bp. The Fig.2-4 shows the 48 Repeat Areas whose length is 28 bp we find.
Fig.2-4 The Repeat Area which is 28bp
The second step of pre-processing algorithm is to scoring the hairpin we get in the first step. We create a score called DR-Score to evaluate the quality of Repeat area comparing to the wild-type hairpin. The calculation method is below: \[{\rm{DR - Scor}}{{\rm{e}}^{\rm{i}}}_{PA}{\rm{ = }}{{\rm{E}}_H} \bullet \sum\limits_{\rm{n}} {\frac{{D{R_{{\rm{size}}}}}}{{D{R_{{\rm{size}}}}{\rm{ + }}{N_{{\rm{mismatcn}}}}}}} \]
This formula can be divided into two parts. The first part is , regularized Levenshetein Distance value between evaluated hairpin and wild-type hairpin, which is an index to describe sequence similarity in bioinformatics. The second part is possibility value, , which describe the quality of Repeat area in the genome it belongs on its own. refers to the length of Repeat Area, which is 28 or 29 bp in this case. The refers to the number of mutated nucleic acids in Repeat Area in the CRISPR array comparing to the common one. The refers to the times that the common Repeat Areas apereas in the CRISPR array.
Now take the CRISPR loci in Fig2-5 as an example to calculated the possibility value.
Fig.2-5 one CRISPR array comes from CRISPR database
The CRISPR array shows in Fig.2-5 has 9 Repeat Areas which contains 3 kinds:
DR1: ‘ACTGTACCATGCCTTACTTTGGATTCAAGGCAAAAC’,
DR2: ‘ACTGTACCATGCCTGATTTTGGATTCGAGGCAAAAC’,
DR3: ‘ACTGTACCATGCCTTACTTTGGATTCAAGTAAATCG’,
The first Repeat area is most common while the rest has some mutation. The 3 possibility values are listed following: $DR1 - PossibilityValue = \sum\limits_{\rm{n}} {\frac{{D{R_{{\rm{size}}}}}}{{D{R_{{\rm{size}}}}{\rm{ + }}{N_{{\rm{mismatcn}}}}}}} = \sum\limits_9 {\frac{{36}}{{36{\rm{ + }}0}}} {\rm{ = }}9$ $DR2 - PossibilityValue = \sum\limits_{\rm{n}} {\frac{{D{R_{{\rm{size}}}}}}{{D{R_{{\rm{size}}}}{\rm{ + }}{N_{{\rm{mismatcn}}}}}}} = \sum\limits_9 {\frac{{36}}{{36{\rm{ + }}3}}} \approx 8.31$ $DR3 - PossibilityValue = \sum\limits_{\rm{n}} {\frac{{D{R_{{\rm{size}}}}}}{{D{R_{{\rm{size}}}}{\rm{ + }}{N_{{\rm{mismatcn}}}}}}} = \sum\limits_9 {\frac{{36}}{{36{\rm{ + }}5}}} \approx 7.90$
It is quite easy to understand why the DR-Score will divide into two part. We need the Levenshtein Distance to describe the sequence similarity because not just the length of Repeat Area in CRISPR type I-F system is 28 and 29 bp, we need it to distinguish the hairpin which is totally faulted. And we also need the possibility value to describe the quality of hairpin we get, and it is constructed by the following assumption: the more you occur, the better you are.
And If the same Repeat Areas occur in different species,we use the weight means to calculate the DR-Score. Just assume that there are repeats in the in the species, the DR-Score can be calculated by the formula below: $DR - Score = \sum\limits_{i = 1}^j {\left( {\frac{{{n_i}}}{{\sum\limits_m^j {{n_m}} }}DR - Scor{e_i}} \right)} $
The third step of pre-processing algorithm is training the SVM model. The SVM (support vector machine) is a machine learning algorithm which is used to classification and regression. It can construct the complex relationship between input and output with the help of kernel function. And it has been successfully used in predicting the strength of promoter. In this step we choose the sequence of Repeat Area as input and the DR-Score as output.
Before we training the SVM model, we should first change the sequence into mathematical representation using the follow method.
The original sequence data which is coded by ‘A’,‘G’,‘C’and ‘T’ can be transformed into the matrix by the formula below: \[x = \left\{ \begin{array}{l} \left[ {1,0,0,0} \right]\mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} se{q_i} = A\\ \left[ {0,1,0,0} \right]\mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} se{q_i} = G\\ \left[ {0,0,1,0} \right]\mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} se{q_i} = C\\ \left[ {0,0,0,1} \right]\mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} \mathop {}\limits^{} se{q_i} = T \end{array} \right.\]
For example, the ‘AGCTA’ can be transformed into the matrix: .
After processing the sequence data, we divide the data into two part: training data and test data, then try to training the model.
The complex relationship between the Repeats sequence matrix and its DR-Score can be mapped by an SVM regression function ,,. In order to achieve our goals, the SVM model is constructed followed Vapnik et al: The performance of the SVM model is evaluated by two Squared correlation coefficient ():