Difference between revisions of "Team:OUC-China/miniToe Family"

 
(45 intermediate revisions by 3 users not shown)
Line 12: Line 12:
  
 
#bodyContent{
 
#bodyContent{
margin:0;
+
margin:0;  
 
padding:0;     
 
padding:0;     
 
position:absolute;
 
position:absolute;
Line 148: Line 148:
 
<section class="box features">
 
<section class="box features">
 
<h2 class="major"><span>miniToe Family</span></h2>
 
<h2 class="major"><span>miniToe Family</span></h2>
+
<br/>
 +
In the miniToe family, the protein and hairpin were mutated to meet the goal of the different regulation level. In this part, the model can help us design mutants. Importantly, we used different strategies to design the feature of Cys4 and the hairpin. For example, molecular dynamics played an important role in designing protein mutants, and the bioinformatics and machine learning supported us to find the hairpin mutants of our interest.
 +
<br/><br/>
 
<p>
 
<p>
  
<h3>1. Enzymes Mutation</h3>
+
<h3 id='wer'>1. Enzymes Mutation</h3>
<h4>1.1 The Four Keys in miniToe System</h4>
+
<h4>1.1 The Four Key POINTS in miniToe System</h4>
<br />  The wet lab members give us four important sites, Gln104, Tyr176, Phe155, His29, which play important roles in binding and cleavage in protein Csy4. Considering 20 kinds of amino acids, we have 80 mutants to explore and choose if we only have one site mutated.
+
<br />  The wet lab members gave us four important sites, Gln104, Tyr176, Phe155, His29, which play important roles in binding and cleavage protein Csy4. Considering 20 kinds of amino acids, we have 80 mutants to explore and choose if we only have one site mutated.
<br /><br />Before designing the protein mutants, we first looking into the working process of miniToe structure to find the most important keys in our system.
+
<br /><br />Before designing the protein mutants, we looked into the working process of miniToe to find the most important key points in our system.
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/b/bd/T--OUC-China--mf1.jpg" height="450">
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/b/bd/T--OUC-China--mf1.jpg" height="450">
 
</div>
 
</div>
 
<div align="center"><p>Fig.1-1 The working process of miniToe system</p></div>
 
<div align="center"><p>Fig.1-1 The working process of miniToe system</p></div>
 
   All the reactions happened in our first system, miniToe, can be described chronologically by following five main steps[1]:
 
   All the reactions happened in our first system, miniToe, can be described chronologically by following five main steps[1]:
<br /> <br />(1)The miniToe structure is produced and accumulated.
+
<br /> <br />(1)MiniToe is produced and accumulated.
<br /> (2)The Csy4 is produced with IPTG induced.
+
<br /> (2)Csy4 is produced under IPTG conditions.
<br /> (3)The Csy4 binds to the miniToe structure and form the Csy4-miniToe complex
+
<br /> (3)Csy4 binds to the miniToe structure and forms Csy4-miniToe complex.
<br /> (4)The Csy4 cleave the special site and divide the miniToe structure into two parts: the Csy4-crRNA complex and the mRNA of sfGFP.
+
<br /> (4)Csy4 cleaves the specific site of miniToe structure into two parts: Cys4-crRNA (cis-repressive RNA) complex and mRNA encoding sfGFP.
 
<br /> (5)The sfGFP is produced.
 
<br /> (5)The sfGFP is produced.
<br /><br />  From the description above, we can get four key problems in our system to make sure that our system can work successfully:
+
<br /><br />  There are four key points in our miniToe system, which confirms whether our system can work successfully;
<br /><br /> (1)Does the Csy4 dock correctly with the miniToe structure (hairpin)?
+
<br /><br /> (1)Does Csy4 dock correctly with the miniToe structure?
<br /> (2)How about the binding ability between the Csy4 and miniToe structure (hairpin)?
+
<br /> (2)How about the binding capacity  between Csy4 and miniToe structure?
<br /> (3)How about the cleavage ability between the Csy4 and miniToe structure (hairpin)?
+
<br /> (3)How about the cleavage capacity between Csy4 and miniToe structure?
 
<br /> (4)Does cis-repressive RNA release from the RBS?
 
<br /> (4)Does cis-repressive RNA release from the RBS?
<br /><br />  The most impressive way to explore four problems is to model our system at the atom level by molecular dynamics[2]. And there are lots of work in exploring the Csy4-RNA complex by molecular dynamics[3][4][5]. <br /><br />
+
<br /><br />  The most impressive way to explore four points is to model our system at the atom level by molecular dynamics[2]. And there are lots of work on exploring Csy4-crRNA complex by molecular dynamics[3][4][5]. <br /><br />
 
<h4>1.2 Molecular Dynamics</h4>
 
<h4>1.2 Molecular Dynamics</h4>
  <br /> Molecular dynamics (MD)[6] is a computer simulation method for studying the physical movements of atoms and molecules. The atoms and molecules are allowing to interact for a fixed period of time, giving a view of the dynamic evolution of the system. In the most common version, the trajectories of atoms and molecules are determined by numerically solving Newton's equations of motion for a system of interacting particles, whose forces between the particles and their potential energies are often calculated using interatomic potentials or molecular mechanics force fields.
+
  <br /> Molecular dynamics (MD)[6] is a computer simulation method for studying the physical movements of atoms and molecules. The atoms and molecules allows to interact for a fixed period of time, giving a view of the dynamic evolution of the system. In most common version, the trajectories of atoms and molecules are determined by numerically solving Newton's equations of motion for a system of interacting particles, whose forces between the particles and potential energies are often calculated using interatomic potentials or molecular mechanics force fields.
<br /><br />To a system which consists of molecule or atoms, the total energy of a system includes kinetic energy and potential energy,which can be described by the formula below: <br /><br />
+
<br /><br />For a system involving molecule or atoms, the total energy includes kinetic energy and potential energy,which can be described by the formula below: <br /><br />
 
 
 
<div align="center">
 
<div align="center">
Line 185: Line 187:
 
</math></dir></div>
 
</math></dir></div>
  
<br />where the E<sub>kin</sub> donates the kinetic energy and the U donates the potential energy.  
+
<br />where the E<sub>kin</sub> denotes the kinetic energy and the U denotes the potential energy.  
 
<br /><br />In a molecule system, the total potential energy can be calculated by adding the  U<sub>nb</sub>、 bond stretching potentials energy  U<sub>b</sub>、angle bending potentials energy U<sub>&#x03B8;</sub>、torsion angle potentials energy U<sub>&#x03D5;</sub>、out-of-plane potentials energy U<sub>&#x03C7;</sub> and some other cross effect U<sub>cross</sub> together,which also can be described by the formula below: <br /><br />
 
<br /><br />In a molecule system, the total potential energy can be calculated by adding the  U<sub>nb</sub>、 bond stretching potentials energy  U<sub>b</sub>、angle bending potentials energy U<sub>&#x03B8;</sub>、torsion angle potentials energy U<sub>&#x03D5;</sub>、out-of-plane potentials energy U<sub>&#x03C7;</sub> and some other cross effect U<sub>cross</sub> together,which also can be described by the formula below: <br /><br />
 
 
Line 222: Line 224:
 
</div>
 
</div>
  
<br /><br />This formula above also called the force field in molecular dynamics’ theory. There are many force field in the world that based on the statistical thermodynamics and empirical result. In the research of the protein and nucleic acid, the Amber force field is one of the best force field in the world. So we choose Amber as our force field and the formula of this field show below:
+
<br /><br />This formula above is also called the force field in the theory of molecular dynamics. Based on the statistical thermodynamics and empirical result, there are many force field in the world. In the study of proteins and nucleic acids, the Amber force field is one of the best force field in the world. So we choose Amber as our force field and the formula of this field.
\[\begin{array}{l}
+
E = \sum\limits_{bond} {{K_b}{{({r_{ij}} - {r_0})}^2}}  + \sum\limits_{angle} {{K_\theta }{{(\theta  - {\theta _0})}^2}}  + \sum\limits_{dihedral} {\frac{{{K_\phi }[1 + \cos (n\phi  - {\phi _0})]}}{2}} \\
+
+ \sum\limits_{impr} {\frac{{{K_\chi }[1 + \cos (n\chi  - {\chi _0})]}}{2}}  + \sum\limits_{nobond} {{\varepsilon _{ij}}\left[ {{{\left( {\frac{{R_{ij}^0}}{{{R_{ij}}}}} \right)}^{12}} - 2{{\left( {\frac{{R_{ij}^0}}{{{R_{ij}}}}} \right)}^6}} \right]}  + \sum\limits_{nobond} {\frac{{{q_i}{q_j}}}{{{R_{ij}}}}}
+
\end{array}\]
+
 
<br /><br /> The items in the formula refers to bond stretching term、angle bending potentials、dihedral angle potentials、put of plane angle potentials、improper dihedral angle potentials、Van Der Waals interaction and Coulombic interaction terms in order.
 
<br /><br /> The items in the formula refers to bond stretching term、angle bending potentials、dihedral angle potentials、put of plane angle potentials、improper dihedral angle potentials、Van Der Waals interaction and Coulombic interaction terms in order.
  <br /><br /> Now considering the system which contains of consists of N molecule or atoms in the classic mechanics, the atom or molecule can be characterized as follow:
+
  <br /><br /> Considering the system which contains N molecule or atoms in the classic mechanics, the atom or molecule can be characterized as follow:
                                 <br /><br />The position of atom i is r<sub>i</sub> , the speed of atom i is v<sub>i</sub> , the acceleration of atom  i is a.
+
                                 <br /><br />The position of atom i is r<sub>i</sub> , the speed of atom i is v<sub>i</sub> , the acceleration of atom  i is a<sub>i</sub>.
 
<br /><br />After the integral operation, we can get two formula of v<sub>i</sub> and r<sub>i</sub> : <br />
 
<br /><br />After the integral operation, we can get two formula of v<sub>i</sub> and r<sub>i</sub> : <br />
 
<br />
 
<br />
Line 415: Line 413:
 
 
 
<h4>1.3 Logic Line</h4>
 
<h4>1.3 Logic Line</h4>
<br />  When choosing the Csy4 mutants, we choose the four problems which discuss before as the four keys, and we choose the molecular dynamics as our main tools to look into our system in atom level. So what’s is the logic line between them?  
+
<br />  When choosing Csy4 mutants, we considered the four points that discussed before. We choosed the molecular dynamics as main tools to study our system in atom level. What's the logic line between them?
<br /><br />We considering the mutants choosing-problem by following description:
+
<br /><br />What we know and proved by the experiment is that the wild-type Csy4 with the miniToe system is working well, which means that all the important key problems we discussion did not exist in the wild-type Csy4. The wild-type Csy4 can dock correctly with the miniToe structure and the Csy4 have a good ability to bind and cleave the miniToe structure, finally the crRNA release from the RBS. So we choose the wild-type Csy4 as a standard, and all the Csy4 mutant can check the four key problems by comparing to wild-type Csy4.
+
<br /><br /> So the four key problems transform into two main problems: how to describe four key problems and compare between the wild-type and mutant in mathematical? That’s what we discuss in the following two parts.
+
  
<br /><br />
 
  
  <h4>1.4 How to describe four key problems in mathematical form </h4>
+
 
<br /><br />We are going to check the molecular dynamics of wild-type in the miniToe system to show you how to describe four key problems in mathematical form in this part.
+
<br /><br />We considered the problems about choosing mutants in the following description:
<br /><br />For the first problem, we define a matrix called interaction matrix which can describe the interaction possibility between every amino acid of the protein and every nucleic acid of the hairpin, this interaction matrix can be calculated by the catRAPID graphic[7]. We submit the wild-type Csy4 and miniToe structure respectively to online service and it can return us the interaction matrix. The Fig.1-3 is the heat map of the interaction matrix for wild-type Csy4 in the hairpin region. <div align="center"><img src="https://static.igem.org/mediawiki/2018/d/d7/T--OUC-China--mf3.jpg" height="450">
+
 
 +
 
 +
<br />
 +
In experiments strain-Csy4-Wide Type works well, proving that the four key points we discussed before was confirmed. The Cys4 in the strain-Csy4-Wide Type can dock correctly with the miniToe structure and the Csy4 have a good capacity of binding and cleaving  miniToe structure, finally the cis-repressive RNA releases from the RBS. So we chose the strain-Csy4-Wide Type as a criterion, the four key points can be checked in all Csy4 mutants.
 +
 
 +
<br /><br /> So there will be two main problems: How to describe the four key points in mathematics way and compare the
 +
wild-type Csy4 and mutant Csy4? That's what we discuss in the following two parts.
 +
 
 +
<br /><br />
 +
<br />
 +
  <h4 id='JC1'>1.4 How to describe four key points in mathematical form </h4>
 +
<br />We are going to check the molecular dynamics of wild-type Csy4 in miniToe system and show how to describe four key points in mathematical form in this part.
 +
<br /><br />For the first point, we defined a matrix called interaction matrix. It can describe the possibility of interaction between every amino acid of proteins and every nucleic acid of hairpins, this interaction matrix can be calculated by the catRAPID graphic[7]. We submitted the wild-type Csy4 and miniToe structure respectively to online service and it can return us the interaction matrix. The Fig.1-3 is the heat map of the interaction matrix for wild-type Csy4 in the hairpin region. <div align="center"><img src="https://static.igem.org/mediawiki/2018/d/d7/T--OUC-China--mf3.jpg" height="450">
 
</div>
 
</div>
<div align="center"><p>Fig.1-3 The heatmap of interaction matrix for wild-type Csy4.</p></div> <br /><br />And the rest three problems is solved by the molecular dynamics. The work of molecular dynamics is mostly based on the Jiří Šponer’s work[8], but still something different.
+
<div align="center"><p>Fig.1-3 The heatmap of interaction matrix for wild-type Csy4.</p></div> <br />And the rest three points is explained by the molecular dynamics. Some of our work about molecular dynamics is based on Jiří Šponer’s work[8], but most of work were done by us.
<br /><br />In order to explore the rest three key problems, we prepare two structure for the molecular dynamics. And the geometries of our miniToe system is based on the X-ray structure of Csy4/RNA complex with the cleaved RNA (PDB ID: 4AL5, resolution 2.0 A), it can be seen in the Fig.1-4.
+
<br /><br />In order to explore the rest three key points, we prepared two structure for the molecular dynamics. The geometries of our miniToe system is based on the X-ray structure of Csy4/RNA complex with the cleaved RNA (PDB ID: 4AL5, resolution 2.0 A), it can be seen in the Fig.1-4.
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/5/5f/T--OUC-China--mf4.jpg" height="450">
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/5/5f/T--OUC-China--mf4.jpg" height="450">
 
</div>
 
</div>
 
<div align="center"><p>Fig.1-4 the X-ray structure of Csy4/RNA complex</p></div>
 
<div align="center"><p>Fig.1-4 the X-ray structure of Csy4/RNA complex</p></div>
<br />The first structure called precursor complex is prepared for the second and third problem: how about the binding ability and cleavage ability between the Csy4 and miniToe structure (hairpin)? The precursor complex consists of two part: wild-type Csy4 and miniToe structure that before cleaved. It describes the structure in the period the after the Csy4 binding to hairpin but didn’t cleave the hairpin in the special site. The Csy4 structure is coming from the X-ray structure we mentioned before while the miniToe structure is constructed totally by the rational model: we put the sequence into the mFold[9] to generate the secondary structure of hairpin then the tertiary structure is produced by the RNAComposer[10]. The molecular docking between Csy4 and miniToe structure is carried out by PatchDock[11]. The precursor complex of wild-type Csy4 can seem in the Fig.1-5
+
<br />The first structure called precursor complex is prepared for the second and third point: how about the capacity of binding and cleavage between Csy4 and miniToe structure? The precursor complex consists of two part: wild-type Csy4 and miniToe structure. It describes the structure after the Csy4 bound to hairpin but didn’t cleave the hairpin in the special site. The Csy4 structure origins from the X-ray structure we mentioned before while the miniToe structure is constructed totally by the rational model: we put the sequence into the mFold[9] to generate the secondary structure of hairpin then the tertiary structure is produced by the RNAComposer[10]. The molecular docking between Csy4 and miniToe structure is carried out by PatchDock[11]. The precursor complex of wild-type Csy4 can be seen in the Fig.1-5
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/a/ae/T--OUC-China--mf5.jpg" height="300">
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/a/ae/T--OUC-China--mf5.jpg" height="300">
 
</div>
 
</div>
 
<div align="center"><p>Fig.1-5. The precursor complex of wild-type Csy4</p></div>
 
<div align="center"><p>Fig.1-5. The precursor complex of wild-type Csy4</p></div>
<br />   After getting the precursor complex, we begin to prepare for the simulation. Missing hydrogen atoms were added by PDBFixer[12], Forcefield we used is amber ff98SB[13]. The system is immersed in a rectangular TIP3P water box. After minimizing the energy of Protein/RNA system, we give some restriction to the RNA chain to make sure that the structure will not become an unreasonable structure when the temperature rises. All the reaction runs with the PBC condition under 300K and 1 atm in the NPT. The time step is 2 fs while the total simulation time is 50 ns. Gromacs[14] and OpenMM[15] are the most common software we used in Ubuntu 16.04 and Window10. The equipment we used is Intel i7 6700HQ with the NVIDIA GTX 960M 4G, it can simulate about 100-150 ns per day under the GPU acceleration. And the trajectories result is analyzed by Pymol[16] and MDAnalysis[17].  
+
<br />   Having got the structure of precursor complex, we began to prepare for the simulation. Missing hydrogen atoms were added by PDBFixer[12], Forcefield we used is amber ff98SB[13]. The system is immersed in a rectangular TIP3P water box. After minimizing the energy of Protein/RNA system, we gave some restrictions on the RNA chain, making sure that the structure will not become an unreasonable structure when the temperature rises. All the reaction runs with the PBC condition under 300K and 1 atm in the NPT. The time step is 2 fs while the total simulation time is 50 ns. Gromacs[14] and OpenMM[15] are the most common software we used in Ubuntu 16.04 and Window10. The equipment we used is Intel i7 6700HQ with the NVIDIA GTX 960M 4G, it can simulate about 100-150 ns per day under the GPU acceleration. And the trajectories result is analyzed by Pymol[16] and MDAnalysis[17].  
<br /><br />  For the second problem, what we can get from the simulation data is protein binding free energy to describe the ability of binding. We use the data in 30-50 ns to calculate it, the 20 ns data in the beginning is aborted to make sure the structure is smooth when calculating. The result of binding free energy for wild-type Csy4 is <math>
+
<br /><br />  For the second point, what we can get from the simulation data is protein binding free energy used to describe the capacity of binding. The 20 ns data in the beginning is aborted. So we use the data in 30-50 ns, making sure the structure is smooth when calculating. The result of binding free energy for wild-type Csy4 is <math>
 
  <mrow>
 
  <mrow>
 
   <mo>&#x25B3;</mo><msub>
 
   <mo>&#x25B3;</mo><msub>
Line 450: Line 456:
 
</math>
 
</math>
 
  .
 
  .
<br /><br />  For the third problem, what we can get from the simulation data is some significant distance of key interaction in the active site of Csy4 to describe the ability of cleavage. Jiří Šponer[8] points out some important key interactions of the active site including Ser148(OG)-G20(O2’)、Ser150(OG)-G20(O3’)、Ser151(OG)-G20(N2’). By exploring Jiří Šponer’s work, we finally choose the Ser151(OG)-G20(N2’) as our mathematical form in the third problem. The distance curve of Ser151(OG)-G20(N2’) for wild-type Csy4 can be seen in Fig.1-6. We get the similar result comparing to Jiří  Šponer’s workp[8]. <div align="center"><img src="https://static.igem.org/mediawiki/2018/0/05/T--OUC-China--mf6.jpg" height="450">
+
<br /><br />  For the third point, what we can get from the simulation data is some significant distance of key interaction in the active site of Csy4, whicn describes the capacity of cleavage. Jiří Šponer[8] points out some important key interactions of the active site including Ser148(OG)-G20(O2’)、Ser150(OG)-G20(O3’)、Thr151(OG)-G20(N2’). Based on Jiří Šponer’s work, we finally chose the Thr151(OG)-G20(N2’) as our mathematical form in the third point. The distance curve of Thr151(OG)-G20(N2’) for wild-type Csy4 can be seen in Fig.1-6. We get the similar result compared to Jiří  Šponer’s work[8]. <div align="center"><img src="https://static.igem.org/mediawiki/2018/0/05/T--OUC-China--mf6.jpg" height="450">
 
</div>
 
</div>
<div align="center"><p>Fig.1-6. The distance of Ser151(OG)-G20(N2’) in wild-type Csy4</p></div><br /><br />   The second structure called product structure is prepared for the fourth problem: does crRNA release from the RBS. The product complex consists of two part: wild-type Csy4 and miniToe structure that after cleaved. It describes the structure in the period the after the Csy4 binding and cleaving the hairpin in the special site. The Csy4 structure is coming from the X-ray structure we mentioned before while the miniToe structure which is cleaved constructed totally by the rational model: we put two RNA sequence into the SimRNAweb[18] to finish the molecular docking of two chains RNA and generate the tertiary structure. And the molecular docking between Csy4 and miniToe structure is carried out by PatchDock[11]. The product complex of wild-type Csy4 can seem in the Fig.1-7
+
<div align="center"><p>Fig.1-6. The distance of Thr151(OG)-G20(N2’) in wild-type Csy4</p></div><br /><br />   The second structure called product structure is prepared for the fourth point: does  
 +
cis-repressive RNA release from the RBS? The product complex consists of two part after cleavage: wild-type Csy4 and miniToe structure. It describes the structure after the Csy4 binding and cleaving the hairpin in the specific site. The Csy4 structure comes from the X-ray structure we mentioned before while the miniToe structure is designed totally by the rational model: we put two RNA sequence into the SimRNAweb[18] to finish the molecular docking of two chains RNA and generate the tertiary structure. And the molecular docking between Csy4 and miniToe structure is carried out by PatchDock[11]. The product complex of wild-type Csy4 can be seen in the Fig.1-7
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/7/7a/T--OUC-China--mf7.jpg" height="300">
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/7/7a/T--OUC-China--mf7.jpg" height="300">
 
</div>
 
</div>
 
<div align="center"><p>Fig.1-7 The product complex of wild-type Csy4</p></div>
 
<div align="center"><p>Fig.1-7 The product complex of wild-type Csy4</p></div>
<br /><br />  We also explore the product complex by molecular dynamics follow the protocol mentioned before, but this time we only set the restriction to RBS chain while the crRNA chain is free in moving.
+
<br /><br />  We also explore the product complex by molecular dynamics following the protocol mentioned before. But in this part, we only set restrictions on RBS chain while the crRNA chain is free in moving.
<br /><br />For the fourth problem, what we can get from the simulation data is the RMSD describing the structure movement for the crRNA chain to be the mathematical form. We can see the RMSD in Fig.1-8. The RMSD is unstable which give an explanation to experiment that crRNA is release from RBS.
+
<br /><br />For the fourth point, what we can get from the simulation data is the RMSD. It can describe the movement of the crRNA chain in a mathematical form. We can see the RMSD in Fig.1-8. The RMSD is unstable and it explains that crRNA released from RBS.
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/b/b9/T--OUC-China--mf8.jpg" height="450">
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/b/b9/T--OUC-China--mf8.jpg" height="450">
 
</div>
 
</div>
Line 463: Line 470:
  
 
<h4>1.5 How to compare four mathematical forms between wild-type and mutants</h4>
 
<h4>1.5 How to compare four mathematical forms between wild-type and mutants</h4>
<br />We are going to check the molecular dynamics of the miniToe system with the mutant Csy4 to show you how to compare the four mathematical forms we choose before between the wild-type Csy4 and Csy4 mutants. In the following description, we give an example using the mutant Q104A to show you how to make comparing.
+
<br />We are going to check the molecular dynamics of the miniToe system with Csy4 mutants to show how to compare the four mathematical forms we chose before between the wild-type Csy4 and Csy4 mutants. In the following description, we give an example using the mutant Q104A to show you how to compare.
<br /><br />We first use the SwissModel[19] to generate the tertiary structure with the template Csy4 (PDB ID: 4AL5, resolution 2.0 A). And using the molecular dynamics, we can get the four mathematical form showing in the Fig.1-9. <div align="center"><img src="https://static.igem.org/mediawiki/2018/a/a3/T--OUC-China--mf9.jpg" height="600">
+
<br /><br />Fistly we use the SwissModel[19] to generate the tertiary structure with the template Csy4 (PDB ID: 4AL5, resolution 2.0 A). Using the molecular dynamics, we can get four mathematical forms showing in the Fig.1-9. <div align="center"><img src="https://static.igem.org/mediawiki/2018/a/a3/T--OUC-China--mf9.jpg" height="600">
 
</div>
 
</div>
<div align="center"><p>Fig.1-9 The four key problems in mathematical forms for Csy4-Q104A</p></div> <br /><br />   Now we divided the four curves into two kind of data: the matrix and the numerical value. The interaction matrix and the curve can be regard as matrix because the curve is discrete, and the binding free energy is just an numerical value.
+
<div align="center"><p>Fig.1-9 The four key points in mathematical forms for Csy4-Q104A</p></div> <br /><br />   Now we divided the four curves into two kinds of data: the matrix and the numerical value. The interaction matrix and the curve can be regard as matrix because the curve is discrete, and the binding free energy is just an numerical value.
<br /> <br /> For the matrix we can use Euclidean distance to describe the difference between two matric: <br /><br />
+
<br /> <br /> For the matrix we can use Euclidean distance to describe the differences between two matric: <br /><br />
 
<div align="center">
 
<div align="center">
 
<math>
 
<math>
Line 519: Line 526:
 
</math>
 
</math>
 
</div>
 
</div>
  <br />For the free bind ing energy, we used the formula below to calculate the difference between the wild type and mutant[20]: <br /><br />
+
  <br />For the free bind ing energy, we can use the formula below to calculate the differences between the wild type and mutant[20]: <br /><br />
 
<div align="center">
 
<div align="center">
 
   <math>
 
   <math>
Line 551: Line 558:
 
   </mrow>
 
   </mrow>
 
</math>
 
</math>
</div> <br /><br />  According to description above, we define four value used to compare four key problems between mutant and wild-type:
+
</div> <br /><br />  According to the descriptions above, we defined four values that used to compare four key points between mutant and wild-type:
 
<math>
 
<math>
 
  <mrow>
 
  <mrow>
Line 585: Line 592:
 
   <mn>3</mn>
 
   <mn>3</mn>
 
   </msub>
 
   </msub>
   <mo stretchy='false'>(</mo><mi>S</mi><mi>e</mi><mi>r</mi><mn>151</mn><mo>&#x2212;</mo><mi>G</mi><mn>20</mn><mover>
+
   <mo stretchy='false'>(</mo><mi>T</mi><mi>h</mi><mi>r</mi><mn>151</mn><mo>&#x2212;</mo><mi>G</mi><mn>20</mn><mover>
 
   <mrow></mrow>
 
   <mrow></mrow>
 
   <mrow></mrow>
 
   <mrow></mrow>
Line 601: Line 608:
 
</math>
 
</math>
 
.
 
.
<br /><br />   For the mutant Q104A, the four is showing in the following chart
+
<br /><br />   For the mutant Q104A, the four values shows in the following chart
 
<table width="200" border="1">
 
<table width="200" border="1">
 
   <tbody>
 
   <tbody>
Line 655: Line 662:
 
<br /><br />
 
<br /><br />
 
<h4>1.6 Conclusion and Result</h4>
 
<h4>1.6 Conclusion and Result</h4>
<br /><br />  From the discussion in 1.4 and 1.5, we have the tools to evaluate the four key problems in mathematical form and find the method to compare between mutant and wild-type. The next step is to find out the mutant which fit our needs.
+
<br /><br />  From discussions in 1.4 and 1.5, we had the tools to evaluate the four key points in mathematical form and found the method to compare the mutant and wild-type. The next step is to find out the mutant that meets our needs.
   <br /><br /> In order to save the computer resource, we only choose the top 10  mutants in D<sub>1</sub> to explore the molecular dynamics. And we mainly focus on the D<sub>3</sub> which describe the ability of cleavage while the D<sub>4</sub> and ln(K<sub>rel</sub>) is an alternative value being considered. And for the D<sub>3</sub> we need limitation in the right border to make sure we will not choose the mutant which is totally inactive in cleavage. Luckily, we find that the Csy4-H29A is an inactive protein which has been proved[21]. So we choose 13.41, which is the D<sub>3</sub> value for Csy4-H29A as the right border.  
+
   <br /><br /> We chose the top 10  mutants in D<sub>1</sub> to explore the molecular dynamics. And we mainly focused on the D<sub>3</sub> which describes the capacity of cleavage while the D<sub>4</sub> and ln(K<sub>rel</sub>) is an alternative value to be considered. And the value of D<sub>3</sub> need to be limited to make sure the mutant which is totally inactive in cleavage won't be chosen. Luckily, we find that the Csy4-H29A is an inactive protein which has been proved[21]. So we chose 13.41, which is the D<sub>3</sub> value for Csy4-H29A as the threshold.  
 
   <br />  
 
   <br />  
According to all the things we have discussed, the five mutants were chosen in the following table: <br />
+
According to what we have discussed, the five mutants were chosen in the following table: <br />
 
     <table width="200" border="1">
 
     <table width="200" border="1">
 
   <tbody>
 
   <tbody>
Line 745: Line 752:
 
<p>
 
<p>
  
<h3>2. Hairpin Mutation</h3><br />
+
<h3 id='werr'>2. Hairpin Mutation</h3><br />
 
<h4>2.1 The Large Mutant Library</h4>
 
<h4>2.1 The Large Mutant Library</h4>
<br /><br /> Starting with the hairpin we want to design. We know that the hairpin used in miniToe structure is coming from the Repeat Area in CRISPR type I-F system, it can be recognized and cleaved by Csy4, and it also called DR in CRISPR system. The Fig.2-1 shows a stander CRISPR array. The yellow line between two DR is guide RNA(gRNA)which is caught from foreign DNA[22].
+
<br /><br /> Starting with the hairpins we want to design. We knew that the hairpin used in miniToe structure comes from the Repeat Area in CRISPR type I-F system. It can be recognized and cleaved by Csy4, and it is also called DR in CRISPR system. The Fig.2-1 shows a stander CRISPR array. The yellow line between two DR is guide RNA(gRNA)which is caught from foreign DNA[22].
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/8/8f/T--OUC-China--mf21.jpg" height="150"> </div>
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/8/8f/T--OUC-China--mf21.jpg" height="150"> </div>
 
<div align="center"><p>Fig.2-1 the structure of CRISPR Array</p></div>
 
<div align="center"><p>Fig.2-1 the structure of CRISPR Array</p></div>
When design the hairpin mutations, it is impossible to use the MD method to find out which hairpin can fit our need because the mutation library is to large. In our miniTioe structure in Fig.2-2, except for the two important nucleic acids----G20 and C21, the mutant library is about. If we still use molecular dynamics to explore all the hairpin mutations, too much time and computer resource is wasted.  
+
When designed the hairpin mutations, it is impossible to use the MD method to find out which hairpin can meet our needs because the mutation library is too large. In our miniTioe structure in Fig.2-2, except for the two important bases----G20 and C21, the mutant library is about 4<sup>20</sup>. If we still use molecular dynamics to explore all the hairpin mutations, too much time and lots of computer resources will be wasted.  
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/d/db/T--OUC-China--mf22.jpg" height="400"> </div>
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/d/db/T--OUC-China--mf22.jpg" height="400"> </div>
 
<div align="center"><p>Fig.2-2 The structure of miniToe</p></div>
 
<div align="center"><p>Fig.2-2 The structure of miniToe</p></div>
 
<h4>2.2 Pre-processing Algorithm</h4>
 
<h4>2.2 Pre-processing Algorithm</h4>
<br />  Combining the bioinformatics and machine learning, we present an algorithm to pre-processing our big mutation library. Fig.2-3 is the flow chart of the pre-processing algorithm. <br /><br />
+
<br />  Combining the bioinformatics and machine learning, we present a pre-processing algorithm for our big mutation library. Fig.2-3 is the flow chart of the pre-processing algorithm. <br /><br />
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/1/1f/T--OUC-China--mf23.jpg" height="450"> </div>
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/1/1f/T--OUC-China--mf23.jpg" height="450"> </div>
<div align="center"><p>Fig.2-3 The flow chart of this pre-processing algorithm</p></div><br /><br /> The first step of pre-processing algorithm is to find all the repeat area in the genome as training input. We download the genome as much as possible from NCBI. With the help of PILER-CR[23], which is an algorithm used to find the Repeat Area in bioinformatics, we can get the Repeat Area and CROSPR array from genome quickly. And we only focus on the Repeat Area whose length are 28 and 29 bp because there are some research showing that the length of Repeat Area in CRISPR type I-F system are 28 and 29 bp[24]. In the first step, we download about 5000 genomes and find out 119 Repeat Areas which are 28 and 29 bp. The Fig.2-4 shows the 48 Repeat Areas whose length is 28 bp we find.
+
<div align="center"><p>Fig.2-3 The flow chart of this pre-processing algorithm</p></div><br /><br /> The first step of pre-processing algorithm is to find all the repeat areas in the genome as training input. We download the genome as much as possible from NCBI. With the help of PILER-CR[23], an algorithm that used to find the Repeat Area in bioinformatics, we can get the Repeat Area and CROSPR array from genome quickly. And we only focus on the Repeat Area whose length are 28 and 29 bp because Some researches show that the length of Repeat Area in CRISPR type I-F system are 28 and 29 bp[24]. In the first step, we download about 5000 genomes and find out 119 Repeat Areas which are 28 and 29 bp. The Fig.2-4 shows the 48 Repeat Areas whose length is 28 bp we find.
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/c/c9/T--OUC-China--mf24.jpg" height="1200"> </div>
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/c/c9/T--OUC-China--mf24.jpg" height="1200"> </div>
 
<div align="center"><p>Fig.2-4 The Repeat Area which is 28bp</p></div>  
 
<div align="center"><p>Fig.2-4 The Repeat Area which is 28bp</p></div>  
The second step of pre-processing algorithm is to scoring the hairpin we get in the first step. We create a score called DR-Score to evaluate the quality of Repeat area comparing to the wild-type hairpin. The calculation method is below:
+
The second step of pre-processing algorithm is to score the hairpin we get in the first step. We created a score called DR-Score to evaluate the quality of Repeat area comparing to the wild-type hairpin. The calculation method is below:
 
<math display='block'>
 
<math display='block'>
 
  <mrow>
 
  <mrow>
Line 784: Line 791:
 
   <mrow>
 
   <mrow>
 
     <mfrac>
 
     <mfrac>
    <mrow>
 
      <mi>D</mi><msub>
 
      <mi>R</mi>
 
      <mrow>
 
        <mtext>size</mtext></mrow>
 
      </msub>
 
      </mrow>
 
 
     <mrow>
 
     <mrow>
 
       <mi>D</mi><msub>
 
       <mi>D</mi><msub>
Line 803: Line 803:
 
       </msub>
 
       </msub>
 
       </mrow>
 
       </mrow>
    </mfrac>
 
    </mrow>
 
  </mstyle></mrow>
 
</math>
 
 
 
<br />This formula can be divided into two parts. The first part is E<sub>H</sub> , regularized Levenshetein Distance value between evaluated hairpin and wild-type hairpin, which is an index to describe sequence similarity in bioinformatics. The second part is possibility value,<math display='block'>
 
<mrow>
 
  <mstyle displaystyle='true'>
 
  <munder>
 
    <mo>&#x2211;</mo>
 
    <mtext>n</mtext>
 
  </munder>
 
  <mrow>
 
    <mfrac>
 
 
     <mrow>
 
     <mrow>
 
       <mi>D</mi><msub>
 
       <mi>D</mi><msub>
Line 823: Line 808:
 
       <mrow>
 
       <mrow>
 
         <mtext>size</mtext></mrow>
 
         <mtext>size</mtext></mrow>
      </msub>
 
      </mrow>
 
    <mrow>
 
      <mi>D</mi><msub>
 
      <mi>R</mi>
 
      <mrow>
 
        <mtext>size</mtext></mrow>
 
      </msub>
 
      <mtext>+</mtext><msub>
 
      <mi>N</mi>
 
      <mrow>
 
        <mtext>mismatcn</mtext></mrow>
 
 
       </msub>
 
       </msub>
 
       </mrow>
 
       </mrow>
Line 841: Line 814:
 
   </mstyle></mrow>
 
   </mstyle></mrow>
 
</math>
 
</math>
  which describe the quality of Repeat area in the genome it belongs on its own. DR<sub>size</sub> refers to the length of Repeat Area, which is 28 or 29 bp in this case. The N<sub>mismatch</sub> refers to the number of mutated nucleic acids in Repeat Area in the CRISPR array comparing to the common one. The n refers to the times that the common Repeat Areas apereas in the CRISPR array.
+
 
<br /><br />  Now take the CRISPR loci in Fig2-5 as an example to calculated the possibility value. <div align="center"><img src="https://static.igem.org/mediawiki/2018/9/95/T--OUC-China--mf25.jpg" height="200"> </div>
+
 
 +
 +
<br />This formula can be divided into two parts. The first part is E<sub>H</sub> , regularized Levenshetein Distance value between evaluated hairpin and wild-type hairpin, which is an index to describe sequence similarity in bioinformatics. The second part is possibility value,\(\sum\limits_{\text{n}} {\frac{{D{R_{{\text{size}}}}{\text{ + }}{N_{{\text{mismatcn}}}}}}{{D{R_{{\text{size}}}}}}} \), which describes the quality of Repeat areas in the genome on its own. DR<sub>size</sub> refers to the length of Repeat Area, which is 28 or 29 bp in this case. The N<sub>mismatch</sub> refers to the number of mutated nucleic acids in Repeat Area in the CRISPR array compared to the common one. The n refers to the times that the common Repeat Areas apereas in the CRISPR array.
 +
<br /><br />  Now take the CRISPR loci in Fig2-5 as an example to calculate the possibility value. <div align="center"><img src="https://static.igem.org/mediawiki/2018/9/95/T--OUC-China--mf25.jpg" height="200"> </div>
 
<div align="center"><p>Fig.2-5 one CRISPR array comes from CRISPR database</p></div>
 
<div align="center"><p>Fig.2-5 one CRISPR array comes from CRISPR database</p></div>
 
 
Line 867: Line 843:
 
       <mrow>
 
       <mrow>
 
         <mtext>size</mtext></mrow>
 
         <mtext>size</mtext></mrow>
 +
      </msub>
 +
      <mtext>+</mtext><msub>
 +
      <mi>N</mi>
 +
      <mrow>
 +
        <mtext>mismatcn</mtext></mrow>
 
       </msub>
 
       </msub>
 
       </mrow>
 
       </mrow>
Line 874: Line 855:
 
       <mrow>
 
       <mrow>
 
         <mtext>size</mtext></mrow>
 
         <mtext>size</mtext></mrow>
      </msub>
 
      <mtext>+</mtext><msub>
 
      <mi>N</mi>
 
      <mrow>
 
        <mtext>mismatcn</mtext></mrow>
 
 
       </msub>
 
       </msub>
 
       </mrow>
 
       </mrow>
Line 890: Line 866:
 
   <mrow>
 
   <mrow>
 
     <mfrac>
 
     <mfrac>
    <mrow>
 
      <mn>36</mn></mrow>
 
 
     <mrow>
 
     <mrow>
 
       <mn>36</mn><mtext>+</mtext><mn>0</mn></mrow>
 
       <mn>36</mn><mtext>+</mtext><mn>0</mn></mrow>
 +
    <mrow>
 +
      <mn>36</mn></mrow>
 
     </mfrac>
 
     </mfrac>
 
     </mrow>
 
     </mrow>
Line 914: Line 890:
 
       <mrow>
 
       <mrow>
 
         <mtext>size</mtext></mrow>
 
         <mtext>size</mtext></mrow>
 +
      </msub>
 +
      <mtext>+</mtext><msub>
 +
      <mi>N</mi>
 +
      <mrow>
 +
        <mtext>mismatcn</mtext></mrow>
 
       </msub>
 
       </msub>
 
       </mrow>
 
       </mrow>
Line 921: Line 902:
 
       <mrow>
 
       <mrow>
 
         <mtext>size</mtext></mrow>
 
         <mtext>size</mtext></mrow>
      </msub>
 
      <mtext>+</mtext><msub>
 
      <mi>N</mi>
 
      <mrow>
 
        <mtext>mismatcn</mtext></mrow>
 
 
       </msub>
 
       </msub>
 
       </mrow>
 
       </mrow>
Line 937: Line 913:
 
   <mrow>
 
   <mrow>
 
     <mfrac>
 
     <mfrac>
    <mrow>
 
      <mn>36</mn></mrow>
 
 
     <mrow>
 
     <mrow>
 
       <mn>36</mn><mtext>+</mtext><mn>3</mn></mrow>
 
       <mn>36</mn><mtext>+</mtext><mn>3</mn></mrow>
 +
    <mrow>
 +
      <mn>36</mn></mrow>
 
     </mfrac>
 
     </mfrac>
 
     </mrow>
 
     </mrow>
   </mstyle><mo>&#x2248;</mo><mn>8.31</mn></mrow>
+
   </mstyle><mo>&#x2248;</mo><mn>1.083</mn></mrow>
 
</math>
 
</math>
 
</div>
 
</div>
Line 961: Line 937:
 
       <mrow>
 
       <mrow>
 
         <mtext>size</mtext></mrow>
 
         <mtext>size</mtext></mrow>
 +
      </msub>
 +
      <mtext>+</mtext><msub>
 +
      <mi>N</mi>
 +
      <mrow>
 +
        <mtext>mismatcn</mtext></mrow>
 
       </msub>
 
       </msub>
 
       </mrow>
 
       </mrow>
Line 968: Line 949:
 
       <mrow>
 
       <mrow>
 
         <mtext>size</mtext></mrow>
 
         <mtext>size</mtext></mrow>
      </msub>
 
      <mtext>+</mtext><msub>
 
      <mi>N</mi>
 
      <mrow>
 
        <mtext>mismatcn</mtext></mrow>
 
 
       </msub>
 
       </msub>
 
       </mrow>
 
       </mrow>
Line 984: Line 960:
 
   <mrow>
 
   <mrow>
 
     <mfrac>
 
     <mfrac>
 +
    <mrow>
 +
      <mn>36</mn><mtext>+</mtext><mn>5</mn></mrow>
 
     <mrow>
 
     <mrow>
 
       <mn>36</mn></mrow>
 
       <mn>36</mn></mrow>
    <mrow>
 
      <mn>36</mn><mtext>+</mtext><mn>5</mn></mrow>
 
 
     </mfrac>
 
     </mfrac>
 
     </mrow>
 
     </mrow>
   </mstyle><mo>&#x2248;</mo><mn>7.90</mn></mrow>
+
   </mstyle><mo>&#x2248;</mo><mn>1.14</mn></mrow>
 
</math>
 
</math>
</div>
+
</div>
+
<br /><br />  It is quite easy to understand why the DR-Score is divided into two part. We need the Levenshtein Distance to describe the sequence similarly because not just the length of Repeat Area in CRISPR type I-F system is 28 and 29 bp[26], we need it to distinguish the hairpins which is totally faulted. And we also need the possibility value to describe the quality of hairpins we get, and it is constructed by the following assumptions: the more you occur, the better you are.
<br /><br />  It is quite easy to understand why the DR-Score will divide into two part. We need the Levenshtein Distance to describe the sequence similarity because not just the length of Repeat Area in CRISPR type I-F system is 28 and 29 bp[26], we need it to distinguish the hairpin which is totally faulted. And we also need the possibility value to describe the quality of hairpin we get, and it is constructed by the following assumption: the more you occur, the better you are.
+
<br /><br /> And If the same Repeat Areas occurs in j different species,we use the weight means to calculate the DR-Score. Just assume that there are n<sub>i</sub> repeats in the in the i th species, the DR-Score can be calculated by the formula below: <br /><br /> <div align="center"> <math>
<br /><br /> And If the same Repeat Areas occur in j different species,we use the weight means to calculate the DR-Score. Just assume that there are n<sub>i</sub> repeats in the in the i th species, the DR-Score can be calculated by the formula below: <br /><br /> <div align="center"> <math>
+
 
  <mrow>
 
  <mrow>
 
   <mi>D</mi><mi>R</mi><mo>&#x2212;</mo><mi>S</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi><mo>=</mo><mstyle displaystyle='true'>
 
   <mi>D</mi><mi>R</mi><mo>&#x2212;</mo><mi>S</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi><mo>=</mo><mstyle displaystyle='true'>
Line 1,039: Line 1,014:
 
</div>
 
</div>
 
 
<br /><br />   The third step of pre-processing algorithm is training the SVM model. The SVM (support vector machine) is a machine learning algorithm which is used to classification and regression[25]. It can construct the complex relationship between input and output with the help of kernel function. And it has been successfully used in predicting the strength of promoter. In this step, we choose the sequence of Repeat Area as input and the DR-Score as output.  
+
<br /><br />   The third step of pre-processing algorithm is training the SVM model. The SVM (support vector machine) is a machine learning algorithm which is used to classify and regress[25]. It can construct the complex relationship between input and output with the help of kernel function. And it has been successfully used in predicting the strength of promoter. In this step, we chose the sequence of Repeat Area as input and the DR-Score as output.  
  <br /><br />  Before we training the SVM model, we should first change the sequence into mathematical representation using the following method.
+
  <br /><br />  Before we trained the SVM model, we should first change the sequence into mathematical representation using the following method.
 
<br /> The original sequence data Seq='seq<sbu>1</sub>seq<sbu>2</sub>...seq<sbu>n-1</sub>seq<sbu>n</sub>'  which is coded by 'A', 'G', 'C'and 'T' can be transformed into the matrix x by the formula below:
 
<br /> The original sequence data Seq='seq<sbu>1</sub>seq<sbu>2</sub>...seq<sbu>n-1</sub>seq<sbu>n</sub>'  which is coded by 'A', 'G', 'C'and 'T' can be transformed into the matrix x by the formula below:
 
<br />
 
<br />
Line 1,252: Line 1,227:
  
 
<br /> For example, the ‘AGCTA’ can be transformed into the matrix: [1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0] .
 
<br /> For example, the ‘AGCTA’ can be transformed into the matrix: [1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0] .
  <br /><br /> After processing the sequence data, we divide the data into two part: training data and test data, then try to training the model.
+
  <br /><br /> After processing the sequence data, we divided the data into two part: training data and test data, then tried to train the model.
<br /><br />The complex relationship between the Repeats sequence matrix x and its DR-Score y can be mapped by an SVM regression function y=f(x). In order to achieve our goals, the SVM model is constructed followed Vapnik et al[25]:
+
<br /><br />The complex relationship between the Repeats sequence matrix x and its DR-Score y can be mapped by an SVM regression function y=f(x). In order to achieve our goals, the SVM model is constructed following Vapnik et al[25]:
 
<br /> <br />The performance of the SVM model is evaluated by two Squared correlation coefficient (R<sup>2</sup>):
 
<br /> <br />The performance of the SVM model is evaluated by two Squared correlation coefficient (R<sup>2</sup>):
 
<br />
 
<br />
Line 1,388: Line 1,363:
 
   </mrow>
 
   </mrow>
 
</math>
 
</math>
<br /> Where the f(x<sub>i</sub>) and y<sub>i</sub> are the values of the prediction and of DR-score of the Repeat. Many kernel functions have been tested including polynomial function, the sigmoid function, and Gaussian radial basis function (RBF). The RBF is the best choice of our model finally. Fig.2-6 shows the training result by giving the R<sup>2</sup> of training data and test data which both have high R<sup>2</sup>.
+
<br /> The f(x<sub>i</sub>) and y<sub>i</sub> are the values of the prediction and of DR-score of the Repeat. Many kernel functions have been tested including polynomial function, the sigmoid function and Gaussian radial basis function (RBF). The RBF is the best choice of our model finally. Fig.2-6 shows the training results by giving the R<sup>2</sup> of training data and testing data which both have high R<sup>2</sup>.
 
 
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/5/5f/T--OUC-China--mf26.jpg" height="300"> </div>
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/5/5f/T--OUC-China--mf26.jpg" height="300"> </div>
 
<div align="center"><p>Fig.2-6 The training result</p></div>
 
<div align="center"><p>Fig.2-6 The training result</p></div>
  The last step of pre-processing algorithm is to evaluated the big mutation library using the SVM model we train in the third step before. And by the model we can give a score to the hairpin then
+
  The last step of pre-processing algorithm is to evaluate the big mutation library using the SVM model we trained in the third step before. And by the model we can give a score to the hairpin.
 
<br /><br />
 
<br /><br />
 
<h4>2.3 Re-Check and Result</h4>
 
<h4>2.3 Re-Check and Result</h4>
<br />  After we pre-processing the big mutant library, we choose 10 hairpins whose score is high but also in gradient, which can make sure that we can achieve the goal that have different regulation level. Then we re-check the four key problems to  
+
<br />  After pre-processing the big mutant library, we chose 10 hairpins whose scores are high but also in gradient, which can make sure that we can achieve the goal of  different expression level. Then we re-check the four key points to  
  <br /> All in all, we finally choose five hairpin mutants in the following chart:
+
  <br /> All in all, we finally chose five hairpin mutants in the following chart:
 
 
 
<table width="200" border="1">
 
<table width="200" border="1">
Line 1,439: Line 1,414:
  
 
<h3>3.Comparing to the experiment</h3>
 
<h3>3.Comparing to the experiment</h3>
<br />After designing the protein mutant and hairpin mutants, the wet lab members test the all the Csy4 mutants and hairpin mutants. The result can see in the Fig.3-1.
+
<br />After designing the protein mutants and hairpin mutants, the wet lab members tested the all the Csy4 mutants and hairpin mutants. The result can see in the Fig.3-1.
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/9/93/T--OUC-China--mf31.jpg" height="450"> </div>
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/9/93/T--OUC-China--mf31.jpg" height="450"> </div>
 
<div align="center"><p>Fig.3-1 The experimental result of mutants</p></div>
 
<div align="center"><p>Fig.3-1 The experimental result of mutants</p></div>
<br />  And we try to give a comparison between the special value we used before for evaluate the mutant and experimental result to check our model.
+
<br />  And we tried to give a comparison between the special value we used before for evaluating the mutants and experimental results to check our model.
  <br /><br /> For the protein mutants, we give a comparison between D<sub>3</sub> and experimental result. Fig.3-2 is the result. <div align="center"><img src="https://static.igem.org/mediawiki/2018/c/cb/T--OUC-China--JCMODELPDB.png" height="450"> </div>
+
  <br /><br /> For the protein mutants, we gave a comparison between D<sub>3</sub> and experimental results. Fig.3-2 is the results. <div align="center"><img src="https://static.igem.org/mediawiki/2018/c/cb/T--OUC-China--JCMODELPDB.png" height="450"> </div>
<div align="center"><p>Fig.3-2 The comparison between model and experiment for protein mutant</p></div> <br />  As we can see in the Fig.3-2, we can find the inner relationship between D<sub>3</sub> and experiment result: the D<sub>3</sub> value describe the cleavage ability between the wild-type and mutant. The higher D<sub>3</sub> value means that it will have an big weaker cleavage ability than the wild-type Csy4.
+
<div align="center"><p>Fig.3-2 The comparison between model and experiment for protein mutants</p></div> <br />  As we can see in the Fig.3-2, we can find the inner relationship between D<sub>3</sub> and experiment results: the D<sub>3</sub> value describes the difference in the capacity of cleavage between the wild-type and mutant. The higher D<sub>3</sub> value means that it will have weaker capacity of cleavage than the wild-type Csy4.
  <br /><br /> For the hairpin mutants, we give a comparison between DR-Score and experimental result. Fig.3-3 is the result.
+
  <br /><br /> For the hairpin mutants, we gave a comparison between DR-Score and experimental result. Fig.3-3 is the result.
<div align="center"><img src="https://static.igem.org/mediawiki/2018/6/65/T--OUC-China--mf33.jpg" height="450"> </div>
+
<div align="center"><img src="https://static.igem.org/mediawiki/2018/1/1f/T--OUC-China--JCMODELhDB.png" height="450"> </div>
<div align="center"><p>Fig.3-3 The comparison between model and experiment for hairpin mutant</p></div> <br />  As we can see in the Fig.3-3 we can also can find the inner relationship between DR-Score and experiment result except for the miniToe 5. It is reasonable because the machine learning is quit sensitive to the data amounts and the R<sup>2</sup> is not 1 in our training result of SVM model.  
+
<div align="center"><p>Fig.3-3 The comparison between model and experiment for hairpin mutant</p></div> <br />  As we can see in the Fig.3-3, we can also find the inner relationship between DR-Score and experiment results except for the miniToe 1. It is reasonable because the machine learning is quite sensitive to the data amounts and the R<sup>2</sup> is not 1 in our training result of SVM model. <br />
 
  <br /> After all, our wet lab member test 30 combinations of our Csy4 and hairpin. Fig.3-4 is the heatmap result of it.
 
  <br /> After all, our wet lab member test 30 combinations of our Csy4 and hairpin. Fig.3-4 is the heatmap result of it.
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/7/71/T--OUC-China--mf34.jpg" height="450"> </div>
 
<div align="center"><img src="https://static.igem.org/mediawiki/2018/7/71/T--OUC-China--mf34.jpg" height="450"> </div>
<div align="center"><p>Fig.3-4 The heatmap result of 30 combination</p></div>
+
<div align="center"><p>Fig.3-4 The heatmap results of 30 combination</p></div>
 
 
 
 
Line 1,461: Line 1,436:
 
<p>
 
<p>
  
<h3>4.Future Work</h3>
+
<h3 id='aa'>4.Future Work</h3>
<br/>1) The QM/MM molecular dynamics is used to design protein mutant more and more frequently[26], we are going to use it to explore more and more details in Csy4/RNA complex.
+
<br/>1) The QM/MM molecular dynamics is used to design protein mutants more frequently[26], we are going to use it to explore more details in Csy4/RNA complex.And there are something out of Intuitive you can see in the Fig.3-4. For example the high expression of miniToe4-H29A. I think it may be described with the QM/MM molecular dynamics method.
<br/><br />2) We are gonging to find more and more Repeat Area to increase the data amount for our pre-processing algorithm.
+
<br/><br />2) We are going to find more Repeat Areas to increase the data amounts for our pre-processing algorithm.
 
</p>
 
</p>
  
Line 1,525: Line 1,500:
  
  
 
+
<br /><br /><br /><br />
  
  
Line 1,561: Line 1,536:
  
  
  <div class="copyright1">Contact Us : oucigem@163.com  | &copy;2018 OUC IGEM.All Rights Reserved.  |  ………… </div>
+
    <div class="copyright1">Contact Us : oucigem@163.com  | &copy;2018 OUC IGEM.All Rights Reserved. <br />
 +
<img src="https://static.igem.org/mediawiki/2017/b/b4/T--OUC-China--foot1.jpeg"alt="banner"width="80px">
 +
<img src="https://static.igem.org/mediawiki/2017/6/62/T--OUC-China--foot2.jpeg"alt="banner"width="80px">
 +
<img src="https://static.igem.org/mediawiki/2018/f/f3/T--OUC-China--lalala.png"alt="banner"width="80px">
 +
<img src="https://static.igem.org/mediawiki/2017/5/51/T--OUC-China--NSG.png"alt="banner"height="65px">
 +
<img src="https://static.igem.org/mediawiki/2017/2/2a/T--OUC-China--ML.png"alt="banner"height="65px">&emsp;
 +
  </div>
 
 
  

Latest revision as of 17:17, 5 December 2018

Team OUC-China: Main

miniToe Family


In the miniToe family, the protein and hairpin were mutated to meet the goal of the different regulation level. In this part, the model can help us design mutants. Importantly, we used different strategies to design the feature of Cys4 and the hairpin. For example, molecular dynamics played an important role in designing protein mutants, and the bioinformatics and machine learning supported us to find the hairpin mutants of our interest.

1. Enzymes Mutation

1.1 The Four Key POINTS in miniToe System


The wet lab members gave us four important sites, Gln104, Tyr176, Phe155, His29, which play important roles in binding and cleavage protein Csy4. Considering 20 kinds of amino acids, we have 80 mutants to explore and choose if we only have one site mutated.

Before designing the protein mutants, we looked into the working process of miniToe to find the most important key points in our system.

Fig.1-1 The working process of miniToe system

All the reactions happened in our first system, miniToe, can be described chronologically by following five main steps[1]:

(1)MiniToe is produced and accumulated.
(2)Csy4 is produced under IPTG conditions.
(3)Csy4 binds to the miniToe structure and forms Csy4-miniToe complex.
(4)Csy4 cleaves the specific site of miniToe structure into two parts: Cys4-crRNA (cis-repressive RNA) complex and mRNA encoding sfGFP.
(5)The sfGFP is produced.

There are four key points in our miniToe system, which confirms whether our system can work successfully;

(1)Does Csy4 dock correctly with the miniToe structure?
(2)How about the binding capacity between Csy4 and miniToe structure?
(3)How about the cleavage capacity between Csy4 and miniToe structure?
(4)Does cis-repressive RNA release from the RBS?

The most impressive way to explore four points is to model our system at the atom level by molecular dynamics[2]. And there are lots of work on exploring Csy4-crRNA complex by molecular dynamics[3][4][5].

1.2 Molecular Dynamics


Molecular dynamics (MD)[6] is a computer simulation method for studying the physical movements of atoms and molecules. The atoms and molecules allows to interact for a fixed period of time, giving a view of the dynamic evolution of the system. In most common version, the trajectories of atoms and molecules are determined by numerically solving Newton's equations of motion for a system of interacting particles, whose forces between the particles and potential energies are often calculated using interatomic potentials or molecular mechanics force fields.

For a system involving molecule or atoms, the total energy includes kinetic energy and potential energy,which can be described by the formula below:

E= E kin +U

where the Ekin denotes the kinetic energy and the U denotes the potential energy.

In a molecule system, the total potential energy can be calculated by adding the Unb、 bond stretching potentials energy Ub、angle bending potentials energy Uθ、torsion angle potentials energy Uϕ、out-of-plane potentials energy Uχ and some other cross effect Ucross together,which also can be described by the formula below:

U= U nb + U b + U θ + U ϕ + U χ + U cross


This formula above is also called the force field in the theory of molecular dynamics. Based on the statistical thermodynamics and empirical result, there are many force field in the world. In the study of proteins and nucleic acids, the Amber force field is one of the best force field in the world. So we choose Amber as our force field and the formula of this field.

The items in the formula refers to bond stretching term、angle bending potentials、dihedral angle potentials、put of plane angle potentials、improper dihedral angle potentials、Van Der Waals interaction and Coulombic interaction terms in order.

Considering the system which contains N molecule or atoms in the classic mechanics, the atom or molecule can be characterized as follow:

The position of atom i is ri , the speed of atom i is vi , the acceleration of atom i is ai.

After the integral operation, we can get two formula of vi and ri :

v i = v i 0 + a i t

r i = r i 0 + v i 0 + 1 2 a i t 2


where the vi 0 refers to initial speed and the ri 0 refers to initial position.

According to the classic mechanics, the force applied to atoms is the negative gradient of potential energy:

F i = i U=( x i + y j + z k )U


Due to Newton’s second law, the acceleration of atom i can also be described as:

a i = F i m i


Using the total formula above together, we can have the full process of the molecular dynamics described and the Fig.2-2 shows the flow chart of MD in program.


Fig.1-2 The flow chart of MD in program

1.3 Logic Line


When choosing Csy4 mutants, we considered the four points that discussed before. We choosed the molecular dynamics as main tools to study our system in atom level. What's the logic line between them?

We considered the problems about choosing mutants in the following description:
In experiments strain-Csy4-Wide Type works well, proving that the four key points we discussed before was confirmed. The Cys4 in the strain-Csy4-Wide Type can dock correctly with the miniToe structure and the Csy4 have a good capacity of binding and cleaving miniToe structure, finally the cis-repressive RNA releases from the RBS. So we chose the strain-Csy4-Wide Type as a criterion, the four key points can be checked in all Csy4 mutants.

So there will be two main problems: How to describe the four key points in mathematics way and compare the wild-type Csy4 and mutant Csy4? That's what we discuss in the following two parts.


1.4 How to describe four key points in mathematical form


We are going to check the molecular dynamics of wild-type Csy4 in miniToe system and show how to describe four key points in mathematical form in this part.

For the first point, we defined a matrix called interaction matrix. It can describe the possibility of interaction between every amino acid of proteins and every nucleic acid of hairpins, this interaction matrix can be calculated by the catRAPID graphic[7]. We submitted the wild-type Csy4 and miniToe structure respectively to online service and it can return us the interaction matrix. The Fig.1-3 is the heat map of the interaction matrix for wild-type Csy4 in the hairpin region.

Fig.1-3 The heatmap of interaction matrix for wild-type Csy4.


And the rest three points is explained by the molecular dynamics. Some of our work about molecular dynamics is based on Jiří Šponer’s work[8], but most of work were done by us.

In order to explore the rest three key points, we prepared two structure for the molecular dynamics. The geometries of our miniToe system is based on the X-ray structure of Csy4/RNA complex with the cleaved RNA (PDB ID: 4AL5, resolution 2.0 A), it can be seen in the Fig.1-4.

Fig.1-4 the X-ray structure of Csy4/RNA complex


The first structure called precursor complex is prepared for the second and third point: how about the capacity of binding and cleavage between Csy4 and miniToe structure? The precursor complex consists of two part: wild-type Csy4 and miniToe structure. It describes the structure after the Csy4 bound to hairpin but didn’t cleave the hairpin in the special site. The Csy4 structure origins from the X-ray structure we mentioned before while the miniToe structure is constructed totally by the rational model: we put the sequence into the mFold[9] to generate the secondary structure of hairpin then the tertiary structure is produced by the RNAComposer[10]. The molecular docking between Csy4 and miniToe structure is carried out by PatchDock[11]. The precursor complex of wild-type Csy4 can be seen in the Fig.1-5

Fig.1-5. The precursor complex of wild-type Csy4


Having got the structure of precursor complex, we began to prepare for the simulation. Missing hydrogen atoms were added by PDBFixer[12], Forcefield we used is amber ff98SB[13]. The system is immersed in a rectangular TIP3P water box. After minimizing the energy of Protein/RNA system, we gave some restrictions on the RNA chain, making sure that the structure will not become an unreasonable structure when the temperature rises. All the reaction runs with the PBC condition under 300K and 1 atm in the NPT. The time step is 2 fs while the total simulation time is 50 ns. Gromacs[14] and OpenMM[15] are the most common software we used in Ubuntu 16.04 and Window10. The equipment we used is Intel i7 6700HQ with the NVIDIA GTX 960M 4G, it can simulate about 100-150 ns per day under the GPU acceleration. And the trajectories result is analyzed by Pymol[16] and MDAnalysis[17].

For the second point, what we can get from the simulation data is protein binding free energy used to describe the capacity of binding. The 20 ns data in the beginning is aborted. So we use the data in 30-50 ns, making sure the structure is smooth when calculating. The result of binding free energy for wild-type Csy4 is G binding =59154.9251 kj/mol .

For the third point, what we can get from the simulation data is some significant distance of key interaction in the active site of Csy4, whicn describes the capacity of cleavage. Jiří Šponer[8] points out some important key interactions of the active site including Ser148(OG)-G20(O2’)、Ser150(OG)-G20(O3’)、Thr151(OG)-G20(N2’). Based on Jiří Šponer’s work, we finally chose the Thr151(OG)-G20(N2’) as our mathematical form in the third point. The distance curve of Thr151(OG)-G20(N2’) for wild-type Csy4 can be seen in Fig.1-6. We get the similar result compared to Jiří Šponer’s work[8].

Fig.1-6. The distance of Thr151(OG)-G20(N2’) in wild-type Csy4



The second structure called product structure is prepared for the fourth point: does cis-repressive RNA release from the RBS? The product complex consists of two part after cleavage: wild-type Csy4 and miniToe structure. It describes the structure after the Csy4 binding and cleaving the hairpin in the specific site. The Csy4 structure comes from the X-ray structure we mentioned before while the miniToe structure is designed totally by the rational model: we put two RNA sequence into the SimRNAweb[18] to finish the molecular docking of two chains RNA and generate the tertiary structure. And the molecular docking between Csy4 and miniToe structure is carried out by PatchDock[11]. The product complex of wild-type Csy4 can be seen in the Fig.1-7

Fig.1-7 The product complex of wild-type Csy4



We also explore the product complex by molecular dynamics following the protocol mentioned before. But in this part, we only set restrictions on RBS chain while the crRNA chain is free in moving.

For the fourth point, what we can get from the simulation data is the RMSD. It can describe the movement of the crRNA chain in a mathematical form. We can see the RMSD in Fig.1-8. The RMSD is unstable and it explains that crRNA released from RBS.

Fig.1-8 The RMSD of the product complex of wild-type Csy4

1.5 How to compare four mathematical forms between wild-type and mutants


We are going to check the molecular dynamics of the miniToe system with Csy4 mutants to show how to compare the four mathematical forms we chose before between the wild-type Csy4 and Csy4 mutants. In the following description, we give an example using the mutant Q104A to show you how to compare.

Fistly we use the SwissModel[19] to generate the tertiary structure with the template Csy4 (PDB ID: 4AL5, resolution 2.0 A). Using the molecular dynamics, we can get four mathematical forms showing in the Fig.1-9.

Fig.1-9 The four key points in mathematical forms for Csy4-Q104A



Now we divided the four curves into two kinds of data: the matrix and the numerical value. The interaction matrix and the curve can be regard as matrix because the curve is discrete, and the binding free energy is just an numerical value.

For the matrix we can use Euclidean distance to describe the differences between two matric:

D(p, q WT )= i m j n ( p i,j q WT i,j ) 2

For the free bind ing energy, we can use the formula below to calculate the differences between the wild type and mutant[20]:

ln( K drel )=ln( K dWT K dMUT )= G binding


According to the descriptions above, we defined four values that used to compare four key points between mutant and wild-type: D 1 (intteraction matrix) , ln( K drel ) , D 3 (Thr151G20 curve) , D 4 (RMSD) .

For the mutant Q104A, the four values shows in the following chart
Csy4-Mutant D 1 ln( K drel ) D 3 D 4
Q104A 0.483 2483 9.48 30.82


1.6 Conclusion and Result



From discussions in 1.4 and 1.5, we had the tools to evaluate the four key points in mathematical form and found the method to compare the mutant and wild-type. The next step is to find out the mutant that meets our needs.

We chose the top 10 mutants in D1 to explore the molecular dynamics. And we mainly focused on the D3 which describes the capacity of cleavage while the D4 and ln(Krel) is an alternative value to be considered. And the value of D3 need to be limited to make sure the mutant which is totally inactive in cleavage won't be chosen. Luckily, we find that the Csy4-H29A is an inactive protein which has been proved[21]. So we chose 13.41, which is the D3 value for Csy4-H29A as the threshold.
According to what we have discussed, the five mutants were chosen in the following table:
Csy4-Mutant D 1 ln( K drel ) D 3 D 4
WT 0 0 0 0
Q104A 0.483 2483 9.48 30.82
Y176F 0.592 -382 11.61 40.62
F155A 0.233 -1627 13.41 35.71
H29A 0.173 833 15.29 316.22

2. Hairpin Mutation


2.1 The Large Mutant Library



Starting with the hairpins we want to design. We knew that the hairpin used in miniToe structure comes from the Repeat Area in CRISPR type I-F system. It can be recognized and cleaved by Csy4, and it is also called DR in CRISPR system. The Fig.2-1 shows a stander CRISPR array. The yellow line between two DR is guide RNA(gRNA)which is caught from foreign DNA[22].

Fig.2-1 the structure of CRISPR Array

When designed the hairpin mutations, it is impossible to use the MD method to find out which hairpin can meet our needs because the mutation library is too large. In our miniTioe structure in Fig.2-2, except for the two important bases----G20 and C21, the mutant library is about 420. If we still use molecular dynamics to explore all the hairpin mutations, too much time and lots of computer resources will be wasted.

Fig.2-2 The structure of miniToe

2.2 Pre-processing Algorithm


Combining the bioinformatics and machine learning, we present a pre-processing algorithm for our big mutation library. Fig.2-3 is the flow chart of the pre-processing algorithm.

Fig.2-3 The flow chart of this pre-processing algorithm



The first step of pre-processing algorithm is to find all the repeat areas in the genome as training input. We download the genome as much as possible from NCBI. With the help of PILER-CR[23], an algorithm that used to find the Repeat Area in bioinformatics, we can get the Repeat Area and CROSPR array from genome quickly. And we only focus on the Repeat Area whose length are 28 and 29 bp because Some researches show that the length of Repeat Area in CRISPR type I-F system are 28 and 29 bp[24]. In the first step, we download about 5000 genomes and find out 119 Repeat Areas which are 28 and 29 bp. The Fig.2-4 shows the 48 Repeat Areas whose length is 28 bp we find.

Fig.2-4 The Repeat Area which is 28bp

The second step of pre-processing algorithm is to score the hairpin we get in the first step. We created a score called DR-Score to evaluate the quality of Repeat area comparing to the wild-type hairpin. The calculation method is below: DR-Score i PA =E H n D R size + N mismatcn D R size
This formula can be divided into two parts. The first part is EH , regularized Levenshetein Distance value between evaluated hairpin and wild-type hairpin, which is an index to describe sequence similarity in bioinformatics. The second part is possibility value,\(\sum\limits_{\text{n}} {\frac{{D{R_{{\text{size}}}}{\text{ + }}{N_{{\text{mismatcn}}}}}}{{D{R_{{\text{size}}}}}}} \), which describes the quality of Repeat areas in the genome on its own. DRsize refers to the length of Repeat Area, which is 28 or 29 bp in this case. The Nmismatch refers to the number of mutated nucleic acids in Repeat Area in the CRISPR array compared to the common one. The n refers to the times that the common Repeat Areas apereas in the CRISPR array.

Now take the CRISPR loci in Fig2-5 as an example to calculate the possibility value.

Fig.2-5 one CRISPR array comes from CRISPR database



The CRISPR array shows in Fig.2-5 has 9 Repeat Areas which contains 3 kinds:

DR1: ‘ACTGTACCATGCCTTACTTTGGATTCAAGGCAAAAC’
DR2: ‘ACTGTACCATGCCTGATTTTGGATTCGAGGCAAAAC’
DR3: ‘ACTGTACCATGCCTTACTTTGGATTCAAGTAAATCG’


The first Repeat area is most common while the rest has some mutation. The 3 possibility values are listed following:

DR1PossibilityValue= n D R size + N mismatcn D R size = 9 36+0 36 =9
DR2PossibilityValue= n D R size + N mismatcn D R size = 9 36+3 36 1.083
DR3PossibilityValue= n D R size + N mismatcn D R size = 9 36+5 36 1.14


It is quite easy to understand why the DR-Score is divided into two part. We need the Levenshtein Distance to describe the sequence similarly because not just the length of Repeat Area in CRISPR type I-F system is 28 and 29 bp[26], we need it to distinguish the hairpins which is totally faulted. And we also need the possibility value to describe the quality of hairpins we get, and it is constructed by the following assumptions: the more you occur, the better you are.

And If the same Repeat Areas occurs in j different species,we use the weight means to calculate the DR-Score. Just assume that there are ni repeats in the in the i th species, the DR-Score can be calculated by the formula below:

DRScore= i=1 j ( n i m j n m DRScor e i )


The third step of pre-processing algorithm is training the SVM model. The SVM (support vector machine) is a machine learning algorithm which is used to classify and regress[25]. It can construct the complex relationship between input and output with the help of kernel function. And it has been successfully used in predicting the strength of promoter. In this step, we chose the sequence of Repeat Area as input and the DR-Score as output.

Before we trained the SVM model, we should first change the sequence into mathematical representation using the following method.
The original sequence data Seq='seq1seq2...seqn-1seqn' which is coded by 'A', 'G', 'C'and 'T' can be transformed into the matrix x by the formula below:
x={ [ 1,0,0,0 ] se q i =A [ 0,1,0,0 ] se q i =G [ 0,0,1,0 ] se q i =C [ 0,0,0,1 ] se q i =T
For example, the ‘AGCTA’ can be transformed into the matrix: [1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0] .

After processing the sequence data, we divided the data into two part: training data and test data, then tried to train the model.

The complex relationship between the Repeats sequence matrix x and its DR-Score y can be mapped by an SVM regression function y=f(x). In order to achieve our goals, the SVM model is constructed following Vapnik et al[25]:

The performance of the SVM model is evaluated by two Squared correlation coefficient (R2):
R 2 = ( n i=1 n f( x i ) y i i=1 n f( x i ) i=1 n y i ) 2 (n i=1 n f ( x i ) 2 ( i=1 n f( x i ) ) 2 )(n i=1 n y i ( i=1 n y i ) 2 )
The f(xi) and yi are the values of the prediction and of DR-score of the Repeat. Many kernel functions have been tested including polynomial function, the sigmoid function and Gaussian radial basis function (RBF). The RBF is the best choice of our model finally. Fig.2-6 shows the training results by giving the R2 of training data and testing data which both have high R2.

Fig.2-6 The training result

The last step of pre-processing algorithm is to evaluate the big mutation library using the SVM model we trained in the third step before. And by the model we can give a score to the hairpin.

2.3 Re-Check and Result


After pre-processing the big mutant library, we chose 10 hairpins whose scores are high but also in gradient, which can make sure that we can achieve the goal of different expression level. Then we re-check the four key points to
All in all, we finally chose five hairpin mutants in the following chart:
Hairpin-Mutant DRScore
miniToe1 76.6306
miniToe2 65.6278
miniToe3 66.7160
miniToe4 62.5537
miniToe5 52.9794

3.Comparing to the experiment


After designing the protein mutants and hairpin mutants, the wet lab members tested the all the Csy4 mutants and hairpin mutants. The result can see in the Fig.3-1.

Fig.3-1 The experimental result of mutants


And we tried to give a comparison between the special value we used before for evaluating the mutants and experimental results to check our model.

For the protein mutants, we gave a comparison between D3 and experimental results. Fig.3-2 is the results.

Fig.3-2 The comparison between model and experiment for protein mutants


As we can see in the Fig.3-2, we can find the inner relationship between D3 and experiment results: the D3 value describes the difference in the capacity of cleavage between the wild-type and mutant. The higher D3 value means that it will have weaker capacity of cleavage than the wild-type Csy4.

For the hairpin mutants, we gave a comparison between DR-Score and experimental result. Fig.3-3 is the result.

Fig.3-3 The comparison between model and experiment for hairpin mutant


As we can see in the Fig.3-3, we can also find the inner relationship between DR-Score and experiment results except for the miniToe 1. It is reasonable because the machine learning is quite sensitive to the data amounts and the R2 is not 1 in our training result of SVM model.

After all, our wet lab member test 30 combinations of our Csy4 and hairpin. Fig.3-4 is the heatmap result of it.

Fig.3-4 The heatmap results of 30 combination

4.Future Work


1) The QM/MM molecular dynamics is used to design protein mutants more frequently[26], we are going to use it to explore more details in Csy4/RNA complex.And there are something out of Intuitive you can see in the Fig.3-4. For example the high expression of miniToe4-H29A. I think it may be described with the QM/MM molecular dynamics method.

2) We are going to find more Repeat Areas to increase the data amounts for our pre-processing algorithm.

5.Reference

[1].Du P, Miao C, Lou Q, et al. Engineering Translational Activators with CRISPR-Cas System[J]. Acs Synthetic Biology, 2016, 5(1):74.

[2].Molecular Dynamics: , Lecture Notes in Physics, Volume 258. ISBN 978-3-540-16789-1. Springer-Verlag, 1986

[3].Tang Y, Nilsson L. Molecular dynamics simulations of the complex between human U1A protein and hairpin II of U1 small nuclear RNA and of free RNA in solution[J]. Biophysical Journal, 1999, 77(3):1284-1305.

[4].Reyes C M, Kollman P A. Structure and thermodynamics of RNA-protein binding: using molecular dynamics and free energy analyses to calculate the free energies of binding and conformational change[J]. Journal of Molecular Biology, 2000, 297(5):1145-1158.

[5].Jr M A, Nilsson L. Molecular dynamics simulations of nucleic acid-protein complexes.[J]. Curr Opin Struct Biol, 2008, 18(2):194-199.

[6].Karplus M. Molecular Dynamics Simulations of Biomolecules[J]. Acc Chem Res, 2002, 9(9):321-323.

[7].Agostini F, Zanzoni A, Klus P, et al. catRAPID omics: a web server for large-scale prediction of protein–RNA interactions[J]. Bioinformatics, 2013, 29(22):2928-2930.

[8].Estarellas C, Otyepka M, Koča J, et al. Molecular dynamic simulations of protein/RNA complexes: CRISPR/Csy4 endoribonuclease.[J]. Biochimica Et Biophysica Acta, 2015, 1850(5):1072-1090.

[9].M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31 (13), 3406-3415, 2003.

[10].Popenda M, Szachniuk M, Antczak M, et al. Automated 3D structure composition for large RNAs[J]. Nucleic Acids Research, 2012, 40(14):e112-e112.

[11].Schneidmanduhovny D, Inbar Y, Nussinov R, et al. PatchDock and SymmDock: servers for rigid and symmetric docking[J]. Nucleic Acids Research, 2005, 33(Web Server issue):363-7.

[12].https://github.com/pandegroup/pdbfixe

[13].Samudravijaya K. Comparison of multiple Amber force fields and development of improved protein backbone parameters[J]. Proteins-structure Function & Bioinformatics, 2010, 65(3):712-725.

[14].Pronk S, Schulz R, Larsson P, et al. GROMACS 4.5[J]. Bioinformatics, 2013, 29(7):845-854.

[15].Eastman P, Pande V S. OpenMM: A Hardware Independent Framework for Molecular Simulations[M]. IEEE Educational Activities Department, 2010.

[16].Shringi R P. PyMol Software for 3D Visualization of Aligned Molecules[J]. Biomaterials, 2005, 26(1):63-72.

[17].Michaudagrawal N, Denning E J, Woolf T B, et al. MDAnalysis: a toolkit for the analysis of molecular dynamics simulations.[J]. Journal of Computational Chemistry, 2011, 32(10):2319-2327.

[18].Boniecki M J, Lach G, Dawson W K, et al. SimRNA: a coarse-grained method for RNA folding simulations and 3D structure prediction[J]. Nucleic Acids Research, 2016, 44(7):e63-e63.

[19].Schwede T, Kopp J, Guex N, et al. SWISS-MODEL: An automated protein homology-modeling server.[J]. Nucleic Acids Research, 2003, 31(13):3381-3385.

[20].Estimation of Relative Protein–RNA Binding Strengths from Fluctuations in the Bound State

[21].Lee H Y, Haurwitz R E, Apffel A, et al. RNA-protein analysis using a conditional CRISPR nuclease[J]. Proceedings of the National Academy of Sciences of the United States of America, 2013, 110(14):5416-5421.

[22].Przybilski R, Richter C, Gristwood T, et al. Csy4 is responsible for CRISPR RNA processing in Pectobacterium atrosepticum.[J]. Rna Biology, 2011, 8(3):517-528.

[23].Edgar R C. PILER-CR: Fast and accurate identification of CRISPR repeats[J]. Bmc Bioinformatics, 2007, 8(1):1-6.

[24].Crawley A B, Henriksen J R, Barrangou R. CRISPRdisco: An Automated Pipeline for the Discovery and Analysis of CRISPR-Cas Systems[J]. 2018, 1(2).

[25].Schölkopf B, Tsuda K, Vert J P. Support Vector Machine Applications in Computational Biology[C]// MIT Press, 2004:71-92.

[26].Wang X, Li R, Cui W, et al. QM/MM free energy Simulations of an efficient Gluten Hydrolase (Kuma030) Implicate for a Reactant-State Based Protein-Design Strategy for General Acid/Base Catalysis[J]. Scientific Reports, 2018, 8.





Contact Us : oucigem@163.com | ©2018 OUC IGEM.All Rights Reserved.
banner banner banner banner banner