Difference between revisions of "Team:Jilin China/Model/Screening System"

Line 99: Line 99:
 
       <h2>Methodology</h2>
 
       <h2>Methodology</h2>
 
       <p>The RNA-based Thermosonser Intelligent Screening System is based on a random forest algorithm in machine learning, which is perturb-and-combine techniques[1] specifically designed for trees. This means a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.</p>
 
       <p>The RNA-based Thermosonser Intelligent Screening System is based on a random forest algorithm in machine learning, which is perturb-and-combine techniques[1] specifically designed for trees. This means a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.</p>
<img src="https://static.igem.org/mediawiki/2018/c/c1/T--Jilin_China--model--jxl1.png" width="65%">
+
<div align="center"><img src="https://static.igem.org/mediawiki/2018/c/c1/T--Jilin_China--model--jxl1.png" width="50%"></div>
 
<center>Figure 1. A simple random forest</center>
 
<center>Figure 1. A simple random forest</center>
 
<p>Firstly, we perform feature engineering on RNA sequences. Under the guidance of some papers, we can obtain many strong features such as GC content, stem length, loop size, sequence length, number of free bases, and free energy. Among these, the GC content and sequence length can be easily obtained, but the features of number of free bases, stem length and ring size require us to obtain the RNA secondary structure in advance. Here, we use the principle of minimum free energy and dynamic programming algorithm to find the RNA secondary structure and its free energy.</p>
 
<p>Firstly, we perform feature engineering on RNA sequences. Under the guidance of some papers, we can obtain many strong features such as GC content, stem length, loop size, sequence length, number of free bases, and free energy. Among these, the GC content and sequence length can be easily obtained, but the features of number of free bases, stem length and ring size require us to obtain the RNA secondary structure in advance. Here, we use the principle of minimum free energy and dynamic programming algorithm to find the RNA secondary structure and its free energy.</p>
Line 109: Line 109:
 
<p>The main parameters to adjust when using these methods is n_estimators and max_features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. For us, we use n_estimatores=1000 to train our model. The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values is max_features=sqrt(n_features) for classification tasks (where n_features is the number of features in the RNA-based thermosonsers). And when setting max depth of trees in our forest is None and the min samples split of training samples is 2(i.e., when fully developing the trees),we get the best results. In addition, in random forests, we use bootstrapping samples and estimate the generalization accuracy on the out-of-bag(OOB) samples[3].</p>
 
<p>The main parameters to adjust when using these methods is n_estimators and max_features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. For us, we use n_estimatores=1000 to train our model. The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values is max_features=sqrt(n_features) for classification tasks (where n_features is the number of features in the RNA-based thermosonsers). And when setting max depth of trees in our forest is None and the min samples split of training samples is 2(i.e., when fully developing the trees),we get the best results. In addition, in random forests, we use bootstrapping samples and estimate the generalization accuracy on the out-of-bag(OOB) samples[3].</p>
 
<p>Finally, random forest also features the parallel construction of the trees and parallel computation of the predictions. But instead of using all the cores of the computer, We used one core to build our random forest, which reduced inter-process communication overhead.</p>
 
<p>Finally, random forest also features the parallel construction of the trees and parallel computation of the predictions. But instead of using all the cores of the computer, We used one core to build our random forest, which reduced inter-process communication overhead.</p>
<img src="https://static.igem.org/mediawiki/2018/a/a0/T--Jilin_China--model--jxl3.png">
+
<div align="center"><img src="https://static.igem.org/mediawiki/2018/a/a0/T--Jilin_China--model--jxl3.png" width="80%"></div>
 
<center>Figure 3. The core we used and the time we needed</center>
 
<center>Figure 3. The core we used and the time we needed</center>
 
<p>The relative rank of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the RNA-based thermosonser. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features. Here, we combined with the fraction of samples a feature contributes and the decrease in impurity from splitting them to create a normalized estimate of the predictive power of that feature.</p>
 
<p>The relative rank of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the RNA-based thermosonser. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features. Here, we combined with the fraction of samples a feature contributes and the decrease in impurity from splitting them to create a normalized estimate of the predictive power of that feature.</p>
<img src="https://static.igem.org/mediawiki/2018/c/c3/T--Jilin_China--model--jxl4.png">
+
<div align="center"><img src="https://static.igem.org/mediawiki/2018/c/c3/T--Jilin_China--model--jxl4.png"></div>
 
<center>Figure 4. Feature importance evaluation</center>
 
<center>Figure 4. Feature importance evaluation</center>
 
     </div>
 
     </div>
Line 127: Line 127:
 
<li class="pragraph_4" id="pragraph_4">
 
<li class="pragraph_4" id="pragraph_4">
 
<div>
 
<div>
 +
<h2>References</h2>
 
  <ul>
 
  <ul>
 
       <li>[1] Breiman L. Arcing Classifiers[J]. Annals of Statistics, 1998, 26(3):801-824. </li>
 
       <li>[1] Breiman L. Arcing Classifiers[J]. Annals of Statistics, 1998, 26(3):801-824. </li>

Revision as of 06:53, 15 October 2018

RNA-based Thermosensors
Intelligent Screening System

Introduction Methodology Result References

Model

  • Introduction

    During our repeated experiments, we found that the RNA-based thermosonsers we obtained were not all desirable. In order to speed up our experiments and reduce the probability of this undesirable thermosonser in subsequent experiments. We have developed an intelligent screening system for RNA-based thermosonsers based on machine learning.

  • Methodology

    The RNA-based Thermosonser Intelligent Screening System is based on a random forest algorithm in machine learning, which is perturb-and-combine techniques[1] specifically designed for trees. This means a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

    Figure 1. A simple random forest

    Firstly, we perform feature engineering on RNA sequences. Under the guidance of some papers, we can obtain many strong features such as GC content, stem length, loop size, sequence length, number of free bases, and free energy. Among these, the GC content and sequence length can be easily obtained, but the features of number of free bases, stem length and ring size require us to obtain the RNA secondary structure in advance. Here, we use the principle of minimum free energy and dynamic programming algorithm to find the RNA secondary structure and its free energy.

    Figure 2. The features we extracted

    As other classifiers, forest classifiers have to be fitted with two arrays: a array X of size [n_samples, n_features] holding the training samples, and an array Y of size [n_samples] holding the target class labels for the training samples.(where n_samples is our training samples and n_features is the number of features in the RNA-based thermosonsers)

    In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree using the principle of the largest Gini coefficient, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

    In contrast to the original publication[2], we implementation combines classifiers by averaging their probabilistic prediction(i.e., Soft Voting), instead of letting each classifier vote for a single class.

    The main parameters to adjust when using these methods is n_estimators and max_features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. For us, we use n_estimatores=1000 to train our model. The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values is max_features=sqrt(n_features) for classification tasks (where n_features is the number of features in the RNA-based thermosonsers). And when setting max depth of trees in our forest is None and the min samples split of training samples is 2(i.e., when fully developing the trees),we get the best results. In addition, in random forests, we use bootstrapping samples and estimate the generalization accuracy on the out-of-bag(OOB) samples[3].

    Finally, random forest also features the parallel construction of the trees and parallel computation of the predictions. But instead of using all the cores of the computer, We used one core to build our random forest, which reduced inter-process communication overhead.

    Figure 3. The core we used and the time we needed

    The relative rank of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the RNA-based thermosonser. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features. Here, we combined with the fraction of samples a feature contributes and the decrease in impurity from splitting them to create a normalized estimate of the predictive power of that feature.

    Figure 4. Feature importance evaluation
  • Result

    In the end, we increased our probability of getting an desirable RNA-based thermosonser from 47% to about 65%, and got the two most important features of our design RNA thermometer, GC ratio and free energy.

  • References

    • [1] Breiman L. Arcing Classifiers[J]. Annals of Statistics, 1998, 26(3):801-824.
    • [2] Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5-32.
    • [3]Wolpert D H, Macready W G. An Efficient Method To Estimate Bagging's Generalization Error[J]. Machine Learning, 1997, 35(1):41-55.