(Created page with "<div class="page-container"> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); <...") |
|||
Line 1: | Line 1: | ||
<div class="page-container"> | <div class="page-container"> | ||
+ | <div class="last-page-identify" data-value="11"></div> | ||
<script type="text/x-mathjax-config"> | <script type="text/x-mathjax-config"> | ||
MathJax.Hub.Config({ | MathJax.Hub.Config({ | ||
Line 11: | Line 12: | ||
</div> | </div> | ||
<div class="header2"> | <div class="header2"> | ||
− | Recommendation | + | Search & Recommendation System |
</div> | </div> | ||
<div class="header3"> | <div class="header3"> | ||
Line 64: | Line 65: | ||
<div class="content"> | <div class="content"> | ||
The flow of 1021 features we process for each project is shown in the figure. In our search & recommendation | The flow of 1021 features we process for each project is shown in the figure. In our search & recommendation | ||
− | system, we train the neural network to make the output better represents the | + | system, we train the neural network to make the output better represents the inputs' features. |
</div> | </div> | ||
<div class="header4"> | <div class="header4"> | ||
Line 112: | Line 113: | ||
features of searchings and items. In our collaborative recommendation system, the matrix constructed by inner | features of searchings and items. In our collaborative recommendation system, the matrix constructed by inner | ||
product would be quite sparse, since a large number of projects in different styles exist. The traditional page | product would be quite sparse, since a large number of projects in different styles exist. The traditional page | ||
− | rank algorithm | + | rank algorithm won't perform well in the recommendation. Empirical evidence shows that using deeper layers of |
neural networks offers better recommendation performance. | neural networks offers better recommendation performance. | ||
Line 121: | Line 122: | ||
and then the weight and deviation variables of each layer are set initially. Then, we build a neural network | and then the weight and deviation variables of each layer are set initially. Then, we build a neural network | ||
model based on TensorFlow, an excellent framework for machine learning. During training, we use | model based on TensorFlow, an excellent framework for machine learning. During training, we use | ||
− | back-propagation algorithm to minimize the cost function and adjust the | + | back-propagation algorithm to minimize the cost function and adjust the model's parameters. The specific |
algorithm will be explained in the following part. For more details of neural network, see: <a href="https://en.wikipedia.org/wiki/Artificial_neural_network"> | algorithm will be explained in the following part. For more details of neural network, see: <a href="https://en.wikipedia.org/wiki/Artificial_neural_network"> | ||
wiki page of | wiki page of | ||
Line 138: | Line 139: | ||
</div> | </div> | ||
− | <div> | + | <div class="oneLine"> |
− | <img class="img-same-line" src="https://static.igem.org/mediawiki/2018/e/e6/T--SYSU-Software--model4L.png | + | <img class="img-same-line" src="https://static.igem.org/mediawiki/2018/e/e6/T--SYSU-Software--model4L.png" /> |
− | <img class="img-same-line" src="https://static.igem.org/mediawiki/2018/2/2c/T--SYSU-Software--model4R.png | + | <img class="img-same-line" src="https://static.igem.org/mediawiki/2018/2/2c/T--SYSU-Software--model4R.png" /> |
</div> | </div> | ||
<div class="content"> | <div class="content"> | ||
Line 187: | Line 188: | ||
We extract thousands of keywords by putting the wiki page or description file of each project into the | We extract thousands of keywords by putting the wiki page or description file of each project into the | ||
− | + | Microsoft's Azure keyword extraction tool, and use all these keywords and projects to form a huge loosening | |
normalization matrix. Formally, it is a n * m matrix A, where n is the number of key words and m is the | normalization matrix. Formally, it is a n * m matrix A, where n is the number of key words and m is the | ||
number of projects. The element at column i, raw j is the frequency of occurrence of the ith keyword in the | number of projects. The element at column i, raw j is the frequency of occurrence of the ith keyword in the | ||
Line 205: | Line 206: | ||
We mainly trained 2 DNN: | We mainly trained 2 DNN: | ||
</div> | </div> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/1/12/T--SYSU-Software--mathplus2.png" /> | ||
<ol> | <ol> | ||
<li> | <li> | ||
Line 234: | Line 236: | ||
Forward propagation algorithm: Our input dimension is equal to the team features and word vector | Forward propagation algorithm: Our input dimension is equal to the team features and word vector | ||
respectively, all of which are transferred to 3 hidden layer [(512, 256, 64) &(256, 128, 64)] for training. | respectively, all of which are transferred to 3 hidden layer [(512, 256, 64) &(256, 128, 64)] for training. | ||
− | Each neuron is the weighted sum of the upper | + | Each neuron is the weighted sum of the upper level's input: |
$A_{ij} = \sum_{k=1}^n w_{kj}x_k + b$ | $A_{ij} = \sum_{k=1}^n w_{kj}x_k + b$ | ||
The calculation results of the upper layer are the inputs of the next layer, while the calculation of the | The calculation results of the upper layer are the inputs of the next layer, while the calculation of the | ||
output layer is the weighted sum of the values of each neuron of the upper layer. The calculation of weight | output layer is the weighted sum of the values of each neuron of the upper layer. The calculation of weight | ||
− | ($w_{kj}$) is a matrix multiplication, which is used as | + | ($w_{kj}$) is a matrix multiplication, which is used as "matmul" in TensorFlow. |
</li> | </li> | ||
<li> | <li> | ||
Line 261: | Line 263: | ||
</li> | </li> | ||
</ol> | </ol> | ||
+ | |||
+ | <div class="content"> | ||
+ | Our training process can be vividly shown in the following figure through Tensorboard, a powerful visualization | ||
+ | tool, and we can also see that our model can eventually get a good result that the cost function converges | ||
+ | after thousands of training epochs. | ||
+ | </div> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/1/1b/T--SYSU-Software--mathplus.mov" /> | ||
<div class="header4"> | <div class="header4"> | ||
Line 275: | Line 284: | ||
<div class="content"> | <div class="content"> | ||
We fed our word vector model with Google News corpus. After the training process is done, we have a model that | We fed our word vector model with Google News corpus. After the training process is done, we have a model that | ||
− | can convert words into numerical vectors, i.e. $f:\Gamma\rightarrow R^2$ | + | can convert words into numerical vectors, i.e. $f:\Gamma\rightarrow R^2$, where $\Gamma$ denoted the corpus |
space | space | ||
</div> | </div> | ||
Line 302: | Line 311: | ||
in our databases. Then, the system will recommend 10 projects with the smallest Euclidean Metric. | in our databases. Then, the system will recommend 10 projects with the smallest Euclidean Metric. | ||
</div> | </div> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/3/32/T--SYSU-Software--redgirl.png" /> | ||
<div class="next-page-identify" data-value="13"></div> <!-- 这个是跳页标记 --> | <div class="next-page-identify" data-value="13"></div> <!-- 这个是跳页标记 --> | ||
</div> | </div> |
Latest revision as of 16:12, 17 October 2018
<script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); </script> <script src="https://2018.igem.org/common/MathJax-2.5-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> </script>
Modeling
Search & Recommendation System
Introduction
The users of our software are all innovative researchers on Synthetic Biology, who are interested in many different biological fields. The purpose of our system is to recommend most related projects to the users based on the research interest the users offer to our search system.
When it comes to recommendation system, which most constructed by matrix factorization among collaborative filtering, it will face some thorny problems among huge amounts of data. Empirical evidence shows that using deeper layers of neural networks offers better recommendation performance. In our search & recommendation system, we have trained 2 deep neural network, based on TensorFlow, to construct recommendation model.
The overall strategy of our system is Collaborative Filtering , i.e. we first search the similar key words in our database of the unknown word offered by users and then recommend the related projects through the DNN we have trained. To quantify the semantic similarities between words accurately ,we use the word2Vec model, and also to recommend the similar projects efficiently, we use the Ball Tree Algorithm to implement the K Nearest Neighbors strategy.
Models used in the system
Encoder-Decoder Model
The Encoder-Decoder Frames can be understood so intuitively: processing a general processing model that generates one sentence (or chapter) into another. For < X, Y >, our goal is to generate the target sentence Y with the Encoder-Decoder framework given the input sentence X. X and Y can be either the same language or 2 different languages. We use a neural machine translation so that the 1021 features of each reference project is better expressed.
Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The model consists of an encoder that encodes a source vector into a fixed-length vector from which a decoder generates a translation.
<img src="https://static.igem.org/mediawiki/2018/3/3e/T--SYSU-Software--modeling1.mp4" />
The flow of 1021 features we process for each project is shown in the figure. In our search & recommendation system, we train the neural network to make the output better represents the inputs' features.
Word2Vec Algorithm
Word2vec is an algorithm that produces word embedding , i.e. it converts a corpus of text into a high dimensional real vector space(in our case , the dimension is 300) and each word in the corpus is assigned to a vector in the vector space. If two words are similar semantically, then they will be close under cosine distance measure.
<img src="" />
As an interface to word2vec, we go with a Python package called gensim, which appears to be a popular NLP(Natural Language Processing) package, and has some nice documentation and tutorials, including for word2vec.
Our Search System includes word vectors for a vocabulary of 3 million words and phrases that we have trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features.
The reason why we use Word2vec is that it can distinguish the semantic meanings of words accurately by Deep Learning technique, which outperforms the traditional semantic analysis methods greatly.When a user enters any direction he wants to query in our search interface, Word2vec can help the search system to have an accurate understanding of what he wants to say. Besides, it include some commonly paired words and misspellings of words.
Deep Neural Network
The Neural Network Model takes inspiration in the biological nervous system to predict its results. It is the appropriate strategy to model complex processes and it is able to learn from experience. Neural Networks generally require a big amount of data to be fully trained.
In recent years, deep neural networks have yielded immense success. However, the exploration of deep neural networks on recommender systems has received relatively less scrutiny. When it comes to model the key factor in collaborative filtering, they still resorted to matrix factorization and applied an inner product on the latent features of searchings and items. In our collaborative recommendation system, the matrix constructed by inner product would be quite sparse, since a large number of projects in different styles exist. The traditional page rank algorithm won't perform well in the recommendation. Empirical evidence shows that using deeper layers of neural networks offers better recommendation performance.
<img src="" />
In our search & recommendation system, we use supervised learning: Firstly, the samples and labels are defined, and then the weight and deviation variables of each layer are set initially. Then, we build a neural network model based on TensorFlow, an excellent framework for machine learning. During training, we use back-propagation algorithm to minimize the cost function and adjust the model's parameters. The specific algorithm will be explained in the following part. For more details of neural network, see: <a href="https://en.wikipedia.org/wiki/Artificial_neural_network"> wiki page of Artificial Neural Network </a>
Ball tree
Since KD Tree Algorithm can hardly perform well in high dimensional data, the Ball Tree algorithm is rather efficient when searching for the most similar items, expediting nearest neighbor search queries, in which the objective is to find the k points in the tree that are closest to a given test point by some distance metric, thus the users can get recommendation instantly after they enter their interested keyword to our system.
<img class="img-same-line" src="" /> <img class="img-same-line" src="" />
Ball tree is a K dimensional hypersphere to cover observation points and put them in trees. Graph (a) shows a 2-D plane with 16 observation instances. Graph (b) is its corresponding ball tree, where the number in the node represents the number of observations contained.Circles at different levels are drawn in different styles. Each node in a tree corresponds to a circle, and the number of nodes represents the number of observation points in the region. When using a ball tree, the leaf node containing the target is first found from top to bottom, from which the nearest observation point is found. This distance is the upper bound of the nearest neighbor distance. Check whether the brotherly node contains a smaller observation point than this upper bound.
In our Recommendation System , we use the Ball Tree implementation in scikit-learn , a simple and efficient open source machine learning module in Python. For more details of Ball Tree algorithm, see: <a href="https://en.wikipedia.org/wiki/Ball_tree"> wiki page of Ball Tree
</a>
Algorithms used in the System
Deep Neural Network Architecture
Neural Network Basics
Details will be ignored here. For the people who are interested , we offered a link to wikipedia in the above section.
Data preprocessing
Every year, hundreds of teams participate in the iGEM competition, and a large number of synthetic biology literatures are published. We have collected them in our database, including projects from previous years' iGEM teams and well-known journals. Users can search for interesting projects by keywords and get inspiration for designing or optimizing their own designs.
- We extract thousands of keywords by putting the wiki page or description file of each project into the Microsoft's Azure keyword extraction tool, and use all these keywords and projects to form a huge loosening normalization matrix. Formally, it is a n * m matrix A, where n is the number of key words and m is the number of projects. The element at column i, raw j is the frequency of occurrence of the ith keyword in the jth project.
- All keywords extracted from 1 are expressed as a 300 dimensional word vector through our Word2Vec model.
- The project corresponding to the maximum value of each column in matrix A is extracted as the label of the keyword, also the training sample for our following supervised learning.
We mainly trained 2 DNN:
<img src="" />
- Through the Encoder-Decoder Model, we have trained the complete features of all projects we collected, and express 1021 features of each project in a 64-dimensional vector, so that our final recommendation tree can be based on a more dense database.
- We have trained the keywords and their corresponding projects.
Both DNNs have the same input dimension as their eigenvectors, along with 3 hidden layers of (512, 256, 64) and (256, 128, 64) neurons, and also the final output layer.
Training
- Initial the weights($w_{kj}$) and biases($b$) thereby describe the trainable parameters of each layer.
- Forward propagation algorithm: Our input dimension is equal to the team features and word vector respectively, all of which are transferred to 3 hidden layer [(512, 256, 64) &(256, 128, 64)] for training. Each neuron is the weighted sum of the upper level's input: $A_{ij} = \sum_{k=1}^n w_{kj}x_k + b$ The calculation results of the upper layer are the inputs of the next layer, while the calculation of the output layer is the weighted sum of the values of each neuron of the upper layer. The calculation of weight ($w_{kj}$) is a matrix multiplication, which is used as "matmul" in TensorFlow.
- Each layer of the neural network has an activation function to provide some nonlinear effect. In our practice of building the search& recommendation system, the sigmoid function can achieve our desired effect after several attempts.
- Cost Function: The difference between the predicted result and the real value: $$J(\theta)=\frac1{2m}\sum_{i=1}^{m}(y^i-h_\theta(x^i))^2$$
- Gradient Descent: Use back propagation to obtain the derivative of the cost function corresponding to each parameter (each element of the imported matrix). $$\frac{\partial J(\theta)}{\partial\theta_j}=-\frac1m\sum_{i=0}^m(y^i-h_\theta(x^i))x^i_j $$ The optimization objective is to minimize our cost function. Our model was trained using the AdamGrad Optimizer with a batch-size of 256 and an initial learning rate of 0.01, in order to minimize our cost function. Upon loss convergence the training was stopped.
Our training process can be vividly shown in the following figure through Tensorboard, a powerful visualization tool, and we can also see that our model can eventually get a good result that the cost function converges after thousands of training epochs.
<img src="https://static.igem.org/mediawiki/2018/1/1b/T--SYSU-Software--mathplus.mov" />
Collaborative Filtering
This is the final step to construct our Recommendation System , overall we use collaborative filtering strategy to make prediction. Here is the algorithm:
Train word vector model
We fed our word vector model with Google News corpus. After the training process is done, we have a model that can convert words into numerical vectors, i.e. $f:\Gamma\rightarrow R^2$, where $\Gamma$ denoted the corpus space
Transform users input
We get users input word k, we then convert it into normalized word vector v, i.e.$v=\frac{f(k)}{|f(k)|}$ We input the user's keyword into our pre-imported word2Vec interface, and then input the generated word vector into the active function trained by the neural network to get a 64-dimensional vector V which represents the keyword.
Search K most similar projects (Build the Ball Tree)
Through neural networks, we train the complete features of the 64-dimensional projects and construct the Ball Tree by all these 64 dimensional projects.Details will be ignored here. For the people who are interested , we offered a link to wikipedia in the above section.
Make Recommendation
Through the Ball Tree we constructed, we calculated the Euclidean Metric between the vector V and the projects in our databases. Then, the system will recommend 10 projects with the smallest Euclidean Metric.
<img src="" />