(11 intermediate revisions by 4 users not shown) | |||
Line 5: | Line 5: | ||
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"> | <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"> | ||
<link href="https://2018.igem.org/wiki/index.php?title=Template:NEFU_China/CSS-menu&action=raw&ctype=text/css" rel="stylesheet" type="text/css"> | <link href="https://2018.igem.org/wiki/index.php?title=Template:NEFU_China/CSS-menu&action=raw&ctype=text/css" rel="stylesheet" type="text/css"> | ||
− | |||
<link href="https://2018.igem.org/wiki/index.php?title=Template:NEFU_China/CSS-software2&action=raw&ctype=text/css" rel="stylesheet" type="text/css"> | <link href="https://2018.igem.org/wiki/index.php?title=Template:NEFU_China/CSS-software2&action=raw&ctype=text/css" rel="stylesheet" type="text/css"> | ||
Line 125: | Line 124: | ||
<div id="menu" style="background-color:rgba(0,0,0,0.6)!important"> | <div id="menu" style="background-color:rgba(0,0,0,0.6)!important"> | ||
<li id="nav" style="left: 8%!important; width: 100%!important;">           | <li id="nav" style="left: 8%!important; width: 100%!important;">           | ||
− | + | ||
<ul class="firstmenu" style="float: left"> | <ul class="firstmenu" style="float: left"> | ||
Line 136: | Line 135: | ||
<ul id="sub_02"> | <ul id="sub_02"> | ||
<li><a href="https://2018.igem.org/Team:NEFU_China/Background" target="_self">BACKGROUND</a></li> | <li><a href="https://2018.igem.org/Team:NEFU_China/Background" target="_self">BACKGROUND</a></li> | ||
− | <li><a href="https://2018.igem.org/Team:NEFU_China/Description" target="_self">DESCRIPTION | + | <li><a href="https://2018.igem.org/Team:NEFU_China/Description" target="_self">DESCRIPTION & DESIGN</a></li> |
− | + | <li><a href="https://2018.igem.org/Team:NEFU_China/Coding book" target="_self">CODE BOOK</a></li> | |
− | <li><a href="https://2018.igem.org/Team:NEFU_China/Coding book" target="_self"> | + | |
</ul> | </ul> | ||
</li> | </li> | ||
− | |||
− | |||
<li class="mainlevel" id="mainlevel_03"> | <li class="mainlevel" id="mainlevel_03"> | ||
− | <a href="https://2018.igem.org/Team:NEFU_China/ | + | <a href="https://2018.igem.org/Team:NEFU_China/Demonstrate"><img id="parts" src="https://static.igem.org/mediawiki/2018/6/62/T--NEFU_China--_RESULTS.png">EXPERIMENTS</a> |
<ul id="sub_03"> | <ul id="sub_03"> | ||
<li><a href="https://2018.igem.org/Team:NEFU_China/Lock_Key" target="_self">LOCK & KEY</a></li> | <li><a href="https://2018.igem.org/Team:NEFU_China/Lock_Key" target="_self">LOCK & KEY</a></li> | ||
− | <li><a href="https://2018.igem.org/Team:NEFU_China/Suicide" target="_self"> | + | <li><a href="https://2018.igem.org/Team:NEFU_China/Suicide" target="_self">INFORMATION DESTRUCTION</a></li> |
− | <li><a href="https://2018.igem.org/Team:NEFU_China/Splicing" target="_self">SPLICING</a></li> | + | <li><a href="https://2018.igem.org/Team:NEFU_China/Splicing" target="_self">Pre-RNA SPLICING</a></li> |
<li><a href="https://2018.igem.org/Team:NEFU_China/Demonstrate" target="_self">DEMONSTRATE</a></li> | <li><a href="https://2018.igem.org/Team:NEFU_China/Demonstrate" target="_self">DEMONSTRATE</a></li> | ||
<hr> | <hr> | ||
Line 165: | Line 161: | ||
<ul id="sub_05"> | <ul id="sub_05"> | ||
<li><a href="https://2018.igem.org/Team:NEFU_China/Model" target="_self">OVERVIEW</a></li> | <li><a href="https://2018.igem.org/Team:NEFU_China/Model" target="_self">OVERVIEW</a></li> | ||
− | <li><a href="https://2018.igem.org/Team:NEFU_China/Model1" target="_self"> | + | <li><a href="https://2018.igem.org/Team:NEFU_China/Model1" target="_self">CORRESPONDING COEFFICIENT</a></li> |
− | <li><a href="https://2018.igem.org/Team:NEFU_China/Model2" target="_self"> | + | <li><a href="https://2018.igem.org/Team:NEFU_China/Model2" target="_self">KILLING MODEL</a></li> |
</ul> | </ul> | ||
</li> | </li> | ||
Line 183: | Line 179: | ||
<li><a href="https://2018.igem.org/Team:NEFU_China/Attributions" target="_self">ATTRIBUTIONS</a></li> | <li><a href="https://2018.igem.org/Team:NEFU_China/Attributions" target="_self">ATTRIBUTIONS</a></li> | ||
<li><a href="https://2018.igem.org/Team:NEFU_China/Members" target="_self">MEMBERS</a></li> | <li><a href="https://2018.igem.org/Team:NEFU_China/Members" target="_self">MEMBERS</a></li> | ||
− | |||
− | |||
</ul> | </ul> | ||
</li> | </li> | ||
Line 190: | Line 184: | ||
<a href="https://2018.igem.org/Team:NEFU_China/Human_Practices"><img id="humanpractice" src="https://static.igem.org/mediawiki/2018/9/91/T--NEFU_China--_HUMANPRACTICE.png">HUMAN PRACTICE</a> | <a href="https://2018.igem.org/Team:NEFU_China/Human_Practices"><img id="humanpractice" src="https://static.igem.org/mediawiki/2018/9/91/T--NEFU_China--_HUMANPRACTICE.png">HUMAN PRACTICE</a> | ||
<ul id="sub_08"> | <ul id="sub_08"> | ||
+ | <li><a href="https://2018.igem.org/Team:NEFU_China/Human_Practices" target="_self">OVERVIEW</a></li> | ||
<li><a href="https://2018.igem.org/Team:NEFU_China/Gold_integrated" target="_self">GOLD INTEGRATED</a></li> | <li><a href="https://2018.igem.org/Team:NEFU_China/Gold_integrated" target="_self">GOLD INTEGRATED</a></li> | ||
<li><a href="https://2018.igem.org/Team:NEFU_China/Silver" target="_self">SILVER</a></li> | <li><a href="https://2018.igem.org/Team:NEFU_China/Silver" target="_self">SILVER</a></li> | ||
Line 210: | Line 205: | ||
var context =canvas.getContext("2d"); | var context =canvas.getContext("2d"); | ||
var W = window.innerWidth; | var W = window.innerWidth; | ||
− | var H = | + | var H = 15000; |
//var H = window.innerHeight*1.5; | //var H = window.innerHeight*1.5; | ||
canvas.width = W; | canvas.width = W; | ||
Line 304: | Line 299: | ||
<br> | <br> | ||
<p> | <p> | ||
− | <strong>1.Preprocessing entered data.</strong><br> | + | <strong>1.Preprocessing entered data.</strong> |
+ | <br> | ||
Algorithm: Determine whether there are numbers in the entered data and whether there are empty data.<br> | Algorithm: Determine whether there are numbers in the entered data and whether there are empty data.<br> | ||
The English sentences that we entered have no Spaces. And they satisfy the following conditions:<br> | The English sentences that we entered have no Spaces. And they satisfy the following conditions:<br> | ||
Line 311: | Line 307: | ||
b. The length of English sentences is no longer than 50. | b. The length of English sentences is no longer than 50. | ||
As the length of English sentences grows longer, other introns appear in the codons, and this information will be cut in the organism. The probability of appearing introns increases with length. Therefore, we limit the length of the sentence to 50 characters. | As the length of English sentences grows longer, other introns appear in the codons, and this information will be cut in the organism. The probability of appearing introns increases with length. Therefore, we limit the length of the sentence to 50 characters. | ||
− | <br><br> | + | <br> |
+ | <br> | ||
<strong>2.Find all possible word combinations.</strong><br> | <strong>2.Find all possible word combinations.</strong><br> | ||
Algorithm: Achieve efficient word graph scanning based on a prefix tree structure. Build a DAG(Directed Acyclic Graph) to obtain all English word segmentation results.<br> | Algorithm: Achieve efficient word graph scanning based on a prefix tree structure. Build a DAG(Directed Acyclic Graph) to obtain all English word segmentation results.<br> | ||
Line 317: | Line 314: | ||
A prefix tree, is a kind of search tree. We use the prefix tree for word frequency statistics. In this prefix tree, the root is associated with the empty string, and all the descendants of a node have a common prefix. In the search, we start from the root node and connect each character passing through the road as the corresponding string of the node. Using the prefix tree structure allows us to improve the speed of query words and the efficiency of search.<br> | A prefix tree, is a kind of search tree. We use the prefix tree for word frequency statistics. In this prefix tree, the root is associated with the empty string, and all the descendants of a node have a common prefix. In the search, we start from the root node and connect each character passing through the road as the corresponding string of the node. Using the prefix tree structure allows us to improve the speed of query words and the efficiency of search.<br> | ||
The figure is a prefix tree structure of ever, very, every, go and good.<br> | The figure is a prefix tree structure of ever, very, every, go and good.<br> | ||
+ | <br> | ||
+ | <div align="center"> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/e/e2/T--NEFU_China--tree.png"> | ||
+ | </div> | ||
+ | <p> | ||
After constructing the prefix tree, we build a directed acyclic graph. In our DAG, each node is a letter, and each letter points to its next letter. If we detect a word, we add a directed edge between the first letter and the last letter of the word. Therefore, we can traverse all the segmentation results according to the DAG we constructed.<br> | After constructing the prefix tree, we build a directed acyclic graph. In our DAG, each node is a letter, and each letter points to its next letter. If we detect a word, we add a directed edge between the first letter and the last letter of the word. Therefore, we can traverse all the segmentation results according to the DAG we constructed.<br> | ||
− | For example, if the sentence is "Iwanttostudybiologyeveryday", assumes that the word in the prefix tree as follows:<br> | + | For example, if the sentence is "Iwanttostudybiologyeveryday", assumes that the word in the prefix tree as follows: |
+ | </p> | ||
+ | <br> | ||
+ | <p> | ||
“and, biology, cat, day, ever, every, everyday, fat, guess, he, I, jeep, kangaroo, learn, mom, noon, opt, peach, quit, rust, study, to, you, very, want, xylan, yellow, zoom”<br> | “and, biology, cat, day, ever, every, everyday, fat, guess, he, I, jeep, kangaroo, learn, mom, noon, opt, peach, quit, rust, study, to, you, very, want, xylan, yellow, zoom”<br> | ||
− | Then the DAG is as shown in the figure.<br> | + | </p> |
− | <br><br> | + | <p> Then the DAG is as shown in the figure.<br></p> |
+ | <br> | ||
+ | <div align="center"> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/d/d0/T--NEFU_China--dfs.png"> | ||
+ | </div> | ||
+ | <br> | ||
+ | <br> | ||
+ | <p> | ||
<strong>3.Find the most probable combination</strong><br> | <strong>3.Find the most probable combination</strong><br> | ||
Algorithm: Use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.<br> | Algorithm: Use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.<br> | ||
TF-IDF is a statistical method that can be used to assess how important a word is to this document. The importance of a word increases proportionately as it appears in a document, but decreases inversely as it appears in the corpus. Its formula is as follows:<br> | TF-IDF is a statistical method that can be used to assess how important a word is to this document. The importance of a word increases proportionately as it appears in a document, but decreases inversely as it appears in the corpus. Its formula is as follows:<br> | ||
+ | </p> | ||
+ | <br> | ||
+ | <div align="center"> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/b/be/T--NEFU_China--software3-01.png"> | ||
+ | </div> | ||
+ | <br> | ||
+ | <p> | ||
+ | |||
We can obtain the possibility of each word segmentation method by calculating TF-IDF, so as to find the most probable combination. In order to improve the efficiency of calculation, we use the idea of dynamic programming to carry out calculation.<br> | We can obtain the possibility of each word segmentation method by calculating TF-IDF, so as to find the most probable combination. In order to improve the efficiency of calculation, we use the idea of dynamic programming to carry out calculation.<br> | ||
− | For example, for the word “everyday”, we have the possible word combinations: “ever, every, very, day, everyday”.<br> | + | For example, for the word “everyday”, we have the possible word combinations: “ever, every, very, day, everyday”. |
+ | |||
+ | </p> | ||
+ | <br> | ||
+ | <br> | ||
+ | <div align="center"> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/9/90/T--NEFU_China--software3-02.png"> | ||
+ | </div> | ||
+ | <br> | ||
+ | <p> | ||
We need to calculate:<br> | We need to calculate:<br> | ||
+ | </p> | ||
+ | <p> | ||
(“xxx” means the TF-IDF value of the word xxx)<br> | (“xxx” means the TF-IDF value of the word xxx)<br> | ||
route[9]=0; e-v-e-r-y-d-a-y<br> | route[9]=0; e-v-e-r-y-d-a-y<br> | ||
Line 337: | Line 368: | ||
= “everyday”+route[9] everyday<br> | = “everyday”+route[9] everyday<br> | ||
So, the most probable combination is “everyday”. | So, the most probable combination is “everyday”. | ||
+ | </p> | ||
<br><br> | <br><br> | ||
</p> | </p> | ||
Line 345: | Line 377: | ||
</div> | </div> | ||
<hr> | <hr> | ||
+ | |||
<div class="row results"> | <div class="row results"> | ||
<div class="three columns header-col"> | <div class="three columns header-col"> | ||
Line 379: | Line 412: | ||
<br> | <br> | ||
+ | |||
+ | <div align="center"> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/f/fe/T--NEFU_China--software3-04.png" style="width: 660px;"> | ||
+ | </div> | ||
+ | <br> | ||
</p> | </p> | ||
</div> | </div> | ||
+ | </div> | ||
<div class="row item"> | <div class="row item"> | ||
− | + | <div class="twelve columns"> | |
<h3>Print best result</h3> | <h3>Print best result</h3> | ||
<p> | <p> | ||
Line 420: | Line 459: | ||
<br> | <br> | ||
+ | |||
+ | <div align="center"> | ||
+ | <img src="https://static.igem.org/mediawiki/2018/b/bc/T--NEFU_China--software3-03.png" style="width: 660px;"> | ||
+ | </div> | ||
</p> | </p> | ||
Line 427: | Line 470: | ||
</div> | </div> | ||
</div> | </div> | ||
− | + | <hr> | |
<div class="row others"> | <div class="row others"> | ||
<div class="three columns header-col"> | <div class="three columns header-col"> | ||
Line 458: | Line 501: | ||
<a href="https://github.com/igemsoftware2018/Team_NEFU_China/tree/master/6.Visualization">2.Visualization</a><br> | <a href="https://github.com/igemsoftware2018/Team_NEFU_China/tree/master/6.Visualization">2.Visualization</a><br> | ||
<a href="https://github.com/igemsoftware2018/Team_NEFU_China/tree/master/8.MainWindow.exe">3.Visualizaiton.exe</a><br> | <a href="https://github.com/igemsoftware2018/Team_NEFU_China/tree/master/8.MainWindow.exe">3.Visualizaiton.exe</a><br> | ||
− | + | </p> | |
− | + | </div> | |
− | + | </div> | |
− | + | </div> | |
− | + | ||
− | + | </div> | |
</section> | </section> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
</body> | </body> | ||
</html> | </html> |
Latest revision as of 12:45, 9 November 2018
intorduction
Aim
Segment English Sentences without Spaces.
In the Coding book, the space character has no corresponding codon. Therefore, we need to segment English sentences without Spaces. For example, the input sentence is "Iwanttostudybiologyeveryday", the output of the results should be: "i want to study biology everyday". In our program, firstly we implement word graph scanning based on prefix tree structure and construct DAG(Directed Acyclic Graph) to obtain all English word segmentation results, secondly we use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.
Programming
1.Preprocessing entered data.
Algorithm: Determine whether there are numbers in the entered data and whether there are empty data.
The English sentences that we entered have no Spaces. And they satisfy the following conditions:
a.The English sentences do not contain any characters except letters, such as numbers, punctuation marks, etc.
In the construction of our Coding Book, we establish the correspondence between letters and codons. Therefore, the digital and punctuation information will be lost during the period of decryption. Our software does not support these characters.
b. The length of English sentences is no longer than 50.
As the length of English sentences grows longer, other introns appear in the codons, and this information will be cut in the organism. The probability of appearing introns increases with length. Therefore, we limit the length of the sentence to 50 characters.
2.Find all possible word combinations.
Algorithm: Achieve efficient word graph scanning based on a prefix tree structure. Build a DAG(Directed Acyclic Graph) to obtain all English word segmentation results.
In this process, we constructed a prefix tree tree and a directed acyclic graph.
A prefix tree, is a kind of search tree. We use the prefix tree for word frequency statistics. In this prefix tree, the root is associated with the empty string, and all the descendants of a node have a common prefix. In the search, we start from the root node and connect each character passing through the road as the corresponding string of the node. Using the prefix tree structure allows us to improve the speed of query words and the efficiency of search.
The figure is a prefix tree structure of ever, very, every, go and good.
After constructing the prefix tree, we build a directed acyclic graph. In our DAG, each node is a letter, and each letter points to its next letter. If we detect a word, we add a directed edge between the first letter and the last letter of the word. Therefore, we can traverse all the segmentation results according to the DAG we constructed.
For example, if the sentence is "Iwanttostudybiologyeveryday", assumes that the word in the prefix tree as follows:
“and, biology, cat, day, ever, every, everyday, fat, guess, he, I, jeep, kangaroo, learn, mom, noon, opt, peach, quit, rust, study, to, you, very, want, xylan, yellow, zoom”
Then the DAG is as shown in the figure.
3.Find the most probable combination
Algorithm: Use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.
TF-IDF is a statistical method that can be used to assess how important a word is to this document. The importance of a word increases proportionately as it appears in a document, but decreases inversely as it appears in the corpus. Its formula is as follows:
We can obtain the possibility of each word segmentation method by calculating TF-IDF, so as to find the most probable combination. In order to improve the efficiency of calculation, we use the idea of dynamic programming to carry out calculation.
For example, for the word “everyday”, we have the possible word combinations: “ever, every, very, day, everyday”.
We need to calculate:
(“xxx” means the TF-IDF value of the word xxx)
route[9]=0; e-v-e-r-y-d-a-y
...
route[6]=max(“day”+route[9], “d”+route[7])=“day”+route[9] e-v-e-r-y-day
...
route[2]=max(“very+route[6], “v”+route[3])=“very+route[6] e-very-day
route[1]=max(“ever”+route[5], “every”+route[6], “everyday”+route[9])
= “everyday”+route[9] everyday
So, the most probable combination is “everyday”.
RESULTS
Warnings.
Input:
please input the sentences:
Output:
Warning: The entered data is empty.
Input:
please input the sentences:Iwanttostudy888biologyeveryday.
Output:
Warning: Please enter letters, you may enter some numbers.
Print DAG and all results.
Input:
please input the sentences:Iwanttostudybiologyeveryday.
Output:
DAG:
{0: [0], 1: [4], 2: [2], 3: [3], 4: [4], 5: [6], 6: [6], 7: [11], 8: [8], 9: [9], 10: [10], 11: [11], 12: [18], 13: [13], 14: [14], 15: [17], 16: [16], 17: [17], 18: [18], 19: [22, 23, 26], 20: [23], 21: [21], 22: [22], 23: [23], 24: [26], 25: [25], 26: [26], 27: [27]}
All results: i want to study biology log y ever every everyday very day .
Print best result
Route:
{
28: (0, 0),
27: (0, 27),
26: (0, 26),
25: (9.857849805800359e-08, 25),
24: (2.3106801949289814e-06, 26),
23: (2.3106801949289814e-06, 23),
22: (2.3106801949289814e-06, 22),
21: (2.3106801949289814e-06, 21),
20: (4.671241819750759e-06, 23),
19: (0.00011031439602868175, 26),
18: (0.00011031439602868175, 18),
17: (0.00011031439602868175, 17),
16: (0.00011031439602868175, 16),
15: (0.00022651239742305776, 17),
14: (0.00022651239742305776, 14),
13: (0.00022651239742305776, 13),
12: (0.0002681925141215141, 18),
11: (0.0002681925141215141, 11),
10: (0.0002681925141215141, 10),
9: (0.0002681925141215141, 9),
8: (0.0002681925141215141, 8),
7: (0.0002727594331865744, 11),
6: (0.0002727594331865744, 6),
5: (0.00027285757717578014, 6),
4: (0.00027285757717578014, 4),
3: (0.00027285757717578014, 3),
2: (0.00027295615567383817, 2),
1: (0.0002747994303250776, 4),
0: (0.0002747994303250776, 0)}
Best result: I want to study biology everyday .
Others
Visual Software
We developed a visual software. There are an input textbox, an output textbox, two radio buttons and a translate button in the software interface. We can choose radio buttons to select letters to codons or codons to letters. In addition to these, our software can also provide open files, copy files, cut files, save files, print files and other basic functions.
Software interface:
Letters to Codons:
Codons to letters: