Line 299: | Line 299: | ||
<br> | <br> | ||
<p> | <p> | ||
− | <strong>1.Preprocessing entered data.</strong><br> | + | <strong>1.Preprocessing entered data.</strong> |
+ | <br> | ||
Algorithm: Determine whether there are numbers in the entered data and whether there are empty data.<br> | Algorithm: Determine whether there are numbers in the entered data and whether there are empty data.<br> | ||
The English sentences that we entered have no Spaces. And they satisfy the following conditions:<br> | The English sentences that we entered have no Spaces. And they satisfy the following conditions:<br> | ||
Line 306: | Line 307: | ||
b. The length of English sentences is no longer than 50. | b. The length of English sentences is no longer than 50. | ||
As the length of English sentences grows longer, other introns appear in the codons, and this information will be cut in the organism. The probability of appearing introns increases with length. Therefore, we limit the length of the sentence to 50 characters. | As the length of English sentences grows longer, other introns appear in the codons, and this information will be cut in the organism. The probability of appearing introns increases with length. Therefore, we limit the length of the sentence to 50 characters. | ||
− | <br><br> | + | <br> |
+ | <br> | ||
<strong>2.Find all possible word combinations.</strong><br> | <strong>2.Find all possible word combinations.</strong><br> | ||
Algorithm: Achieve efficient word graph scanning based on a prefix tree structure. Build a DAG(Directed Acyclic Graph) to obtain all English word segmentation results.<br> | Algorithm: Achieve efficient word graph scanning based on a prefix tree structure. Build a DAG(Directed Acyclic Graph) to obtain all English word segmentation results.<br> | ||
Line 318: | Line 320: | ||
<p> | <p> | ||
After constructing the prefix tree, we build a directed acyclic graph. In our DAG, each node is a letter, and each letter points to its next letter. If we detect a word, we add a directed edge between the first letter and the last letter of the word. Therefore, we can traverse all the segmentation results according to the DAG we constructed.<br> | After constructing the prefix tree, we build a directed acyclic graph. In our DAG, each node is a letter, and each letter points to its next letter. If we detect a word, we add a directed edge between the first letter and the last letter of the word. Therefore, we can traverse all the segmentation results according to the DAG we constructed.<br> | ||
− | For example, if the sentence is "Iwanttostudybiologyeveryday", assumes that the word in the prefix tree as follows:</p> | + | For example, if the sentence is "Iwanttostudybiologyeveryday", assumes that the word in the prefix tree as follows: |
− | <br> | + | </p> |
− | <p> | + | <br> |
− | “and, biology, cat, day, ever, every, everyday, fat, guess, he, I, jeep, kangaroo, learn, mom, noon, opt, peach, quit, rust, study, to, you, very, want, xylan, yellow, zoom”<br></p> | + | <p> |
− | + | “and, biology, cat, day, ever, every, everyday, fat, guess, he, I, jeep, kangaroo, learn, mom, noon, opt, peach, quit, rust, study, to, you, very, want, xylan, yellow, zoom”<br> | |
+ | </p> | ||
+ | <p> Then the DAG is as shown in the figure.<br></p> | ||
<br> | <br> | ||
<div align="center"> | <div align="center"> | ||
Line 329: | Line 333: | ||
<br> | <br> | ||
<br> | <br> | ||
− | <p> | + | <p> |
<strong>3.Find the most probable combination</strong><br> | <strong>3.Find the most probable combination</strong><br> | ||
Algorithm: Use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.<br> | Algorithm: Use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.<br> | ||
TF-IDF is a statistical method that can be used to assess how important a word is to this document. The importance of a word increases proportionately as it appears in a document, but decreases inversely as it appears in the corpus. Its formula is as follows:<br> | TF-IDF is a statistical method that can be used to assess how important a word is to this document. The importance of a word increases proportionately as it appears in a document, but decreases inversely as it appears in the corpus. Its formula is as follows:<br> | ||
− | </p> | + | </p> |
<br> | <br> | ||
<div align="center"> | <div align="center"> | ||
<img src="https://static.igem.org/mediawiki/2018/b/be/T--NEFU_China--software3-01.png"> | <img src="https://static.igem.org/mediawiki/2018/b/be/T--NEFU_China--software3-01.png"> | ||
− | + | </div> | |
<br> | <br> | ||
− | <p> | + | <p> |
We can obtain the possibility of each word segmentation method by calculating TF-IDF, so as to find the most probable combination. In order to improve the efficiency of calculation, we use the idea of dynamic programming to carry out calculation.<br> | We can obtain the possibility of each word segmentation method by calculating TF-IDF, so as to find the most probable combination. In order to improve the efficiency of calculation, we use the idea of dynamic programming to carry out calculation.<br> | ||
For example, for the word “everyday”, we have the possible word combinations: “ever, every, very, day, everyday”. | For example, for the word “everyday”, we have the possible word combinations: “ever, every, very, day, everyday”. | ||
− | </p> | + | </p> |
− | <br> | + | <br> |
<br> | <br> | ||
<div align="center"> | <div align="center"> | ||
Line 351: | Line 355: | ||
</div> | </div> | ||
<br> | <br> | ||
− | <p> | + | <p> |
− | We need to calculate:<br></p> | + | We need to calculate:<br> |
− | <p> | + | </p> |
+ | <p> | ||
(“xxx” means the TF-IDF value of the word xxx)<br> | (“xxx” means the TF-IDF value of the word xxx)<br> | ||
route[9]=0; e-v-e-r-y-d-a-y<br> | route[9]=0; e-v-e-r-y-d-a-y<br> | ||
Line 362: | Line 367: | ||
route[1]=max(“ever”+route[5], “every”+route[6], “everyday”+route[9])<br> | route[1]=max(“ever”+route[5], “every”+route[6], “everyday”+route[9])<br> | ||
= “everyday”+route[9] everyday<br> | = “everyday”+route[9] everyday<br> | ||
− | So, the most probable combination is “everyday”.</p> | + | So, the most probable combination is “everyday”. |
+ | </p> | ||
<br><br> | <br><br> | ||
</p> | </p> | ||
Line 408: | Line 414: | ||
<div align="center"> | <div align="center"> | ||
− | <img src="https://static.igem.org/mediawiki/2018/f/fe/T--NEFU_China--software3-04.png" style=" | + | <img src="https://static.igem.org/mediawiki/2018/f/fe/T--NEFU_China--software3-04.png" style="width: 660px;"> |
− | + | ||
− | "> | + | |
</div> | </div> | ||
<br> | <br> | ||
Line 457: | Line 461: | ||
<div align="center"> | <div align="center"> | ||
− | <img src="https://static.igem.org/mediawiki/2018/b/bc/T--NEFU_China--software3-03.png" style=" | + | <img src="https://static.igem.org/mediawiki/2018/b/bc/T--NEFU_China--software3-03.png" style="width: 660px;"> |
− | + | ||
− | "> | + | |
</div> | </div> | ||
</p> | </p> | ||
Line 499: | Line 501: | ||
<a href="https://github.com/igemsoftware2018/Team_NEFU_China/tree/master/6.Visualization">2.Visualization</a><br> | <a href="https://github.com/igemsoftware2018/Team_NEFU_China/tree/master/6.Visualization">2.Visualization</a><br> | ||
<a href="https://github.com/igemsoftware2018/Team_NEFU_China/tree/master/8.MainWindow.exe">3.Visualizaiton.exe</a><br> | <a href="https://github.com/igemsoftware2018/Team_NEFU_China/tree/master/8.MainWindow.exe">3.Visualizaiton.exe</a><br> | ||
+ | </p> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </section> | ||
</body> | </body> | ||
</html> | </html> |
Latest revision as of 12:45, 9 November 2018
intorduction
Aim
Segment English Sentences without Spaces.
In the Coding book, the space character has no corresponding codon. Therefore, we need to segment English sentences without Spaces. For example, the input sentence is "Iwanttostudybiologyeveryday", the output of the results should be: "i want to study biology everyday". In our program, firstly we implement word graph scanning based on prefix tree structure and construct DAG(Directed Acyclic Graph) to obtain all English word segmentation results, secondly we use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.
Programming
1.Preprocessing entered data.
Algorithm: Determine whether there are numbers in the entered data and whether there are empty data.
The English sentences that we entered have no Spaces. And they satisfy the following conditions:
a.The English sentences do not contain any characters except letters, such as numbers, punctuation marks, etc.
In the construction of our Coding Book, we establish the correspondence between letters and codons. Therefore, the digital and punctuation information will be lost during the period of decryption. Our software does not support these characters.
b. The length of English sentences is no longer than 50.
As the length of English sentences grows longer, other introns appear in the codons, and this information will be cut in the organism. The probability of appearing introns increases with length. Therefore, we limit the length of the sentence to 50 characters.
2.Find all possible word combinations.
Algorithm: Achieve efficient word graph scanning based on a prefix tree structure. Build a DAG(Directed Acyclic Graph) to obtain all English word segmentation results.
In this process, we constructed a prefix tree tree and a directed acyclic graph.
A prefix tree, is a kind of search tree. We use the prefix tree for word frequency statistics. In this prefix tree, the root is associated with the empty string, and all the descendants of a node have a common prefix. In the search, we start from the root node and connect each character passing through the road as the corresponding string of the node. Using the prefix tree structure allows us to improve the speed of query words and the efficiency of search.
The figure is a prefix tree structure of ever, very, every, go and good.
After constructing the prefix tree, we build a directed acyclic graph. In our DAG, each node is a letter, and each letter points to its next letter. If we detect a word, we add a directed edge between the first letter and the last letter of the word. Therefore, we can traverse all the segmentation results according to the DAG we constructed.
For example, if the sentence is "Iwanttostudybiologyeveryday", assumes that the word in the prefix tree as follows:
“and, biology, cat, day, ever, every, everyday, fat, guess, he, I, jeep, kangaroo, learn, mom, noon, opt, peach, quit, rust, study, to, you, very, want, xylan, yellow, zoom”
Then the DAG is as shown in the figure.
3.Find the most probable combination
Algorithm: Use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.
TF-IDF is a statistical method that can be used to assess how important a word is to this document. The importance of a word increases proportionately as it appears in a document, but decreases inversely as it appears in the corpus. Its formula is as follows:
We can obtain the possibility of each word segmentation method by calculating TF-IDF, so as to find the most probable combination. In order to improve the efficiency of calculation, we use the idea of dynamic programming to carry out calculation.
For example, for the word “everyday”, we have the possible word combinations: “ever, every, very, day, everyday”.
We need to calculate:
(“xxx” means the TF-IDF value of the word xxx)
route[9]=0; e-v-e-r-y-d-a-y
...
route[6]=max(“day”+route[9], “d”+route[7])=“day”+route[9] e-v-e-r-y-day
...
route[2]=max(“very+route[6], “v”+route[3])=“very+route[6] e-very-day
route[1]=max(“ever”+route[5], “every”+route[6], “everyday”+route[9])
= “everyday”+route[9] everyday
So, the most probable combination is “everyday”.
RESULTS
Warnings.
Input:
please input the sentences:
Output:
Warning: The entered data is empty.
Input:
please input the sentences:Iwanttostudy888biologyeveryday.
Output:
Warning: Please enter letters, you may enter some numbers.
Print DAG and all results.
Input:
please input the sentences:Iwanttostudybiologyeveryday.
Output:
DAG:
{0: [0], 1: [4], 2: [2], 3: [3], 4: [4], 5: [6], 6: [6], 7: [11], 8: [8], 9: [9], 10: [10], 11: [11], 12: [18], 13: [13], 14: [14], 15: [17], 16: [16], 17: [17], 18: [18], 19: [22, 23, 26], 20: [23], 21: [21], 22: [22], 23: [23], 24: [26], 25: [25], 26: [26], 27: [27]}
All results: i want to study biology log y ever every everyday very day .
Print best result
Route:
{
28: (0, 0),
27: (0, 27),
26: (0, 26),
25: (9.857849805800359e-08, 25),
24: (2.3106801949289814e-06, 26),
23: (2.3106801949289814e-06, 23),
22: (2.3106801949289814e-06, 22),
21: (2.3106801949289814e-06, 21),
20: (4.671241819750759e-06, 23),
19: (0.00011031439602868175, 26),
18: (0.00011031439602868175, 18),
17: (0.00011031439602868175, 17),
16: (0.00011031439602868175, 16),
15: (0.00022651239742305776, 17),
14: (0.00022651239742305776, 14),
13: (0.00022651239742305776, 13),
12: (0.0002681925141215141, 18),
11: (0.0002681925141215141, 11),
10: (0.0002681925141215141, 10),
9: (0.0002681925141215141, 9),
8: (0.0002681925141215141, 8),
7: (0.0002727594331865744, 11),
6: (0.0002727594331865744, 6),
5: (0.00027285757717578014, 6),
4: (0.00027285757717578014, 4),
3: (0.00027285757717578014, 3),
2: (0.00027295615567383817, 2),
1: (0.0002747994303250776, 4),
0: (0.0002747994303250776, 0)}
Best result: I want to study biology everyday .
Others
Visual Software
We developed a visual software. There are an input textbox, an output textbox, two radio buttons and a translate button in the software interface. We can choose radio buttons to select letters to codons or codons to letters. In addition to these, our software can also provide open files, copy files, cut files, save files, print files and other basic functions.
Software interface:
Letters to Codons:
Codons to letters: