Latest revision as of 12:45, 9 November 2018

Software 3

intorduction

Aim

Segment English Sentences without Spaces.

In the Coding book, the space character has no corresponding codon. Therefore, we need to segment English sentences without Spaces. For example, the input sentence is "Iwanttostudybiologyeveryday", the output of the results should be: "i want to study biology everyday". In our program, firstly we implement word graph scanning based on prefix tree structure and construct DAG(Directed Acyclic Graph) to obtain all English word segmentation results, secondly we use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.

Programming

1.Preprocessing entered data.
Algorithm: Determine whether there are numbers in the entered data and whether there are empty data.
The English sentences that we entered have no Spaces. And they satisfy the following conditions:
a.The English sentences do not contain any characters except letters, such as numbers, punctuation marks, etc. In the construction of our Coding Book, we establish the correspondence between letters and codons. Therefore, the digital and punctuation information will be lost during the period of decryption. Our software does not support these characters.
b. The length of English sentences is no longer than 50. As the length of English sentences grows longer, other introns appear in the codons, and this information will be cut in the organism. The probability of appearing introns increases with length. Therefore, we limit the length of the sentence to 50 characters.

2.Find all possible word combinations.
Algorithm: Achieve efficient word graph scanning based on a prefix tree structure. Build a DAG(Directed Acyclic Graph) to obtain all English word segmentation results.
In this process, we constructed a prefix tree tree and a directed acyclic graph.
A prefix tree, is a kind of search tree. We use the prefix tree for word frequency statistics. In this prefix tree, the root is associated with the empty string, and all the descendants of a node have a common prefix. In the search, we start from the root node and connect each character passing through the road as the corresponding string of the node. Using the prefix tree structure allows us to improve the speed of query words and the efficiency of search.
The figure is a prefix tree structure of ever, very, every, go and good.

After constructing the prefix tree, we build a directed acyclic graph. In our DAG, each node is a letter, and each letter points to its next letter. If we detect a word, we add a directed edge between the first letter and the last letter of the word. Therefore, we can traverse all the segmentation results according to the DAG we constructed.
For example, if the sentence is "Iwanttostudybiologyeveryday", assumes that the word in the prefix tree as follows:

“and, biology, cat, day, ever, every, everyday, fat, guess, he, I, jeep, kangaroo, learn, mom, noon, opt, peach, quit, rust, study, to, you, very, want, xylan, yellow, zoom”

Then the DAG is as shown in the figure.

3.Find the most probable combination
Algorithm: Use IF-IDF(Term Frequency-Inverse Document Frequency) model and maximum sharding method to obtain the optimal word segmentation results.
TF-IDF is a statistical method that can be used to assess how important a word is to this document. The importance of a word increases proportionately as it appears in a document, but decreases inversely as it appears in the corpus. Its formula is as follows:

We can obtain the possibility of each word segmentation method by calculating TF-IDF, so as to find the most probable combination. In order to improve the efficiency of calculation, we use the idea of dynamic programming to carry out calculation.
For example, for the word “everyday”, we have the possible word combinations: “ever, every, very, day, everyday”.

We need to calculate:

(“xxx” means the TF-IDF value of the word xxx)
route[9]=0; e-v-e-r-y-d-a-y
...
route[6]=max(“day”+route[9], “d”+route[7])=“day”+route[9] e-v-e-r-y-day
...
route[2]=max(“very+route[6], “v”+route[3])=“very+route[6] e-very-day
route[1]=max(“ever”+route[5], “every”+route[6], “everyday”+route[9])
= “everyday”+route[9] everyday
So, the most probable combination is “everyday”.

RESULTS

Warnings.

Input:
please input the sentences:
Output:
Warning: The entered data is empty.
Input:
please input the sentences:Iwanttostudy888biologyeveryday.
Output:
Warning: Please enter letters, you may enter some numbers.

Print DAG and all results.

Input:
please input the sentences:Iwanttostudybiologyeveryday.
Output:
DAG：
{0: [0], 1: [4], 2: [2], 3: [3], 4: [4], 5: [6], 6: [6], 7: [11], 8: [8], 9: [9], 10: [10], 11: [11], 12: [18], 13: [13], 14: [14], 15: [17], 16: [16], 17: [17], 18: [18], 19: [22, 23, 26], 20: [23], 21: [21], 22: [22], 23: [23], 24: [26], 25: [25], 26: [26], 27: [27]}
All results: i want to study biology log y ever every everyday very day .

Print best result

Route:
{
28: (0, 0),
27: (0, 27),
26: (0, 26),
25: (9.857849805800359e-08, 25),
24: (2.3106801949289814e-06, 26),
23: (2.3106801949289814e-06, 23),
22: (2.3106801949289814e-06, 22),
21: (2.3106801949289814e-06, 21),
20: (4.671241819750759e-06, 23),
19: (0.00011031439602868175, 26),
18: (0.00011031439602868175, 18),
17: (0.00011031439602868175, 17),
16: (0.00011031439602868175, 16),
15: (0.00022651239742305776, 17),
14: (0.00022651239742305776, 14),
13: (0.00022651239742305776, 13),
12: (0.0002681925141215141, 18),
11: (0.0002681925141215141, 11),
10: (0.0002681925141215141, 10),
9: (0.0002681925141215141, 9),
8: (0.0002681925141215141, 8),
7: (0.0002727594331865744, 11),
6: (0.0002727594331865744, 6),
5: (0.00027285757717578014, 6),
4: (0.00027285757717578014, 4),
3: (0.00027285757717578014, 3),
2: (0.00027295615567383817, 2),
1: (0.0002747994303250776, 4),
0: (0.0002747994303250776, 0)}
Best result: I want to study biology everyday .

Others

Visual Software

We developed a visual software. There are an input textbox, an output textbox, two radio buttons and a translate button in the software interface. We can choose radio buttons to select letters to codons or codons to letters. In addition to these, our software can also provide open files, copy files, cut files, save files, print files and other basic functions.
Software interface:
Letters to Codons:
Codons to letters:

Download

1.Englishwordsegment
2.Visualization
3.Visualizaiton.exe

Difference between revisions of "Team:NEFU China/Software3"