Document

Database
Data Collection
Interaction
Database structure
Search
Data Download
Promoter Prediction
Web Page
Reference

Database

Based on more than 20,000 biobricks in the iGEM Registry, we integrated multiple databases and created a brand-new database, BioMaster, to enlarge the amount of biobricks’ information. In order to describe biobricks more accurately, we choose to integrate some experimentally verified or highly reliable large databases such as Uniprot, String, EPD, etc., which also ensures that the data in our database is trustworthy.

Fig.1.Database integration

Data Collecting

We discovered that many biobricks in the iGEM registry only contains limited sequence information because of the lack of basic information and unclear description, which makes it quite hard to find other related information of one biobrick. So, we decided to use method of local sequence alignment based on BLAST to compare with all sequences in other databases. We located all component of the biobrick and provide more detailed information.

BLAST is developed and maintained by NCBI, which is one of the most famous tools in the field of nucleic acid and protein sequence alignment. As a comprehensive tool for multiple platforms, BLAST has a directly available version of the web, as well as a self-contained, efficient and flexible local version.

Fig.2.Expand information by BLAST

To analyze large amount of sequential data more efficiently, we choose BLAST+2.7.1 in the local version as our tool. With the aid of BLAST, we can quickly compare two sequence files locally after we get necessary data from iGEM registry and other databases waiting to be integrated.

In order to filter out high-confidence matching sequences as much as possible, we take two steps:

1. Filter a large number of duplicate matches and homologous sequences through a python program, leaving only one best match for each matching segment.

2. We randomly select a part of the matching sequence through preliminary screening and judge the correctness of the matching by manual verification and add the corresponding label, then select three BLAST parameters: e-value, pident and pqlen. Using the logistic regression algorithm with polynomial features, and cooperate with grid search to train the classifier to classify the BLAST results. Taking the coding sequence as an example, the accuracy of the classifier 3 refolding cross-validation is 98.5%.

Fig.3.BLAST results screening

A large number of sampling test results show that the matching sequences selected by the above steps have a high consistency in function and usage, which ensures the accuracy of data expansion to a certain extent.

We obtain the abstract of related references in NCBI using the crawler module in python. For interested documents, users can skip to the full text through provided links to get deeper understandings of biobricks.

Interaction

Biobrick interactions often lead to unpredictable consequences, so we provide interactions between biobricks to remind users of biobricks' interactions with certain parts. With this information, users can further avoid unnecessary interactions to gene loops.

Data about the interactions is collected from STRING Database and data about those parts comes from iGEM Registry.

In order to visually represent the interaction between biobricks, we use the interactive graphics pack: Cytoscape.js to draw interactive scatter plots to improve users’ experience.

Fig.4.Interaction of biobricks

Database structure

In order to prevent data redundancy, we build a relational database based on mysql and make a series of standard relational tables. The database structure is as follows:

Fig.5.Database structure

Search

In order to make it more convenient and improve user experience, BioMaster provides a variety of search methods. You can find previous projects and parts and biobricks by keywords. You can also use multiple IDs (such as iGEM_ID, Uniprot_ID, EPD_ID or gene name) to find the corresponding biobrick. In addition, we provide the BLAST method, you can directly find the matching biobricks through the sequence. These variety of searching methods are undoubtedly more humane.

Fig.6.Different search methods

Keywords search

The iGEM official encouraged us to stand on the shoulders of giants, so we provided information about the teams that submitted those biobricks when users use BioMaster. We collected wikis of the teams from 2005-2018 and used the Microsoft Text Analytics Service to extract keywords to make it easier for users to locate previous projects through keywords and get inspired from those projects.

In addition, we found that 87.5% of our surveys hope to search by functions, so we also extract functions as part of the keywords, and expand the information by keywords, basic information of parts, wikis and sites, interactions, etc. All the information mentioned above are closely connected and stored in BioMaster.

Fig.7.The keywords that users want to use to find the parts.

ID Search

BioMaster combined different IDs in different databases. It supports multiple database ID retrieval methods, enabling users to find the desired biobricks in multiple ways and view the same data from multiple perspectives.

Sequence Search

We find out that many users hope to find desired parts through sequences, so we provided BLAST method inside BioMaster. Users can locate the biobrick they want by providing sequence information and set proper threshold.

Data Download

For biobrick data, we provide data download in fasta format, and for the whole database, we provide data download in sql format, you can create an identical database directly through it.

Promoter prediction

We constructed a CNN-based promoter predictor. In the preprocess of the sequence, we use the method of 'One Hot Encoding' to convert the nucleotide sequence containing ATCG into a numerical value.

Sliding window was used to identify promoter regions in large-scale sequences.

We use our promoter prediction tool to analyze the genome of E. coli and build a database of predicted results combined with Ensembl gene annotation information.

Web Page

In order to make our webpage database come true, we choose php as the background programming language which is simple, flexible and friendly to the database. By using php, we can call to mysql and it can realize the accurate extraction, filtering and sorting of the data in the database.

Based on this, we use Thinkphp as the framework to write the front-end code, which makes the process of passing the background data to the front end accurate and reliable, and its frame structure is clear enough. The packaging method is convenient to call, and the workload is simplified.

In the front-end display, we refer to Bootstrap and adopt its beautiful and practical layout. And we also apply some css style files to ensure that our database interface is clear, concise and beautiful. In order to provide users with more convenience, we use Angularplasmid-master, cytoscape.js and other plug-ins and html and css to visualize data. By using these tools, we can make plasmid maps, association diagrams and sequence diagrams, etc., which improves our functions a lot.

We used Ngnix as our WEB Server. Users access public server, then through Nginx it reverse proxy to private server, getting resource from the private server. By this way, security was strengthened. In addition, we implement load balancing to optimize resource usage by distribute resource in multiple server. And it also enhanced the stability of server.

Reference

McGinnis, Scott, and Thomas L. Madden. "BLAST: at the core of a powerful and diverse set of sequence analysis tools." Nucleic acids research 32.suppl_2 (2004): W20-W25.

Szklarczyk, Damian, et al. "The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible." Nucleic acids research (2016): gkw937.

Apweiler, Rolf, et al. "UniProt: the universal protein knowledgebase." Nucleic acids research 32.suppl_1 (2004): D115-D119.

Artimo, Panu, et al. "ExPASy: SIB bioinformatics resource portal." Nucleic acids research 40.W1 (2012): W597-W603.

Gama-Castro, Socorro, et al. "RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond." Nucleic acids research 44.D1 (2015): D133-D143.

Dimmer, Emily C., et al. "The UniProt-GO annotation database in 2011." Nucleic acids research 40.D1 (2011): D565-D570.

Hershberg, Ruti, et al. "PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites." Nucleic Acids Research 29.1 (2001): 277-00.