SARA: Software Aggregating Research Assistant
To increase the utility of our outreach, and help the iGEM community, we sought to create a tool that will be useful for years to come. Each year, many teams come to the iGEM competition with software that they developed in conjunction with their research. These tools cater directly to lab work and synthetic biology, and are imperative to the success of a team. At the start of our term, we looked for a tool that could either be directly applied to our work, or could be further built upon to suit our needs. As we searched for software, we found it difficult to sort through the iGEM wikis due to the sheer number of projects, and found that the software descriptions were difficult to discern amongst the wiki content.
Inspired by the opportunity to develop a database of iGEM software, we created SARA, the Software Aggregating Research Assistant, to allow for the simplified searching and management of past iGEM software projects. SARA provides the opportunity for old software to be updated and improved to stay current, and decreases the likelihood that teams will create redundant software.
As can be seen from the figure below, the number of software tools developed by teams have drastically increased. This trend suggests that in order to keep track of the evergrowing number of tools, an organized, cohesive, searchable storage system is required.
How does SARA work?
SARA utilizes a web scraper that finds software from past wikis using the standardized address for iGEM wiki pages, the desired year, the list of teams, and the Software suffix. The Scraper visits the software page of all teams in a given year, identifies if the page has content, and records the content if it is recognized as software. A similar algorithm is used to scrape the description pages of teams in the software track. A short description from the desired page is generated and stored in an excel file. To generate an accurate description of the software we attempted three strategies. The first strategy involved grabbing the first 500 words on the software page. However, inconsistencies in the format and presentation of information on the page meant that the description usually consisted of only background information or didn't fully capture the purpose or abilities of the software. The second strategy was to use various machine learning algorithms to extract important sentences from the pages, but this did not create a cohesive or coherent narrative as the sentences were often taken out of context. Our last strategy was to generate descriptions by manually reading and paraphrasing the information on the software pages. This was the most accurate and complete method, as we were also able to record the name of the software and any github or download links, but also the most time consuming. Ultimately, we found that the manual approach was the most reliable and robust, which was needed for a tool intended to be used by future teams.
How can I access SARA?
We chose to distribute our database in two ways. A local application was created with web scraping, parsing, and database capabilities, which can be used after downloading as an .exe. Two online versions of the database and scraper were also created. To start, online hosting and database services were explored, with two viable options. The first was Caspio, a free service for creating online databases, which allowed for the uploading of a database and its customized presentation on any webpage. However we found that search functions and the integration of the web scraper were limited. Option two was independent hosting, which allowed for complete control of the database, its appearance, and its capabilities, however it was possibly expensive and difficult to implement. We ultimately went with the independent hosting option, as we felt it allowed for maximum usefulness and customizability. For both the online and local versions of SARA, each entry includes the team name, the year that the team competed, the name of the software, an accurate description of the software, and a link to a github or downloadable files. Users can search by team, year, or a keyword. The database can be updated each year by running the web scraper with the latest year inputted as a parameter. Edits or additions can be manually submitted and reviewed for accuracy by other users or administration.
Click HERE to access SARA.
Ultimately, SARA makes it easier for teams to find iGEM software to use and to build upon. Hopefully by building on existing software and improving its functionality, iGEM software will be maintained and increase in usefulness. Improved access to existing software through SARA will also reduce teams’ workload as they will be able to effectively utilize these tools.
Luhn, H. P. (1958). The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2(2), 159-165. doi:10.1147/rd.22.0159