Team:Munich/Software

Phactory

Software

sequ-into

Quality Control - Third Generation Sequencing Style

Phactory gives an alternative method to produce bacteriophages for therapytic usage, which meets the good manufactering pratice. A critacal point is the quallity of the bacteriophage DNA, which we could drastically improve up to 96%. This was only possible through iterative engineering cycles and many parts of our project fitting seamlessly together.

With that said, any protocol optimization can only be as sophisticated as the quality control and feedback loop is fine. Furthermore, manufacturing at a certain quality standard can only ever be sustainable with the corresponding quality controls in place. Allowing effortless testing, with clear results at the earliest instant of time.

To achieve that, we implemented sequ-into.

Sequ-into, in the context of our project, detects contaminations in third generation sequencing data in a highly sensitive manner. Our app brings wetlab sequencing experiments in close proximity to drylab analysis and is therefore the ideal feedback tool for the crucial quality control in Phactory.

Please find our git repository here and our IGEM Judging release here.

A generic square placeholder image with rounded corners in a figure.

About sequ-into

Sequ-into is a cross-platform desktop application with a straightforward user experience, created by the fusion of an intuitive graphical interface with state-of-the-art long-read alignment software.

Reads originating from unwanted sources are detected and summarized by a comprehensive statistical overview, but can also be filtered and exported in standardized FASTQ-format to facilitate custom evaluation of experimental findings. Additionally, it might be unclear whether a sequencing experiment produced reads of the intended target. Therefore the filtering we implemented also allows for a positive selection.

Third generation sequencing techniques rapidly evolved as a common practice in molecular biology. Great advances have been made in terms of feasibility, cost, throughput, and read-length. However, sample contamination still poses a big issue: it complicates correct, high-quality downstream analysis of sequencing data and usage in medical applications.

… our run times aren’t fixed, unlike the other systems. Some people even have what they are looking for after a few minutes in real time, with success criteria not being based on total yield and an en-run analysis.

Clive G. Brown, CTO of Oxford Nanopore

This raises a simple question: How do you check if you have what we are looking for in a quick and easy manner?

The question could also be rephrased to: Is our sequencing run contaminated? In contrast to Illumina sequencing, in long read third generation sequencing (e.g PacBio or minION), there is always the possibility to abort a sequencing procedure, redo the library preparation and continue using the same chip. Especially when sequencing prokaryotic(-like) material, huge contaminations of the sample are possible. These could either be human DNA/RNA from the library prep, ribosomal RNA due to rRNA depletion not working, or even contamination from other organisms (host organism for phages, etc.).

Thus the earlier such contaminations are detected, the better the sequencing chip can be conserved for future use. Therefore we implemented sequ-into.

Especially since the library preparation and the sequencing are often in the hands of trained life scientists, while the down-stream analysis on the other hand is performed by trained computer scientists, there is often a gap in the workflow. Sequ-into aims to close that gap as a convenient cross-platform tool, fusing an intuitive graphical-user-interface with state-of-the-art long-read alignment software.

Sequ-into is not an application for a thorough interpretive analysis of your sequencing data.

It can, however, be a very helpful addition to your sequencing lab routine. As it is easy to install and to use without much prior knowledge necessary, it is ideal for the very first assessment of your sequencing files.

In particular it allows you to quickly tell whether your sequenced reads represent what you aimed for to sequence or if your reads in fact stem from an unwanted source. Thus allowing for a fast reaction during long sequencing experiments and early alterations in your protocol in the laboratory prior to sequencing.

Later on, you might want to further investigate your sequencing data. Here, the extraction function of sequ-into offers the possibility to save reads of a wanted source or unwanted sources separately. Even the filtering by many possible contaminations at once is possible. All this while GraphMap, the tool we employed for this in the background, is highly sensitive and specialized for the utilization with third generation sequencing techniques.

How to use sequ-into

The philosophy behind sequ-into is to to bring a sophisticated bioinformatic tool as close to the wetlab as possible.

We wanted to set the entry point to use our software as low as possible. Therefore we created an introductory YouTube video, that shows the installation on a Mac OS System. It furthermore is an deatiled introduction of the whole functionality of sequ-into - which is the same on Windows and Unix Systems, the interpretation of the results and usage of the read-filtering option.

The user will also find comprehensive information on the installation, employed packages and structure of sequ-into on our documentation.


A generic square placeholder image with rounded corners in a figure.
Download the app here.
A generic square placeholder image with rounded corners in a figure.

How to get sequ-into?

You can use sequ-into on a Mac OS, Linux as well as on a Windows System. Please follow the respective instructions in our Installation guide.

Get started

Step 1: Read files

FastQ, as well as Fast5, are suitable formats for evaluating your sequencing data with sequ-into.

In the first step, you can choose which files you would like to seek into. Each chosen file or folder will be handled separately. This is also true if you upload them twice.

If you wish to examine certain reads together, e.g. because they stem from the same experiment, make sure to save them in a folder and upload that folder via Choose Directory. In order to analyze a single file, upload it via Choose File.

As soon as you have chosen your files an output directory will be generated. You will find a temp folder where your read files reside. You can change that output directory and folder name at the bottom of the page if you click on the text field.

After that, click Next to proceed.

A generic square placeholder image with rounded corners in a figure.

Step 2: Reference files

To check what your sequencing files truly consist of you need a reference against which the reads will be mapped.

That reference might be a possible contamination, such as E. Coli, or a targeted known genome of what you intended to sequence. Of course, you can also use shorter sequences instead of a whole genome as a reference. For details on possible technical limitations, please see GraphMap and Nature Communications.

Mapping is possible against RNA as well as against DNA sequences, as long as they are in the FastA Format. You can find sequences for example on NCBI .

Click on Choose Reference to choose your reference files. You can select as many files as you wish. These files will still be present after you used Reset, but are deleted when you close the application.

If you work with certain references repeatedly they can also be saved in the app so that they are available every time even after you closed sequ-into. Simply Save Contaminants. Your own references can always be deleted from sequ-into later on, just click the trash can to do so.

Keep in mind that calculation time increases with file size and file quantity! Consider using the switches behind each reference to turn them off if you don’t need them for your current run. They will still be available after you used Reset.

After that, click Start to run the calculations.

A generic square placeholder image with rounded corners in a figure.

Step 3: Results

The Results consist of two sections: a statistical overview on how your reads mapped to the reference(s) and the filter to extract and save only those reads you need for your downstream analysis.

Section 1:

For each combination of FastQ (file/directory) with FastA you will find one table and three plots.

A generic square placeholder image with rounded corners in a figure.

The table includes read and base frequencies in the reference FastA file. For reads, you receive the information about aligned or not aligned reads. It is not always sufficient enough to rely only on reads in the further analysis. The different read sizes can cause the wrong interpretation of the data: three contaminated reads of length 50 bp or 5000 bp make a big difference despite the fact that there is three of them in both cases. For making proper conclusions about the data it is useful to take a look on the bases as well. For bases, it is important to note that there are two different definitions: alignment bases and aligned bases.

Aligned reads consist out of bases. These bases are called the aligned bases. On the other hand, the bases that are indeed aligned, means mapped to the base in the reference and are not skipped, are called alignment bases.

To support the statistical information in the table visually we also added two pie charts that correspond to the relative and absolute values in the table. These two plots will help you to gain information about the number of bases and reads that were found in a reference file and make a conclusion about the possibility of contamination.

Additionally, there is a bar plot representing the distribution of the read length in the FastQ file you uploaded. This chart could be used for evaluation of the quality of sequencing or even be helpful by evolving theories about files with filtered reads. For your сonvenience all plots are saved in the output directory specified in Step 1.

Section 2:

In the section below you will find a filter which you can optionally use to extract and save distinguish parts of the read FastQ file: reads that were mapped to the reference (aligned switch) and those which were not (not aligned switch), in other words possibly contaminated reads and reads that can be used for downstream analysis (in case the reference FastA file you used is a possible contaminant. If you added the FastA file of the organism you expect to sequence, not aligned reads are contamination).

A generic square placeholder image with rounded corners in a figure.

If you uploaded multiple references files one more filter will appear (All references): filter of reads that are aligned to all references or reads that are aligned to none of the references.

With this filter, it is possible to refine sequencing data and consequently, achieve preferable results by downstream analysis. It can also give you a hint about the origin of the possible contamination, as the reads that are not mapped to the expected organism can be checked with BLAST.

Once again all files will be saved in your output directory specified in Step 1.

How sequ-into works

We brought together a straightforward intuitive interface built with Electron and React, that gives the user easy access to the state-of-the-art long read alignment tool GraphMap which itself is highly specialized for nanopore sequencing.

To make this possible we run a python script in the background that relies on HTSeq as infrastructure for high-throughput data and pysam to handle the genomic data sets.

Sequ-into has the aim of bringing the sequencing data analysis and the laboratory protocol optimization in close proximity.

While highly specialized tools and pipelines for third generation sequencing data analysis are available, they often are not handy nor convenient to use as a first assessment right after or during the sequencing run.

As a possible solution we brought together a straightforward intuitive interface built with Electron and React , that gives the user easy access to the state-of-the-art long read alignment tool GraphMap which itself is highly specialized for nanopore sequencing.

To make this possible we run a python script in the background that relies on HTSeq as infrastructure for high-throughput data and pysam to handle the genomic data sets.

What does sequ-into do?

A generic square placeholder image with rounded corners in a figure.

In order to be able to draw conclusions of the sequencing quality in general and the composition of the data - in terms of contaminations versus the true sequencing traget - the reads are mapped to references. The reference being either a possible contamination, leaving your desired reads unaligned, or your target sequence, meaning your designated reads are the ones that did align. The distribution of read length from the original files and the results of these alignments are then elucidated in a statistical overview and employed to separate those reads you aimed for from those that were sequenced involuntary.

How does sequ-into achieve this?

From a Typescript interface to functionality

A generic square placeholder image with rounded corners in a figure.

The user interface of sequ-into is based on Electron and React and written in Typescript. However, the functionality of our app depends on a python script (ContamTool.py) in the background, that must be called according to the users request.

Read Files

Sequ-into is able to deal with both, the FastQ as well as the Fast5 format. If the latter is used, we extract the base called sequences and convert them into the FastQ format.

Thanks to the fact that the Fast5 format is in fact HDF5, a file format that can contain an unlimited variety of datatypes while allowing for input/output of complex data, it was possible to manipulate the files with the h5py python interface efficiently. To prevent excessive runtimes of our app, there is currently a processing limit of 1000 reads per Fast5 file.

return OrderedDict([
(Fast5TYPE.BASECALL_2D, '/Analyses/Basecall_2D_%03d/'),
(Fast5TYPE.BASECALL_1D_COMPL, '/Analyses/Basecall_1D_%03d/'),
(Fast5TYPE.BASECALL_1D, '/Analyses/Basecall_1D_%03d/'),
(Fast5TYPE.BASECALL_RNN_1D, '/Analyses/Basecall_RNN_1D_%03d/'),
(Fast5TYPE.BARCODING, '/Analyses/Barcoding_%03d/'),
(Fast5TYPE.PRE_BASECALL, '/Analyses/EventDetection_%03d/')
])

After acquiring the sequenced data meant to be analyzed, sequ-into handles each uploaded file/folder as a separated call. In the case of a folder, sequ-into searches for each file in that directory down to the deepest level of the directory tree.

self.state.inputFiles.forEach(element => {

var stats = fs.lstatSync(element.path)

if (stats.isDirectory()){
var allFilesInDir = fs.readdirSync(element.path);
processFilesForElement[element.path] = [];

allFilesInDir.forEach((myFile:any) => {
if(myFile.toUpperCase().endsWith("FASTQ") || myFile.toUpperCase().endsWith("FQ"))
{
var pathToFile = self.normalizePath(path.join(element.path, myFile));
processFilesForElement[element.path].push(pathToFile)
}
});

if (processFilesForElement[element.path].length == 0){
self.extractReadsForFolder(element.path);
}
}else{
processFilesForElement[element.path] = [self.normalizePath(element.path)];
}
});

All files that are pooled in a folder are handled as one file in the further steps (ContamTool.py), resulting in a combined analysis of all the files in that folder.

Reference Files

The next step is to acquire the FastA files that are used as a reference for the alignment. As the user might have similar requests repeatedly, it is possible to save reference files in the app itself. To make these files available even after the app is closed, we use a JSON file to store their paths internally together with our default genome of Escherichia coli K-12 MG1655.

Cross Plattform Compatibility

Now that the required data is accessible, the python script (ContamTool.py) handling the alignment, calculation and plotting can be called.

As the alignment-tool we employed in our python script runs asynchron but since we have to make several calls for the functionality of sequ-into, one for each file per reference, we call the python script sequential.

child = spawnSync(
program,
programArgs,
{
cwd: process.cwd(),
env: process.env,
stdio: 'pipe',
encoding: 'utf-8',
shell: useShell
})


To facilitate this on every platform sequ-into formulates the call command accordingly.

For a Unix system, this is simply:


var splitted_command = command.split(" ");
program = "python3";
programArgs = splitted_command;
useShell = true;

For Mac OS, the explicit PATH variable containing the location of the programs must be added manually:

var np = shellPath.sync();
process.env.PATH = np;

On Windows, however, it is necessary to make the call WSL compatible:

var splitCmd = ["-i", "-c", "python3 " + command];
program = "bash";
programArgs = splitCmd;
useShell = false;

Script Output

The output of each python call - that is for each file per reference - is collected via another JSON file data structure. More details here.

ContamTool.py

As mentioned above the functionality of sequ-into depends on the python script ContamTool.py which assesses the input read files, coordinates the alignment, interprets the alignment results and allows for read extraction according to the gained knowledge.

Read File Handling

All files that are pooled in a folder are handled as one FastQ file in the further steps to make the combined analysis possible.

fastqFile = os.path.join(output_dir, prefix + "complete.fastq")
os.system("cat " + ' '.join(read_file) + " > " + fastqFile)

HTSeq allows for an efficient iteration over all reads from the now single input file.

reads = HTSeq.FastqReader(read_file)
for read in reads:
...

Calling the Alignment Tool GraphMap

The idea behind sequ-into that enables finding possible contaminations and deciding if a certain target was sequenced, respectively, is to map the raw reads from the sequencing files against a reference. Thus allowing to split the original joint read file into two categories: the reads that aligned to the reference and those that did not.

Nanopore sequencing data, however, comes with certain obstacles that complicate alignments. On the one hand, because of Nanopores high-throughput nature, the data size means that alignment algorithms commonly used are too slow - something that was overcome only with a tradeoff to lower sensitivity. On the other hand, the variable error profile of ONT MinION sequencers made parameter tuning mandatory to gain high sensitivity and precision. What makes sequ-into a reliable tool nevertheless, is GraphMap. This mapping algorithm is specifically designed to analyse nanopore sequencing reads, while it handles potentially high-error rates robustly and aligns long reads with speed and high precision thanks to a fast graph traversal. (Nature 2016, Sovic et al.)

For each reference, GraphMap is called with the input read file, generating a Sequence Alignment Map.

for file in cont_file:
sam_file_name = os.path.split(file)[1][:-6]+".sam"
samFile = os.path.join(output_dir,prefix + sam_file_name)
os.system("graphmap align -r "+file+" -d "+read_file+" -o "+samFile)

Evaluating the GraphMap Output

With the pysam interface it is now easy to count the features of interest directly from the corresponding sam file for each reference:

for aln in samFile:
totalBases += len(aln.seq)
totalReads += 1
if not aln.is_unmapped:
alignmentBases += aln.alen
alignedLength += len(aln.seq)
alignedReads += 1

ContamTool.py Output

The read file is assest for each reference. ContamTool.py produces three images per reference from the generated data. A read length distribution of the original FastQ file/ files and two pie charts showing the percentage of aligned and not aligned reads or bases. The collected data, as well as the paths to the images are dumped in a JSON file for easy handling in the further steps.

{
"/pathToReference/ecoli_k12_mg1655.fasta":
{
"totalReads": 7,
"alignedReads": 0,
"totalBases": 62387,
"alignmentBases": 0,
"alignedLength": 0,
"idAlignedReads": [],
"idNotAlignedReads": ["c9a72623-c55c-4464-ac5e-d1e70cea8466", "4b57cb5c-0c3d-4650-
9d57-c94cf4aea2ef", ...],
"readLengthPlot": "/outputPath/file2_ecoli_k12_mg1655_ref1_ref2_reads_length.png",
"readsPie": "/outputPath/file2_ecoli_k12_mg1655_ref1_ref2_read_pie.png",
"basesPie": "/outputPath/file2_ecoli_k12_mg1655_ref1_ref2_bases_pie.png",
"refs": ["/pathToReference/ecoli_k12_mg1655.fasta"]},

"/pathToReference/ref1.fasta":
{
"totalReads": 7,
"alignedReads": 0,
...},

"/pathToReference/ref2.fasta":
{
"totalReads": 7,
"alignedReads": 0,
...}

Extracting Read Files

Besides the contamination evaluation, sequ-into furthermore allows for a separation of the reads into the ones that aligned to the reference versus the ones that that did not align. It generates new FastQ files according to the users inquiry which can then be used in a more elaborate downstream analysis. One notable possibility that sequ-into offers, is the extraction of reads against several references at once. Exporting only those reads in the end that represent the intersection (red) of reads aligned against all references or none, according to set theory.

A generic square placeholder image with rounded corners in a figure.

Sequencing Results

We have found two main results with our app.

First, using our tool the wetlab part of our project could reduce the amount of contamination significantly to less than 2% of the reads as one can see in the following plots. This was a valuable step for Phactory as high contamination levels impair phage assembly in a TX/TL system.

A generic square placeholder image with rounded corners in a figure.
Shown is the percentage of reads of the 3S sequencing experiment that did not align against the 3S phage genome across different purification protocols.
A generic square placeholder image with rounded corners in a figure.
Shown is the percentage of reads of the 3S sequencing experiment that did align against the E. Coli genome, across different purification protocols.

Second, we noticed that running Sequ-Into on only a very first subset of the sequenced reads is sufficient to get an upper bound of the expected contamination. Thus only using the first 1000 reads of a sequencing experiment is enough to determine how well the library preparation worked. Moreover we noticed that the contamination usually has a lower average read length, and is higher in the very beginning of each sequencing experiment. On the first sight this seems to be surprising. Possible explanations are secondary structure effects of longer reads, making the longer reads drop down faster than shorter reads.

More non-target reads sequenced in the first 10% of the sequencing time of each experiment.
Also in the first x sequenced reads.

For more details on how our software influenced Phactory: Please visit Measurement and Data Analysis

References