Team:Munich/Measurement

Phactory

Measurement

Cell-Free Systems as Universal Expression Platform In Synthetic Biology

Genetic modification of microorganisms is an essential method in the field of synthetic biology. By ‘thinking outside of the cell’ a cell free transcription/translation (TX-TL) system has emerged to be a true alternative by expanding the capabilities of a natural biological system from in vivo to in vitro ¹. While commercialized TX-TL systems exhibit good reproducible results, they are only available at high cost. Contrary to commercial systems, home-made TX-TL systems are affordable but come with the drawback of high batch-to-batch variations after time-consuming preparation steps. Further investigations in the optimization of the protocol for TX-TL preparation are necessary to create an easily accessible platform for the research community. Our iGEM team developed a protocol for the easy and high-quality production of a cell extract, which can be reproduced in any laboratory without the need for expensive equipment. This TX-TL system can be used to measure protein expression, genetic circuit performance and other synthetic biology tools.

A generic square placeholder image with rounded corners in a figure.

Optimization of TX-TL Preparation and Quality Control

Translation Efficiency

To obtain a high quality cell extract, we focused on the optimization of several key factors: The culture conditions were optimized by screening for the optimal cell density to harvest cells. Assessing several different lysis methods, we identified sonication as the cheapest, most accessible and reproducible way for the production of our TX-TL protein expression system. Furthermore, the upscaling of TX-TL preparation was demonstrated by using a bioreactor for cell cultivation. Elaborate steps like the removal of the endotoxin Lipid A from the TX-TL were rendered obsolete by designing a suitable mutant strain for TX-TL preparation. For this purpose, TX-TL was prepared with a strain lacking the msbB gene to disrupt the lipid A biosynthesis pathway. By this the lipid A concentration could be lowered by a factor of 49 compared to the cell extract prepared using the wild-type strain (0.06 EU/mL).

The produced TX-TL went through further quality control checkpoints consisting of measuring protein content, as well as an activity assay via expression level analysis. The protein concentration was determined by using a commercial BCA protein assay kit. Compared to the commercial TX-TL, our home-made TX-TL shows a similar protein content of about 16 mg/mL.

However, a distinct and functional quality control requires the determination of the protein expression rate. This was done by expressing fluorescent proteins (GFP, YFP, RFP, CFP) and recording the corresponding signal using standard lab equipment like a plate reader. Extending the mRNA of the fluorescent protein mTurquoise2 (mTQ2) with the sequence for the malachite green aptamer (MG), we planned to determine the transcription and translation rate in parallel. Unfortunately, we could not express a functional fused mTQ2-MG and instead measured the transcription and translation separately by using the single components on two different plasmids.

Fluorescence curves of the different fluorescent proteins in cell extract. Proteins analysed were GFP, YFP, RFP and mTQ.

Analysing early results, we observed that functional GFP was expressed, but the measured signal did not show a coherent trend. This phenomenon was due to several autofluorescent metabolites produced in the cell extract³. Therefore GFP had to be excluded from further measurements.

Moreover, mTQ2 showed the fastest maturation time and lowest signal-to-noise ratio of the tested fluorescent proteins. Due to its long fluorescent lifetimes (> 3.7 ns) and high quantum yields (> 0.8) it delivers significant results for the quality control of the translation performance of our TX-TL⁴.

The expression of mTurquoise2 showed a five times higher fluorescent signal in our home-made TX-TL than in the commercialized TX-TL and therefore confirms that our optimized protocol produces cell extract of high translation efficiency. Although the overall protein expression rate can be observed with this method, the transcription rate has to be determined using a different approach.

Transcription Efficiency

For an excellent quality control, the transcription efficiency of our home made TX-TL was analyzed separately. We used a plasmid containing a malachite green aptamer downstream of a T7 promoter. The malachite green aptamer, a small RNA transcript, is able to bind a specific ligand, malachite green and enhances its initial fluorescence >2000 fold⁵.

By observing the fluorescence signal over time the transcription levels of different TX-TL batches can be compared.

Cell Extract – A Platform For Qualified Comparison Of BioBrick^TM

These results prove our home-made TX-TL as a high-quality expression platform regarding transcription as well as translation. Thus, it forms a perfect platform for testing various biological parts listed in the iGEM Registry. The standardized expression system allows a more reliable comparison of BioBricks for iGEM teams around the world.

Data Analysis

One of the most essential aspects when it comes to manufacturing therapeutics and, thus, a major consideration while designing Phactory is quality control. To tackle this issue from a bioinformatician’s point-of-view, we analysed minION Oxford Nanopore sequencing data in order to draw conclusions on purity, origin, and functionality of the bacteriophages genome.

For this purpose, the wetlab team sequenced several phage genomes: T7, 3S, NES, FFP. Analysis of the results has been performed using an in-house developed software poreSTAT. For each sequencing sample, it is determined how many bases are sequenced, what base-pair yield has been achieved and how many minION pores were used.

Considering the sequence in which the reads have been acquired, it can nicely be seen how the used chip gets worn out with each sequencing experiment. While for the first sequencing experiment many pores are show high average read lengths, the used pores average read lengths decreases with the number of sequencing experiments.

The read and base-pair yield is summarized in the following table:

Experiment	Sequencing Time	Reads Sequenced	Base-pairs sequenced
T7	12h 02min	424,198	1.27 × 10⁹
3S	3h 42min	77,092	2.53 × 10⁸
NES	3h 58min	39,501	2.31 × 10⁸
NFFP	5h 42min	27,633	1.23 × 10⁸

Phage Genome Assembly

Genome assembly refers to aligning and merging fragments in order to reconstruct the original sequence. Particularly Nanopore sequencing is well suitable for genome assembly since its long reads allow to reduce the ambiguity of highly similar and repetitive sequences. The read distributions of the sequencing experiments are shown in Figure 2.

During the initial screening it was observable that the read lengths do not approach the expected length of the phage genomes in the 50-70 kbp range for T7 and 100 kbp range for the remaining phages. Smaller reads generally lead to more ambiguity and but are inevitable due to experimental limitations.

The usual assembly algorithms require the reference genome. As bacteriophage genomes can be highly mosaic, i.e. the genome of many phage species appear to be composed of numerous individual modules, we could not use the reference from the database. Consequently, we were forced to perform a de novo assembly, a method of creating the original sequence by aligning and merging fragments without the aid of a reference genome.

There are several tools available for de novo genome assembly. Unfortunately, most approaches have been developed for short-read genome assembly (2nd generation sequencing, Illumina) and employ a de bruijn graph approach.

Here we have 3rd generation sequencing data which must be handled totally different from old short-read sequencing data: the reads are less perfect in terms of sequencing errors. While short-reads nowadays have error-rates of about 1% (e.g. 1 base out of 100 is reported incorrectly), this error is up to 15% for nanopore sequencing data using newest sequencing chemistry (R9.4 at the time of wiki-freeze).

Assemblers suitable for 3rd generation sequencing data are nominal, canu and miniasm being the most prevalent ones. Both rely on a overlap-layout-consensus approach, which, historically, can be seen as the father of all assemblers (see celera assembler).

We first used canu to assemble our genomes. The main limitation to any approach in 3rd generation sequencing data analysis is contamination. Particularly for assembly, contamination is disastrous because it can lead the assembler into faulty assemblies – depending on the phylogenetic distance of the original sample and the contamination. In the worst case, the assembler is the collapses all organism sequences or gives no output sequence at all.

We thus developed sequ-into to first detect the contamination and also get rid of contamination-originated reads. More on the performance and finding while using sequ-into can be found at our Software page.

After eliminating contamination we noticed non-uniform coverage of the sequence in the phage genome assemblies after re-aligning the reads to the assembly, which can be seen in Figure 3.

Figure 3: non-unform coverage of FFP sequence

In theory, we should observe uniform coverage over the full genome since there is no bias for read template generation during sample preparation (due to random primers).

However, the first half of the sequence has a lower coverage than the remaining part and there is a high-coverage region in the middle and end of the assembled genome.

Terminal repeats are common in phage genomes. Therefore, it can happen that double coverage within the sequence occurs due to missassembly of the genome. As a result, initial and terminal residues of the sequences are forcematched in the interim of the assembled sequence rather than at its edges.

We thus tried to use the other assembler, miniasm, which is known for very fast assemblies but little error correction. However, this error correction can be achieved by combining miniasm with minimap for read mapping and racon for polishing the sequences.


#!/usr/bin/env sh



INREADS=$1

ASMFOLDER=$2

ASMPREFIX=$3



THREADS=$4



if [ -z "$4" ]

then

    THREADS=4

fi



# path to used executables

MINIMAP2=minimap2

MINIASM=miniasm

GRAPHMAP=graphmap

RACON=racon



# first we must overlap all reads with each other

$MINIMAP2 -x ava-ont -t$THREADS $INREADS $INREADS > $ASMFOLDER/$ASMPREFIX.paf



# then miniasm can create alignment

$MINIASM -f $INREADS $ASMFOLDER/$ASMPREFIX.paf > $ASMFOLDER/$ASMPREFIX.gfa



# extract unitigs (an assembly of fragments for which there are no #competing choices in terms of internal overlaps) from miniasm

awk '$1 ~/S/ {print ">"$2"\n"$3}' $ASMFOLDER/$ASMPREFIX.gfa > $ASMFOLDER/$ASMPREFIX.unitigs.fasta



# align reads with unitigs

$MINIMAP2 $ASMFOLDER/$ASMPREFIX.unitigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.unitigs.paf



# find contigs (contiguous join of unitigs)

$RACON $INREADS $ASMFOLDER/$ASMPREFIX.unitigs.paf $ASMFOLDER/$ASMPREFIX.unitigs.fasta > $ASMFOLDER/$ASMPREFIX.contigs.fasta



~/progs/minimap2/minimap2 -x map-ont -a -t$THREADS $ASMFOLDER/$ASMPREFIX.contigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.reads.mm2.sam

$GRAPHMAP align -r $ASMFOLDER/$ASMPREFIX.contigs.fasta -d $INREADS -o $ASMFOLDER/$ASMPREFIX.reads.gm.sam

And can be started simply from the command-line using:
./assemble.sh FQ_file PATH_TO_ASM_FOLDER PREFIX_of_output>

This finally led to a reasonable assembly after rearranging the aforementioned sequences which initially were overrepresented in the sequence.

Phage Genome Annotation

After having a core genome we want to check how many protein-coding genes we can find in the genome. Two of the programs widely used in the field are glimmer and genemark.

Since genermark has been shown to perform better in benchmark tests, we used this tool for the genome annotation. We ran the tool on the assembled genome in FASTA format generating a gene annotation file (gff3) for the genome highlighting all coding sequences. For easier and more compact usage, we transformed the genome in fasta format with the annotation in gff3 into the embl flat file format.

In conclusion, we can describe the assembled genomes as follows:

Genome	#Genes reference	#Genes	#Genes+	#Genes-	%CG content	Genome Length
T7	60	70	69	1	0.4833	39,684
3S	277	423	305	118	0.4019	164,860
NES		969	821	148	0.3396	373,576
FFP		534	353	181	0.3522	368,471

It can be seen that the number of detected genes using genemark is higher than the number of the genes in the reference. This is most likely due to incorrectly sequenced bases leading to an early stop codon. More sophisticated polishing steps or higher quality Illumina sequencing reads could possibly avoid this.

Using the embl flat file format we visualized the phage genomes in a circular genome diagram plot.

Here we must note several things. For the 3S genome, we can see that at certain positions we see a high decrease in the coverage (at 58kbp, 72kbp and 110kbp). At these positions no reads align to the reference genome. Since this realignment was performed using graphmap, while the inital assembly pipeline uses minimap2 internally, this could also be an alignment artifact.

For the NES genome we can see a similar behaviour at 330kbp. Additionally we can see two spikes at the ends of the genome. Finally the FFP genome again has the same problems as the NES genome ends.

Conclusion

Sequ-into has been shown to be able to allow rapid genome assembly from a Nanopore sequencing sample such as encountered in Phactory manufacturing experiments. We use state-of-the-art 3rd generation sequencing data analysis tools within our framework to overcome difficulties frequently experienced by scientists in Nanopore applications. In result, we not only introduce our step by step evolved results of the analysis, assembly and finally annotation of the bacteriophage genomes but also make clear how to deal with Nanopore sequencing data already widely used in synthetic biology and rising. Furthermore our custom phage genome of the bacteriophage 3S allows further analysis and usage in the manufacturing cycle, for example, a fast contamination detection using sequ-into.