Team:Munich/Measurement

Phactory

Measurement

Cell-Free Systems as Universal Expression Platform In Synthetic Biology

Genetic modification of microorganisms is an essential method in the field of synthetic biology. By ‘thinking outside of the cell’ a cell free transcription/translation (TX-TL) system has emerged to be a true alternative by expanding the capabilities of a natural biological system from in vivo to in vitro 1. While commercialized TX-TL systems exhibit good reproducible results, they are only available at high cost. Contrary to commercial systems, home-made TX-TL systems are affordable but come with the drawback of high batch-to-batch variations after time-consuming preparation steps. Further investigations in the optimization of the protocol for TX-TL preparation are necessary to create an easily accessible platform for the research community. Our iGEM team developed a protocol for the easy and high-quality production of a cell extract, which can be reproduced in any laboratory without the need for expensive equipment. This TX-TL system can be used to measure protein expression, genetic circuit performance and other synthetic biology tools.

A generic square placeholder image with rounded corners in a figure.

Optimization of TX-TL Preparation and Quality Control

Translation Efficiency

To obtain a high quality cell extract, we focused on the optimization of several key factors: The culture conditions were optimized by screening for the optimal cell density to harvest cells. Assessing several different lysis methods, we identified sonication as the cheapest, most accessible and reproducible way for the production of our TX-TL protein expression system. Furthermore, the upscaling of TX-TL preparation was demonstrated by using a bioreactor for cell cultivation. Elaborate steps like the removal of the endotoxin Lipid A from the TX-TL were rendered obsolete by designing a suitable mutant strain for TX-TL preparation. For this purpose, TX-TL was prepared with a strain lacking the msbB gene to disrupt the lipid A biosynthesis pathway. By this the lipid A concentration could be lowered by a factor of 49 compared to the cell extract prepared using the wild-type strain (0.06 EU/mL).

The produced TX-TL went through further quality control checkpoints consisting of measuring protein content, as well as an activity assay via expression level analysis. The protein concentration was determined by using a commercial BCA protein assay kit. Compared to the commercial TX-TL, our home-made TX-TL shows a similar protein content of about 16 mg/mL.

However, a distinct and functional quality control requires the determination of the protein expression rate. This was done by expressing fluorescent proteins (GFP, YFP, RFP, CFP) and recording the corresponding signal using standard lab equipment like a plate reader. Extending the mRNA of the fluorescent protein mTurquoise2 (mTQ2) with the sequence for the malachite green aptamer (MG), we planned to determine the transcription and translation rate in parallel. Unfortunately, we could not express a functional fused mTQ2-MG and instead measured the transcription and translation separately by using the single components on two different plasmids.

A generic square placeholder image with rounded corners in a figure.
A generic square placeholder image with rounded corners in a figure.
A generic square placeholder image with rounded corners in a figure.
A generic square placeholder image with rounded corners in a figure.
Fluorescence curves of the different fluorescent proteins in cell extract. Proteins analysed were GFP, YFP, RFP and mTQ.

Analysing early results, we observed that functional GFP was expressed, but the measured signal did not show a coherent trend. This phenomenon was due to several autofluorescent metabolites produced in the cell extract3. Therefore GFP had to be excluded from further measurements.

Moreover, mTQ2 showed the fastest maturation time and lowest signal-to-noise ratio of the tested fluorescent proteins. Due to its long fluorescent lifetimes (> 3.7 ns) and high quantum yields (> 0.8) it delivers significant results for the quality control of the translation performance of our TX-TL4.

The expression of mTurquoise2 showed a five times higher fluorescent signal in our home-made TX-TL than in the commercialized TX-TL and therefore confirms that our optimized protocol produces cell extract of high translation efficiency. Although the overall protein expression rate can be observed with this method, the transcription rate has to be determined using a different approach.

Transcription Efficiency

For an excellent quality control, the transcription efficiency of our home made TX-TL was analyzed separately. We used a plasmid containing a malachite green aptamer downstream of a T7 promoter. The malachite green aptamer, a small RNA transcript, is able to bind a specific ligand, malachite green and enhances its initial fluorescence >2000 fold5.

By observing the fluorescence signal over time the transcription levels of different TX-TL batches can be compared.

Cell Extract – A Platform For Qualified Comparison Of BioBrickTM

These results prove our home-made TX-TL as a high-quality expression platform regarding transcription as well as translation. Thus, it forms a perfect platform for testing various biological parts listed in the iGEM Registry. The standardized expression system allows a more reliable comparison of BioBricks for iGEM teams around the world.

Data Analysis

One of the most essential aspects when it comes to manufacturing therapeutics and, thus, a major consideration while designing Phactory is quality control. To tackle this issue from a bioinformatician’s point-of-view, we analysed minION Oxford Nanopore sequencing data in order to draw conclusions on purity, origin, and functionality of the bacteriophages genome.

For this purpose, the wetlab team sequenced several phage genomes: T7, 3S, NES, FFP. Analysis of the results has been performed using an in-house developed software poreSTAT. For each sequencing sample, it is determined how many bases are sequenced, what base-pair yield has been achieved and how many minION pores were used.

Considering the sequence in which the reads have been acquired, it can nicely be seen how the used chip gets worn out with each sequencing experiment. While for the first sequencing experiment many pores are show high average read lengths, the used pores average read lengths decreases with the number of sequencing experiments.

T7
3S
NES
FFP

The read and base-pair yield is summarized in the following table:

Experiment Sequencing Time Reads Sequenced Base-pairs sequenced
T7 12h 02min 424,198 1.27 × 109
3S 3h 42min 77,092 2.53 × 108
NES 3h 58min 39,501 2.31 × 108
NFFP 5h 42min 27,633 1.23 × 108

Phage Genome Assembly

Genome assembly refers to aligning and merging fragments in order to reconstruct the original sequence. Particularly Nanopore sequencing is well suitable for genome assembly since its long reads allow to reduce the ambiguity of highly similar and repetitive sequences. The read distributions of the sequencing experiments are shown in Figure 2.

Figure 2a: T7
Figure 2b: 3S
Figure 2c: NES
Figure 2d: FFP

During the initial screening it was observable that the read lengths do not approach the expected length of the phage genomes in the 50-70 kbp range for T7 and 100 kbp range for the remaining phages. Smaller reads generally lead to more ambiguity and but are inevitable due to experimental limitations.

The usual assembly algorithms require the reference genome. As bacteriophage genomes can be highly mosaic, i.e. the genome of many phage species appear to be composed of numerous individual modules, we could not use the reference from the database. Consequently, we were forced to perform a de novo assembly, a method of creating the original sequence by aligning and merging fragments without the aid of a reference genome.

There are several tools available for de novo genome assembly. Unfortunately, most approaches have been developed for short-read genome assembly (2nd generation sequencing, Illumina) and employ a de bruijn graph approach.

Here we have 3rd generation sequencing data which must be handled totally different from old short-read sequencing data: the reads are less perfect in terms of sequencing errors. While short-reads nowadays have error-rates of about 1% (e.g. 1 base out of 100 is reported incorrectly), this error is up to 15% for nanopore sequencing data using newest sequencing chemistry (R9.4 at the time of wiki-freeze).

Assemblers suitable for 3rd generation sequencing data are nominal, canu and miniasm being the most prevalent ones. Both rely on a overlap-layout-consensus approach, which, historically, can be seen as the father of all assemblers (see celera assembler).

We first used canu to assemble our genomes. The main limitation to any approach in 3rd generation sequencing data analysis is contamination. Particularly for assembly, contamination is disastrous because it can lead the assembler into faulty assemblies – depending on the phylogenetic distance of the original sample and the contamination. In the worst case, the assembler is the collapses all organism sequences or gives no output sequence at all.

We thus developed sequ-into to first detect the contamination and also get rid of contamination-originated reads. More on the performance and finding while using sequ-into can be found at our Software page.

After eliminating contamination we noticed non-uniform coverage of the sequence in the phage genome assemblies after re-aligning the reads to the assembly, which can be seen in Figure 3.

Figure 3: non-unform coverage of FFP sequence

In theory, we should observe uniform coverage over the full genome since there is no bias for read template generation during sample preparation (due to random primers).

However, the first half of the sequence has a lower coverage than the remaining part and there is a high-coverage region in the middle and end of the assembled genome.

Terminal repeats are common in phage genomes. Therefore, it can happen that double coverage within the sequence occurs due to missassembly of the genome. As a result, initial and terminal residues of the sequences are forcematched in the interim of the assembled sequence rather than at its edges.

We thus tried to use the other assembler, miniasm, which is known for very fast assemblies but little error correction. However, this error correction can be achieved by combining miniasm with minimap for read mapping and racon for polishing the sequences.

#!/usr/bin/env sh

INREADS=$1
ASMFOLDER=$2
ASMPREFIX=$3

THREADS=$4

if [ -z "$4" ]
then
THREADS=4
fi

# path to used executables
MINIMAP2=minimap2
MINIASM=miniasm
GRAPHMAP=graphmap
RACON=racon

# first we must overlap all reads with each other
$MINIMAP2 -x ava-ont -t$THREADS $INREADS $INREADS > $ASMFOLDER/$ASMPREFIX.paf

# then miniasm can create alignment
$MINIASM -f $INREADS $ASMFOLDER/$ASMPREFIX.paf > $ASMFOLDER/$ASMPREFIX.gfa

# extract unitigs (an assembly of fragments for which there are no #competing choices in terms of internal overlaps) from miniasm
awk '$1 ~/S/ {print ">"$2"\n"$3}' $ASMFOLDER/$ASMPREFIX.gfa > $ASMFOLDER/$ASMPREFIX.unitigs.fasta

# align reads with unitigs
$MINIMAP2 $ASMFOLDER/$ASMPREFIX.unitigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.unitigs.paf

# find contigs (contiguous join of unitigs)
$RACON $INREADS $ASMFOLDER/$ASMPREFIX.unitigs.paf $ASMFOLDER/$ASMPREFIX.unitigs.fasta > $ASMFOLDER/$ASMPREFIX.contigs.fasta

~/progs/minimap2/minimap2 -x map-ont -a -t$THREADS $ASMFOLDER/$ASMPREFIX.contigs.fasta $INREADS > $ASMFOLDER/$ASMPREFIX.reads.mm2.sam
$GRAPHMAP align -r $ASMFOLDER/$ASMPREFIX.contigs.fasta -d $INREADS -o $ASMFOLDER/$ASMPREFIX.reads.gm.sam

And can be started simply from the command-line using:
./assemble.sh FQ_file PATH_TO_ASM_FOLDER PREFIX_of_output>

This finally led to a reasonable assembly after rearranging the aforementioned sequences which initially were overrepresented in the sequence.

Phage Genome Annotation

After having a core genome we want to check how many protein-coding genes we can find in the genome. Two of the programs widely used in the field are glimmer and genemark.

Since genermark has been shown to perform better in benchmark tests, we used this tool for the genome annotation. We ran the tool on the assembled genome in FASTA format generating a gene annotation file (gff3) for the genome highlighting all coding sequences. For easier and more compact usage, we transformed the genome in fasta format with the annotation in gff3 into the embl flat file format.

In conclusion, we can describe the assembled genomes as follows:

Genome #Genes reference #Genes #Genes+ #Genes- %CG content Genome Length
T7 60 70 69 1 0.4833 39,684
3S 277 423 305 118 0.4019 164,860
NES 969 821 148 0.3396 373,576
FFP 534 353 181 0.3522 368,471

It can be seen that the number of detected genes using genemark is higher than the number of the genes in the reference. This is most likely due to incorrectly sequenced bases leading to an early stop codon. More sophisticated polishing steps or higher quality Illumina sequencing reads could possibly avoid this.

Using the embl flat file format we visualized the phage genomes in a circular genome diagram plot.

A generic square placeholder image with rounded corners in a figure.
3S
A generic square placeholder image with rounded corners in a figure.
FFP
A generic square placeholder image with rounded corners in a figure.
NES

Here we must note several things. For the 3S genome, we can see that at certain positions we see a high decrease in the coverage (at 58kbp, 72kbp and 110kbp). At these positions no reads align to the reference genome. Since this realignment was performed using graphmap, while the inital assembly pipeline uses minimap2 internally, this could also be an alignment artifact.

For the NES genome we can see a similar behaviour at 330kbp. Additionally we can see two spikes at the ends of the genome. Finally the FFP genome again has the same problems as the NES genome ends.

Conclusion

Sequ-into has been shown to be able to allow rapid genome assembly from a Nanopore sequencing sample such as encountered in Phactory manufacturing experiments. We use state-of-the-art 3rd generation sequencing data analysis tools within our framework to overcome difficulties frequently experienced by scientists in Nanopore applications. In result, we not only introduce our step by step evolved results of the analysis, assembly and finally annotation of the bacteriophage genomes but also make clear how to deal with Nanopore sequencing data already widely used in synthetic biology and rising. Furthermore our custom phage genome of the bacteriophage 3S allows further analysis and usage in the manufacturing cycle, for example, a fast contamination detection using sequ-into.

References

  1. Hodgman, C. E., Jewett, M. C. (2012). Cell-Free Synthetic Biology: Thinking Outside the Cell. MetabEng. 2012 May; 14(3): 261–269.
  2. Kremers, G-J., Gilbert, S. G., Cranfill, P. J., Davidson, M. W. & Piston, D. W. (2011). Fluorescent proteins at a glance. J Cell Sci. 124(2): 157–160.
  3. Galbán, J., Sanz-Vicente, I., Navarro, J., & de Marcos, S. (2016). The intrinsic fluorescence of FAD and its application in analytical chemistry: a review. Methods and applications in fluorescence, 4(4), 042005.
  4. Goedhart, J., Von Stetten, D., Noirclerc-Savoye, M., Lelimousin, M., Joosen, L., Hink, M. A., ... & Royant, A. (2012). Structure-guided evolution of cyan fluorescent proteins towards a quantum yield of 93%. Nature communications, 3, 751.
  5. Babendure, J. R., Adams, S. R., & Tsien, R. Y. (2003). Aptamers switch on fluorescence of triphenylmethane dyes. Journal of the American Chemical Society, 125(48), 14716-14717.
  6. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research. 2017;27(5):722-736. doi:10.1101/gr.215087.116.
  7. Sovic, I., Šikić, M., Wilm, A., Fenlon, S.N., Chen, S.L., & Nagarajan, N. (2016). Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nature communications, 7, 11307.
  8. Heng Li; Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences, Bioinformatics, Volume 32, Issue 14, 15 July 2016, Pages 2103–2110.
  9. Gennady Denisov, Brian Walenz, Aaron L. Halpern, Jason Miller, Nelson Axelrod, Samuel Levy, Granger Sutton; Consensus generation and variant detection by Celera Assembler, Bioinformatics, Volume 24, Issue 8, 15 April 2008, Pages 1035–1040.
  10. Arthur L. Delcher, Douglas Harmon, Simon Kasif, Owen White, Steven L. Salzberg; Improved microbial gene identification with GLIMMER, Nucleic Acids Research, Volume 27, Issue 23, 1 December 1999, Pages 4636–4641.
  11. Besemer J, Borodovsky M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Research. 2005;33(Web Server issue):W451-W454. doi:10.1093/nar/gki487.
  12. Zhenyu Li, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang, Jun Gan, Nan Li, Xuesong Hu, Binghang Liu, Bicheng Yang, Wei Fan; Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph, Briefings in Functional Genomics, Volume 11, Issue 1, 1 January 2012, Pages 25–37.