Team:EPFL/Software

iGEM EPFL 2018


Ginga pipeline is a patient personalised genomic pipeline that integrates state of the art genomic tools together with the 2018 EPFL IGEM team developed software to create an streamline and fast process to translate the raw exome sequencing data of patients with melanoma and other cancers, to a library of neoantigens that can specifically activate the immune system to target the tumor.


Ginga, is named after the fundamental movement of the capoeira martial art. As in the martial art, Ginga, serves as the starting point for our project CAPOEIRA, and connects all the parts of the project back to the patient.


The main problem that Ginga targets is the discovery of neoantigens, peptide sequences that are presented uniquely in the surface of tumor cells. This sequences are often not recognised by the immune system as dangerous and hence the tumor is allow to develop without control. However, by training the immune system to recognise this specific neoantigen sequences, it is possible to use the defense system to target the tumor and eliminate it. Ginga, uses whole exome sequences extracted from patients to create a specific neoantigen library catered for each individual needs. Then, the next step in CAPOEIRA, the production of a vaccine system to effectively deliver the neoantigen sequence comes in.


The second problem that Ginga aims to address is a problem that constantly came up during our conversations with experts from the field of oncology, which is the lack of any rapid method to test the effectivity of the vaccine administered to the patient. This is very critical since delivering the incorrect therapy to the patient can delay the action of any measure against the tumor, allowing it to extend and develop. Furthermore, applying the incorrect therapy to the patient can lead to exposure to high levels of toxicity. CAPOEIRA aims to develop a rapid and non-invasive method to detect the vaccine efficiency by applying our CRISPR/Cas12a detection scheme. Ginga aims to create a link between the injected vaccine and the monitoring of its effect, by retrieving the point mutations in the DNA that code for the library of neoantigens delivered to the patient. This mutated DNA can be found in small quantities on the blood as ctDNA which we aim to detect with CAPOEIRA’s couple detection system.


Finally, the aim of CAPOEIRA is lifelong support to the patient diagnosed with cancer to avoid a possible cancer relapse and metastasis. In this part Ginga aims to detect the specific chromosomal rearrangements of the patient’s cancer in order to detect in advance if the cancer is relapsing.


Furthermore, Ginga’s pipeline also integrates with the vaccine monitoring and relapse follow-up parts of CAPOEIRA. Once, the library of specific neoantigens for each patient is identified, Ginga can translate this neoantigens and track their specific location in the genome and output the DNA sequences that can be used as targets to monitor the vaccine. This target DNA can then be detected in the blood ctDNA of the patient using CAPOEIRA’s CRISPR/CAS12a detection system.


Motivation

The main reason for attempting to create Ginga was the lack of any current standardized method of discovering neoantigens from patient sequencing data. Furthermore, after discussing the concept with researchers in the immunotherapy field, we found that there was a big gap between the computational genomic analysis used to extract and processed patient medical data and how this data was implemented in the common practice. With Ginga we wanted to bring both side of research closer by creating a single pipeline that can be easily use by all kinds of users. We aim to emphasize a user-friendly and intuitive pipeline for analysis of genomic data, detection of cancer mutations and prediction of the targeted sequences for a possible cancer vaccine therapy, while still taking advantage of currently developed open-source genomics packages such as BWA, SamTools or GATK packages. The pipeline borrows inspiration from other currently available pipeline such as TSNAD (Zhou, Zhan et al).
Ginga was conceive with three main objectives in mind:

  • Scalability
    Thousands of patient’s data can be analyzed in parallel
  • Fine-Tune Analysis
    Each sample characteristics can be fine tune.
  • Streamline Process
    Optimized data flow and compatibility between file formats and structure
  • The Pipeline


    The pipeline uses the most advanced tools currently available to perform exome analysis and mutation calling and integrates in a single process using the Ginga python scripts. The pipeline is meant to run natively in Linux operating system and was deployed using the latest release of Ubuntu (18.04 LTS). More information about the technical aspects of the project can be found in Ginga’s Github. In order to translate the patient exome data to the neoantigen library 11 distinct steps are required. Furthermore, apart from the main pipeline the software also has contains 2 additional workflows, for chromosomal rearrangement detection (Breakdancer) and for tracing back the DNA sequence and index of the candidate neoantigens (neoSearch). Here, it's the outline of Ginga’s pipeline:


    Quality Control - FastQC (v 0.11.7)


    FastQC (Andrews S. 2010) is a genomic analysis tool used to preprocess the sequence data and identify the quality and conditions of the reads. These metrics are useful downstream in the pipeline to assess relevance of the results.


    Sequencing Adapter Cut and Quality Control - Trimmomatic (v0.38) /BBmap (v38.36)


    Raw genomic sequences contain sequencing universal and index kmer adapters, that are used to sort and organised the reads. Furthermore, sequenced reads have heterogeneous read length. Trimmomatic (Bolger, Lohse and Usadel 2114-2120) allows to clip the adapters and remove artifacts. Alternatively, BBmap can be used to automatically search and clip a series of common sequencing adapters.


    Reference alignment - BWA (v.0.7.17)


    In order to identify the sites of the exome it is required to map the samples to a reference genome. BWA (McKenna et al. 1297-1303) is based on the Burrows-Wheeler transformation, which can efficiently align the reads.


    Sort and Indexing - Samtools (v 1.7)


    Samtools (Li et al. 2078-2079) can be used to sort the aligned reads from the SAM (sequence alignment/map) format to BAM (binary alignment/map) format. BAM files are more compressed and optimized that than SAM files, optimising the workflow of the pipeline. Furthermore, Samtools has the option of indexing the file.

    Remove Duplicate Reads - Picard (v 2.18.14)


    Duplicated sequences are commonly found in genes due to the enrichment of certain reads during the sequencing protocol. Picard MarkDuplicates tool allows to remove the duplicated sequences, removing possible bias and artifacts during the variant calling.


    Base Quality Assessment - GATK (v 4.0.9.0)


    The quality of the bases pairs of the read can condition the results of the variant calling analysis. GATK (McKenna et al. 1297-1303) BaseRecalibrator tool uses machine learning to calibrate the quality of the base pairs reducing the number of false positives.


    Mutation Calling - GATK (v 4.0.9.0) Mutect2


    GATK (McKenna et al. 1297-1303) Mutect2 identifies can detect SNPs and Indels in large reads.


    Filter Mutations - GATK (v 4.0.9.0)


    GATK (McKenna et al. 1297-1303) FilterMutect removes variants according to a series of filters, and contamination content of the samples.


    Gene-based Annotation - Annovar (v. 2018Apr16)


    Annovar functionally annotates the variants and protein coding changes of the filtered mutations.


    Mutated Peptide Extraction - neoExtract (v 1.0)


    neoExtract determines the neoantigen iterations possible that contain the functional variant change within the annotated proteins, simplifying the HLA-peptide binding affinity process.


    MHC-I-peptide binding affinity - NetMHC (v 4.0)


    NetMHC (Andreatta and Nielsen 511-517), uses machine learning algorithms to predict the binding affinity of the peptides to the MHC-I complexes on the surface of antigen presenting cells.


    Rank Binding Affinity and Search for Origin Mutations - neoSearch (v 1.0)


    neoSearch sort the HLA-binding Affinity according to their Rank in order to standardize the output. Furthermore, it uses the protein index to search the mutated DNA sequence encoding for the neoantigen sequence.


    Validation


    Validation of Ginga pipeline was perform with whole exome sequencing data of 8 melanoma patients. The guide and results of this validation can be found in the Result section of the wiki and in Ginga’s Githubhere


    Future Implementation


    Even though the current pipeline has been validated with real world data, providing very promising results, we believe that we just scratched the surface of the real potential of Ginga. We recently applied for a Research Grant from AWS to continue the development of the software to make it more accessible and user-friendly. We recently which will provide us with a full year of fund and support to develop the pipeline to work on fully developing Ginga. Furthermore, we have been already in contact with researchers from different Bioinformatic institutions, including Swiss Bioinformatics Institute other bioinformatics laboratory and cancer research groups at EPFL, UNIL and CHUV. We intend to implement Ginga using their tools, and iteratively improve the software wrapper and user interface to remove the annoyance and complexity of using Linux Command Line.

    Reference

    • Zhou, Zhan et al. "TSNAD: An Integrated Software For Cancer Somatic Mutation And Tumour-Specific Neoantigen Detection." Royal Society Open Science 4.4 (2017): 170050. Web.

    • Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at:http://www.bioinformatics.babraham.ac.uk/projects/fastqc

    • McKenna, A. et al. "The Genome Analysis Toolkit: A Mapreduce Framework For Analyzing Next-Generation DNA Sequencing Data." Genome Research 20.9 (2010): 1297-1303. Web.

    • Normal Tissue Controls." Genome Medicine 9.1 (2017): n. pag. Web.

    • Li, H. et al. "The Sequence Alignment/Map Format And Samtools." Bioinformatics 25.16 (2009): 2078-2079. Web
    • Li, H., and R. Durbin. "Fast And Accurate Short Read Alignment With Burrows-Wheeler Transform." Bioinformatics 25.14 (2009): 1754-1760. Web.