Ginga pipeline is a patient personalised genomic pipeline that integrates state of the art genomic tools together with the 2018 EPFL IGEM team developed software to create an streamline and fast process to translate the raw exome sequencing data of patients with melanoma and other cancers, to a library of neoantigens that can specifically activate the immune system to target the tumor.
Ginga, is named after the fundamental movement of the capoeira martial art. As in the martial art, Ginga, serves as the starting point for our project CAPOEIRA, and connects all the parts of the project back to the patient.
The main problem that Ginga targets is the discovery of neoantigens, peptide sequences that are presented uniquely in the surface of tumor cells. This sequences are often not recognised by the immune system as dangerous and hence the tumor is
allow to develop without control. However, by training the immune system to recognise this specific neoantigen sequences, it is possible to use the defense system to target the tumor and eliminate it. Ginga, uses whole exome sequences extracted
from patients to create a specific neoantigen library catered for each individual needs. Then, the next step in CAPOEIRA, the production of a vaccine system to effectively deliver the neoantigen sequence comes in.
The second problem that Ginga aims to address is a problem that constantly came up during our conversations with experts from the field of oncology, which is the lack of any rapid method to test the effectivity of the vaccine administered to the patient. This is very critical since delivering the incorrect therapy to the patient can delay the action of any measure against the tumor, allowing it to extend and develop. Furthermore, applying the incorrect therapy to the patient can lead to exposure to high levels of toxicity. CAPOEIRA aims to develop a rapid and non-invasive method to detect the vaccine efficiency by applying our CRISPR/Cas12a detection scheme. Ginga aims to create a link between the injected vaccine and the monitoring of its effect, by retrieving the point mutations in the DNA that code for the library of neoantigens delivered to the patient. This mutated DNA can be found in small quantities on the blood as ctDNA which we aim to detect with CAPOEIRA’s couple detection system.
Finally, the aim of CAPOEIRA is lifelong support to the patient diagnosed with cancer to avoid a possible cancer relapse and metastasis. In this part Ginga aims to detect the specific chromosomal rearrangements of the patient’s cancer in order to detect in advance if the cancer is relapsing.
Furthermore, Ginga’s pipeline also integrates with the vaccine monitoring and relapse follow-up parts of CAPOEIRA. Once, the library of specific neoantigens for each patient is identified, Ginga can translate this neoantigens and track their
specific location in the genome and output the DNA sequences that can be used as targets to monitor the vaccine. This target DNA can then be detected in the blood ctDNA of the patient using CAPOEIRA’s CRISPR/CAS12a detection system.
The main reason for attempting to create Ginga was the lack of any current standardized method of discovering neoantigens from patient sequencing data. Furthermore, after discussing the concept with researchers in the immunotherapy field, we found that there was a big gap between the computational genomic analysis used to extract and processed patient medical data and how this data was implemented in the common practice. With Ginga we wanted to bring both side of research closer by creating a single pipeline that can be easily use by all kinds of users. We aim to emphasize a user-friendly and intuitive pipeline for analysis of genomic data, detection of cancer mutations and prediction of the targeted sequences for a possible cancer vaccine therapy, while still taking advantage of currently developed open-source genomics packages such as BWA, SamTools or GATK packages. The pipeline borrows inspiration from other currently available pipeline such as TSNAD (Zhou, Zhan et al).
The pipeline uses the most advanced tools currently available to perform exome analysis and mutation calling and integrates in a single process using the Ginga python scripts. The pipeline is meant to run natively in Linux operating system and was deployed using the latest release of Ubuntu (18.04 LTS). More information about the technical aspects of the project can be found in Ginga’s Github. In order to translate the patient exome data to the neoantigen library 11 distinct steps are required. Furthermore, apart from the main pipeline the software also has contains 2 additional workflows, for chromosomal rearrangement detection (Breakdancer) and for tracing back the DNA sequence and index of the candidate neoantigens (neoSearch). Here, it's the outline of Ginga’s pipeline:
FastQC (Andrews S. 2010) is a genomic analysis tool used to preprocess the sequence data and identify the quality and conditions of the reads. These metrics are useful downstream in the pipeline to assess relevance of the results.
Raw genomic sequences contain sequencing universal and index kmer adapters, that are used to sort and organised the reads. Furthermore, sequenced reads have heterogeneous read length. Trimmomatic (Bolger, Lohse and Usadel 2114-2120) allows to clip the adapters and remove artifacts. Alternatively, BBmap can be used to automatically search and clip a series of common sequencing adapters.
In order to identify the sites of the exome it is required to map the samples to a reference genome. BWA (McKenna et al. 1297-1303) is based on the Burrows-Wheeler transformation, which can efficiently align the reads.
Samtools (Li et al. 2078-2079) can be used to sort the aligned reads from the SAM (sequence alignment/map) format to BAM (binary alignment/map) format. BAM files are more compressed and optimized that than SAM files, optimising the workflow of the pipeline. Furthermore, Samtools has the option of indexing the file.
Duplicated sequences are commonly found in genes due to the enrichment of certain reads during the sequencing protocol. Picard MarkDuplicates tool allows to remove the duplicated sequences, removing possible bias and artifacts during the variant calling.
The quality of the bases pairs of the read can condition the results of the variant calling analysis. GATK (McKenna et al. 1297-1303) BaseRecalibrator tool uses machine learning to calibrate the quality of the base pairs reducing the number of false positives.
GATK (McKenna et al. 1297-1303) Mutect2 identifies can detect SNPs and Indels in large reads. GATK (McKenna et al. 1297-1303) FilterMutect removes variants according to a series of filters, and contamination content of the samples.
Annovar functionally annotates the variants and protein coding changes of the filtered mutations.
neoExtract determines the neoantigen iterations possible that contain the functional variant change within the annotated proteins, simplifying the HLA-peptide binding affinity process.
NetMHC (Andreatta and Nielsen 511-517), uses machine learning algorithms to predict the binding affinity of the peptides to the MHC-I complexes on the surface of antigen presenting cells.
neoSearch sort the HLA-binding Affinity according to their Rank in order to standardize the output. Furthermore, it uses the protein index to search the mutated DNA sequence encoding for the neoantigen sequence.
Validation of Ginga pipeline was perform with whole exome sequencing data of 8 melanoma patients. The guide and results of this validation can be found in the Result section of the wiki and in Ginga’s Githubhere
Even though the current pipeline has been validated with real world data, providing very promising results, we believe that we just scratched the surface of the real potential of Ginga. We recently applied for a Research Grant from AWS to continue the development of the software to make it more accessible and user-friendly. We recently which will provide us with a full year of fund and support to develop the pipeline to work on fully developing Ginga. Furthermore, we have been already in contact with researchers from different Bioinformatic institutions, including Swiss Bioinformatics Institute other bioinformatics laboratory and cancer research groups at EPFL, UNIL and CHUV. We intend to implement Ginga using their tools, and iteratively improve the software wrapper and user interface to remove the annoyance and complexity of using Linux Command Line.
Motivation
Ginga was conceive with three main objectives in mind:
Thousands of patient’s data can be analyzed in parallel
Each sample characteristics can be fine tune.
Optimized data flow and compatibility between file formats and structure
The Pipeline
Quality Control - FastQC (v 0.11.7)
Sequencing Adapter Cut and Quality Control - Trimmomatic (v0.38) /BBmap (v38.36)
Reference alignment - BWA (v.0.7.17)
Sort and Indexing - Samtools (v 1.7)
Remove Duplicate Reads - Picard (v 2.18.14)
Base Quality Assessment - GATK (v 4.0.9.0)
Mutation Calling - GATK (v 4.0.9.0) Mutect2
Filter Mutations - GATK (v 4.0.9.0)
Gene-based Annotation - Annovar (v. 2018Apr16)
Mutated Peptide Extraction - neoExtract (v 1.0)
MHC-I-peptide binding affinity - NetMHC (v 4.0)
Rank Binding Affinity and Search for Origin Mutations - neoSearch (v 1.0)
Validation
Future Implementation
Reference