Difference between revisions of "Team:Tongji China/Programme"

Line 26: Line 26:
 
We get data from
 
We get data from
 
<a href="https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22Colon%22%2C%22Colorectal%22%2C%22Rectum%22%5D%7D%7D%5D%7D&searchTableTab=mutations&ssmsTable_size=100">The Cancer Genome Atlas (TCGA)’s Genomic Data Commons (GDC) data portal</a> which gives the information of colorectal cancer mutations.<br> The Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.<br>
 
<a href="https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22Colon%22%2C%22Colorectal%22%2C%22Rectum%22%5D%7D%7D%5D%7D&searchTableTab=mutations&ssmsTable_size=100">The Cancer Genome Atlas (TCGA)’s Genomic Data Commons (GDC) data portal</a> which gives the information of colorectal cancer mutations.<br> The Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.<br>
<br><br><p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/9/97/T--Tongji_China--picture-drylab-programming-1.png" width="90%" height="90%"></p>
+
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/9/97/T--Tongji_China--picture-drylab-programming-1.png" width="90%" height="90%"></p>
 
<div class="instructionOfPicture">
 
<div class="instructionOfPicture">
 
Fig.Program.1 Web page of TCGA database in condition of colorectal cancer
 
Fig.Program.1 Web page of TCGA database in condition of colorectal cancer
 
</div>
 
</div>
<br><br>In the picture above, you can see a data table at the bottom right corner. The data table contains DNA changes, mutation types, consequences, affected cases in cohort, affected cases across the GDC, impact of the mutations and survival plot of overall mutations. Part of the table shows like below:
+
<br>In the picture above, you can see a data table at the bottom right corner. The data table contains DNA changes, mutation types, consequences, affected cases in cohort, affected cases across the GDC, impact of the mutations and survival plot of overall mutations. Part of the table shows like below:
<br><br>
+
<br>
 
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/0/0c/T--Tongji_China--picture-drylab-programming-6.jpg" width="90%" height="90%"></p>
 
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/0/0c/T--Tongji_China--picture-drylab-programming-6.jpg" width="90%" height="90%"></p>
 
<div class="instructionOfPicture">
 
<div class="instructionOfPicture">
Line 37: Line 37:
  
 
</div>
 
</div>
<br><br>The first column of the table tell us where and how the mutations (single nucleotide variations) occur. For example, “chr12:g.25245350C>T” means at the 25245350 site of chromosome 12, “C” is changed into “T” in this mutation. The second column of the table tells the types of the mutations. The types are substitution, deletion and insertion. The third column of the table tells the consequences of the mutations. For the first line, the DNA change “chr12:g.25245350C>T” results in a missense mutation of gene “KRAS”, and this mutation is already named “KRAS G12D”. And the forth column of the data table, tells the rate of the certain mutation occurring among the colorectal cancer patients. For example, “60 / 537, 11.17%” means that among the 537 patients under research, 60 of them have that certain mutation (chr12:g.25245350C>T).
+
<br>The first column of the table tell us where and how the mutations (single nucleotide variations) occur. For example, “chr12:g.25245350C>T” means at the 25245350 site of chromosome 12, “C” is changed into “T” in this mutation. The second column of the table tells the types of the mutations. The types are substitution, deletion and insertion. The third column of the table tells the consequences of the mutations. For the first line, the DNA change “chr12:g.25245350C>T” results in a missense mutation of gene “KRAS”, and this mutation is already named “KRAS G12D”. And the forth column of the data table, tells the rate of the certain mutation occurring among the colorectal cancer patients. For example, “60 / 537, 11.17%” means that among the 537 patients under research, 60 of them have that certain mutation (chr12:g.25245350C>T).
<br><br> We get the reference genome from Encyclopedia of DNA Elements (ENCODE) dataset,
+
<br>We get the reference genome from Encyclopedia of DNA Elements (ENCODE) dataset,
 
<a href="https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/">GRCh38_no_alt_analysis_set_GCA_000001405.15</a> in fasta format. And we get the genome reference file (
 
<a href="https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/">GRCh38_no_alt_analysis_set_GCA_000001405.15</a> in fasta format. And we get the genome reference file (
 
<a href="https://www.encodeproject.org/references/ENCSR425FOI/">set ENCSR425FOI</a>) from Encyclopedia of DNA Elements (ENCODE) dataset, containing features of genes in “Gene Transfer Format”. Each line of this GTF file contains seqname, source, feature, start, end, score, strand, frame, and group of genes. A line in the GTF file shows like this: (The line is made into a table for easy looking.)
 
<a href="https://www.encodeproject.org/references/ENCSR425FOI/">set ENCSR425FOI</a>) from Encyclopedia of DNA Elements (ENCODE) dataset, containing features of genes in “Gene Transfer Format”. Each line of this GTF file contains seqname, source, feature, start, end, score, strand, frame, and group of genes. A line in the GTF file shows like this: (The line is made into a table for easy looking.)
Line 46: Line 46:
 
Table.programming.2 An example line of the genome reference file in GTF format.
 
Table.programming.2 An example line of the genome reference file in GTF format.
 
</div>
 
</div>
<br><br>We use the first column of the data table, the DNA change information of mutations, and the gene id of each mutation in the third column of the data table to find the correspondent lines in the GTF file which contain the same gene and are CDS featured, and the start and end sites of the CDS must enclose the mutation site. For example, for the DNA change “chr12:g.25245350C>T” of gene KRAS, we can find a line in the GTF file shows like below: (The line is made into a table for easy looking.)
+
<br>We use the first column of the data table, the DNA change information of mutations, and the gene id of each mutation in the third column of the data table to find the correspondent lines in the GTF file which contain the same gene and are CDS featured, and the start and end sites of the CDS must enclose the mutation site. For example, for the DNA change “chr12:g.25245350C>T” of gene KRAS, we can find a line in the GTF file shows like below: (The line is made into a table for easy looking.)
 
<br>
 
<br>
 
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/7/7e/T--Tongji_China--picture-drylab-programming-3.jpg" width="90%" height="90%"></p>
 
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/7/7e/T--Tongji_China--picture-drylab-programming-3.jpg" width="90%" height="90%"></p>
Line 53: Line 53:
  
 
</div>
 
</div>
<br><br>The line is for the gene KRAS and is CDs featured and its start site 25245274 is before the mutation site 25245350, and end 25245384 is after. Then we create a GFF file looking like below: (GFF format file also contains seqname, source, feature, start, end, score, strand, frame, and group of genes like GTF.)
+
<br>The line is for the gene KRAS and is CDs featured and its start site 25245274 is before the mutation site 25245350, and end 25245384 is after. Then we create a GFF file looking like below: (GFF format file also contains seqname, source, feature, start, end, score, strand, frame, and group of genes like GTF.)
<br><br><br>
+
<br>
 
<table border="0" cellspacing="5px" cellpadding="5px">
 
<table border="0" cellspacing="5px" cellpadding="5px">
 
<tr>
 
<tr>
Line 72: Line 72:
  
 
</div>
 
</div>
<br><br>
+
<br>
 
<a href="https://static.igem.org/mediawiki/2018/f/f7/T--Tongji_China--python-1-tsv2gff.txt">tsv2gff.py</a><p>Click to see the Python script</p>  
 
<a href="https://static.igem.org/mediawiki/2018/f/f7/T--Tongji_China--python-1-tsv2gff.txt">tsv2gff.py</a><p>Click to see the Python script</p>  
 
<br><br><br> The first, fourth, fifth, seventh and eighth column of the GFF file separately tell the chromosome, start, end, strand and frame. And we put the information could be used later in the third (feature) column.
 
<br><br><br> The first, fourth, fifth, seventh and eighth column of the GFF file separately tell the chromosome, start, end, strand and frame. And we put the information could be used later in the third (feature) column.
 
<br>Then we use the
 
<br>Then we use the
 
<a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">bedtools</a> software to withdraw the DNA sequence from reference genome (GRCh38).
 
<a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">bedtools</a> software to withdraw the DNA sequence from reference genome (GRCh38).
<br><br>
+
<br>
 
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/c/c6/T--Tongji_China--picture-drylab-programming-4.png" /></p>
 
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/c/c6/T--Tongji_China--picture-drylab-programming-4.png" /></p>
 
<div class="instructionOfPicture">
 
<div class="instructionOfPicture">
 
<a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">Figure source: bedtools website</a>
 
<a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">Figure source: bedtools website</a>
 
</div>
 
</div>
<br><br>We get a fasta file containing the normal DNA sequences of the CDS where occur the mutations of colorectal cancer we got from the TCGA GDC data portal by running the commands below:<br><br>
+
<br>We get a fasta file containing the normal DNA sequences of the CDS where occur the mutations of colorectal cancer we got from the TCGA GDC data portal by running the commands below:<br>
 
<div class="codeText"><div style="white-space: nowrap;">bedtools getfasta -fo ***.fasta \<br></div>
 
<div class="codeText"><div style="white-space: nowrap;">bedtools getfasta -fo ***.fasta \<br></div>
 
<div style="white-space: nowrap;">&emsp;-fi&nbsp;GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta&nbsp;\</div>
 
<div style="white-space: nowrap;">&emsp;-fi&nbsp;GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta&nbsp;\</div>

Revision as of 15:41, 7 October 2018

Programme
Dry Lab
Programming
We get data from The Cancer Genome Atlas (TCGA)’s Genomic Data Commons (GDC) data portal which gives the information of colorectal cancer mutations.
The Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.

Fig.Program.1 Web page of TCGA database in condition of colorectal cancer

In the picture above, you can see a data table at the bottom right corner. The data table contains DNA changes, mutation types, consequences, affected cases in cohort, affected cases across the GDC, impact of the mutations and survival plot of overall mutations. Part of the table shows like below:

Table.Programming.1 Part of the data table we get from the TCGA database containing information of mutations of colorectal cancer.

The first column of the table tell us where and how the mutations (single nucleotide variations) occur. For example, “chr12:g.25245350C>T” means at the 25245350 site of chromosome 12, “C” is changed into “T” in this mutation. The second column of the table tells the types of the mutations. The types are substitution, deletion and insertion. The third column of the table tells the consequences of the mutations. For the first line, the DNA change “chr12:g.25245350C>T” results in a missense mutation of gene “KRAS”, and this mutation is already named “KRAS G12D”. And the forth column of the data table, tells the rate of the certain mutation occurring among the colorectal cancer patients. For example, “60 / 537, 11.17%” means that among the 537 patients under research, 60 of them have that certain mutation (chr12:g.25245350C>T).
We get the reference genome from Encyclopedia of DNA Elements (ENCODE) dataset, GRCh38_no_alt_analysis_set_GCA_000001405.15 in fasta format. And we get the genome reference file ( set ENCSR425FOI) from Encyclopedia of DNA Elements (ENCODE) dataset, containing features of genes in “Gene Transfer Format”. Each line of this GTF file contains seqname, source, feature, start, end, score, strand, frame, and group of genes. A line in the GTF file shows like this: (The line is made into a table for easy looking.)

Table.programming.2 An example line of the genome reference file in GTF format.

We use the first column of the data table, the DNA change information of mutations, and the gene id of each mutation in the third column of the data table to find the correspondent lines in the GTF file which contain the same gene and are CDS featured, and the start and end sites of the CDS must enclose the mutation site. For example, for the DNA change “chr12:g.25245350C>T” of gene KRAS, we can find a line in the GTF file shows like below: (The line is made into a table for easy looking.)

Table.programming.3 An example line of the genome reference file in GTF format.

The line is for the gene KRAS and is CDs featured and its start site 25245274 is before the mutation site 25245350, and end 25245384 is after. Then we create a GFF file looking like below: (GFF format file also contains seqname, source, feature, start, end, score, strand, frame, and group of genes like GTF.)
chr12 . chr12:g.25245350C>T|KRAS|25245274|25245384|-|0 25245274 25245384 . - 0 .
Table.programming.4 An example line of GFF format file we create.

tsv2gff.py

Click to see the Python script




The first, fourth, fifth, seventh and eighth column of the GFF file separately tell the chromosome, start, end, strand and frame. And we put the information could be used later in the third (feature) column.
Then we use the bedtools software to withdraw the DNA sequence from reference genome (GRCh38).


We get a fasta file containing the normal DNA sequences of the CDS where occur the mutations of colorectal cancer we got from the TCGA GDC data portal by running the commands below:
bedtools getfasta -fo ***.fasta \
 -fi GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
 -bed ***.gff \
 -s \
 -name

The normal fasta file looks like below:

>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCT
TGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG

nor2snv.py

Click to see the Python script



Then we translate the “mutated fasta file” into amino acid sequences and make slide windows (length of 8 to 14aa) around the mutated sites.

Fig.programming.3 making slide windows around the mutation sites length of 8aa to 14aa.


Then we get a series of amino acid fasta files:


Windows of the chr12:g.25245350C>T mutation of 8aa:

>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGAD
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSAL



Windows of the chr12:g.25245350C>T mutation of 9aa:

>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
YKLVVVGAD
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGADG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSAL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSALT
……


Windows of the chr12:g.25245350C>T mutation of 14aa:

>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
MTEYKLVVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
TEYKLVVVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
EYKLVVVGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
YKLVVVGADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADGVGKSAL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGVGKSALT
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVGKSALTI
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGKSALTIQ
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKSALTIQL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSALTIQLI
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSALTIQLIQ


mkfrm.py

Click to see the Python script.




Then we use NetMHC to predict these oligopeptides for immunogenicity. We download the software of NetMHC method to run at our own site, with the same functionality of the NetMHC 4.0 Sever. We run the peptides separately by allele type and length of peptides, and one of the command lines looks like this :

netMHC -l 8 -a HLA-A0101 aa_len8.fasta -s > len8_HLA-A0101.txt

This line means we use the 8 aa length peptides and set the allele type as “HLA-A0101”. Then we choose all the HLA supertype representatives of the allele types the netMHC Server provides to run the peptides of all lengths. The HLA supertype representatives are selected by the netMHC Sever as below :
HLA-A0101 HLA-A*01:01 (A1) HLA supertype representative
HLA-A0201 HLA-A*02:01 (A2) HLA supertype representative
HLA-A0301 HLA-A*03:01 (A3) HLA supertype representative
HLA-A2402 HLA-A*24:02 (A24) HLA supertype representative
HLA-A2601 HLA-A*26:01 (A26) HLA supertype representative
HLA-B0702 HLA-B*07:02 (B7) HLA supertype representative
HLA-B0801 HLA-B*08:01 (B8) HLA supertype representative
HLA-B2705 HLA-B*27:05 (B27) HLA supertype representative
HLA-B3901 HLA-B*39:01 (B39) HLA supertype representative
HLA-B4001 HLA-B*40:01 (B44) HLA supertype representative
HLA-B5801 HLA-B*58:01 (B58) HLA supertype representative
HLA-B1501 HLA-B*15:01 (B62) HLA supertype representative

Allele is dependent on the source of our sequence, the race of the samples, and to which MHC-I molecule or molecules the potential antigen is mostly likely to bind. Then we submit and run. You can download our files below.

Output files

Here we see, the lower the number of Affinity is, the easier for the oligopeptide to bind to certain MHC-I molecule(s). And the software itself has filtered the results for us. Generally speaking, peptides marked as “SB (strong binding)” are the ones we want most, and those marked as “WB (weak binding)” are the second-best. Due to the plenty of our peptide source, we don’t have to make do with second best and will only adopt “SB (strong binding)” ones.
Like this, we have selected possible antigen peptides for colorectal cancer in DRY LAB to be tested in the other part of our project—the WET LAB.



Acknowledge:
National Cancer Institute (NCI)
The Cancer Genome Atlas (TCGA)
The Genomic Data Commons (GDC) data portal
Encyclopedia of DNA Elements (ENCODE)
NetMHC