(98 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
{{Tongji_China/Head_N_Prefix}} | {{Tongji_China/Head_N_Prefix}} | ||
− | + | {{Tongji_China/userDefault}} | |
− | + | ||
− | + | ||
{{Tongji_China/Head_N_Suffix}} | {{Tongji_China/Head_N_Suffix}} | ||
{{Tongji_China/Header_N}} | {{Tongji_China/Header_N}} | ||
Line 14: | Line 12: | ||
<body> | <body> | ||
<div class="background"> | <div class="background"> | ||
+ | <div class="content"> | ||
<div class="title"> | <div class="title"> | ||
Dry Lab | Dry Lab | ||
Line 22: | Line 21: | ||
<div class="title_2"> | <div class="title_2"> | ||
Programming | Programming | ||
− | </div> | + | </div><br><br> |
− | + | <p style="text-align:center" width="100%"><a href="#PHASE1" class="test_a"><img src="https://static.igem.org/mediawiki/2018/2/2f/T--Tongji_China--picture-drylab-programming-7.png" width="25%" height="10%" ></a> <img src="https://static.igem.org/mediawiki/2018/e/ed/T--Tongji_China--picture-design-4.png" width="5%" height="5%"> <a href="#phase2" class="test_a"><img src="https://static.igem.org/mediawiki/2018/2/2e/T--Tongji_China--picture-drylab-programming-8.png" width="25%" height="10%"></a> <img src="https://static.igem.org/mediawiki/2018/e/ed/T--Tongji_China--picture-design-4.png" width="5%" height="5%"> <a href="#phase3" class="test_a"><img src="https://static.igem.org/mediawiki/2018/4/4b/T--Tongji_China--picture-drylab-programming-9.png" width="25%" height="10%"></a></p><a name="PHASE1" style="text-decoration:none;"> </a><br><br><br><br> | |
− | + | <p class="littletitle">Phase 1. Get data from TCGA and preparation</p> | |
− | <a href="https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22Colon%22%2C%22Colorectal%22%2C%22Rectum%22%5D%7D%7D%5D%7D&searchTableTab=mutations&ssmsTable_size=100">The Cancer Genome Atlas (TCGA)’s Genomic Data Commons (GDC) data portal</a> which gives the information of colorectal cancer mutations.<br> The Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.<br> | + | We get data from |
+ | <a href="https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22Colon%22%2C%22Colorectal%22%2C%22Rectum%22%5D%7D%7D%5D%7D&searchTableTab=mutations&ssmsTable_size=100">The Cancer Genome Atlas (TCGA)’s Genomic Data Commons (GDC) data portal</a> which gives the information of colorectal cancer mutations.<br> The Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.<br><br> | ||
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/9/97/T--Tongji_China--picture-drylab-programming-1.png" width="90%" height="90%"></p> | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/9/97/T--Tongji_China--picture-drylab-programming-1.png" width="90%" height="90%"></p> | ||
<div class="instructionOfPicture"> | <div class="instructionOfPicture"> | ||
Line 37: | Line 37: | ||
</div> | </div> | ||
− | <br>The first column of the table | + | <br>The first column of the table tells us where and how the mutations (single nucleotide variations) occur. For example, "chr12:g.25245350C>T" means at the 25245350 site of chromosome 12, "C" is changed into "T" in this mutation. The second column of the table tells the types of the mutations. The types are substitution, deletion and insertion. We exclude the mutation data of intron and UTR, like the high-lighted record in the table. The third column of the table tells the consequences of the mutations. For the first line, the DNA change "chr12:g.25245350C>T" results in a missense mutation of gene "KRAS", and this mutation is already named "KRAS G12D". And the forth column of the data table, tells the rate of the certain mutation occurring among the colorectal cancer patients. For example, "60 / 537, 11.17%" means that among the 537 patients under research, 60 of them have that certain mutation (chr12:g.25245350C>T). |
− | <br>We get the reference genome from Encyclopedia of DNA Elements (ENCODE) dataset, | + | <br><br>We get the reference genome from Encyclopedia of DNA Elements (ENCODE) dataset, |
− | <a href="https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/">GRCh38_no_alt_analysis_set_GCA_000001405.15</a> in fasta format. And we | + | <a href="https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/">GRCh38_no_alt_analysis_set_GCA_000001405.15</a> in fasta format. And we also use the genome reference file ( |
− | <a href="https://www.encodeproject.org/references/ENCSR425FOI/">set ENCSR425FOI</a>) from | + | <a href="https://www.encodeproject.org/references/ENCSR425FOI/">set ENCSR425FOI</a>) from ENCODE dataset, containing features of genes in "Gene Transfer Format". Each line of this GTF file contains seqname, source, feature, start, end, score, strand, frame, and group of genes. A line in the GTF file shows like this: (The line is made into a table for easy looking.) |
<br> | <br> | ||
− | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/6/6e/T--Tongji_China--picture-drylab-programming-2.jpg" width=" | + | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/6/6e/T--Tongji_China--picture-drylab-programming-2.jpg" width="70%" height="70%"></p> |
<div class="instructionOfPicture"> | <div class="instructionOfPicture"> | ||
Table.programming.2 An example line of the genome reference file in GTF format. | Table.programming.2 An example line of the genome reference file in GTF format. | ||
Line 48: | Line 48: | ||
<br>We use the first column of the data table, the DNA change information of mutations, and the gene id of each mutation in the third column of the data table to find the correspondent lines in the GTF file which contain the same gene and are CDS featured, and the start and end sites of the CDS must enclose the mutation site. For example, for the DNA change “chr12:g.25245350C>T” of gene KRAS, we can find a line in the GTF file shows like below: (The line is made into a table for easy looking.) | <br>We use the first column of the data table, the DNA change information of mutations, and the gene id of each mutation in the third column of the data table to find the correspondent lines in the GTF file which contain the same gene and are CDS featured, and the start and end sites of the CDS must enclose the mutation site. For example, for the DNA change “chr12:g.25245350C>T” of gene KRAS, we can find a line in the GTF file shows like below: (The line is made into a table for easy looking.) | ||
<br> | <br> | ||
− | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/7/7e/T--Tongji_China--picture-drylab-programming-3.jpg" width=" | + | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/7/7e/T--Tongji_China--picture-drylab-programming-3.jpg" width="70%" height="70%"></p> |
<div class="instructionOfPicture"> | <div class="instructionOfPicture"> | ||
Table.programming.3 An example line of the genome reference file in GTF format. | Table.programming.3 An example line of the genome reference file in GTF format. | ||
Line 55: | Line 55: | ||
<br>The line is for the gene KRAS and is CDs featured and its start site 25245274 is before the mutation site 25245350, and end 25245384 is after. Then we create a GFF file looking like below: (GFF format file also contains seqname, source, feature, start, end, score, strand, frame, and group of genes like GTF.) | <br>The line is for the gene KRAS and is CDs featured and its start site 25245274 is before the mutation site 25245350, and end 25245384 is after. Then we create a GFF file looking like below: (GFF format file also contains seqname, source, feature, start, end, score, strand, frame, and group of genes like GTF.) | ||
<br> | <br> | ||
− | + | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/8/89/T--Tongji_China--outputfile-drylab-program-table-1.png" width="80%" height="80%"></p> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
<div class="instructionOfPicture"> | <div class="instructionOfPicture"> | ||
Table.programming.4 An example line of GFF format file we create. | Table.programming.4 An example line of GFF format file we create. | ||
Line 73: | Line 62: | ||
</div> | </div> | ||
<br> | <br> | ||
− | <a href="https://static.igem.org/mediawiki/2018/f/f7/T--Tongji_China--python-1-tsv2gff.txt">tsv2gff.py</a | + | <a href="https://static.igem.org/mediawiki/2018/f/f7/T--Tongji_China--python-1-tsv2gff.txt">tsv2gff.py</a><font size=2> Click to see the Python script</font> |
− | + | <br><br> The first, fourth, fifth, seventh and eighth column of the GFF file separately tell the chromosome, start, end, strand and frame. And we put the information could be used later in the third (feature) column. | |
<br>Then we use the | <br>Then we use the | ||
<a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">bedtools</a> software to withdraw the DNA sequence from reference genome (GRCh38). | <a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">bedtools</a> software to withdraw the DNA sequence from reference genome (GRCh38). | ||
<br> | <br> | ||
− | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/c/c6/T--Tongji_China--picture-drylab-programming-4.png" /></p> | + | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/c/c6/T--Tongji_China--picture-drylab-programming-4.png" width="100%" /></p> |
<div class="instructionOfPicture"> | <div class="instructionOfPicture"> | ||
<a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">Figure source: bedtools website</a> | <a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">Figure source: bedtools website</a> | ||
</div> | </div> | ||
<br>We get a fasta file containing the normal DNA sequences of the CDS where occur the mutations of colorectal cancer we got from the TCGA GDC data portal by running the commands below:<br> | <br>We get a fasta file containing the normal DNA sequences of the CDS where occur the mutations of colorectal cancer we got from the TCGA GDC data portal by running the commands below:<br> | ||
− | + | <div width="80%" style="margin-left:6.25em; margin-right:6.25em; line-height:110%; text-decoration:none;"><pre class="codetx"> | |
− | + | <code>bedtools getfasta -fo ***.fasta \</code><br> | |
− | + | <code>-fi GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \</code><br> | |
− | + | <code>-bed ***.gff \</code><br> | |
− | + | <code>-s \</code><br> | |
+ | <code>-name</code></pre></div><br> | ||
+ | |||
The normal fasta file looks like below:<br><br> | The normal fasta file looks like below:<br><br> | ||
<div class="dataText"> | <div class="dataText"> | ||
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> | >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> | ||
− | + | ATGACTGAATATAAACTTGT<br>GGTAGTTGGAGCTGGTGGCG<br>TAGGCAAGAGTGCCT | |
− | + | TGACGA<br>TACAGCTAATTCAGAATCAT<br>TTTGTGGACGAATATGATCC<br>AACAATAGAG | |
<br> | <br> | ||
</div> | </div> | ||
<br> | <br> | ||
− | <a href="https://static.igem.org/mediawiki/2018/2/2d/T--Tongji_China--python-2-nor2snv.txt">nor2snv.py</a>< | + | <a href="https://static.igem.org/mediawiki/2018/2/2d/T--Tongji_China--python-2-nor2snv.txt">nor2snv.py</a><font size=2> Click to see the Python script<a name="phase2" style="text-decoration:none;"> </a></font> |
− | <br> | + | <br><br><br><br> |
− | + | <p class="littletitle">Phase 2. Make Peptide Windows</p> | |
+ | Then we translate the “mutated fasta file” into amino acid sequences and make slide windows (length of 8 to 14aa) around the mutated sites. | ||
<br><br> | <br><br> | ||
− | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/b/bb/T--Tongji_China--picture-drylab-programming-5.jpg" width=" | + | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/b/bb/T--Tongji_China--picture-drylab-programming-5.jpg" width="75%" height="75%"></p> |
<div class="instructionOfPicture"> | <div class="instructionOfPicture"> | ||
Fig.programming.3 making slide windows around the mutation sites length of 8aa to 14aa. | Fig.programming.3 making slide windows around the mutation sites length of 8aa to 14aa. | ||
Line 106: | Line 98: | ||
</div> | </div> | ||
<br><br> Then we get a series of amino acid fasta files: | <br><br> Then we get a series of amino acid fasta files: | ||
− | + | <br><br> Windows of the chr12:g.25245350C>T mutation of 8aa:<br> | |
<div class="dataText"> | <div class="dataText"> | ||
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> KLVVVGAD | >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> KLVVVGAD | ||
Line 123: | Line 115: | ||
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-) | <br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-) | ||
<br> DGVGKSAL | <br> DGVGKSAL | ||
− | <br> | + | <br>......<br> |
</div> | </div> | ||
− | + | <br> Windows of the chr12:g.25245350C>T mutation of 9aa:<br> | |
<div class="dataText"> | <div class="dataText"> | ||
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> YKLVVVGAD | >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> YKLVVVGAD | ||
Line 148: | Line 140: | ||
</div> | </div> | ||
− | + | <br> Windows of the chr12:g.25245350C>T mutation of 14aa:<br> | |
<div class="dataText"> | <div class="dataText"> | ||
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> MTEYKLVVVGADGV | >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> MTEYKLVVVGADGV | ||
Line 173: | Line 165: | ||
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-) | <br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-) | ||
<br> DGVGKSALTIQLIQ | <br> DGVGKSALTIQLIQ | ||
− | <br> | + | <br>......<br> |
</div> | </div> | ||
<br><br> | <br><br> | ||
− | <a href="https://static.igem.org/mediawiki/2018/9/91/T--Tongji_China--python-3-mkfrm.txt">mkfrm.py</a>< | + | <a href="https://static.igem.org/mediawiki/2018/9/91/T--Tongji_China--python-3-mkfrm.txt">mkfrm.py</a><font size=2> Click to see the Python script<a name="phase3" style="text-decoration:none;"> </a></font><br><br><br><br> |
+ | <p class="littletitle">Phase 3. Test MHC-I Affinity</p> | ||
+ | Then we use | ||
<a href="http://www.cbs.dtu.dk/services/NetMHC/">NetMHC</a> to predict these oligopeptides for immunogenicity. We download the software of NetMHC method to run at our own site, with the same functionality of the NetMHC 4.0 Sever. We run the peptides separately by allele type and length of peptides, and one of the command lines looks like this : | <a href="http://www.cbs.dtu.dk/services/NetMHC/">NetMHC</a> to predict these oligopeptides for immunogenicity. We download the software of NetMHC method to run at our own site, with the same functionality of the NetMHC 4.0 Sever. We run the peptides separately by allele type and length of peptides, and one of the command lines looks like this : | ||
<br><br> | <br><br> | ||
− | + | <div width="80%" style="margin-left:6.25em; margin-right:6.25em; line-height:110%; text-decoration:none;"><pre class="codetx"> | |
− | + | <code>netMHC -l 8 -a HLA-A0101 aa_len8.fasta -s > len8_HLA-A0101.txt</code></pre></div> | |
− | + | ||
− | <br> This line means we use the 8 aa length peptides and set the allele type as | + | <br> This line means we use the 8 aa length peptides and set the allele type as "HLA-A0101". Then we choose all the HLA supertype representatives of the allele types the netMHC Server provides to run the peptides of all lengths. The HLA supertype representatives are selected by the netMHC Sever as below : |
− | <br> | + | <br><br> |
<div style="text-align: center;"> | <div style="text-align: center;"> | ||
− | <table border="0" cellspacing=" | + | <table border="0" cellspacing="0.3em"> |
<tr> | <tr> | ||
<td>HLA-A0101</td> | <td>HLA-A0101</td> | ||
Line 251: | Line 245: | ||
</div> | </div> | ||
<br> | <br> | ||
− | + | Allele is dependent on the source of our sequence, the race of the samples, and to which MHC-I molecule or molecules the potential antigen is mostly likely to bind. Then we submit and run. You can download our files below. | |
− | <br><br> | + | <br><br>Output files: <br> |
− | <a href=""> | + | <a href="https://static.igem.org/mediawiki/2018/1/14/T--Tongji_China--program-8wb.txt">length-8-wb.txt</a>    |
− | <br><br> Here we see, the lower the number of <b>Affinity</b> is, the easier for the oligopeptide to bind to certain MHC-I molecule(s). And the software itself has filtered the results for us. Generally speaking, peptides marked as | + | <a href="https://static.igem.org/mediawiki/2018/8/84/T--Tongji_China--program-9sb.txt">length-9-sb.txt</a>    |
+ | <a href="https://static.igem.org/mediawiki/2018/c/c3/T--Tongji_China--program-9wb.txt">length-9-wb.txt</a>    | ||
+ | <a href="https://static.igem.org/mediawiki/2018/8/82/T--Tongji_China--program-10sb.txt">length-10-sb.txt</a><br> | ||
+ | <a href="https://static.igem.org/mediawiki/2018/b/bf/T--Tongji_China--program-10wb.txt">length-10-wb.txt</a>   | ||
+ | <a href="https://static.igem.org/mediawiki/2018/1/19/T--Tongji_China--program-11sb.txt">length-11-sb.txt</a>   | ||
+ | <a href="https://static.igem.org/mediawiki/2018/6/6d/T--Tongji_China--program-11wb.txt">length-11-wb.txt</a>   | ||
+ | <a href="https://static.igem.org/mediawiki/2018/9/96/T--Tongji_China--program-12sb.txt">length-12-sb.txt</a><br> | ||
+ | <a href="https://static.igem.org/mediawiki/2018/5/55/T--Tongji_China--program-12wb.txt">length-12-wb.txt</a>   | ||
+ | <a href="https://static.igem.org/mediawiki/2018/4/43/T--Tongji_China--program-13sb.txt">length-13-sb.txt</a>   | ||
+ | <a href="https://static.igem.org/mediawiki/2018/f/f6/T--Tongji_China--program-13wb.txt">length-13-wb.txt</a>   | ||
+ | <a href="https://static.igem.org/mediawiki/2018/d/d3/T--Tongji_China--program-14sb.txt">length-14-sb.txt</a><br> | ||
+ | <a href="https://static.igem.org/mediawiki/2018/1/11/T--Tongji_China--program-14wb.txt">length-14-wb.txt</a> | ||
+ | |||
+ | <br><br> | ||
+ | We choose 4 peptides each of which is the strongest binding peptides to MHC-I molecule of 10, 11, 12, 13 aa long type: <br> | ||
+ | The strongest of 10 aa long peptides: | ||
+ | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/f/f2/T--Tongji_China--picture-program-result-3.png" width="100%" height="120%"></p> | ||
+ | The strongest of 11 aa long peptides: | ||
+ | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/0/0e/T--Tongji_China--picture-program-result-1.png" width="100%" height="120%"></p> | ||
+ | The strongest of 12 aa long peptides: | ||
+ | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/5/5f/T--Tongji_China--picture-program-result-2.png" width="100%" height="120%"></p> | ||
+ | The strongest of 13 aa long peptides: | ||
+ | <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/5/50/T--Tongji_China--picture-program-result-0.png" width="100%" height="120%"></p> | ||
+ | |||
+ | <br><br> Here we see, the lower the number of <b>Affinity</b> is, the easier for the oligopeptide to bind to certain MHC-I molecule(s). And the software itself has filtered the results for us. Generally speaking, peptides marked as "SB (strong binding)" are the ones we want most, and those marked as "WB (weak binding)" are the second-best. Due to the plentifulness of our peptide source, we don't have to make do with second best and will only adopt "SB (strong binding)" ones. | ||
<br>Like this, we have selected possible antigen peptides for colorectal cancer in DRY LAB to be tested in the other part of our project—the | <br>Like this, we have selected possible antigen peptides for colorectal cancer in DRY LAB to be tested in the other part of our project—the | ||
<a href="/Team:Tongji_China/WetLab">WET LAB</a>.<br><br><br><br> | <a href="/Team:Tongji_China/WetLab">WET LAB</a>.<br><br><br><br> | ||
− | <div | + | |
+ | <div class="littletitle">PHASE 4. Improvement</div><br> | ||
+ | |||
+ | |||
+ | <font color="#EEC778" face=charcoal size="4"><I><b># Further experiment proof<br></b></I></font> | ||
+ | However, our programming workflow is not that suitable for individual therapy. Nowadays, it's easy to test the MHC allele type of a person. Considering individual therapy, we can test the allele of a colorectal cancer patient, and get the sequencing data of the cancer cell and blood from the patient. Then we can do the sequence alignment of the two types of data to find the mutation sites of that certain patient. Using this workflow we built up, we can make multiple peptides around the mutation sites of the patient. With determining of their allele type, we can choose a certain allele suitable for that certain patient. Even more, we can use the <a href="https://2018.igem.org/Team:Tongji_China/Modeling">model</a> we built up to predict which mutation site could product the strongest biding peptide to the MHC-I molecule. | ||
+ | <br><br><br><br> | ||
+ | |||
+ | |||
+ | <div class="littletitle">Acknowledge:</div><br> | ||
<a href="https://www.nih.gov/about-nih/what-we-do/nih-almanac/national-cancer-institute-nci">National Cancer Institute (NCI)</a><br> | <a href="https://www.nih.gov/about-nih/what-we-do/nih-almanac/national-cancer-institute-nci">National Cancer Institute (NCI)</a><br> | ||
<a href="https://cancergenome.nih.gov">The Cancer Genome Atlas (TCGA)</a><br> | <a href="https://cancergenome.nih.gov">The Cancer Genome Atlas (TCGA)</a><br> | ||
<a href="https://gdc.cancer.gov/access-data/gdc-data-portal">The Genomic Data Commons (GDC) data portal</a><br> | <a href="https://gdc.cancer.gov/access-data/gdc-data-portal">The Genomic Data Commons (GDC) data portal</a><br> | ||
<a href="https://www.encodeproject.org">Encyclopedia of DNA Elements (ENCODE)</a><br> | <a href="https://www.encodeproject.org">Encyclopedia of DNA Elements (ENCODE)</a><br> | ||
− | <a href="http://www.cbs.dtu.dk/services/NetMHC/">NetMHC</a | + | <a href="http://www.cbs.dtu.dk/services/NetMHC/">NetMHC</a> |
+ | </div><br><br> | ||
</div> | </div> | ||
− | |||
</body> | </body> | ||
− | </html> | + | </html> |
+ | |||
+ | {{Tongji_China/Footer_N}} |
Latest revision as of 19:01, 17 October 2018
Dry Lab
Programming
Phase 1. Get data from TCGA and preparation
We get data from The Cancer Genome Atlas (TCGA)’s Genomic Data Commons (GDC) data portal which gives the information of colorectal cancer mutations.The Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.
Fig.Program.1 Web page of TCGA database in condition of colorectal cancer
In the picture above, you can see a data table at the bottom right corner. The data table contains DNA changes, mutation types, consequences, affected cases in cohort, affected cases across the GDC, impact of the mutations and survival plot of overall mutations. Part of the table shows like below:
Table.Programming.1 Part of the data table we get from the TCGA database containing information of mutations of colorectal cancer.
The first column of the table tells us where and how the mutations (single nucleotide variations) occur. For example, "chr12:g.25245350C>T" means at the 25245350 site of chromosome 12, "C" is changed into "T" in this mutation. The second column of the table tells the types of the mutations. The types are substitution, deletion and insertion. We exclude the mutation data of intron and UTR, like the high-lighted record in the table. The third column of the table tells the consequences of the mutations. For the first line, the DNA change "chr12:g.25245350C>T" results in a missense mutation of gene "KRAS", and this mutation is already named "KRAS G12D". And the forth column of the data table, tells the rate of the certain mutation occurring among the colorectal cancer patients. For example, "60 / 537, 11.17%" means that among the 537 patients under research, 60 of them have that certain mutation (chr12:g.25245350C>T).
We get the reference genome from Encyclopedia of DNA Elements (ENCODE) dataset, GRCh38_no_alt_analysis_set_GCA_000001405.15 in fasta format. And we also use the genome reference file ( set ENCSR425FOI) from ENCODE dataset, containing features of genes in "Gene Transfer Format". Each line of this GTF file contains seqname, source, feature, start, end, score, strand, frame, and group of genes. A line in the GTF file shows like this: (The line is made into a table for easy looking.)
Table.programming.2 An example line of the genome reference file in GTF format.
We use the first column of the data table, the DNA change information of mutations, and the gene id of each mutation in the third column of the data table to find the correspondent lines in the GTF file which contain the same gene and are CDS featured, and the start and end sites of the CDS must enclose the mutation site. For example, for the DNA change “chr12:g.25245350C>T” of gene KRAS, we can find a line in the GTF file shows like below: (The line is made into a table for easy looking.)
Table.programming.3 An example line of the genome reference file in GTF format.
The line is for the gene KRAS and is CDs featured and its start site 25245274 is before the mutation site 25245350, and end 25245384 is after. Then we create a GFF file looking like below: (GFF format file also contains seqname, source, feature, start, end, score, strand, frame, and group of genes like GTF.)
Table.programming.4 An example line of GFF format file we create.
tsv2gff.py Click to see the Python script
The first, fourth, fifth, seventh and eighth column of the GFF file separately tell the chromosome, start, end, strand and frame. And we put the information could be used later in the third (feature) column.
Then we use the bedtools software to withdraw the DNA sequence from reference genome (GRCh38).
We get a fasta file containing the normal DNA sequences of the CDS where occur the mutations of colorectal cancer we got from the TCGA GDC data portal by running the commands below:
bedtools getfasta -fo ***.fasta \
-fi GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
-bed ***.gff \
-s \
-name
The normal fasta file looks like below:
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ATGACTGAATATAAACTTGT
GGTAGTTGGAGCTGGTGGCG
TAGGCAAGAGTGCCT TGACGA
TACAGCTAATTCAGAATCAT
TTTGTGGACGAATATGATCC
AACAATAGAG
ATGACTGAATATAAACTTGT
GGTAGTTGGAGCTGGTGGCG
TAGGCAAGAGTGCCT TGACGA
TACAGCTAATTCAGAATCAT
TTTGTGGACGAATATGATCC
AACAATAGAG
nor2snv.py Click to see the Python script
Phase 2. Make Peptide Windows
Then we translate the “mutated fasta file” into amino acid sequences and make slide windows (length of 8 to 14aa) around the mutated sites.
Fig.programming.3 making slide windows around the mutation sites length of 8aa to 14aa.
Then we get a series of amino acid fasta files:
Windows of the chr12:g.25245350C>T mutation of 8aa:
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGAD
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSAL
......
KLVVVGAD
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSAL
......
Windows of the chr12:g.25245350C>T mutation of 9aa:
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
YKLVVVGAD
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGADG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSAL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSALT
……
YKLVVVGAD
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGADG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSAL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSALT
……
Windows of the chr12:g.25245350C>T mutation of 14aa:
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
MTEYKLVVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
TEYKLVVVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
EYKLVVVGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
YKLVVVGADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADGVGKSAL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGVGKSALT
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVGKSALTI
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGKSALTIQ
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKSALTIQL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSALTIQLI
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSALTIQLIQ
......
MTEYKLVVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
TEYKLVVVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
EYKLVVVGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
YKLVVVGADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADGVGKSAL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGVGKSALT
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVGKSALTI
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGKSALTIQ
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKSALTIQL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSALTIQLI
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSALTIQLIQ
......
mkfrm.py Click to see the Python script
Phase 3. Test MHC-I Affinity
Then we use NetMHC to predict these oligopeptides for immunogenicity. We download the software of NetMHC method to run at our own site, with the same functionality of the NetMHC 4.0 Sever. We run the peptides separately by allele type and length of peptides, and one of the command lines looks like this :
netMHC -l 8 -a HLA-A0101 aa_len8.fasta -s > len8_HLA-A0101.txt
This line means we use the 8 aa length peptides and set the allele type as "HLA-A0101". Then we choose all the HLA supertype representatives of the allele types the netMHC Server provides to run the peptides of all lengths. The HLA supertype representatives are selected by the netMHC Sever as below :
HLA-A0101 | HLA-A*01:01 (A1) | HLA supertype representative |
HLA-A0201 | HLA-A*02:01 (A2) | HLA supertype representative |
HLA-A0301 | HLA-A*03:01 (A3) | HLA supertype representative |
HLA-A2402 | HLA-A*24:02 (A24) | HLA supertype representative |
HLA-A2601 | HLA-A*26:01 (A26) | HLA supertype representative |
HLA-B0702 | HLA-B*07:02 (B7) | HLA supertype representative |
HLA-B0801 | HLA-B*08:01 (B8) | HLA supertype representative |
HLA-B2705 | HLA-B*27:05 (B27) | HLA supertype representative |
HLA-B3901 | HLA-B*39:01 (B39) | HLA supertype representative |
HLA-B4001 | HLA-B*40:01 (B44) | HLA supertype representative |
HLA-B5801 | HLA-B*58:01 (B58) | HLA supertype representative |
HLA-B1501 | HLA-B*15:01 (B62) | HLA supertype representative |
Allele is dependent on the source of our sequence, the race of the samples, and to which MHC-I molecule or molecules the potential antigen is mostly likely to bind. Then we submit and run. You can download our files below.
Output files:
length-8-wb.txt length-9-sb.txt length-9-wb.txt length-10-sb.txt
length-10-wb.txt length-11-sb.txt length-11-wb.txt length-12-sb.txt
length-12-wb.txt length-13-sb.txt length-13-wb.txt length-14-sb.txt
length-14-wb.txt
We choose 4 peptides each of which is the strongest binding peptides to MHC-I molecule of 10, 11, 12, 13 aa long type:
The strongest of 10 aa long peptides: The strongest of 11 aa long peptides: The strongest of 12 aa long peptides: The strongest of 13 aa long peptides:
Here we see, the lower the number of Affinity is, the easier for the oligopeptide to bind to certain MHC-I molecule(s). And the software itself has filtered the results for us. Generally speaking, peptides marked as "SB (strong binding)" are the ones we want most, and those marked as "WB (weak binding)" are the second-best. Due to the plentifulness of our peptide source, we don't have to make do with second best and will only adopt "SB (strong binding)" ones.
Like this, we have selected possible antigen peptides for colorectal cancer in DRY LAB to be tested in the other part of our project—the WET LAB.
PHASE 4. Improvement
# Further experiment proof
However, our programming workflow is not that suitable for individual therapy. Nowadays, it's easy to test the MHC allele type of a person. Considering individual therapy, we can test the allele of a colorectal cancer patient, and get the sequencing data of the cancer cell and blood from the patient. Then we can do the sequence alignment of the two types of data to find the mutation sites of that certain patient. Using this workflow we built up, we can make multiple peptides around the mutation sites of the patient. With determining of their allele type, we can choose a certain allele suitable for that certain patient. Even more, we can use the model we built up to predict which mutation site could product the strongest biding peptide to the MHC-I molecule.
Acknowledge:
National Cancer Institute (NCI)
The Cancer Genome Atlas (TCGA)
The Genomic Data Commons (GDC) data portal
Encyclopedia of DNA Elements (ENCODE)
NetMHC