Difference between revisions of "Team:Tongji China/Model"

Line 18: Line 18:
 
</div>
 
</div>
 
<div class="logoPicture">
 
<div class="logoPicture">
<img src="https://static.igem.org/mediawiki/2018/c/c1/T--Tongji_China--picture-drylab-model-0.png" width="15%" height="25%">
+
<img src="https://static.igem.org/mediawiki/2018/8/80/T--Tongji_China--picture-drylab-programming-0.png" width="15%" height="25%">
 
</div>
 
</div>
 
<div class="title_2">
 
<div class="title_2">
Model
+
Programming
 
</div>
 
</div>
                <div class="content">
+
<div class="content">
                <p><I>Acknowledge:CPU China. This part is made by Team CPU China and thanks for their collaboration!</I></p>
+
We get data from
                We use Bayesian statistics to predict which type of mutation is most likely to product MHC strong binding peptides with the sum of the affinity of each mutation site and each allele type.<br>
+
<a href="https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22Colon%22%2C%22Colorectal%22%2C%22Rectum%22%5D%7D%7D%5D%7D&searchTableTab=mutations&ssmsTable_size=100">The Cancer Genome Atlas (TCGA)’s Genomic Data Commons (GDC) data portal</a> which gives the information of colorectal cancer mutations.<br> The Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.<br>
Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a degree of belief in an event. The degree of belief may be based on prior knowledge about the event, such as the results of previous experiments, or on personal beliefs about the event. This differs from a number of other interpretations of probability, such as the frequentist interpretation that views probability as the limit of the relative frequency of an event after a large number of trials.<br>Bayes' theorem is a fundamental theorem in Bayesian statistics, as it is used by Bayesian methods to update probabilities, which are degrees of belief, after obtaining new data. Given two events A and B, the conditional probability of A given that B is true is expressed as follows:<br>
+
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/9/97/T--Tongji_China--picture-drylab-programming-1.png" width="90%" height="90%"></p>
                <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/8/82/T--Tongji_China--picture-drylab-model-1.png" width="25%" height="25%"></p>
+
Although Bayes' theorem is a fundamental result of probability theory, it has a specific interpretation in Bayesian statistics. In the above equation, <I>A</I> usually represents a proposition (such as the statement that a coin lands on heads fifty percent of the time) and <I>B</I> represents the evidence, or new data that is to be considered (such as the result of a series of coin flips). <I>P(A)</I> is the prior probability of <I>A</I> which expresses one's beliefs about <I>A</I> before evidence is considered. The prior probability may also quantify prior knowledge or information about <I>A</I>. <I>P(B|A)</I> is the likelihood function, which can be interpreted as the probability of the evidence <I>B</I> given that <I>A</I> is true. The likelihood quantifies the extent to which the evidence <I>B</I> supports the proposition <I>A</I>. <I>P(A|B)</I> is the posterior probability, the probability of the proposition B into account. Essentially, Bayes' theorem updates one's prior beliefs <I>P(A)</I> after considering the new evidence <I>B</I>.<br>
+
The probability of the evidence <I>P(B)</I> can be calculated using the law of total probability. If <I>{A1, A2, …, An}</I> is a partition of the sample space, which is the set of all outcomes of an experiment, then,<br>
+
                <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/b/b8/T--Tongji_China--picture-drylab-model-2.png" width="90%" height="90%"></p>
+
The formulation of statistical models using Bayesian statistics has the identifying feature of requiring the specification of prior distributions for any unknown parameters. Indeed, parameters of prior distributions may themselves have prior distributions, leading to Bayesian hierarchical modeling, or may be interrelated, leading to Bayesian networks.<br><br>
+
We use Bayesian statistics to predict which type of mutation is most likely to product MHC strong binding peptides with the sum of the affinity of each mutation site and each allele type. Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a degree of belief in an event. If a colorectal cancer patient is detected to have an immune response to our medicine, we can predict which mutation is playing the strongest role in cancer using this model. If a patient's mutation sites are already known, this model can also help predict which site can be the best one for peptide making for this certain patient which contributes to the individualized therapy. <br>The heat map below shows the sum of the affinity of each allele type and each mutation.<br>
+
                <p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/e/e4/T--Tongji_China--picture-drylab-model-3.png" width="75%" height="75%"></p>
+
 
<div class="instructionOfPicture">
 
<div class="instructionOfPicture">
Figure.Model.1 Heatmap of the affinity of each allele type and each mutation.
+
Fig.Program.1 Web page of TCGA database in condition of colorectal cancer
 +
</div>
 +
<br>In the picture above, you can see a data table at the bottom right corner. The data table contains DNA changes, mutation types, consequences, affected cases in cohort, affected cases across the GDC, impact of the mutations and survival plot of overall mutations. Part of the table shows like below:
 +
<br>
 +
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/0/0c/T--Tongji_China--picture-drylab-programming-6.jpg" width="90%" height="90%"></p>
 +
<div class="instructionOfPicture">
 +
Table.Programming.1 Part of the data table we get from the TCGA database containing information of mutations of colorectal cancer.
  
 
</div>
 
</div>
<br><br>
+
<br>The first column of the table tell us where and how the mutations (single nucleotide variations) occur. For example, “chr12:g.25245350C>T” means at the 25245350 site of chromosome 12, “C” is changed into “T” in this mutation. The second column of the table tells the types of the mutations. The types are substitution, deletion and insertion. The third column of the table tells the consequences of the mutations. For the first line, the DNA change “chr12:g.25245350C>T” results in a missense mutation of gene “KRAS”, and this mutation is already named “KRAS G12D”. And the forth column of the data table, tells the rate of the certain mutation occurring among the colorectal cancer patients. For example, “60 / 537, 11.17%” means that among the 537 patients under research, 60 of them have that certain mutation (chr12:g.25245350C>T).
<a href="https://static.igem.org/mediawiki/2018/b/b1/T--Tongji_China--output-drylab-model-1.csv">affinity_conseqeunce.csv</a><font size=2>  Click to download the consequence file</font>
+
<br>We get the reference genome from Encyclopedia of DNA Elements (ENCODE) dataset,
</div>
+
<a href="https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/">GRCh38_no_alt_analysis_set_GCA_000001405.15</a> in fasta format. And we get the genome reference file (
 +
<a href="https://www.encodeproject.org/references/ENCSR425FOI/">set ENCSR425FOI</a>) from Encyclopedia of DNA Elements (ENCODE) dataset, containing features of genes in “Gene Transfer Format”. Each line of this GTF file contains seqname, source, feature, start, end, score, strand, frame, and group of genes. A line in the GTF file shows like this: (The line is made into a table for easy looking.)
 +
<br>
 +
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/6/6e/T--Tongji_China--picture-drylab-programming-2.jpg" width="90%" height="90%"></p>
 +
<div class="instructionOfPicture">
 +
Table.programming.2 An example line of the genome reference file in GTF format.
 +
</div>
 +
<br>We use the first column of the data table, the DNA change information of mutations, and the gene id of each mutation in the third column of the data table to find the correspondent lines in the GTF file which contain the same gene and are CDS featured, and the start and end sites of the CDS must enclose the mutation site. For example, for the DNA change “chr12:g.25245350C>T” of gene KRAS, we can find a line in the GTF file shows like below: (The line is made into a table for easy looking.)
 +
<br>
 +
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/7/7e/T--Tongji_China--picture-drylab-programming-3.jpg" width="90%" height="90%"></p>
 +
<div class="instructionOfPicture">
 +
Table.programming.3 An example line of the genome reference file in GTF format.
 +
 
 +
</div>
 +
<br>The line is for the gene KRAS and is CDs featured and its start site 25245274 is before the mutation site 25245350, and end 25245384 is after. Then we create a GFF file looking like below: (GFF format file also contains seqname, source, feature, start, end, score, strand, frame, and group of genes like GTF.)
 +
<br>
 +
<table border="0" cellspacing="5px" cellpadding="5px">
 +
<tr>
 +
<td>chr12</td>
 +
<td>.</td>
 +
<td>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0</td>
 +
<td>25245274</td>
 +
<td>25245384</td>
 +
<td>.</td>
 +
<td>-</td>
 +
<td>0</td>
 +
<td>.</td>
 +
</tr>
 +
</table>
 +
<div class="instructionOfPicture">
 +
Table.programming.4 An example line of GFF format file we create.
 +
 
 +
</div>
 +
<br>
 +
<a href="https://static.igem.org/mediawiki/2018/f/f7/T--Tongji_China--python-1-tsv2gff.txt">tsv2gff.py</a><font size=2>  Click to see the Python script</font>
 +
<br><br> The first, fourth, fifth, seventh and eighth column of the GFF file separately tell the chromosome, start, end, strand and frame. And we put the information could be used later in the third (feature) column.
 +
<br>Then we use the
 +
<a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">bedtools</a> software to withdraw the DNA sequence from reference genome (GRCh38).
 +
<br>
 +
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/c/c6/T--Tongji_China--picture-drylab-programming-4.png" /></p>
 +
<div class="instructionOfPicture">
 +
<a href="https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html?highlight=getfasta">Figure source: bedtools website</a>
 +
</div>
 +
<br>We get a fasta file containing the normal DNA sequences of the CDS where occur the mutations of colorectal cancer we got from the TCGA GDC data portal by running the commands below:<br>
 +
<div class="codeText"><div style="white-space: nowrap;">bedtools getfasta -fo ***.fasta \<br></div>
 +
<div style="white-space: nowrap;">&emsp;-fi&nbsp;GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta&nbsp;\</div>
 +
<div style="white-space: nowrap;">&emsp;-bed ***.gff \<br></div>
 +
<div style="white-space: nowrap;">&emsp;-s \<br></div>
 +
<div style="white-space: nowrap;">&emsp;-name<br></div><br></div>
 +
The normal fasta file looks like below:<br><br>
 +
<div class="dataText">
 +
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br>
 +
ATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCT<br>
 +
TGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG
 +
<br>
 +
</div>
 +
<br>
 +
<a href="https://static.igem.org/mediawiki/2018/2/2d/T--Tongji_China--python-2-nor2snv.txt">nor2snv.py</a><font size=2>  Click to see the Python script</font>
 +
<br>
 +
<br> Then we translate the “mutated fasta file” into amino acid sequences and make slide windows (length of 8 to 14aa) around the mutated sites.
 +
<br><br>
 +
<p style="text-align:center"><img src="https://static.igem.org/mediawiki/2018/b/bb/T--Tongji_China--picture-drylab-programming-5.jpg" width="90%" height="90%"></p>
 +
<div class="instructionOfPicture">
 +
Fig.programming.3 making slide windows around the mutation sites length of 8aa to 14aa.
 +
 
 +
</div>
 +
<br><br> Then we get a series of amino acid fasta files:
 +
<br><br> Windows of the chr12:g.25245350C>T mutation of 8aa:<br><br>
 +
<div class="dataText">
 +
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> KLVVVGAD
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> LVVVGADG
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VVVGADGV
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VVGADGVG
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VGADGVGK
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> GADGVGKS
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> ADGVGKSA
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> DGVGKSAL
 +
<br>......<br>
 +
</div>
 +
<br> Windows of the chr12:g.25245350C>T mutation of 9aa:<br><br>
 +
<div class="dataText">
 +
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> YKLVVVGAD
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> KLVVVGADG
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> LVVVGADGV
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VVVGADGVG
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VVGADGVGK
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VGADGVGKS
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> GADGVGKSA
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> ADGVGKSAL
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> DGVGKSALT
 +
<br> ……
 +
<br>
 +
 
 +
</div>
 +
<br> Windows of the chr12:g.25245350C>T mutation of 14aa:<br><br>
 +
<div class="dataText">
 +
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)<br> MTEYKLVVVGADGV
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> TEYKLVVVGADGVG
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> EYKLVVVGADGVGK
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> YKLVVVGADGVGKS
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> KLVVVGADGVGKSA
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> LVVVGADGVGKSAL
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VVVGADGVGKSALT
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VVGADGVGKSALTI
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> VGADGVGKSALTIQ
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> GADGVGKSALTIQL
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> ADGVGKSALTIQLI
 +
<br> >chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
 +
<br> DGVGKSALTIQLIQ
 +
<br>......<br>
 +
 
 +
</div>
 +
<br>
 +
<a href="https://static.igem.org/mediawiki/2018/9/91/T--Tongji_China--python-3-mkfrm.txt">mkfrm.py</a><font size=2>Click to see the Python script.</font><br><br> Then we use
 +
<a href="http://www.cbs.dtu.dk/services/NetMHC/">NetMHC</a> to predict these oligopeptides for immunogenicity. We download the software of NetMHC method to run at our own site, with the same functionality of the NetMHC 4.0 Sever. We run the peptides separately by allele type and length of peptides, and one of the command lines looks like this :
 +
<br><br>
 +
<div class="codeText">
 +
<div style="white-space: nowrap;" ><p style="text-align: center;">netMHC -l 8 -a HLA-A0101 aa_len8.fasta -s > len8_HLA-A0101.txt<p></div>
 +
</div>
 +
<br> This line means we use the 8 aa length peptides and set the allele type as “HLA-A0101”. Then we choose all the HLA supertype representatives of the allele types the netMHC Server provides to run the peptides of all lengths. The HLA supertype representatives are selected by the netMHC Sever as below :
 +
<br>
 +
<div style="text-align: center;">
 +
<table border="0" cellspacing="5px">
 +
<tr>
 +
<td>HLA-A0101</td>
 +
<td>HLA-A*01:01 (A1)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-A0201</td>
 +
<td>HLA-A*02:01 (A2)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-A0301</td>
 +
<td>HLA-A*03:01 (A3)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-A2402</td>
 +
<td>HLA-A*24:02 (A24)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-A2601</td>
 +
<td>HLA-A*26:01 (A26)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-B0702</td>
 +
<td>HLA-B*07:02 (B7)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-B0801</td>
 +
<td>HLA-B*08:01 (B8)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-B2705</td>
 +
<td>HLA-B*27:05 (B27)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-B3901</td>
 +
<td>HLA-B*39:01 (B39)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-B4001</td>
 +
<td>HLA-B*40:01 (B44)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-B5801</td>
 +
<td>HLA-B*58:01 (B58)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
<tr>
 +
<td>HLA-B1501</td>
 +
<td>HLA-B*15:01 (B62)</td>
 +
<td>HLA supertype representative</td>
 +
</tr>
 +
</table>
 +
 
 +
</div>
 +
<br>
 +
<b>Allele</b> is dependent on the source of our sequence, the race of the samples, and to which MHC-I molecule or molecules the potential antigen is mostly likely to bind. Then we submit and run. You can download our files below.
 +
<br><br>
 +
<a href="">Output files</a>
 +
<br><br> Here we see, the lower the number of <b>Affinity</b> is, the easier for the oligopeptide to bind to certain MHC-I molecule(s). And the software itself has filtered the results for us. Generally speaking, peptides marked as “SB (strong binding)” are the ones we want most, and those marked as “WB (weak binding)” are the second-best. Due to the plenty of our peptide source, we don’t have to make do with second best and will only adopt “SB (strong binding)” ones.
 +
<br>Like this, we have selected possible antigen peptides for colorectal cancer in DRY LAB to be tested in the other part of our project—the
 +
<a href="/Team:Tongji_China/WetLab">WET LAB</a>.<br><br><br><br>
 +
<div style="font-family: courier; font-size:30px; margin-bottom:20px;">Acknowledge:</div>
 +
<a href="https://www.nih.gov/about-nih/what-we-do/nih-almanac/national-cancer-institute-nci">National Cancer Institute (NCI)</a><br>
 +
<a href="https://cancergenome.nih.gov">The Cancer Genome Atlas (TCGA)</a><br>
 +
<a href="https://gdc.cancer.gov/access-data/gdc-data-portal">The Genomic Data Commons (GDC) data portal</a><br>
 +
<a href="https://www.encodeproject.org">Encyclopedia of DNA Elements (ENCODE)</a><br>
 +
<a href="http://www.cbs.dtu.dk/services/NetMHC/">NetMHC</a><br>
 +
 
 +
</div>
 +
</div></div>
 +
</body>
 +
 
 +
</html>
 +
 
 +
{{Tongji_China/Footer_N}}

Revision as of 13:19, 9 October 2018

Programme
Dry Lab
Programming
We get data from The Cancer Genome Atlas (TCGA)’s Genomic Data Commons (GDC) data portal which gives the information of colorectal cancer mutations.
The Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.

Fig.Program.1 Web page of TCGA database in condition of colorectal cancer

In the picture above, you can see a data table at the bottom right corner. The data table contains DNA changes, mutation types, consequences, affected cases in cohort, affected cases across the GDC, impact of the mutations and survival plot of overall mutations. Part of the table shows like below:

Table.Programming.1 Part of the data table we get from the TCGA database containing information of mutations of colorectal cancer.

The first column of the table tell us where and how the mutations (single nucleotide variations) occur. For example, “chr12:g.25245350C>T” means at the 25245350 site of chromosome 12, “C” is changed into “T” in this mutation. The second column of the table tells the types of the mutations. The types are substitution, deletion and insertion. The third column of the table tells the consequences of the mutations. For the first line, the DNA change “chr12:g.25245350C>T” results in a missense mutation of gene “KRAS”, and this mutation is already named “KRAS G12D”. And the forth column of the data table, tells the rate of the certain mutation occurring among the colorectal cancer patients. For example, “60 / 537, 11.17%” means that among the 537 patients under research, 60 of them have that certain mutation (chr12:g.25245350C>T).
We get the reference genome from Encyclopedia of DNA Elements (ENCODE) dataset, GRCh38_no_alt_analysis_set_GCA_000001405.15 in fasta format. And we get the genome reference file ( set ENCSR425FOI) from Encyclopedia of DNA Elements (ENCODE) dataset, containing features of genes in “Gene Transfer Format”. Each line of this GTF file contains seqname, source, feature, start, end, score, strand, frame, and group of genes. A line in the GTF file shows like this: (The line is made into a table for easy looking.)

Table.programming.2 An example line of the genome reference file in GTF format.

We use the first column of the data table, the DNA change information of mutations, and the gene id of each mutation in the third column of the data table to find the correspondent lines in the GTF file which contain the same gene and are CDS featured, and the start and end sites of the CDS must enclose the mutation site. For example, for the DNA change “chr12:g.25245350C>T” of gene KRAS, we can find a line in the GTF file shows like below: (The line is made into a table for easy looking.)

Table.programming.3 An example line of the genome reference file in GTF format.

The line is for the gene KRAS and is CDs featured and its start site 25245274 is before the mutation site 25245350, and end 25245384 is after. Then we create a GFF file looking like below: (GFF format file also contains seqname, source, feature, start, end, score, strand, frame, and group of genes like GTF.)
chr12 . chr12:g.25245350C>T|KRAS|25245274|25245384|-|0 25245274 25245384 . - 0 .
Table.programming.4 An example line of GFF format file we create.

tsv2gff.py Click to see the Python script

The first, fourth, fifth, seventh and eighth column of the GFF file separately tell the chromosome, start, end, strand and frame. And we put the information could be used later in the third (feature) column.
Then we use the bedtools software to withdraw the DNA sequence from reference genome (GRCh38).


We get a fasta file containing the normal DNA sequences of the CDS where occur the mutations of colorectal cancer we got from the TCGA GDC data portal by running the commands below:
bedtools getfasta -fo ***.fasta \
 -fi GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
 -bed ***.gff \
 -s \
 -name

The normal fasta file looks like below:

>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCT
TGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG

nor2snv.py Click to see the Python script

Then we translate the “mutated fasta file” into amino acid sequences and make slide windows (length of 8 to 14aa) around the mutated sites.

Fig.programming.3 making slide windows around the mutation sites length of 8aa to 14aa.


Then we get a series of amino acid fasta files:

Windows of the chr12:g.25245350C>T mutation of 8aa:

>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGAD
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSAL
......

Windows of the chr12:g.25245350C>T mutation of 9aa:

>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
YKLVVVGAD
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGADG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSAL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSALT
……

Windows of the chr12:g.25245350C>T mutation of 14aa:

>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
MTEYKLVVVGADGV
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
TEYKLVVVGADGVG
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
EYKLVVVGADGVGK
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
YKLVVVGADGVGKS
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
KLVVVGADGVGKSA
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
LVVVGADGVGKSAL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVVGADGVGKSALT
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VVGADGVGKSALTI
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
VGADGVGKSALTIQ
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
GADGVGKSALTIQL
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
ADGVGKSALTIQLI
>chr12:g.25245350C>T|KRAS|25245274|25245384|-|0(-)
DGVGKSALTIQLIQ
......

mkfrm.pyClick to see the Python script.

Then we use NetMHC to predict these oligopeptides for immunogenicity. We download the software of NetMHC method to run at our own site, with the same functionality of the NetMHC 4.0 Sever. We run the peptides separately by allele type and length of peptides, and one of the command lines looks like this :

netMHC -l 8 -a HLA-A0101 aa_len8.fasta -s > len8_HLA-A0101.txt


This line means we use the 8 aa length peptides and set the allele type as “HLA-A0101”. Then we choose all the HLA supertype representatives of the allele types the netMHC Server provides to run the peptides of all lengths. The HLA supertype representatives are selected by the netMHC Sever as below :
HLA-A0101 HLA-A*01:01 (A1) HLA supertype representative
HLA-A0201 HLA-A*02:01 (A2) HLA supertype representative
HLA-A0301 HLA-A*03:01 (A3) HLA supertype representative
HLA-A2402 HLA-A*24:02 (A24) HLA supertype representative
HLA-A2601 HLA-A*26:01 (A26) HLA supertype representative
HLA-B0702 HLA-B*07:02 (B7) HLA supertype representative
HLA-B0801 HLA-B*08:01 (B8) HLA supertype representative
HLA-B2705 HLA-B*27:05 (B27) HLA supertype representative
HLA-B3901 HLA-B*39:01 (B39) HLA supertype representative
HLA-B4001 HLA-B*40:01 (B44) HLA supertype representative
HLA-B5801 HLA-B*58:01 (B58) HLA supertype representative
HLA-B1501 HLA-B*15:01 (B62) HLA supertype representative

Allele is dependent on the source of our sequence, the race of the samples, and to which MHC-I molecule or molecules the potential antigen is mostly likely to bind. Then we submit and run. You can download our files below.

Output files

Here we see, the lower the number of Affinity is, the easier for the oligopeptide to bind to certain MHC-I molecule(s). And the software itself has filtered the results for us. Generally speaking, peptides marked as “SB (strong binding)” are the ones we want most, and those marked as “WB (weak binding)” are the second-best. Due to the plenty of our peptide source, we don’t have to make do with second best and will only adopt “SB (strong binding)” ones.
Like this, we have selected possible antigen peptides for colorectal cancer in DRY LAB to be tested in the other part of our project—the WET LAB.



Acknowledge:
National Cancer Institute (NCI)
The Cancer Genome Atlas (TCGA)
The Genomic Data Commons (GDC) data portal
Encyclopedia of DNA Elements (ENCODE)
NetMHC