This algorithm determines the Shine-Dalgarno (SD) and Anti-Shine-Dalgarno (ASD) sequences for development of orthogonal ribosomes.
Steps 8, 9 and 10 Details
Get all TIRs from genome file, calculate binding energies of remaining ASDs with each TIR, then ranks candidates from highest to lowest binding energies. Higher binding energies are preferred, as this is another orthogonality constraint.
Responsible functions: getallTIRs, findCDSfromformatted, formatCDSfile
Figure S5. Overview: Obtaining all translation initiation regions from the Coding Region.
Function description: formatCDSfile
The CDS file for a particular organism contains lots of extraneous information, so formatCDSfile narrows down the information to only those lines that contain the starting index of the CDS for forward CDSs, or starting and ending indices if the CDS is on the complementary strand. First, the function creates a file path (specific to the organism whose CDS is being analyzed) where the formatted CDS will be stored. Two files are then opened, 1) the CDS file itself, and 2) the file which will be stored in the aforementioned file path where the formatted CDS will be written into. The function iterates through every line in the CDS file. It combines the previous line and the current line into a variable called currentandprevious to prevent any CDS locations that are found over two separate lines from being missed.
The general format of a CDS file specifying the indices of the start codon of that particular CDS are as follows. For CDSs on the forward strand, the data looks like: >lcl|NC_000913.3_cds_NP_414542.1_1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255], where all we care about is bolded (the indices of the start and end of the CDS). If the CDS is found on the complementary strand, the location tag looks like [location=complement(10643..11356)]. For one, this function differentiates between a line that contains useful information about the start/ end location and a line that just lists the DNA sequence by searching currentandprevious for the key “location=”, which is only found directly before the indices are stated. If this key is not found in the line, then the line doesn’t actual contain any useful information thus it is not included in the formatted CDS file. The remaining cases are divided into the cases where the CDS is found on the complementary strand, the forward strand, or if the CDS is a transpliced “joined” sequence.
In the case that the CDS is on the complementary strand, we write the entire line up to the second index in complement() and store it in the formatted CDS file. The technique we use to determine the end of the information we need to write into the file is to find the “)]” that signifies the end of the complement tag (italicized above). In some cases there was some previous part of the line that contained the key “)]” which did not signify the end of the complement tag. In these cases we ignored the first instance of the key and only used the second instance of the key to write the line. Join cases with transpliced CDSs are ignored, because they constitute a very small portion of the entire list of CDSs. If the CDS is on the forward strand, we only care about the starting index so we only write the information up to the starting index. We did this by localizing the “..” key that separated the starting and ending indices in the forward CDS. Again, if there was some other key that did not signify the end of the forward CDS location tag, we ignored the first instance and wrote the line up to the second instance.
Function description: findCDSfromformatted
The purpose of this function is to read the formatted CDS file and create 2 lists, one containing the indices applicable to forward CDS, and one containing those where the CDS is on the complement strand. The function goes line by line and locates the index of the “L” in the key “location=”. To ensure that we are only looking at lines that contain information about the CDS, we first make sure that the line contains the “location=” key (which every line already does, since that’s how the formatted CDS file is initially made). For complementary cases, we take the second index (for example, in location=complement(10642..11356) we take the number 11356, subtract 1 from that number (to account for Python indexing), and store that number in a list called cdslist2. For forward cases we take the first index (in location=190...255 we take 190) and store it in cdslist1.