Revision as of 10:38, 13 October 2018

Template:Nav

PROJECT NANOPORE

Motivation

After our interview with the medical professionals, we recognized that the high cost of RNA sequencing would be a big hurdle in providing affordable gene therapy based on RNA editing. Also, the current next-generation sequencing methods do not provide unbiased direct reads of the transcriptome and ignore important modifications on the nucleobases. As such, to improve the affordability and the accuracy of RNA sequencing in the future, we proposed to develop a high-throughput, unbiased and modification-sensitive RNA sequencing method based on nanopore technologies.

Background

The human transcriptome contains not only the four canonical nucleobases – adenine (A), uracil (U), cytosine (C), and guanine (G), but also non-canonical ones such as inosine (I), pseudouridine (Ψ), 5-methylcytosine (m5C), and many others. These non-canonical nucleobases are naturally present in our cells to regulate gene expression. However, there are also modified bases that are not supposed to occur in healthy cells, which in turn could lead to certain diseases.

Figure 1. Common Modified Bases in RNA

In order to cure diseases caused by modified bases in the transcriptome level, it is necessary to know the sequence of all the RNA in the cell. Since the diseases are caused by the presence of modified nucleobases, the sequencing technology has to be able to identify the modified bases. Illumina sequencing could identify the modified bases position in the transcriptome, but there it lacks the correlation information among the modified bases within the RNA strand. Mass spectrometry of RNA is another way to determine the positions of the modified bases. This technique provides information regarding their correlation, but its low throughput makes it unable to sequence the whole human transcriptome.

Figure 2. Mechanism of Nanopore Sequencing

In this project, we explored the nanopore technology for identification of non-canonical nucleobases in the RNA. Nanopore sequencing is a high-throughput direct sequencing technique which could provide information on the correlation among the modified bases. In nanopore sequencing, the RNA will go through the nanopore from 3’ and it will produce electrical signals. The electrical signals are determined by each 5-mer of the RNA sequence. We will then compare the difference between the electrical signals generated by RNA with modified bases and those generated by normal RNA.

Identification of Inosine in RNA

Synthetic RNA samples were produced from PCR-amplified DNA gBlocks with a predefined sequence. The forward primer was designed to contain overhang T7 promoter, while the reverse primer contained polyA tail to enable binding of adapters for nanopore sequencing. In vitro transcription (IVT) was done on the amplified DNA using inosine (I) as the modified nucleotide to replace all the canonical guanosine (G) while keeping A, U, and C remained unchanged. Another synthetic RNA containing only canonical nucleobases was also produced as the negative control.

The DNA gBlock sequences were designed so that the guanosines were positioned every 9 to 11 nucleotides other than G, with a total length of around 1 kb. We tried different variations of G sequences in the gBlocks, such as xxGxx, xxGGxx, xxGGGxx, xxGxGxx, and xxGxxGxx where x is any canonical nucleotide other than G. We want to find out if our method could differentiate different G sequences.

We weren’t sure if I would result in much different signal from G. So, we also produced RNA samples which were labeled with acrylonitrile. The acrylonitrile attached only to the Is and Gs in the RNA samples which we hope would produce more distinct signals compared to normal guanosines so that it would be easier to distinguish between I and G in the RNA strands.

The current signals produced by all the samples were compared to the negative control, which is the normal RNA sample containing G without acrylonitrile. The electrical signal data were analyzed by machine learning to produce data of % modification in each position in the RNA samples. The % modification is the percent of nanopore electrical signals in that position generated by modified-base-containing RNA sample that are different from the normal RNA. A peak indicates that the difference in the produced signals is much more apparent in that position than the surrounding positions.

Figure 3. Percentage of modification in xxGxx variation with inosine as the modified base

We define the peaks that indicate the presence of inosine are those within 4 positions away from the actual inosine position. In the case of xxGxx sequences, we found that 61% of the peaks indicating the presence of inosine are located 1 or 2 positions behind the actual position. We also found that the signals from the sample with inosine labeled with acrylonitrile (I with ACN) produced a higher % modification compared to the sample with inosine without acrylonitrile (I no ACN) for most cases, which is what we expected.

Figure 4. Percentage of modification in xxGGxx variation with inosine as the modified base

For xxGGxx variation, we found that 45% of the peaks corresponding to the inosine were located on the first inosine and 2 positions behind the first inosine. We speculate that the peak at 2 positions behind the first inosine and the one on the first inosine corresponds to the first inosine and second inosine respectively. However, there are also positions where there is only 1 peak, which means only 1 modified base was detected in that position.

Figure 5. Percentage of modification in xxGxGxx variation with inosine as the modified base

In the case of xxGxGxx variation, it is shown that 60% of xxGxGxx sequences have a peak in the between of the inosines and about half of them have another peak behind the first inosine. In this case, the second inosine is identified more easily than the first inosine.

Figure 6. Percentage of modification in xxGGGxx variation with inosine as the modified base

Figure 7. Percentage of modification in xxGxxGxx variation with inosine as the modified base

In xxGGGxx patterns, we could see that over 70% of them have a peak in the middle inosine position, but there is no obvious pattern for the other peaks near the inosine. As for xxGxxGxx variation, we didn’t find any regular pattern of the peaks shown in the graph.

Identification of Other Modified Bases

We also tried to identify other non-canonical bases through nanopore sequencing. We used another predefined DNA gBlocks as a template for IVT with other modified bases such as pseudouridine (Ψ), 1-methyladenosine (m1A), 6-methyladenosine (m6A), and 5-methylcytosine (m5C).

Figure 8. Percentage of modification in xxTxx and xxTTxx variation with pseudouridine as the modified base

For identification of Ψ, the DNA template sequence was designed to have a single thymidine (T) (xxTxx) or double T (xxTTxx) every 10 – 11 base pairs. Acrylonitrile labeling was also done on the samples to further modify the modified bases. We observed that in 47% of the positions of the modified bases, both single and double T variations, the peaks are located 2 positions behind the T (first T for double T variation).

Figure 9. Percentage of modification in xxCxx variation with 5-methylcytosine as the modified base

Figure 10. Percentage of modification in xxAxx variation with 1-methyladenosine as the modified base

Figure 11. Percentage of modification in xxAxx variation with 6-methyladenosine as the modified base

In case of identification of m5C, m1A, and m6A, we used only xxMxx variation in the DNA template sequence, where M is corresponding non-modified base. As shown in the graphs above, there are many peaks in the graph and we could not identify which peaks correspond to the position of the modified bases. We believed that the nanopore sequencing itself has an error, and this error affects significantly to the results we obtained. A labeling strategy is still being explored to find a molecule that could specifically bind to those modified bases to better distinguish the modified bases with the normal ones.

Conclusion

Based on the all the results we obtained, we found that our current analysis to identify modified bases could detect the signal changes at the positions near the modified bases, indicating the structural effect towards nanopore signals. However, our current model is oversimplified, which is shown by most of the peaks in the graph not being located at the exact position of the modified bases. Further adjustment to the analysis is required to take into account other factors affecting the nanopore electrical signals.

@@ Line 43: / Line 43: @@
 				<h4 class="mg-md text-center">
 					Figure 1. Common Modified Bases in RNA
-				</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/e/eb/T--NTU-Singapore--Modified_Bases.jpg" class="img-responsive center-block lazyload" width="20%" height="20%"/>
+				</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/e/eb/T--NTU-Singapore--Modified_Bases.jpg" class="img-responsive center-block lazyload" width="70%" height="70%"/>
 				<p>
 					In order to cure diseases caused by modified bases in the transcriptome level, it is necessary to know the sequence of all the RNA in the cell. Since the diseases are caused by the presence of modified nucleobases, the sequencing technology has to be able to identify the modified bases. Illumina sequencing could identify the modified bases position in the transcriptome, but there it lacks the correlation information among the modified bases within the RNA strand. Mass spectrometry of RNA is another way to determine the positions of the modified bases. This technique provides information regarding their correlation, but its low throughput makes it unable to sequence the whole human transcriptome.<br>

Difference between revisions of "Team:NTU-Singapore/Nanopore"