Team:NTU-Singapore/Nanopore

Template:Nav

Project Nanopore 

 Motivation

After our interview with the medical professionals, we recognized that the high cost of RNA sequencing would be a big hurdle in providing affordable gene therapy based on RNA editing. Also, the current next-generation sequencing methods do not provide unbiased direct reads of the transcriptome and ignore important modifications on the nucleobases. As such, to improve the affordability and the accuracy of RNA sequencing in the future, we proposed to develop a high-throughput, unbiased and modification-sensitive RNA sequencing method based on nanopore technologies.

 Background

The human transcriptome contains not only the four canonical nucleobases – adenine (A), uracil (U), cytosine (C), and guanine (G), but also non-canonical ones such as inosine (I), pseudouridine (Ψ), 5-methylcytosine (m5C), and many others. These non-canonical nucleobases are naturally present in our cells to regulate gene expression. However, there are also modified bases that are not supposed to occur in healthy cells, which in turn could lead to certain diseases.

Figure 1. Common Modified Bases in RNA

In order to cure diseases caused by modified bases in the transcriptome level, it is necessary to know the sequence of all the RNA in the cell. Since the diseases are caused by the presence of modified nucleobases, the sequencing technology has to be able to identify the modified bases. Illumina sequencing could identify the modified bases position in the transcriptome, but it lacks the correlation information among the modified bases within the RNA strand. Mass spectrometry of RNA is another way to determine the positions of the modified bases. This technique provides information regarding their correlation, but its low throughput makes it unable to sequence the entire human transcriptome.

Figure 2. Mechanism of Nanopore Sequencing

In this project, we explored the nanopore technology for identification of non-canonical nucleobases in the RNA. Nanopore sequencing is a high-throughput direct sequencing technique which could provide information on the correlation among the modified bases. In nanopore sequencing, the RNA will go through the nanopore from 3’ and it will produce electrical signals. The electrical signals are determined by each 5-mer of the RNA sequence. We will then compare the difference between the electrical signals generated by RNA with modified bases and those generated by normal RNA.

 Identification of Inosine in RNA

Synthetic RNA samples were produced from PCR-amplified DNA gBlocks with a predefined sequence. The forward primer was designed to contain overhang T7 promoter, while the reverse primer contained polyA tail to enable binding of adapters for nanopore sequencing. In vitro transcription (IVT) was done on the amplified DNA using inosine (I) as the modified nucleotide to replace all the canonical guanosine (G) while keeping A, U, and C unchanged. Another synthetic RNA sample containing only canonical nucleobases with the exact same sequence as the DNA template was also produced as the negative control. 

The DNA gBlock sequences were designed so that the guanosines were positioned every 10 to 11 nucleotides other than G, with a total length of around 1 kb. We tried different variations of G sequences in the gBlocks, such as xxGxx, xxGGxx, xxGGGxx, and xxGxGxx where x is any canonical nucleotide other than G. We wanted to find out if our method could differentiate different G sequences.

We weren’t sure if I would result in a much different signal from G. So, we also produced RNA samples which were labeled with acrylonitrile. The acrylonitrile attached only to the Is and Gs in the RNA samples which we hope would produce more distinct signals compared to normal guanosines so that it would be easier to distinguish between I and G in the RNA strands.

Figure 3. Inosine labelling by acrylonitrile

The current signals produced by all the samples were compared to the negative control, which is the normal RNA sample containing G without acrylonitrile. The electrical signal data were analyzed by machine learning to produce data of % modification in each position in the RNA samples. The % modification is the percent of nanopore electrical signals in that position generated by modified-base-containing RNA sample that is different from the normal RNA. A peak indicates that the difference in the produced signals is much more apparent in that position than the surrounding positions.

Figure 4. Percentage of modification in xxGxx variation with inosine as the modified base

We define the peaks that indicate the presence of inosine is those within 4 positions away from the actual inosine position. In the case of xxGxx sequences, we found that 61% of the peaks indicating the presence of inosine are located 1 or 2 positions behind the actual position. We also found that the signals from the sample with inosine labeled with acrylonitrile (I with ACN) produced a higher % modification compared to the sample with inosine without acrylonitrile (I no ACN) in most of the positions, which is what we expected.

Figure 5. Percentage of modification in xxGGxx variation with inosine as the modified base

For xxGGxx variation, we found that 45% of the peaks corresponding to the inosine were located on the first inosine and 2 positions behind the first inosine. We speculate that the peak at 2 positions behind the first inosine and the one on the first inosine corresponds to the first inosine and second inosine respectively. However, there are also positions where there is only 1 peak, which means only 1 modified base was detected in that position.

Figure 6. Percentage of modification in xxGGGxx variation with inosine as the modified base

In xxGGGxx patterns, we could see that over 70% of them have a peak in the middle inosine position. However, there is no obvious pattern for the other peaks near the inosines.

Figure 7. Percentage of modification in xxGxGxx variation with inosine as the modified base

In the case of xxGxGxx variation, it is shown that 60% of xxGxGxx sequences have a peak in the between of the inosines and about half of them have another peak behind the first inosine. In this case, the second inosine is identified more easily than the first inosine. 

Identification of Pseudouridine in RNA

We also tried to identify another non-canonical nucleobase, pseudouridine (Ψ), through nanopore sequencing. The methods were exactly the same as detecting inosines, but now we used another predefined DNA gBlocks with xxTxx and xxTTxx variations.

Figure 8. Percentage of modification in xxTxx and xxTTxx variation with pseudouridine as the modified base

In our Ψ-containing RNA samples, we observed that in 47% of the positions of the modified bases, both single and double T variations, the peaks are located 2 positions behind the Ψ or the first Ψ for double T variation.

Conclusion

In conclusion, we found that our current analysis to detect modified bases could detect the signal changes at the positions near the modified bases, indicating that there is indeed a structural effect towards nanopore electrical signals. However, our current model is still oversimplified, as shown by the fact that most of the peaks in the graphs not being located at the exact position of the modified bases. Further adjustment to the analysis is required to take into account other factors affecting the nanopore electrical signals.