Difference between revisions of "Team:NTU-Singapore/Nanopore"

 
(20 intermediate revisions by the same user not shown)
Line 12: Line 12:
 
<div class="row">
 
<div class="row">
 
<div class="col-sm-12">
 
<div class="col-sm-12">
<h1 class="tc-white  mg-md text-center">
+
<h1 class="tc-white  mg-md text-center" style="line-height:1.3em;">
<strong>PROJECT NANOPORE&nbsp;</strong><br>
+
<strong>Project Nanopore&nbsp;</strong><br>
 
</h1>
 
</h1>
 
</div>
 
</div>
Line 26: Line 26:
 
<div class="row">
 
<div class="row">
 
<div class="col-sm-11">
 
<div class="col-sm-11">
<h2 class="mg-md  tc-black">
+
<h2 class="mg-md  tc-black" style="margin-top: 0px;">
<span class="fa fa-chevron-right"></span>&nbsp;Motivation
+
<span class="fa fa-chevron-right" style="margin-top: 0px; margin-bottom: 25px;"></span>&nbsp;Motivation
 
</h2>
 
</h2>
<p class=" text-left">
+
<p class="text-left">
 
After our interview with the medical professionals, we recognized that the high cost of RNA sequencing would be a big hurdle in providing affordable gene therapy based on RNA editing. Also, the current next-generation sequencing methods do not provide unbiased direct reads of the transcriptome and ignore important modifications on the nucleobases. As such, to improve the affordability and the accuracy of RNA sequencing in the future, we proposed to develop a high-throughput, unbiased and modification-sensitive RNA sequencing method based on nanopore technologies.<br>
 
After our interview with the medical professionals, we recognized that the high cost of RNA sequencing would be a big hurdle in providing affordable gene therapy based on RNA editing. Also, the current next-generation sequencing methods do not provide unbiased direct reads of the transcriptome and ignore important modifications on the nucleobases. As such, to improve the affordability and the accuracy of RNA sequencing in the future, we proposed to develop a high-throughput, unbiased and modification-sensitive RNA sequencing method based on nanopore technologies.<br>
 
</p>
 
</p>
Line 38: Line 38:
 
<span class="fa fa-chevron-right"></span>&nbsp;Background
 
<span class="fa fa-chevron-right"></span>&nbsp;Background
 
</h2>
 
</h2>
<p>
+
<p style="padding-top:1em;">
 
The human transcriptome contains not only the four canonical nucleobases – adenine (A), uracil (U), cytosine (C), and guanine (G), but also non-canonical ones such as inosine (I), pseudouridine (Ψ), 5-methylcytosine (m<a class="small-letter ltc-black" href="index.html">5</a>C), and many others. These non-canonical nucleobases are naturally present in our cells to regulate gene expression. However, there are also modified bases that are not supposed to occur in healthy cells, which in turn could lead to certain diseases.<br>
 
The human transcriptome contains not only the four canonical nucleobases – adenine (A), uracil (U), cytosine (C), and guanine (G), but also non-canonical ones such as inosine (I), pseudouridine (Ψ), 5-methylcytosine (m<a class="small-letter ltc-black" href="index.html">5</a>C), and many others. These non-canonical nucleobases are naturally present in our cells to regulate gene expression. However, there are also modified bases that are not supposed to occur in healthy cells, which in turn could lead to certain diseases.<br>
 
</p>
 
</p>
<h4 class="mg-md text-center">
+
<img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/e/eb/T--NTU-Singapore--Modified_Bases.jpg" class="img-responsive center-block lazyload" width="70%" height="70%" style="padding: 1em 0em;"/>
 +
<h4 class="mg-md text-center" style="padding-bottom: 1em;">
 
Figure 1. Common Modified Bases in RNA
 
Figure 1. Common Modified Bases in RNA
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/e/eb/T--NTU-Singapore--Modified_Bases.jpg" class="img-responsive center-block lazyload" width="70%" height="70%" style="padding: 2em 0em;"/>
+
</h4>
 
<p>
 
<p>
In order to cure diseases caused by modified bases in the transcriptome level, it is necessary to know the sequence of all the RNA in the cell. Since the diseases are caused by the presence of modified nucleobases, the sequencing technology has to be able to identify the modified bases. Illumina sequencing could identify the modified bases position in the transcriptome, but there it lacks the correlation information among the modified bases within the RNA strand. Mass spectrometry of RNA is another way to determine the positions of the modified bases. This technique provides information regarding their correlation, but its low throughput makes it unable to sequence the whole human transcriptome.<br>
+
In order to cure diseases caused by modified bases in the transcriptome level, it is necessary to know the sequence of all the RNA in the cell. Since the diseases are caused by the presence of modified nucleobases, the sequencing technology has to be able to identify the modified bases. Illumina sequencing could identify the modified bases position in the transcriptome, but it lacks the correlation information among the modified bases within the RNA strand. Mass spectrometry of RNA is another way to determine the positions of the modified bases. This technique provides information regarding their correlation, but its low throughput makes it unable to sequence the entire human transcriptome.<br>
 
</p>
 
</p>
<h4 class="mg-md text-center">
+
<img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/2/2e/T--NTU-Singapore--Nanopore_Tech.jpg" class="img-responsive center-block lazyload" width="50%" height="50%" style="padding: 1em 0em;"/>
 +
<h4 class="mg-md text-center" style="padding-bottom: 1em;">
 
Figure 2. Mechanism of Nanopore Sequencing
 
Figure 2. Mechanism of Nanopore Sequencing
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/2/2e/T--NTU-Singapore--Nanopore_Tech.jpg" class="img-responsive center-block lazyload" width="50%" height="50%" style="padding: 1em 0em;"/>
+
</h4>
<p>
+
<p style="padding-top:1em;">
 
In this project, we explored the nanopore technology for identification of non-canonical nucleobases in the RNA. Nanopore sequencing is a high-throughput direct sequencing technique which could provide information on the correlation among the modified bases. In nanopore sequencing, the RNA will go through the nanopore from 3&rsquo; and it will produce electrical signals. The electrical signals are determined by each 5-mer of the RNA sequence. We will then compare the difference between the electrical signals generated by RNA with modified bases and those generated by normal RNA.<br>
 
In this project, we explored the nanopore technology for identification of non-canonical nucleobases in the RNA. Nanopore sequencing is a high-throughput direct sequencing technique which could provide information on the correlation among the modified bases. In nanopore sequencing, the RNA will go through the nanopore from 3&rsquo; and it will produce electrical signals. The electrical signals are determined by each 5-mer of the RNA sequence. We will then compare the difference between the electrical signals generated by RNA with modified bases and those generated by normal RNA.<br>
 
</p>
 
</p>
Line 59: Line 61:
 
<span class="fa fa-chevron-right"></span>&nbsp;Identification of Inosine in RNA
 
<span class="fa fa-chevron-right"></span>&nbsp;Identification of Inosine in RNA
 
</h2>
 
</h2>
<p class=" text-justify">
+
<p class="text-left" style="padding-top:1em;">
Synthetic RNA samples were produced from PCR-amplified DNA gBlocks with a predefined sequence. The forward primer was designed to contain overhang T7 promoter, while the reverse primer contained polyA tail to enable binding of adapters for nanopore sequencing. In vitro transcription (IVT) was done on the amplified DNA using inosine (I) as the modified nucleotide to replace all the canonical guanosine (G) while keeping A, U, and C remained unchanged. Another synthetic RNA containing only canonical nucleobases was also produced as the negative control.&nbsp;<br>
+
Synthetic RNA samples were produced from PCR-amplified DNA gBlocks with a predefined sequence. The forward primer was designed to contain overhang T7 promoter, while the reverse primer contained polyA tail to enable binding of adapters for nanopore sequencing. In vitro transcription (IVT) was done on the amplified DNA using inosine (I) as the modified nucleotide to replace all the canonical guanosine (G) while keeping A, U, and C unchanged. Another synthetic RNA sample containing only canonical nucleobases with the exact same sequence as the DNA template was also produced as the negative control.&nbsp;<br>
 
</p>
 
</p>
<p class=" text-justify">
+
<p class="text-left">
The DNA gBlock sequences were designed so that the guanosines were positioned every 9 to 11 nucleotides other than G, with a total length of around 1 kb. We tried different variations of G sequences in the gBlocks, such as xxGxx, xxGGxx, xxGGGxx, xxGxGxx, and xxGxxGxx where x is any canonical nucleotide other than G. We want to find out if our method could differentiate different G sequences.<br>
+
The DNA gBlock sequences were designed so that the guanosines were positioned every 10 to 11 nucleotides other than G, with a total length of around 1 kb. We tried different variations of G sequences in the gBlocks, such as xxGxx, xxGGxx, xxGGGxx, and xxGxGxx where x is any canonical nucleotide other than G. We wanted to find out if our method could differentiate different G sequences.<br>
 
</p>
 
</p>
<p class=" text-justify">
+
<p class="text-left">
We weren&rsquo;t sure if I would result in much different signal from G. So, we also produced RNA samples which were labeled with acrylonitrile. The acrylonitrile attached only to the Is and Gs in the RNA samples which we hope would produce more distinct signals compared to normal guanosines so that it would be easier to distinguish between I and G in the RNA strands.<br>
+
We weren&rsquo;t sure if I would result in a much different signal from G. So, we also produced RNA samples which were labeled with acrylonitrile. The acrylonitrile attached only to the Is and Gs in the RNA samples which we hope would produce more distinct signals compared to normal guanosines so that it would be easier to distinguish between I and G in the RNA strands.<br>
 
</p>
 
</p>
<p class=" text-justify">
+
<img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/4/48/T--NTU-Singapore--ACN_label.jpg" class="img-responsive center-block lazyload" width="70%" height="70%" style="padding: 1em 0em;"/>
The current signals produced by all the samples were compared to the negative control, which is the normal RNA sample containing G without acrylonitrile. The electrical signal data were analyzed by machine learning to produce data of % modification in each position in the RNA samples. The % modification is the percent of nanopore electrical signals in that position generated by modified-base-containing RNA sample that are different from the normal RNA. A peak indicates that the difference in the produced signals is much more apparent in that position than the surrounding positions.<br>
+
<h4 class="mg-md text-center" style="padding-bottom: 1em;">
 +
Figure 3. Inosine labelling by acrylonitrile
 +
</h4>
 +
<p class="text-left">
 +
The current signals produced by all the samples were compared to the negative control, which is the normal RNA sample containing G without acrylonitrile. The electrical signal data were analyzed by machine learning to produce data of % modification in each position in the RNA samples. The % modification is the percent of nanopore electrical signals in that position generated by modified-base-containing RNA sample that is different from the normal RNA. A peak indicates that the difference in the produced signals is much more apparent in that position than the surrounding positions.<br>
 
</p>
 
</p>
 +
<img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/6/67/T--NTU-Singapore--xxGxx_I.jpg" class="img-responsive center-block lazyload" style="padding-left:2.5em;"/>
 
<h4 class="mg-md text-center" style="padding-top:1em;">
 
<h4 class="mg-md text-center" style="padding-top:1em;">
Figure 3. Percentage of modification in xxGxx variation with inosine as the modified base<br>
+
Figure 4. Percentage of modification in xxGxx variation with inosine as the modified base<br>
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/6/67/T--NTU-Singapore--xxGxx_I.jpg" class="img-responsive center-block lazyload" style="padding-left:2em;"/>
+
</h4>
<p class=" text-justify">
+
<p class="text-left" style="padding-top: 1em;">
We define the peaks that indicate the presence of inosine are those within 4 positions away from the actual inosine position. In the case of xxGxx sequences, we found that 61% of the peaks indicating the presence of inosine are located 1 or 2 positions behind the actual position. We also found that the signals from the sample with inosine labeled with acrylonitrile (I with ACN) produced a higher % modification compared to the sample with inosine without acrylonitrile (I no ACN) for most cases, which is what we expected.<br>
+
We define the peaks that indicate the presence of inosine is those within 4 positions away from the actual inosine position. In the case of xxGxx sequences, we found that 61% of the peaks indicating the presence of inosine are located 1 or 2 positions behind the actual position. We also found that the signals from the sample with inosine labeled with acrylonitrile (I with ACN) produced a higher % modification compared to the sample with inosine without acrylonitrile (I no ACN) in most of the positions, which is what we expected.<br>
 
</p>
 
</p>
 +
<img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/5/5d/T--NTU-Singapore--xxGGxx.jpg" class="img-responsive center-block lazyload" style="padding-left:2.5em;"/>
 
<h4 class="mg-md text-center" style="padding-top:1em;">
 
<h4 class="mg-md text-center" style="padding-top:1em;">
Figure 4. Percentage of modification in xxGGxx variation with inosine as the modified base
+
Figure 5. Percentage of modification in xxGGxx variation with inosine as the modified base
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/5/5d/T--NTU-Singapore--xxGGxx.jpg" class="img-responsive center-block lazyload" style="padding-left:2em;"/>
+
</h4>
<p class=" text-justify">
+
<p class="text-left">
 
For xxGGxx variation, we found that 45% of the peaks corresponding to the inosine were located on the first inosine and 2 positions behind the first inosine. We speculate that the peak at 2 positions behind the first inosine and the one on the first inosine corresponds to the first inosine and second inosine respectively. However, there are also positions where there is only 1 peak, which means only 1 modified base was detected in that position.<br>
 
For xxGGxx variation, we found that 45% of the peaks corresponding to the inosine were located on the first inosine and 2 positions behind the first inosine. We speculate that the peak at 2 positions behind the first inosine and the one on the first inosine corresponds to the first inosine and second inosine respectively. However, there are also positions where there is only 1 peak, which means only 1 modified base was detected in that position.<br>
 
</p>
 
</p>
 +
<img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/5/50/T--NTU-Singapore--xxGGGxx_I.jpg" class="img-responsive center-block lazyload"style="padding-left:2.5em;"/>
 
<h4 class="mg-md text-center" style="padding-top:1em;">
 
<h4 class="mg-md text-center" style="padding-top:1em;">
Figure 5. Percentage of modification in xxGxGxx variation with inosine as the modified base
+
Figure 6. Percentage of modification in xxGGGxx variation with inosine as the modified base
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/1/11/T--NTU-Singapore--xxGxGxx_I.jpg" class="img-responsive center-block lazyload"/>
+
</h4>
<p class=" text-justify">
+
<p class="text-left">
In the case of xxGxGxx variation, it is shown that 60% of xxGxGxx sequences have a peak in the between of the inosines and about half of them have another peak behind the first inosine. In this case, the second inosine is identified more easily than the first inosine.&nbsp;<br>
+
In xxGGGxx patterns, we could see that over 70% of them have a peak in the middle inosine position. However, there is no obvious pattern for the other peaks near the inosines.<br>
 
</p>
 
</p>
 +
<img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/1/11/T--NTU-Singapore--xxGxGxx_I.jpg" class="img-responsive center-block lazyload"  style="padding-left:2.5em;"/>
 
<h4 class="mg-md text-center" style="padding-top:1em;">
 
<h4 class="mg-md text-center" style="padding-top:1em;">
Figure 6. Percentage of modification in xxGGGxx variation with inosine as the modified base
+
Figure 7. Percentage of modification in xxGxGxx variation with inosine as the modified base
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/5/50/T--NTU-Singapore--xxGGGxx_I.jpg" class="img-responsive center-block lazyload"/>
+
</h4>
<h4 class="mg-md text-center" style="padding-top:1em;">
+
<p class="text-left">
Figure 7. Percentage of modification in xxGxxGxx variation with inosine as the modified base
+
In the case of xxGxGxx variation, it is shown that 60% of xxGxGxx sequences have a peak in the between of the inosines and about half of them have another peak behind the first inosine. In this case, the second inosine is identified more easily than the first inosine.&nbsp;<br>
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/1/14/T--NTU-Singapore--xxGxxGxx_I.jpg" class="img-responsive center-block lazyload"/>
+
<p class=" text-justify">
+
In xxGGGxx patterns, we could see that over 70% of them have a peak in the middle inosine position, but there is no obvious pattern for the other peaks near the inosine. As for xxGxxGxx variation, we didn&rsquo;t find any regular pattern of the peaks shown in the graph.<br>
+
 
</p>
 
</p>
 
<div class="divider-h">
 
<div class="divider-h">
Line 102: Line 109:
 
</div>
 
</div>
 
<h2 class="mg-md ">
 
<h2 class="mg-md ">
<span class="fa fa-chevron-right"></span> Identification of Other Modified Bases
+
<span class="fa fa-chevron-right"></span> Identification of Pseudouridine in RNA
 
</h2>
 
</h2>
<p class=" text-justify">
+
<p class="text-left">
We also tried to identify other non-canonical bases through nanopore sequencing. We used another predefined DNA gBlocks as a template for IVT with other modified bases such as pseudouridine (Ψ), 1-methyladenosine (m1A), 6-methyladenosine (m6A), and 5-methylcytosine (m5C).<br>
+
We also tried to identify another non-canonical nucleobase, pseudouridine (Ψ), through nanopore sequencing. The methods were exactly the same as detecting inosines, but now we used another predefined DNA gBlocks with xxTxx and xxTTxx variations.<br>
 
</p>
 
</p>
<h4 class="mg-md text-center">
+
<img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/b/b8/T--NTU-Singapore--xxTxx-and-xxTTxx_1.jpg" class="img-responsive center-block lazyload" style="padding-left:2.5em;"/><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/a/a0/T--NTU-Singapore--xxTxx-and-xxTTxx_2.jpg" class="img-responsive center-block lazyload" style="padding-left:2.5em;"/>
 +
<h4 class="mg-md text-center" style="padding-top:1em;">
 
Figure 8. Percentage of modification in xxTxx and xxTTxx variation with pseudouridine as the modified base
 
Figure 8. Percentage of modification in xxTxx and xxTTxx variation with pseudouridine as the modified base
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/b/b8/T--NTU-Singapore--xxTxx-and-xxTTxx_1.jpg" class="img-responsive center-block lazyload" style="padding: 1em 0.5em;/><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/a/a0/T--NTU-Singapore--xxTxx-and-xxTTxx_2.jpg" class="img-responsive center-block lazyload" style="padding: 1em 0.5em;/>
+
</h4>
<p class=" text-justify">
+
<p class="text-left" style="padding-top:1em;">
For identification of Ψ, the DNA template sequence was designed to have a single thymidine (T) (xxTxx) or double T (xxTTxx) every 10 – 11 base pairs. Acrylonitrile labeling was also done on the samples to further modify the modified bases. We observed that in 47% of the positions of the modified bases, both single and double T variations, the peaks are located 2 positions behind the T (first T for double T variation).<br>
+
In our Ψ-containing RNA samples, we observed that in 47% of the positions of the modified bases, both single and double T variations, the peaks are located 2 positions behind the Ψ or the first Ψ for double T variation.<br>
</p>
+
<h4 class="mg-md text-center" style="padding-top:1em;">
+
Figure 9. Percentage of modification in xxCxx variation with 5-methylcytosine as the modified base
+
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/b/b0/T--NTU-Singapore--xxCxx.jpg" class="img-responsive center-block lazyload"/>
+
<h4 class="mg-md text-center" style="padding-top:1em;">
+
Figure 10. Percentage of modification in xxAxx variation with 1-methyladenosine as the modified base
+
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/e/e3/T--NTU-Singapore--xxAxx_m1A.jpg" class="img-responsive center-block lazyload"/>
+
<h4 class="mg-md text-center" style="padding-top:1em;">
+
Figure 11. Percentage of modification in xxAxx variation with 6-methyladenosine as the modified base
+
</h4><img src="img/lazyload-ph.png" data-src="https://static.igem.org/mediawiki/2018/6/62/T--NTU-Singapore--xxAxx_m6A.jpg" class="img-responsive center-block lazyload"/>
+
<p class=" text-justify">
+
In case of identification of m5C, m1A, and m6A, we used only xxMxx variation in the DNA template sequence, where M is corresponding non-modified base. As shown in the graphs above, there are many peaks in the graph and we could not identify which peaks correspond to the position of the modified bases. We believed that the nanopore sequencing itself has an error, and this error affects significantly to the results we obtained. A labeling strategy is still being explored to find a molecule that could specifically bind to those modified bases to better distinguish the modified bases with the normal ones.<br>
+
 
</p>
 
</p>
 
<div class="divider-h">
 
<div class="divider-h">
Line 131: Line 127:
 
<span class="fa fa-chevron-right"></span> Conclusion
 
<span class="fa fa-chevron-right"></span> Conclusion
 
</h2>
 
</h2>
<p class=" text-justify">
+
<p class="text-left">
Based on the all the results we obtained, we found that our current analysis to identify modified bases could detect the signal changes at the positions near the modified bases, indicating the structural effect towards nanopore signals. However, our current model is oversimplified, which is shown by most of the peaks in the graph not being located at the exact position of the modified bases. Further adjustment to the analysis is required to take into account other factors affecting the nanopore electrical signals.<br>
+
In conclusion, we found that our current analysis to detect modified bases could detect the signal changes at the positions near the modified bases, indicating that there is indeed a structural effect towards nanopore electrical signals. However, our current model is still oversimplified, as shown by the fact that most of the peaks in the graphs not being located at the exact position of the modified bases. Further adjustment to the analysis is required to take into account other factors affecting the nanopore electrical signals.<br>
 
</p>
 
</p>
 
<div class="divider-h">
 
<div class="divider-h">

Latest revision as of 03:57, 18 October 2018

Template:Nav

Project Nanopore 

 Motivation

After our interview with the medical professionals, we recognized that the high cost of RNA sequencing would be a big hurdle in providing affordable gene therapy based on RNA editing. Also, the current next-generation sequencing methods do not provide unbiased direct reads of the transcriptome and ignore important modifications on the nucleobases. As such, to improve the affordability and the accuracy of RNA sequencing in the future, we proposed to develop a high-throughput, unbiased and modification-sensitive RNA sequencing method based on nanopore technologies.

 Background

The human transcriptome contains not only the four canonical nucleobases – adenine (A), uracil (U), cytosine (C), and guanine (G), but also non-canonical ones such as inosine (I), pseudouridine (Ψ), 5-methylcytosine (m5C), and many others. These non-canonical nucleobases are naturally present in our cells to regulate gene expression. However, there are also modified bases that are not supposed to occur in healthy cells, which in turn could lead to certain diseases.

Figure 1. Common Modified Bases in RNA

In order to cure diseases caused by modified bases in the transcriptome level, it is necessary to know the sequence of all the RNA in the cell. Since the diseases are caused by the presence of modified nucleobases, the sequencing technology has to be able to identify the modified bases. Illumina sequencing could identify the modified bases position in the transcriptome, but it lacks the correlation information among the modified bases within the RNA strand. Mass spectrometry of RNA is another way to determine the positions of the modified bases. This technique provides information regarding their correlation, but its low throughput makes it unable to sequence the entire human transcriptome.

Figure 2. Mechanism of Nanopore Sequencing

In this project, we explored the nanopore technology for identification of non-canonical nucleobases in the RNA. Nanopore sequencing is a high-throughput direct sequencing technique which could provide information on the correlation among the modified bases. In nanopore sequencing, the RNA will go through the nanopore from 3’ and it will produce electrical signals. The electrical signals are determined by each 5-mer of the RNA sequence. We will then compare the difference between the electrical signals generated by RNA with modified bases and those generated by normal RNA.

 Identification of Inosine in RNA

Synthetic RNA samples were produced from PCR-amplified DNA gBlocks with a predefined sequence. The forward primer was designed to contain overhang T7 promoter, while the reverse primer contained polyA tail to enable binding of adapters for nanopore sequencing. In vitro transcription (IVT) was done on the amplified DNA using inosine (I) as the modified nucleotide to replace all the canonical guanosine (G) while keeping A, U, and C unchanged. Another synthetic RNA sample containing only canonical nucleobases with the exact same sequence as the DNA template was also produced as the negative control. 

The DNA gBlock sequences were designed so that the guanosines were positioned every 10 to 11 nucleotides other than G, with a total length of around 1 kb. We tried different variations of G sequences in the gBlocks, such as xxGxx, xxGGxx, xxGGGxx, and xxGxGxx where x is any canonical nucleotide other than G. We wanted to find out if our method could differentiate different G sequences.

We weren’t sure if I would result in a much different signal from G. So, we also produced RNA samples which were labeled with acrylonitrile. The acrylonitrile attached only to the Is and Gs in the RNA samples which we hope would produce more distinct signals compared to normal guanosines so that it would be easier to distinguish between I and G in the RNA strands.

Figure 3. Inosine labelling by acrylonitrile

The current signals produced by all the samples were compared to the negative control, which is the normal RNA sample containing G without acrylonitrile. The electrical signal data were analyzed by machine learning to produce data of % modification in each position in the RNA samples. The % modification is the percent of nanopore electrical signals in that position generated by modified-base-containing RNA sample that is different from the normal RNA. A peak indicates that the difference in the produced signals is much more apparent in that position than the surrounding positions.

Figure 4. Percentage of modification in xxGxx variation with inosine as the modified base

We define the peaks that indicate the presence of inosine is those within 4 positions away from the actual inosine position. In the case of xxGxx sequences, we found that 61% of the peaks indicating the presence of inosine are located 1 or 2 positions behind the actual position. We also found that the signals from the sample with inosine labeled with acrylonitrile (I with ACN) produced a higher % modification compared to the sample with inosine without acrylonitrile (I no ACN) in most of the positions, which is what we expected.

Figure 5. Percentage of modification in xxGGxx variation with inosine as the modified base

For xxGGxx variation, we found that 45% of the peaks corresponding to the inosine were located on the first inosine and 2 positions behind the first inosine. We speculate that the peak at 2 positions behind the first inosine and the one on the first inosine corresponds to the first inosine and second inosine respectively. However, there are also positions where there is only 1 peak, which means only 1 modified base was detected in that position.

Figure 6. Percentage of modification in xxGGGxx variation with inosine as the modified base

In xxGGGxx patterns, we could see that over 70% of them have a peak in the middle inosine position. However, there is no obvious pattern for the other peaks near the inosines.

Figure 7. Percentage of modification in xxGxGxx variation with inosine as the modified base

In the case of xxGxGxx variation, it is shown that 60% of xxGxGxx sequences have a peak in the between of the inosines and about half of them have another peak behind the first inosine. In this case, the second inosine is identified more easily than the first inosine. 

Identification of Pseudouridine in RNA

We also tried to identify another non-canonical nucleobase, pseudouridine (Ψ), through nanopore sequencing. The methods were exactly the same as detecting inosines, but now we used another predefined DNA gBlocks with xxTxx and xxTTxx variations.

Figure 8. Percentage of modification in xxTxx and xxTTxx variation with pseudouridine as the modified base

In our Ψ-containing RNA samples, we observed that in 47% of the positions of the modified bases, both single and double T variations, the peaks are located 2 positions behind the Ψ or the first Ψ for double T variation.

Conclusion

In conclusion, we found that our current analysis to detect modified bases could detect the signal changes at the positions near the modified bases, indicating that there is indeed a structural effect towards nanopore electrical signals. However, our current model is still oversimplified, as shown by the fact that most of the peaks in the graphs not being located at the exact position of the modified bases. Further adjustment to the analysis is required to take into account other factors affecting the nanopore electrical signals.