Team:NCTU Formosa/Dry Lab/NGS Data Analysis

Navigation Bar Correlation Model

     To train our model more effectively, we use NGS(Next Generation Sequencing) 16S to analyze the microbiota in the soil. We spray biostimulators into the soil to affect the entire microbiota and use NGS 16S regularly. By analyzing multiple NGS data, we can determine the nature of the relationships about bacteria and take advantage of these characteristics to add biostimulators again. Through this process training our model, we adjust to our system again and again, allowing our system predicts the changes of microbiota in soil accurately.

What is NGS 16S?

     Next, we will interpret NGS and 16S rRNA separately.

Next Generation Sequencing

   NGS(Next Generation Sequencing) is a kind of technique to sequence a number of genomes in very short time. There are three main platforms on the market currently: Solexa from Illumina, SOLiD from ABI, and 454 from Roche. The procedures of these three sequencers are different, but they are all based on chain termination. Take Solixa for example, following these steps below:

(1) Use ultrasound to break original DNA sequences into fragments of about 200-500 base pairs, and then attach the adapters to both ends of the fragments.

(2) Place the DNA fragments on a flowcell with complementary adapter sequences on the surface. The adapters will adhere to each other to allow the DNA fragments stay at the flow cell.

(3) Amplify DNA fragments by bridge amplification.

(4) The sequencing uses the method like the Sanger sequencing, adding different bases (dNTPs) and synthetic reagents that have been calibrated for specific removable fluorescent molecules. Repeatedly the process of removing and detecting fluorescence. Last, the computer software will analysis large numbers of DNA sequences quickly.


     FASTQC is mainly used to filter the NGS data. It is very important to check the data quality before analyzing the data. Only when the data quality is high enough, the next step can be continued.
    After we input NGS sequence data into FASTQC, the program will analyze automatically and score each sequence to ensure that the quality of the gene sequences is suitable for computer calculation.

16S rRNA

   16S rRNA is an important component of the ribosomal small subunit of prokaryote. The sequence contains several conserved regions and 9 hypervariable regions (V1 to V9). The hypervariable regions have genus or species specificity, considered to be the most suitable indicator for phylogeny of bacteria and identification of classification. NGS 16S uses the sequence of V4 and V5 in the hypervariable regions to detect the bacterial clusters.


     We divide the agricultural land into four large blocks of A, B, C, and D. In each block, there are three strips of 1, 2, and 3, each of which is divided into T, M, and D. Thus, we have thirty six sample in total( A1T, A2M, ....). We get 50 micro liters per sample from 10 to 15 centimeters depth near the root of each testing plant. Then, the samples are sent to the company for NGS analysis.

     The result will present the each bacteria ratio in each samples and report it in an OTU table.(Fig. 1)

Marker Gene Amplicon Analysis

     Microbiome data are generated from 16S ribosomal RNA(rRNA) gene. The PCR primers were designed to amplify the V4 region of the bacterial 16S ribosomal DNA. After profiling 16S rRNA sequencing, we used QIIME to generate operational taxonomic units (OTUs) table. Then we used bioinformatics tools and statistics methods to analyze microbial diversity in soil samples. We also used machine learning to predict how soil microbiota changes with addition of bio-stimulators.

Operational Taxonomic Units Table (OTUs Table)

     Figure 1 is an example of OTUs table. Each column represents the type and amount of bacteria (OTU1, OTU2, …, OTU7) in each soil sample (A1, A2, A3, B1, B2, and B3). We generate seven tables for each level: Phylum, Class, Order, Family, Genus, and Species.

Figure 1: Schematic Diagram of Operation Taxnomy Unit table

Data Analysis Process

Figure 2: The process of correlation analysis

     The OTUs tables will consist of unclassified names using the open source pipeline of QIIME. Thus, we have to rearrange the data to facilitate analysis according to the following steps:

(1) Delete unclassified genomic segments.
(2) Calculate the ratios of the remaining entries.
(3) Select the most abundant bacteria within soil samples to observe their distribution using the following bar charts (Fig. 3, Fig. 4, Fig. 5).

     After making the Stacked bar, we organized the data to facilitate the analysis (Table 2). We observed growths and declines of bacteria in the soil, and then summarized the bacteria according to their functions. Moreover, we will explain what we do in our farm affect microbiota. For example: Sphingomonas, Alcanivorax, Devosia which are polluting indicator bacteria, are decreased continuously from May to July. We speculate that the reason of their decline is that we have not applied herbicides, pesticides or other pollutants in these three months. When the soil repaired by itself, the pollutants are falling, and the polluting indicator bacteria are also falling. After the series of analysis, we hope to prove the basic hypothesis of our project--we can precisely regulate microbiota in the soil by using bio-stimulator. We will put our details of analysis in our demonstration.

Figure 3: Stacked bar chart of top-20 bacteria ratio in different samples (May)
Figure 4: Stacked bar chart of top-20 bacteria ratio in different samples (June)
Figure 5: Stacked bar chart of top-20 bacteria ratio in different samples (July)

Spearman's Rank Correlation

     The strength of co-occurrence of bacteria within soil samples was evaluated by the Spearman’s rank correlation coefficients. It ranges from -1 to 1. The formula of Spearman correlation coefficient is as follows:

$$\rho_s=1-\frac{6\sum d_{i^2}}{n(n^2-1)}$$

Table 1: Variable and Parameter in Spearman's correlation equation.




$\rho_s$ -

Spearman's correlation value

$d_i$ -

The difference in the ranked observations from each group

$n$ -

The sample size

     We used heat maps of correlation to visualize the correlation strength. Figure 3, 4, and 5 show the top 20 abundant bacteria within soil sample in different months. The reason we selected the top 20 abundant bacteria is that we found out the top 20 abundant bacteria accounted for over 95 percent of the amouts of known bacteria. It is true when we analyze frome phylum level to class level. However, analyzing the proportion of those bacteria at genus level, we found that although the proportion of the original top 20 abundant bacteria decrease to about 75% of known bacteria in some months, the proportion of each other bacterium was still lower than 1%. According to the analysis, it seemed that the effect of the bacteria ranked below 20 could be ignored. A computer program can then visualize the results in a heat map. A map of the 20 most abundant bacteria of our soil is shown below:

Figure 6: Correlation heat map of top-20 bacteria in June

     For example, the figure above shows the heat map of correlation of June. We could select candidate bacteria by the heat map of June to do prediction of microbiota of July because Weka utilized the correlation formula of the bacteria in June to simulate the microbiota of July. To make it more easily to understand spearman, we could observe the heat map above: while the proportions of two kinds of bacteria increased simultaneously, the block in the table showing the correlation between the two bacteria is red. Conversely, the block would turn to blue if one bacterium increased while the other decreased. Since only when data’s correlation coefficient larger than 0.7, the data is meaningful in statistics. We then selected the combination of every two bacteria whose correlation coefficient is greater than 0.7 or less than -0.7 as correlative samples. Absolutely there are exceptions to the prediction of our system since the spearman coefficient could only show the ratio of the bacteria in the soil sample we collected of the month while show every single phenomenon in the nature and then lead to cause difference. However, the difference with low correlation coefficient could not increase error tour Weka training.

Table 2: The ratio variation of top-20 bacteria within June and July.

Taxon (Genus)















Candidatus Solibacter



Candidatus Koribacter













































Alpha-Diversity Analysis

     Use of bio-stimulators to manipulate soil factors requires careful consideration of the microbiota. Certain stimulators may cause specific genera of bacteria to become overly dominant, damaging soil integrity. As a method of monitoring the balance of the microbial ecosystem, we investigate the evenness of the soil.

Eveness--Shannon Index

     Microbial diversity is measured by alpha-diversity (α-diversity). In our study, α-diversity refers richness and the Shannon diversity index. Richness means the number of OTUs, and evenness of bacterial community is measured by the Shannon diversity index, as shown below:


Table 3: Variable and Parameter in Shannon index equation.




$H'$ -

Shannon index

$S$ -

The total number of genuses in samples

$p_i$ -

The ratio of bacteria amount of the ith genus in the sample

     A higher Shannon index indicates greater evenness. The estimated degree of evenness can be derived from the exponential of the value. For example, a soil sample with Shannon index 2.85 and $$e^{2.85}=17$$ It means that the sample approximately consists of 17bacteria that are equal in numbers. Thus, the Shannon index can be used as an observational tool to determine whether bio-stimulators decrease the overall evenness or not, and thus health and stability, of the soil.

Triplicate Analysis

Figure 7: The box plot of shannon index triplicate analysis


1. Kumar, A. and L. C. Rai (2017). "Soil Organic Carbon and Availability of Soil Phosphorus Regulate Abundance of Culturable Phosphate Solubilizing Bacteria in Paddy Fields of the Indo-Gangetic Plain." Pedosphere.

2. Wang, P., et al. (2015). "Long-term rice cultivation stabilizes soil organic carbon and promotes soil microbial activity in a salt marsh derived soil chronosequence." Scientific Reports 5: 15704.