Team:NCTU Formosa/Dry Lab/Microbiota Prediciton

Navigation Bar Microbiota Prediction

     Artificial intelligence and machine learning allow us to realize the seemingly insurmountable goal of predicting the fluctuations of entire microbiotas due to the specific effects of bio-stimulators. While traditional ecologists may find it too difficult to consider every unique microbial relationship in an ecosystem, machine learning programs use numerical analysis to not only quickly determine these associations but also use them to predict overall population shifts caused by stimuli. For our modelling purposes we choose Weka, a software with strong classification capabilities, to establish accurate connections between every genera of bacteria in our soil.

Considering General Factors

     To begin modelling the relationship between bio-stimulators and microbiota, we first determine the most important factors that affect bacterial growth in soil. Three conditions immediately came to mind: temperature, pH and salinity, whose effects are modelled through the respective equations below:

$$Ratkowsky\ Equation:\
$$Cardinal\ pH\ Equation:\
$$Salinity\ Equation:\
R_{sal}(sal)=(f\cdot sal^2)+(g\cdot sal)+h$$

     These factors heavily influence bacterial fluctuation in any environment and are especially important in determining soil microbiota, according to soil expert, Professor Young from National Chung Hsing University. Professor Young also suggested that we should consider the relationship between nitrogen, phosphorus and potassium and soil bacteria, because farm soil is regularly applied with fertilizers containing these vital macronutrients. To take these elements into account, we collected literature discussing their impact on bacterial levels and found the following functions:

$$N=A*(Arthrobacter citreus)+B*(Arthrobacter globiformis)+C*(Arthrobacter terregens)+D*(Bacillus brevis)+E*(Bacillus cereus)+F*(Bacillus cereusvar. Mycoides)+G*(Bacillus circulans)+H*(Bacillus firmus)+I*(Bacillus globisporus)+.....$$

$$P=Gammaproteobacteria (56 isolates)+Firmicutes (28 isolates)+
Actinobacteria(8isolates)+ Alphaproteobacteria (2 isolates).$$

$$K=A*(Bacillus sp)+ B*(Pseudomonas sp)+ C*(Sinorhizobium metallidans)$$

     These equations model the direct relationship between levels of the elements and levels of bacteria in soil – specifically, the levels of bacteria that metabolize said elements.

     Combining these general equations together gives a method of obtaining a rough estimation of how our microbiota will change, based on fluctuation of these factors; our universal factors temperature, pH and salinity assist in modeling general fluctuations in the microbiota, while the more specific factors nitrogen, phosphorus and potassium are quite helpful in predicting how amount of nutrient metabolizing bacteria oscillates when dealing with the effects of fertilizers. But how do we deduce the change in level of bacteria that are unaffected by said nutrients? We turn to our NGS analysis to find missing link.

     From our NGS report we calculate the Spearman correlation value of each pair of genera in our soil. This coefficient, assigned a value between -1 and +1, describes the degree of correlation between each pair, with values closer to -1 representing stronger negative correlation and values closer to +1 representing stronger positive correlation. Correlation values between the 20 most abundant bacterial genera in our soil samples are shown in the following heat map.

Figure 1: Correlation heat map of top-20 bacteria in June

     Once we have our 6 general equations and our correlation values we’re ready to begin using Weka to construct a prediction model. Weka is split into two parts: regression analysis to filter out the non-correlated pairs of bacteria, and cross validation to determine the weighting each bacterial relationship has under different conditions.

Regression Analysis

     We first take advantage of the machine learning software’s classification ability, using the built-in regression analysis module to determine which pairs of bacteria are heavily affected by correlation. To do this we define coefficient values below -0.7 to be truly negatively correlated and coefficient values above +0.7 to be truly positively correlated; pairs assigned a value in between are ignored. Weka then separates truly correlated pairs from the rest; these are the bacteria that will change as an indirect effect of bio-stimulator application. We start with one genus of bacteria and assess the correlation coefficient it has with each other genus in soil. Any pairs with significant correlation are collected into a fold belonging to that bacteria. Once all pairs are assessed, the resulting fold should contain all the bacteria that are correlated with our starting genus.

Figure 2: Weka determines correlation between bacteria in soil.

     For every pair in any particular fold, Weka plots that pair’s data on a graph to find a curve of regression to describe their relationship. For example:

Figure 3: Spearman's correlation example between Mesorhizobium and Xylanimicrobium.
Cross Validation

     The resulting curve is the theoretical relationship between the two bacteria; however, the wide range of soil conditions that vary between different samples may alter the relationship. To account for this, Weka assigns weights to each correlation regression curve by performing cross validation, in this case with three folds. The steps are as follows:

Figure 4: Three fold cross validation.

(1) Three folds of three different genera of bacteria are compared in pairs to determine the accuracy of each pair’s correlational relationship.

(2) If they exhibit a relationship in line with Weka’s initial assessment, nothing changes and the pair keeps its assigned weight.

(3) If they show unexpected associations, they are said to exhibit paradox. Paradox alerts Weka to the discrepancy between prediction and reality, causing it to adjust waiting accordingly.

(4) Through this cross validation Weka calibrates weighting of each pair and can predict how an entire microbiota is related after analyzing all folds. The result can be expressed in a pie chart describing predicted microbial ratios

Figure 5: Resulting correlational values used to predict microbiota ratios.
Artificial Intelligence

     Once our initial model is complete we can begin to make rough predictions about microbiota changes based on a volume of bio-stimulator. The basic rules we established regarding different soil conditions point us in the right direction in terms of bacteria shifts, but to achieve true precise control over soil we must improve our prediction accuracy through artificial intelligence. Artificial intelligence feeds actual data back into our system; more data allows for more calibration and more cross validations, adapting our predictions to the specific nature of our soil sample and improving the accuracy of subsequent predictions.

Model Learning

     We began by generating our model using NGS data from April through June. We entered a volume of bio-stimulator as well NGS data from before and after application, thus generating an initial prediction model. The accuracy of our model with only one month of data was approximately 21%, while inclusion of a second month’s data increased accuracy to 51% - at this point if we were to predict results for June, we would get about 51% of the total microbiota correct. Again, we applied bio-stimulator to our soil and waited for our data. Using June data to calibrate our model increased the prediction accuracy of our model by over another 25%, resulting in a microbiota prediction model with 78% accuracy.

Figure 6: Pie charts of microbiota.

This picture shows the microbiota of July(reality), 1st prediction, and 2nd prediction.
Obviously, the predictive results become more accurate with more training data.

Figure 7: The learing curve of accuracy.

The accuracy significantly clime up with the increasing amount of dataset fed to our model.


Table 1: The comparison of top-20 bacteria between real July data and predictive data.

Taxon (Genus)




















Candidatus Koribacter








Candidatus Solibacter
































































     According to our previous assumption: bacteria with a bacteria ratio of less than 1% will not affect soil conditions. We can infer that the bacteria with less than 1% change in actual and predicted data are still within acceptable limits. Then we predicted that the results were mainly different from the actual results of Candidatus Koribacter, Rhodoplanes, Perlucidibaca, Candidatus Solibacter, Cellvibrio, Kaistobacter, Nitrospira, Flavobacterium, and Pseudomonas. We analyzed these nine bacteria and soil functions and found that the soil nitrate and pollutant content are higher than the actual value, which may be related to the part of the soil loss that we did not consider. Of course, the error of this part will decrease as the number of data increases. After the microanalysis, we will then evaluate the overall composition and the fungus according to the functional classification.

Figure 8: The ratio amount of bacteria with different functions within different data.
(PGPB stands for plant growth-promoting bacteria. NB stands for nitrifying bacteria.)

Table 2: Eveness of different samples.

     After sampling the NGS data of these three months, we classified the top 20 bacteria in those data to get the function table in Fig 8. We can see that the ratio of plant growth-promoting bacteria (PGPB) is rising, and the ratio of plant pathogenic bacteria is fluctuating in a small amount, which means that we can achieve the purpose of raising good bacteria in the case of controlling the proportion of plant pathogenic bacteria by using the biostimulator. Nitrifying bacteria and phosphate solubilizing bacteria rise. These bacteria produce nutrients plants need, so their increament can reduce the amount of fertilizer applied. The decline in pollution indicator bacteria indicates that the proportion of pollutants in the soil components has decreased. That is, we have not applied pollutants such as pesticides, herbicides, etc., and the soil is slowly repairing by itself over time. Finally, the data shows that the denitrifying bacteria decline. Denitrifying bacteria can represent the nitrate content, and the high nitrate content will cause environmental pollution. All information shows us that every month after spreading the biostimulator, the soil gets healthier. It is worth that the evenness of this three months is improving. Therefore, we confirmed that biostimulator can control bacterial changes and make the soil healthier, and evenness can be used as a criterion for segmental soil health.


     Increasingly accurate prediction of microbial shifts due to bio-stimulators is a vital element of our smart farming system. Our goal is to regulate soil microbiota precisely, and we need accurate models to do so. Luckily, machine learning and artificial intelligence can provide just that. A general model formed using established relationships between key environmental factors and bacterial growth is supported by correlation values calculated from NGS data to allow for rough initial predictions of microbial shifts. Raw data obtained after subsequent applications of bio-stimulators is reintroduced into our models through a feedback system, calibrating the weightings of each bacteria correlational relationship to improve accuracy with each cycle. With increasingly precise regulation we can manipulate soil microbiota to produce any desired effect. Visit our real farm demonstration to find out how we use artificial intelligence to increase curcumin concentration in turmeric while maintaining soil health.
     If you are interesting in our model, please click the icon to find out how them work in GitHub!


1. Bouckaert, R. R., et al. (2013). "WEKA Manual for Version 3-7-8, 2013." 21.
2. WI, H., et al. (2011). "Practical machine learning tools and techniques."
3. Barabasz, W. and J. J. P. J. o. E. S. Lipiec (2002). "Biological effects of mineral nitrogen fertilization on soil microorganisms." 11(3): 193-198.
4. KUMAR, A. and L. C. J. P. RAI (2017). "Soil Organic Carbon and Availability of Soil Phosphorus Regulate Abundance of Culturable Phosphate Solubilizing Bacteria in Paddy Fields of the Indo-Gangetic Plain."
5. Lambert, R. J. J. J. o. a. m. (2011). "A new model for the effect of pH on microbial growth: An extension of the Gamma hypothesis." 110(1): 61-68.
6. Nihala Jabin, P. (2017). Screening of potash solubilizing bacteria for plant growth promotional activity and nutrient uptake of brinjal, Vasantrao Naik Marathwada Krishi Vidyapeeth, Parbhani.
7. Ratkowsky, D. A., et al. (1983). "Model for bacterial culture growth rate throughout the entire biokinetic temperature range." J Bacteriol 154(3): 1222-1226.
8. Rousk, J., et al. (2011). "Bacterial salt tolerance is unrelated to soil salinity across an arid agroecosystem salinity gradient." 43(9): 1881-1887.
9. Wikipedia contributors. (2018, October 12). Bacillus. In Wikipedia, The Free Encyclopedia. Retrieved 18:17, October 16, 2018, from
10. Wikipedia contributors. (2018, March 23). Geobacter. In Wikipedia, The Free Encyclopedia. Retrieved 18:34, October 16, 2018, from
11. Espenberg, M., et al. (2018). "Differences in microbial community structure and nitrogen cycling in natural and drained tropical peatland soils." Scientific Reports 8(1): 4742.
12. Hou, J., et al. (2015). "PGPR enhanced phytoremediation of petroleum contaminated soil and rhizosphere microbial community response." Chemosphere 138: 592-598.
13. Hruska, K., Vyzkumny Ustav Veterinarniho Lekarstvi, Brno (Czech Republic) and M. Kaevska, Vyzkumny Ustav Veterinarniho Lekarstvi, Brno (Czech Republic) (dec2012). "Mycobacteria in water, soil, plants and air: a review." v. 57.
14. Jiao, S., et al. (2016). "Microbial succession in response to pollutants in batch-enrichment culture." Scientific Reports 6: 21791.
15. Leys, N. M. E. J., et al. (2004). "Occurrence and Phylogenetic Diversity of Sphingomonas Strains in Soils Contaminated with Polycyclic Aromatic Hydrocarbons." Applied and Environmental Microbiology 70(4): 1944-1955.
16. Ma, M., et al. (2018). "Effect of long-term fertilization strategies on bacterial community composition in a 35-year field experiment of Chinese Mollisols." AMB Express 8(1): 20.
17. Martineau, C., et al. (2015). "Comparative analysis of denitrifying activity in Hyphomicrobium nitrativorans, Hyphomicrobium denitrificans and Hyphomicrobium zavarzinii." AEM. 00848-00815.
18. Rodgers-Vieira, E. A., et al. (2015). "Identification of Anthraquinone-Degrading Bacteria in Soil Contaminated with Polycyclic Aromatic Hydrocarbons." Applied and Environmental Microbiology.
19. Sangwan, P., et al. (2005). "Detection and cultivation of soil verrucomicrobia." Appl Environ Microbiol 71(12): 8402-8410.
20. Sorensen, J. and O. Nybroe (2004). Pseudomonas in the Soil Environment. Pseudomonas: Volume 1 Genomics, Life Style and Molecular Architecture. J.-L. Ramos. Boston, MA, Springer US: 369-401.
21. Umadevi, P., et al. (2018). "Trichoderma harzianum MTCC 5179 impacts the population and functional dynamics of microbial community in the rhizosphere of black pepper (Piper nigrum L.)." Brazilian Journal of Microbiology 49(3): 463-470.
22. van Dijl, J. M. and M. Hecker (2013). "Bacillus subtilis: from soil bacterium to super-secreting cell factory." Microb Cell Fact 12: 3.
23. Wang, R., et al. (2017). "Microbial community composition is related to soil biological and chemical properties and bacterial wilt outbreak." Scientific Reports 7(1): 343.
24. Winston, M. E., et al. (2014). "Understanding Cultivar-Specificity and Soil Determinants of the Cannabis Microbiome." PLOS ONE 9(6): e99641.
25. Yan, G., et al. (2017). "Effects of different nitrogen additions on soil microbial communities in different seasons in a boreal forest." 8(7): e01879.