AMP Forest: a regression model for AMP efficacy
To design AMPs that are high efficient but also won’t cause serious expression deficiency, is a complex optimization problem that were never discussed in the previous literatures. Many mechanisms were proposed to explain AMP’s efficiency but still hard to guide our design or engineering quantitatively. Here we present AMP Forest, a regression model trained on the experimental data of > 0.3 million randomly synthetic short peptides, which could accurately predict the antimicrobial efficiency purely on the peptide sequence, with Pearson correlation coefficient ~ 0.98.
For any machine learning problem, the limit factor is always the dataset. The most up-to-date database of the AMP only contains no more than 20 thousand peptide sequences [9], but apparently not characterized by the same experimental assay. Both the dataset size and quality is far from the regression model. Thus, we seeked for a better dataset that generated by high-throughput experimental assay. Thanks to our characterization on StarCore, we’ve confirmed that AMP could also works on the surface of a larger complex with the similar mechanism. A recent reported method, Surface Localized Antimicrobial Display (SLAY), allows us to acquire massive data from a high-throughput experiment [10]. A large library of randomly synthesized peptide is fused with the membrane-anchor protein and displayed on the surface of the host bacteria. When the peptide is highly produced and also efficient to kill the bacteria, it will reduce the fitness of hosts in a growth competition. By next-generation sequencing (NGS), the fitness is measured and thereby calculated into the bacterial survival rate -- reciprocal to the antimicrobial efficiency. In the dataset we used, 319,586 intact peptides are detected and scored by NGS data. This method perfectly fit our goal -- looking for the AMP that is both effective also easy to produce.
Random forest machine learning for regression model
Though we know that the electric charges and hydrophobicities are important for the properties of AMPs, they cannot really properly describe the efficacy.
Further, we used one-hot encoding matrix to present the sequences, and tried to train by the many advanced architectures of deep neural networks (DNN), including deep convolutional network, and ResNet [11-13]. However, probably because of the common features are weak and scattered, none of them is capable to converge to a good model.
We map the peptides in the sequences space into a distance space: each peptide is recoded, instead of the amino acid sequence, by the Jukes-Cantor distances to top 100 peptides that are most effective to kill bacteria in the experiment.
Using the distances of each peptide as the predictors, and an arbitrary score of the bacterial survival as the output response, we trained a random forest model, bagging 100 decision trees, using TreeBagger function from Matlab Statistics and Machine Learning Toolbox. Remarkably, when more peptides are used as references, the algorithm is more accurate, but also lead to higher computational cost, and face larger risk of overfitting. Our training result is saved and named as AMP Forest hereafter.
Warning: AMP Forest is trained for 20 amino acid peptides, and thereby not very suitable for engineering a peptide that is too long or too short, even though it is capable to process any-length sequence.
Reference
- Fan, Linlin, et al. "DRAMP: a comprehensive data repository of antimicrobial peptides." Scientific reports 6 (2016): 24482.
- Tucker, Ashley T., et al. "Discovery of Next-Generation Antimicrobials through Bacterial Self-Screening of Surface-Displayed Peptide Libraries." Cell 172.3 (2018): 618-628.
- Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
- He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. "An analysis of deep neural network models for practical applications." arXiv preprint arXiv:1605.07678 (2016).