Team:Lund/Model/Hemoglobin

Design of hemoglobin mutants

Overview

In this model we rationally design human hemoglobin mutants in a semi data-driven manner by combining machine learning and protein engineering. We do this by ranking a computer generated library of mutants by using a classifier trained on a set of functionally characterized hemoglobins.

We managed to successfully train the model by using the characterized proteins and an external validation set showed that a meaningful learning was achieved. The model proposed that the way the polarity and hydrophobicity is distributed along the beta chain is an indicator of the strength of the oxygen binding affinity. Furthermore, the generated mutants were ranked by their likelihood of decreasing the oxygen binding affinity, of which it was shown that many proposed mutations were located at residues in close proximity to the heme group. By combining the ranking scheme with biological insight, four mutants were chosen as final candidates. These were successfully expressed in E. coli as indicated by SDS-PAGE and shown to be functional by measuring the crude absorbance spectra. We are still working on fully characterizing the proteins and will hopefully be able to present a true validation by measuring the actual binding affinity.

Why?

For this year, our project has been all about studying bacterial hemoglobins and using them to help upscale certain bioprocesses. Along with this, it felt natural to consider the human variant as well. We found that there exists a big blood shortage in many countries and that there is currently no other blood substitute than human donated blood. However, a lot of work has been put into designing hemoglobin mutants for recombinant expression for this purpose. With this we found a database of human hemoglobin variants which, to our knowledge, no one has explored from a machine learning perspective. The appealing idea of “upscaling hemoglobin with hemoglobin” came up and we found the idea of designing our own human hemoglobin variants by using machine learning and then upscaling them by using the bacterial hemoglobin very tempting. This whole thought resulted in this model along with its corresponding lab section.

Detailed summary

Recombinant hemoglobin based oxygen carriers (rHBOCs) may serve as potential blood substitutes. The advantages over human donated blood are numerous and include universal compatibility, longer shelf life, and reduced risk of disease transmission as well as enhanced oxygen delivery. However, before becoming practically feasible, several properties of the hemoglobin need to be tuned, such as reducing oxygen affinity and increasing stability without compromising other properties.

As a part of this year’s modeling, we have set out to investigate the issue of reducing the oxygen binding affinity by combining machine learning and protein engineering. In total, 143 functionally characterized hemoglobin mutants were obtained from the HbVar database and 2874 structural and physicochemical descriptors were generated from the PROFEAT server.

The dataset was first subjected to a validation and model selection split where the former was set aside for later validation. By using the latter split, the descriptors were subjected to a four stage feature elimination pipeline, resulting in only 22 active descriptors remaining. By investigating the Gini importance of a random forest classifier, it was seen that how the polarity and hydrophobicity are distributed along the beta chain may act as an indicator of the strength of the oxygen binding affinity. Based on this, four classifiers, a random forest, gradient boosted tree, support vector machine and logistic regression along with three voting models were cross validated under a grid search. To avoid random bias, each five-fold cross validation was repeated twenty times. The best model as evaluated by the final external validation was a soft voting classifier consisting of a gradient boosted tree and random forest classifier, yielding an accuracy of 0.73 (+/- 0.08), sensitivity of 0.89 (+/- 0.10), specificity of 0.76 (+/- 0.06) and a roc-auc of 0.69 (+/- 0.13) on the model selection set. For validation, the accuracy was 0.78, sensitivity 0.93, specificity 0.76 and roc-auc 0.74 after retraining the model on the entire model selection set. The final voting classifier was then retrained on all data and ready for the subsequent mutant ranking. The entire procedure is illustrated in fig. 1.

Figure 1: The figure illustrates the machine learning pipeline used to build the classifier. First, the dataset is obtained from the HBVar database. Then, features are generated by the PROEAT server and a model is training by a route of successive feature and model selection. A selected model is then obtained by external validation and retrained for final use.

After the model selection procedure, 1795 single point mutations were generated and additional descriptors were generated from the PROFEAT server. These mutants were then ranked by their likelihood of decreasing the oxygen binding affinity by filtering through the voting classifier. In order to evaluate the final true performance of the model, the ranking was manually analyzed and four high scoring mutants were chosen and expressed in the lab. These were successfully expressed in E. coli as indicated by SDS-PAGE and shown to be functional as indicated by the characteristic absorbance peaks at around 420 and 550 nm when measuring the crude absorbance spectra by spectrophotometry. The entire procedure is illustrated in fig 2.

Figure 2: The figure illustrates the procedure of generating new hemoglobin mutants in silico. First, a set of mutants are obtained from the HBVar database and a set of single point hemoglobin mutants are generated. The corresponding descriptors are then obtained from the PROFEAT. Then, based on the previously built model, all mutants are ranked, manually selected and sent to the wet lab.
Gant Logo LMK Logo Lund Logo Kunskappsartner Logo