Materials and methods

Overview

In this section we present the materials and methods used to generate the hemoglobin mutants. First, the entire machine learning pipeline is presented to provide a comprehensive overview of the parts building up the model. Then, several sections are presented, dedicated for describing each core part of the pipeline along with design motivations and references to further material. For reproducibility, a set of notebooks is available online. We have tried to explain the material as simply as possible and have therefore omitted equations in many places where they would otherwise be suited. Our hope is that even someone with a purely biological background will be able to pick up the content.

The machine learning pipeline

An overview of the pipeline is illustrated in fig 1. and the approach is simple. First, a library of hemoglobin mutants was obtained. By training predictive models under the framework of machine learning, a model was built. Simultaneously, a library of candidate mutants were generated, and the model was used to rank all candidates with respect to what it learned from the hemoglobin library. Once all candidates were ranked, we picked a couple of mutants and evaluated them in the lab for screening. In the ideal case, one would also want to close the cycle by feeding the lab results back to the library, retraining the model and reiterating the procedure until hopefully a satisfactory variant is obtained.

Figure 1: A high level overview of the used approach. An obtained library of hemoglobin mutants is used to train a classifier. Simultaneously, new candidate mutants are generated and subjected for selection. The selected mutants are then sent to the wet lab for true validation, and then fed back into the library for further refinement.

A detailed block diagram is illustrated in fig. 2 and we now proceed to explain what we did in more detail. Starting with the upper model building part, a library of low and high oxygen affinity human hemoglobin mutants was first obtained from the HBVar database. The wild-type sequences were obtained from GenBank and structural and physicochemical descriptors were then generated from the PROFEAT server. The formed dataset was then subjected to a model selection and validation split, of which the former was used for feature and model selection. A set of trained models were then tested on the omitted validation set to get an estimate of the true performance in classifying high and low affinity hemoglobins. A final model was then selected and retrained on the entire hemoglobin library. Proceeding with the lower part of the diagram which explains how we generated new mutants, a set candidates was obtained by generating all possible single point mutations of the hemoglobin. The descriptors for each candidate were generated by the PROFEAT server. All proteins were then ranked by the built model, of which we selected a handful of promising variants and sent them to the wet lab.

Figure 2: A block diagra of the entire modeling procedure.

Dataset

The dataset used was obtained from the HbVar database of human hemoglobins, which contains information about different hemoglobin variants and mutants which cause thalassemia [1]. The entries in the database are mutants found in humans, of which functional studies have been performed to measure certain properties such as oxygen binding affinity and stability. In total, 114 beta and 29 alpha subunits with single or double point mutations were obtained, yielding a total of 143 different protein sequences. The target oxygen binding affinities were labeled as increased or decreased compared to wild-type human hemoglobin and the total high to low affinity ratio was approximately 2 to 1, indicating a slight imbalance in the dataset. Table 1 illustrates five randomly sampled rows from the dataset.

In order to apply and evaluate the mutants, the wild-type amino acid sequence was obtained from genbank. The NCBI reference sequence numbers are NP_000549.1 and NP_000509.1 for the alpha and beta subunits respectively.

Table 1: Five randomly selected mutations. The notation is as follows: XNNY means that amino acid X at position NN was substituted for amino acid Y.
Name	Subunit	Mutation	Affinity
Hb Canebiere	Beta	P36A	Decreased
Hb Bologna	Beta	K61M	Decreased
Hb Meulan	Alpha	S84N	Increased
Hb Lyon-Bron	Alpha	V1A	Decreased
HB Alberta	Beta	G146E	Increased

Protein descriptors

We chose to only analyze one alpha-beta dimer of the hemoglobin. In addition, we restricted the physicochemical and structural properties to only include the primary structure. In total, seven groups of descriptors were used, yielding a total of 2874 descriptors for each hemoglobin variant. These are summarized briefly below and for further details we refer to the PROFEAT (Protein Feature) server from which they were obtained [2][3], as well as the references therein. Unless otherwise stated, the standard settings were used when generating the descriptors.

Group 1 - Amino acid composition

The amino acid composition is the fraction of each amino acid over the whole sequence. The descriptor representation is illustrated in fig. 3. The first two symbols, G1, represent group 1 - the amino acid composition. The second number represents a subsequence, of which we used either the alpha or beta subunit. The final number represents the composition of that amino acid index.

Figure 3: The amino acid composition descriptor.

Group 2 - Dipeptide composition

The dipeptide composition is the fraction of each dipeptide over the whole sequence. The descriptor is illustrated in fig. 4. As with the amino acid composition, the second number represents the subsequence and the third the index of the dipeptide.

Figure 4: The dipeptide composition descriptor.

Group 3 - Autocorrelation descriptors

The autocorrelation descriptors correspond to the autocorrelation of certain amino acid derived properties throughout the sequence. The indices used were hydrophobicity, alpha-hydrogen chemical shifts and membrane-buried preference parameters. All indices were scaled to zero mean and unit variance prior to calculations. Three types of descriptors were used, the normalized Moreau Broto-autocorrelation, Moran autocorrelation and Geary autocorrelation with a lag of thirty amino acids.

The descriptor is illustrated in fig. 5. The first number represents the autocorrelation type, that is the Moreau-Broto-, Moran- or Geary autocorrelation. The second index is the property used, e.g. hydrophobicity. The third is the subsequence and the fourth is the lag number.

Figure 5: The autocorrelation descriptor.

Group 4 - Composition, transition, distribution

This group of descriptors are designed to capture the amino acid composition, transition and distribution properties along the sequence. The calculations proceed by an initial three class-encoding of the amino acids based on their physicochemical properties, such as hydrophobicity, polarity and van der Waals volume. Once encoded, the composition descriptors are given by the fraction of each encoded class for a given physicochemical property. Similarly, the transition from one class to another is defined by the fraction of such transitions along the sequence. This property is symmetric, so if doing the calculations from N- to C-terminus, the fraction of class 1 followed by class 2 as well as class 2 followed by class 1 is calculated and summed, yielding one transition descriptor. Finally, the distribution is calculated by taking the global fraction of each class.

The descriptor is illustrated in fig. 6. The first number corresponds to one of the three properties, composition, transition or distribution. The second is the physicochemical property and the third is the class at which the amino acid is placed into.

Figure 6: The composition, transition and distribution descriptor.

Group 5 - Quasi-sequence order descriptors

The sequence order descriptors are derived by treating each amino acid as a point in the physicochemical space spanned by properties such as hydrophobicity, polarity and side-chain volume. The Schneider-Wrede physicochemical distance matrix was used for this purpose and a coupling number of 30 was used in all calculations. Both the sequence order coupling numbers and quasi-sequence order descriptors were calculated.

The descriptor is illustrated in fig. 7. The first number corresponds to the physicochemical distance matrix used. In our case we only used the Schneider-Wrede matrix. The second is the type of order used and the last is the lag number.

Figure 7: The quasi sequence order descriptor.

Group 6 - Amphiphilic pseudo-amino acid composition

These descriptors were designed to capture the amphiphilic (i.e. lipophilic and hydrophilic) distribution along the sequence and is calculated by the following procedure: first, the hydrophobicities and hydrophilicities are normalized to zero mean and unit variance. Then, two corresponding autocorrelation sequences are calculated using the standard and unbiased statistical definition. The autocorrelation quantities are then weighted, yielding the amphiphilic pseudo-amino acid composition. Here, a weighing factor of 0.05 and a tier correlation factor of 30 was used.

The descriptor is illustrated in fig. 8.

Figure 8: The amphiphilic pseudo-amino acid composition..

Group 8 - Total amino acid properties

The total amino acid property is the sum of a certain physicochemical property along the entire sequence. The ones used were hydrophobicity, alpha-hydrogen chemical shifts and membrane-buried preference parameters which are the standard settings provided by the PROFEAT server.

The descriptor is illustrated in fig. 9 and the only index is the property index of which the total amino acid property is calculated.

Software

All analysis was performed using python 3.5. The scipy stack was used for matrix manipulation and visualization and include the modules numpy, scipy, matplotlib, pandas and sympy. To preprocess the data, select features and build the models, we used numpy, pandas and scikit-learn. In addition, the xgboost implementation of gradient boosted classification tree was used.

The algorithms used were the random forest, gradient boosted trees, decision tree classifier, logistic regression, support vector machine and voting classifier, which were implemented in scikit learn and xgboost by the classes RandomForestClassifier, XGBClassifier, DecisionTreeClassifier, LogisticRegression, SVC and VotingClassifier respectively. All algorithms used operated under standard settings unless otherwise stated. For the case of recursive feature elimination, 20 decision tree estimators were used in the random forest.

To generate the descriptors, the PROFEAT (Protein Feature) server was used. This is a server used for generating structural and physicochemical descriptions from proteins based on the primary structure.

Data preprocessing

The functional data obtained was ordinal in the response. That is, the oxygen binding affinity was reported on relative scales (e.g. increased and slightly increased). To make the task of classification simpler and easier to interpret, the problem was reduced into a binary classification where the response variables were either increased or decreased oxygen affinity.

Prior to feature and model selection, the data was subjected to a stratified 80-20 split where the first part was devoted to model selection and the second to model validation. The validation split was set aside for a final validation and the model selection split was then subjected to further 80-20 cross validation splits each time a property was to be estimated.

In addition, all descriptors were scaled to zero mean and unit variance prior to training when the logistic regression or support vector machine was used.

Feature selection

Commonly, there are only a few descriptors that are truly informative in biological structure activity problems [4]. With this, at least three problems may arise, namely having co-linear descriptors, a too high dimensional feature space and loss of interpretability.

When descriptors are co-linear, it means that they carry roughly the same information in a linear sense. This is problematic since it causes a high redundancy within the dataset which may cause a degradation in the parameter estimation. This may also cause loss of parameter significance despite possibly being truly informative [5]. Furthermore, even though two descriptors may be correlated with a target variable from within the dataset, the relation might not be causal and choosing one over the other may affect the final model severely once extrapolated to outside the dataset. That is, high correlation may also hide the true relation between the descriptors and the target variable [4]. Sadly, there is to our knowledge no way of circumventing this issue without the use of insight and expert knowledge as well as rigorous model selection and validation.

Secondly, the use of too many descriptors may also degrade the performance of a predictor. This is generally known as the curse of dimensionality and several methods have been developed to counteract this [6]. One way is to represent the data in a low dimensional manifold by some kind of embedding technique such as principal component analysis or random projections. Another technique is to instead eliminate descriptors based the predictive power they possess, see e.g. [7] for a review. As a rule of thumb, one should generally strive for a 5-to-1 ratio between the number of data points and number of descriptors [4]. For this model, a total of 2874 descriptors were generated from 143 proteins which clearly addresses the importance of proper feature selection.

Finally, having few descriptors may greatly improve the interpretability, allowing the model to be used for more than just making predictions, but for also for obtaining insights and knowledge for further decision making. One could for instance imagine an algorithm to reveal an interesting relationship between the structure and activity of a certain protein, allowing the further development of a molecule to target the activity via the structural or physicochemical information.

We performed the feature selection in a four step procedure as illustrated in fig. 10, where only the model selection set of was used to find informative features. First, all constant and duplicated descriptors were removed. Then, all descriptors suffering from co-linearity were filtered out at a threshold of 0.7 in the correlation coefficient.

Following the correlation-based elimination, the descriptors were subjected to two model-based eliminations. First a univariate selection using a simple decision tree classifier was used where all descriptors with a value below or equal to 0.5 were discarded. Then, a five-fold cross-validated recursive feature elimination using a random forest classifier as discriminator was used to discard the final descriptors (denoted by RF-RFECV in the figure).

Figure 10: The figure illustrates the feature selection pipeline. First, all constant and duplicated features are removed. Then, features which are co-linear are removed. In the final step, a univariate filter is used to remove uninformative descriptors and the RFECV is used at the final step to remove the last non-contributing descriptors.

For the sake of clarity, we end this section by shortly describing how the univariate filter and RF-RFECV works and then motivate why we chose this pipeline.

The univariate filter proceeds by first splitting the dataset into a training and test set. For each individual descriptor, a decision tree classifier is built using the training set. The tree is then used to make predictions on the test set, of which the roc-auc score is calculated. The descriptors are then ranked with respect to their predictive power and discarded if it does not overcome a predetermined threshold. We used a threshold of 0.5.

The RF-RFECV is a more rigorous method for feature elimination. It proceeds by cross-validating a random forest classifier to determine a baseline performance. The descriptor with the smallest Gini importance, which is a measure of the importance of a descriptor, is then discarded. The classifier is then cross-validated again, but without the descriptor. If the performance does not decrease, the descriptor is deemed uninformative and is completely discarded. The descriptor corresponding to the next smallest Gini importance is then removed, and the cross-validation is reiterated. The entire procedure continues until the performance starts to drop, at which the algorithm is terminated.

While RF-RFECV is very simple and does also take variable interactions into account during the elimination, it is very expensive to run for a large number of features. Therefore, the univariate filter was first used to crudely remove all the non-informative descriptors, leaving the RF-RFECV for the final step.

Models

In total, four models were considered - random forests, (RF) gradient boosted trees (GB), support vector machines (SVC) and logistic regressions (LR). In addition, voting classifiers were built where the respective models were grouped for voting. We used the scikit learn implementations for all models except the gradient boosting tree classifier, for this the xgboost implementation was used. In addition, voting classifiers were used to combine the models. A short description of each model is given below and for further details, we refer to [6][8] as well as the documentations for each class and the references therein.

In order to understand what random forest- and gradient boosted tree classifiers are, one needs to understand what a binary decision tree is. A decision tree is simply a tree where a series of yes or no questions are asked, as illustrated in fig. 11. The zeros and ones at each arrow illustrate the classes in the considered binary classification problem and each node represents a yes or no question. The classification is then made based on value at the final node. The depth is the maximal number of nodes when going from the root to a leaf, where the root is the uppermost node and a leaf is a node which does not branch. An important metric for a decision tree is the Gini impurity which is a measure of disorder. The lower it is, the better the classes are separated at a certain node. From this, one can measure the amount of information a descriptor contains by calculating its total contribution to the decrease in Gini impurity across the decision tree.

Figure 11: An example of a decision tree. The diagram was generated by using graphviz after training a decision tree classifier on the iris dataset.

With this, a random forest is simply a set of decision trees which have been trained with a bootstrapped dataset by using a subset of randomly chosen descriptors. The final prediction is an average of all decision trees. This procedure is also known as bagging (bootstrap averaging). A metric that arises from this is the Gini importance, which is an estimate of how informative a feature is. This metric was used to eliminate descriptors in the RF-RFECV. One of the hyperparameters that needs to be chosen is the number of trees in the ensemble. However, this is generally not a difficult choice as one should opt as for many as one can afford.

A gradient-boosted tree works similarly to a random forest, but instead of averaging multiple decision trees, it proceeds by training a constant number of trees. The set of trees are then trained and updated sequentially with the aim of reducing the errors of the previous generation of trees. The process of sequentially updating the model and refining the performance by propagating the residuals is known as boosting. The hyperparameters considered is the number trees in the ensemble, the learning rate and the maximum depth of the trees.

The logistic regression is a linear model which builds a hyperplane in the feature space by linearly combining the descriptors. The method is well understood, easy to interpret and good for benchmarking against other models. It is also a good choice for high dimensional problems. An L2-regularization was used to reduce overfitting.

A support vector machine proceeds by fitting a hyperplane which maximally separates the classes. This can be accomplished in different ways and one typically proceeds by some sort of transformation. We have used the radial basis function kernel for this case, where every sample is mapped to a higher dimensional feature space where inference is made. The parameters to tune is the penalty of the error term and radial basis kernel coefficient which specifies how “wide” the kernel should be.

A voting classifier combines multiple models to make an estimate by some sort of voting scheme. The idea is that a group of individuals will perform better than each individual separately. To make a classification we used soft voting, which is an average of the probability estimated by all models within the ensemble.

To select the hyperparameters, a twenty times repeated five-fold cross-validation iterated over a grid search was performed. Initially, the grid was chosen coarse and the grid search was then manually zoomed in until a satisfactory set of parameters was obtained. The metric used during the grid search was the roc-auc.

Metrics

The metrics used in the model selection and final validation was the accuracy, sensitivity, specificity and roc-auc. The accuracy is simply the fraction of correctly classified proteins. The sensitivity and specificity are defined as $$ \text{Sensitivity = no. true positive / (no. true positive + no. false negative)} $$ $$ \text{Specificity = no. true negative / (no. true negative + no. false positive)} $$

In our case, the positive class corresponds to hemoglobins with high oxygen binding affinity and the negative to the ones with low. The terminology “number of true positives” mean the number of correctly classified proteins with high affinity. The sensitivity is therefore the fraction of true correctly classified high affinity mutants out of all classified high affinity mutants. It is therefore a measure of how good the model is at finding a positive class given that the class is truly positive. The specificity is in our case similarly interpreted as the fraction of all correctly classified low oxygen binding affinity mutants out of all classified low oxygen affinity mutants. The final metric used is the receiver operating characteristic area under the curve (roc-auc). This is defined as the area under the roc-curve, which is a curve that plots the true positive rate against the false positive rate for a certain classifier for a number of thresholds.

The reason for using multiple metrics is because they capture different aspects of the inference made by the model. For instance, while the accuracy gives the ratio of correctly classified samples, it may be misleading and hard to interpret when the dataset is imbalanced. Consider for example a problem where the objective is to detect a disease based on some property and where 99.9% of the patients are healthy and only 0.1% carry it. Obviously one could then easily get an accuracy score of 0.999 by labeling all patients as healthy, but the model would be totally useless. For this, the roc-auc is typically considered a balanced metric which in this case would give a score of 0.5.

Mutant generation and ranking

Conserved regions were heuristically defined in both the alpha and beta subunit as regions spanned by at least five amino acids which were not mutated across all data. The conserved regions were obtained at the positions 11-20, 41-56, 76-82 and 116-132 in the beta chain and 1-28, 28-40, 44-58, 58-80, 97-126 and 130-139 in the alpha chain. All possible single point mutations were then applied, yielding a total of 1795 single point mutation hemoglobin candidates. Descriptors were generated from the PROFEAT server and the mutants were then ranked by their probability of having a decreased oxygen affinity.

The reason for defining conserved regions was to not deviate too much from the dataset used since it is not known how the algorithm behaves once extrapolated. In addition, since a large number of mutants were generated anyway, it might be argued that those are of higher quality since they are, in some sense, closer to the dataset used.

References

[1] Patrinos, G.P., B. Giardine, C. Riemer, W. Miller, D.H.K. Chui, N.P. Anagnou, H. Wajcman, and R.C. Hardison (2004) Improvements in the HbVar database of human hemoglobin variants and thalassemia mutations for population and sequence variation studies. Nucl. Acids Res . 32 Database issue: D537-541.
[2] P. Zhang, L. Tao, X. Zeng, C. Qin, S.Y. Chen, F. Zhu, S.Y. Yang, Z.R. Li, W.P. Chen, Y.Z. Chen. PROFEAT Update: A Protein Features Web Server with Added Facility to Compute Network Descriptors for Studying Omics-Derived Networks. J Mol Biol. pii:S0022-2836(16)30428-4. (2016).
[3] Rao, H., Zhu, F., Yang, G., Li, Z., and Chen, Y. (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Research 39, W385-W390.
[4] Cronin, M., and Schultz, T. (2003) Pitfalls in QSAR. Journal of Molecular Structure: THEOCHEM 622, 39-51.
[5] Wikipedia contributors. (2018, September 26). Multicollinearity. In Wikipedia, The Free Encyclopedia. Retrieved October 16, 2018, from https://en.wikipedia.org/w/index.php?title=Multicollinearity&oldid=861300471
[6] Hastie, T., Friedman, J., and Tibshirani, R. (2017) The elements of statistical learning. Springer, New York.
[7] Thang, J., Alelyani, S., and Liu, H. Feature Selection for Classification: A Review. (Retrieved October 13, 2018, from https://pdfs.semanticscholar.org/310e/a531640728702fce6c743c1dd680a23d2ef4.pdf
[8] Bishop, C. (2006). Pattern Recognition and Machine Learning. 1st ed. [S.l.]: Springer-Verlag New York.

[1] [1] Patrinos, G.P., B. Giardine, C. Riemer, W. Miller, D.H.K. Chui, N.P. Anagnou, H. Wajcman, and R.C. Hardison (2004) Improvements in the HbVar database of human hemoglobin variants and thalassemia mutations for population and sequence variation studies. Nucl. Acids Res . 32 Database issue: D537-541.

[2] [2] P. Zhang, L. Tao, X. Zeng, C. Qin, S.Y. Chen, F. Zhu, S.Y. Yang, Z.R. Li, W.P. Chen, Y.Z. Chen. PROFEAT Update: A Protein Features Web Server with Added Facility to Compute Network Descriptors for Studying Omics-Derived Networks. J Mol Biol. pii:S0022-2836(16)30428-4. (2016).

[3] [3] Rao, H., Zhu, F., Yang, G., Li, Z., and Chen, Y. (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Research 39, W385-W390.

[4] [4] Cronin, M., and Schultz, T. (2003) Pitfalls in QSAR. Journal of Molecular Structure: THEOCHEM 622, 39-51.

[5] [5] Wikipedia contributors. (2018, September 26). Multicollinearity. In Wikipedia, The Free Encyclopedia. Retrieved October 16, 2018, from https://en.wikipedia.org/w/index.php?title=Multicollinearity&oldid=861300471

[6] [6] Hastie, T., Friedman, J., and Tibshirani, R. (2017) The elements of statistical learning. Springer, New York.

[7] [7] Thang, J., Alelyani, S., and Liu, H. Feature Selection for Classification: A Review. (Retrieved October 13, 2018, from https://pdfs.semanticscholar.org/310e/a531640728702fce6c743c1dd680a23d2ef4.pdf

[8] [8] Bishop, C. (2006). Pattern Recognition and Machine Learning. 1st ed. [S.l.]: Springer-Verlag New York.

Team:Lund/Model/Hemoglobin/MaterialMethod