Team:ZJU-China/Model

Model
MODEL  
ABSTRACT  

After constructing hardware and biochip, we get the currents produced by the electrode. According to that we can help to judge the state of the illness. In our first model, we want to use machine learning to analyze the data. The k-nearest neighbors’ algorithm (k-NN) is a non-parametric method used for classification and regression. Our goal is to apply the sample data which contains three feature vectors and one decision vector on the sample data with unknown decision value to determine if the sample has brain damage.


In the whole system. one essential part of our wetlab experiments is to construct a multi enzyme system. So we need to ensure that this system can work normally during the whole testing procedure. That is what the second model does. we have built a mathematical model to carry out theoretical stimulation for the input and output of the enzyme system. we regard the enzyme system as logic gates and construct a kinetic model. The outputs of this theoretical model were compared with what we got from wetlab experiments to further verify out results and conclusions.


In the enzyme system, in order to find the most efficient way to construct multi-enzyme complex before the experiment, we adopted the protein de novo modeling. The modeling method used is the Robetta server. Computing method is distributed computing. After obtaining the protein spatial conformation, Z-DOCK was used for protein docking, which provides us a way to observe Tag-Catcher system' s effect on the spatial conformation of the enzyme complex.




Cluster Analysis

The k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.[1]It uses a vector space model to classify cases with the same category in which their similarity is high, so that the possible classification of unknown case cases can be evaluated by calculating the similarity with the known category cases.


k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification.


The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.


In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.[2]

Overview

First, we used the 3-σ principle to detect anomalous data and make corrections. Then performed a box plot test and found that most of the data was within acceptable limits.

Fig. 1 Raw data.
Data preprocessing
Fig. 2 Box plot.

The most important step after the k-NN algorithm is to standardize the data.


In our model, the min-max data standardization method is used.


The input sequence x1, x2……xn is transformed as follows:

The resulting new sequence y1 y2……yn is dimensionless.

Data analysis

Our goal is to apply the sample data which contains three feature vectors and one decision vector on the sample data with unknown decision value to determine if the sample has brain damage.


The followings are the classification effect:

From the above three figures, we can see that the indicators of non-brain damage are mostly concentrated in the lower left of the picture, while the cases of brain damage are scattered on the right side of the whole picture, although in our example, the visualization effect is not as obvious as the standard database in python, but our final accuracy has reached 70%~80%. So we have the reason to make a preliminary diagnosis based on the three indicators of lactic acid, glucose and lactate dehydrogenase.

Summary

In the k-NN algorithm, the "majority vote" classification will have a defect when the category distribution is skewed. In other words, samples with more frequent frequencies will dominate the prediction of the test points because they are more likely to appear in the neighborhood of the test points and the properties of the test points are calculated from the samples in the neighborhood.


In order to overcome this defect, we first ensure that the two features of the sample are relatively uniform. Finally, the ratio of brain damage and non-brain damage in our sample is controlled at 3:5, and the number of both samples is much higher than that of k value.

References

[1]

Altman, N. S. (1992). "An introduction to kernel and nearest-neighbor nonparametric regression". The American Statistician. 46 (3): 175-185. doi:10.1080/00031305.1992.10475879.

[2]

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm


Enzyme System - Logic-gate Model
Overview

One essential part of our wet lab experiments is to construct a multi-enzyme system. After applying the enzyme system to the electrodes, in order to make our data measured by the electrode accurate and believable, we need to ensure that this system can work normally during the whole testing procedure.

So we have built a mathematical model to carry out theoretical stimulation for the input and output of the enzyme system. In the first place, under our certain circumstances, we regard the enzyme system as logic gates. Then we construct a kinetic model, during which we transformed original parameters into values between 0 and 1 to reduce the number of parameters. Finally, the outputs of this theoretical model were compared with what we got from wet lab experiments to further verify out results and conclusions.

Logic-gate model

According to research of Reiss, M.; Vladimir Privman et al[1], for an enzyme system with multiple inputs, equivalence could be drawn between every enzymatic reaction and logic gates. So, to reduce the complexity of computation, an orderly combination of logic gates was used to imitate the enzyme system. The enzyme system we constructed and the its corresponding logic-gate model are shown in Fig.1 and Fig.2 respectively.

Fig. 1 The Enzyme System.
Fig. 2 Logic-gate.
Equation deductions

After accomplishing the equivalent model, we first considered reactions without the interference brought by LDH (Fig.3).

Fig. 3

Based on the Michaelis-Menten equation, our enzymatic reactions can be described using the following equation, in which E denotes the enzyme, U and S represent substrates, C is the intermediate complex, P is the product and Ks, Ku are Michaelis constants of the enzymes.

During the reaction, in order to reduce the parameters, Ks and Ku can be omitted because lots of substrates exist and drive the reactions at the beginning.


The proportion of enzyme the forms complex C is a constant, so we have:

Solving the differential equation and we get the following equation:

Meanwhile, according to the properties of enzymatic reactions, the beginning of the reaction is driven by lots of substrates. So, we supposed that U and S will have original value of U0 and S0 when t==0. We can then get the following equation of P when t=tq using the same methods.

Considering that our input and output should follow the principles of logic gates, we transform the input values into values between 0 and 1.

Where S0_max is the maximum value of the concentration of the input when the its logical value is 1. The U0_max and the P(tg)_max is the same.


So the above equation can be simplified into the following form:

a can be expressed by

And for y2, y3 and y4, we have:

Then after the influence of LDH began to appear. In our example, Fig.1, the third such process is biocatalyzed by enzyme HRP, which competes for the input (NADH) of the enzymes LDH. Assuming that this process competes for a fraction, F0, of the input NADH. According to the article[2], we can get the complete equation for the output.

Also f and b can be denoted by:

Then we got the final relations between input and output in our logic-gate model.

Conclusions

After establishing our theoretical model, we programed in Matlab to do the calculation. The inputs of the model are the two respective concentration values. of 0-0 and 0-1 logic input. Running the program, we got the results that are shown in Fig.4 and Fig.5.

According to the two figures, we can get when the logic value of concentration of the two inputs are 0 and 1, or 1 and 0, the value of the output varies from 0.95 to 1. when the logic value of the two inputs are 0 and 0, the value of output varies from 0.9 to 0.95. So, in the two different situations, the logic value of the output scatters in the two different ranges. Comparing the actual output values of our samples (Fig. 6) and the results of the logic-gate model, we can find in the same situation (0,1 and 0,0) the value of the output fluctuate in the range of (-1.05~-1.00) and (-1.17~-1.05). The two methods show the similar difference in the two situations. we can show that the corresponding experiments of 0-0 and 0-1 had an output value within error limit. Therefore, we can further conclude that the enzyme system worked normally on our electrodes.

Fig. 4
Fig. 5
Fig. 6
References

[1]

Reiss, M.; Heibges, A.; Metzger, J.; Hartmeier, W. Determination of BOD-Values of Starch-Containing Waste Water by a BOD-Biosensor, Biosens. Bioelectron. 1998, 13, 1083-1090.

[2]

Vladimir Privman. Theoretical Modeling Expressions for Networked Enzymatic Signal Processing Steps as Logic Gates Optimized by Filtering. Department of Physics, Clarkson University, Potsdam, NY 13699,2016,11.


De Novo Modeling

Most protein structure prediction algorithms are based on homology modeling and sequence alignment (such as Swiss-Model), which is not suitable for this experiment.[1]Taking the Horseradish Peroxidase (HRP) with SnoopTag (N-terminus) and SdyCatcher (C-terminus) as an example, the Swiss-Model predicted only the intermediate HRP fraction (due to the homology between this part and the HRP sequence in the library is extremely high), so the full-length structural prediction by Swiss-Model were not available. Therefore, this experiment used the full-length de novo prediction tool Robetta.


Since the amount of calculation conducted by de novo modeling was extremely large, the data was transmitted to the cloud server for distributed computing. [2] First, Ginzu analysis was performed, thus dividing the protein into several hypothetical domains. Then the algorithm models each section separately, and finally combines them together to obtain full length.


Come back to the Horseradish Peroxidase with SnoopTag and SdyCatcher. Robetta separated the enzyme into two domains (because the full length was too long and it was almost impossible to predict it directly), and the two structural units were predicted, with Confidence of 0.718500 and 0.728900 respectively. The Conf was calculated based on PFAM, MSA, and cutpref. The results are shown in Figure 1B. It can be seen that the prediction basically conformed the assumption and successfully obtained the spatial structure model. The remaining two enzymes were modeled based on similar principles.

Fig. 1
Protein modeling constructed by de novo modeling Robetta. (A) Glucose Oxidase (GOx) modified by the Tag-Catcher system. The SnoopCatcher is shown as the blue part, which allows the GOx to be firmly linked to the protein with SnoopTag. (B) Horseradish Peroxidase (HRP) modified by the Tag-Catcher system. The SnoopTag is shown as the blue part, which can form band with SnoopCatcher to combine HRP with GOx. HRP also has a SdyCatcher (red-labeled in Fig.B) at the other end of the molecule that binds to proteins with SdyTag. (C) Lactate dehydrogenase (LDH) modified by the Tag-Catcher system. The protein has a SdyTag (blue part) that binds to HRP. At the same time, it also has a red-labeled SpyCacher that allows the multi-enzyme complex to have an external interface, which could connect outside molecules through SpyTag-SpyCatcher interaction.


After the de novo modeling of the proteins, we obtained the pdb file of the three enzymes modified by the Tag/Catcher system. Based on this, we conducted docking simulation. Protein docking is the simulation of the multi-enzyme complex’s structure in real situations. By docking the selected areas that must be combined and areas that must not be combined, the docking time can be reduced significantly. The method chosen for docking simulation is the Z-DOCK tool which performs calculations in cloud server and sends the results back to the local. [3] Figure 2 below is the results of the docking simulation.

Fig. 2
Docking results of the multi-enzyme complex by Tag/Catcher ligation. From left to right are LDH, HRP and GOx. The leftmost stretched domain is SpyCatcher, which is the external interface of the multi-enzyme complex.
References

[1]

Baker, D. and A. Sali, Protein Structure Prediction and Structural Genomics. Science, 2001. 294(5540): p. 93-96.

[2]

David E. Kim, Dylan Chivian, David Baker; Protein structure prediction and analysis using the Robetta server, Nucleic Acids Research, Volume 32, Issue suppl_2, 1 July 2004, Pages W526-W531.

[3]

Chen, R. , Li, L. and Weng, Z. (2003), ZDOCK: An initial-stage protein-docking algorithm. Proteins, 52: 80-87. doi:10.1002/prot.10389