Team:NUS Singapore-Sci/enkf

NUS Singapore Science: InterLab

Ensemble
Kalman Filter

Overview
It is easier to measure the intracellular concentrations of reporter proteins (mCherry, GFP-mCherry) than to measure the concentrations of substrate and product mRNA. A good kinetic model that produces similar reporter protein concentrations might be able to infer these hard-to-measure mRNA concentrations. However, if the parameters used in the kinetic model (e.g., mRNA doubling time) are wrong, the outcomes of the model cannot be trusted. This limits the ability of using kinetic models to infer the concentrations of substrate and product mRNA.

To measure the uncertainties in the model outcomes due to uncertainties in the parameters, it is common to run an ensemble of models, with different sets of parameters. These parameter sets are usually drawn at random from a normal distribution with a spread that represents the uncertainty in the parameters. The ensemble of model outcomes can then be used to generate concentration estimates (mean concentrations) and the uncertainties (standard deviation).

What if we could somehow use laboratory-measured reporter protein concentrations to improve the accuracy and precision of the estimate for all of the concentrations? One such method is the Ensemble Kalman Filter (EnKF). The EnKF broadcasts information from the laboratory measurements to all model concentrations in the ensemble through the statistical connections between all concentrations and the concentrations of the reporter proteins in the ensemble. In other words, there is a potential to use the EnKF to infer hard-to-measure mRNA concentrations!

In this project, we explored this potential of the EnKF using simulated laboratory measurements of mCherry and GFP-mCherry concentrations. Here, we will describe the setup of a simple model of the RESCUE system and outline the EnKF algorithm we used. Following that, the results of applying the EnKF will be described and discussed.

If the reader is interested in a more detailed explanation of the EnKF, see Appendix A.
Simple model of RESCUE
The RESCUE system is made up of 6 components:
  1. substrate mRNA (S).
  2. product mRNA (P),
  3. mRNA editing enzyme (E),
  4. enzyme-substrate mRNA (ES),
  5. GFP-mCherry protein that is generated from S (GC), and,
  6. GFP protein that is generated from P (G).

In this project, we set up a simple model that simulates the intracellular evolution of the concentrations of the RESCUE system components. These processes are:
  1. equilibrium binding of E to S,
  2. production of P from ES,
  3. production and decay of S,
  4. production and decay of P,
  5. production of GC from S and decay of GC, and,
  6. production of C from P and the decay of C.

In the model, the equations that govern these processes and their parameters can be found in Appendix B.

After some tuning of the parameters, we managed to achieve equilibrium concentrations of G and GC on the order of 1000s of protein molecules per cells (Figure 1). These values are consistent with a result from the wet laboratory: they experienced sensor oversaturation when they tried to measure fluorescence in the cells.


Figure 1. Model components after tuning. Simulated laboratory measurements will also be drawn from this model run later in the EnKF experiment.
EnKF algorithm
We used the stochastic filter form of the EnKF (Evensen, 1994). The procedure is as follows:
  1. 1. Generate an ensemble of model outcomes.


  2. 2. Compute the prior ensember average \(<\boldsymbol{\psi_{b}}>\) (aka prior estimate), prior ensemble perturbations and the prior ensemble covariance \(\boldsymbol{B}\) via:
    $$<\boldsymbol{\psi_{b}}> = \frac{1}{N_{e}}\sum_{e=1}^{N_{e}}\psi_{b,e}, $$
    $$\boldsymbol{\psi_{b,e}^{'}} = \boldsymbol{\psi_{b,e}} - <\boldsymbol{\psi_{b}}> $$
    $$\boldsymbol{B} = \frac{1}{N_{e}-1} \sum_{e=1}^{N_{e}} \boldsymbol{\psi_{b,e}^{'}} \cdot \boldsymbol{\psi_{b,e}^{'T}}$$
    where \(N_{e}\) is the number of ensemble members, \(\boldsymbol{\psi_{b}}\) is the e-th prior ensemble member concentrations and \(\boldsymbol{\psi_{b}^{'}}\) is the perturbation of the e-th prior ensemble member from the ensemble mean.


  3. 3. Compute the Kalman gain matrix:
    $$\boldsymbol{K} = \boldsymbol{BH^{T}(HBH^{T}+R)^{-1}} $$


  4. 4.Compute the posterior ensemble mean via:
    $$<\boldsymbol{\psi_{a}}> = <\boldsymbol{\psi_{b}}> + \boldsymbol{K[y-H<\psi_{b}>]}$$
    This is the posterior estimate of the EnKF.


  5. 5. Generate the posterior ensemble by first computing the posterior perturbation ensemble:
    $$\boldsymbol{\psi_{a,e}^{'}} = \boldsymbol{\psi_{b,e}^{'}} + \boldsymbol{K[y-H\psi_{b,e}^{'}]} $$
    where \(y_{e}^{'}\) is the e-th random draw from the normal distribution \(N(\boldsymbol{0},\boldsymbol{R})\). Then, add the perturbations to the posterior ensemble mean to obtain the posterior ensemble:
    $$ \boldsymbol{\psi_{a,e}} = \boldsymbol{\psi_{a,e}^{'}} + <\boldsymbol{\psi_{a}}> $$
    where \( \boldsymbol{\psi_{a,e}} \) is the e-th posterior ensemble member.

Setup of EnKF experiment
To simulate the scenario of uncertain and erroneous parameters, we will use the “erroneous” parameters that are 80% in value of the simulated truth parameters. The large uncertainty of this parameter is represented by defining standard deviations that are 30% of each parameter. 20000 sets of parameters were then randomly drawn from a multivariate normal distribution with centered on the “erroneous” parameters, with the aforementioned standard deviations, and all cross-parameter covariances set to zero. These 20000 model runs were then integrated in time until all of the concentrations have reached static equilibrium. The resulting ensemble of 20000 sets of concentrations form the prior ensemble. We note here that 20000 ensemble members were only utilized to generate a smooth histogram. In reality, as little as 20 ensemble members would result in similar results.

The simulated measurements are equilibrium concentrations of GFP and GFP-mCherry. To mimic measurement errors, perturbation concentrations of GFP and GFP-mCherry were drawn from a normal distribution with zero mean and standard deviation that is 1% of the equilibrium concentrations of the simulated truth (Figure 1). These perturbations were respectively applied to the simulated truth’s equilibrium concentrations of GFP and GFP-mCherry. The result is the simulated measurements.

The EnKF algorithm used here is outlined in the previous section. Here, the simulated measurements of equilibrium GFP and mCherry-GFP concentrations were used. R is a diagonal measurement error covariance, with the squares of the standard deviations used to generate measurement perturbations as the diagonal terms.
Results of EnKF Experiment
Table 1. Comparison of the absolute errors of the prior and posterior estimated concentrations
Species Absolute prior estimate error w.r.t. truth Absolute posterior estimate error w.r.t. truth Absolute ratio of prior/posterior errors
Enzyme-substrate 2.5% 1.0% 2.5
Free enzyme 8.4% 3.4% 2.5
Substrate mRNA 11% 1.3% 8.6
Product mRNA 16% 1.1% 14
mCherry 28% 0.61% 46
GFP-mCherry 21% 0.37% 57
EnKF estimate is at least 50% closer to the truth than prior estimate, and more precise

Figures 2 and 3 shows the 2D histograms of the prior ensemble concentrations with respect to mCherry and GFP-mCherry. From Figures 2 and 3, we notice that the center locations of the posterior ensemble clusters are closer to the truth than that of the prior ensemble. This indicates that the EnKF is able to generate estimates for all concentrations that are better than those of the prior ensemble. The errors of the estimates from the prior and posterior ensembles with respect to the magnitude of the truth are displayed in Table 1.

As can be seen from Table 1, there is a dramatic reduction of ensemble-estimated errors after applying the EnKF for all quantities. The improvements are the most dramatic for the ensemble-averaged mCherry and GFP-mCherry as those quantities were measured. While the improvements of the estimates of the mRNA and enzyme concentrations were not as dramatic, those improvements are significant. The magnitudes of the errors are clearly reduced by 50% or more with the EnKF.


Figure 2. Histograms showing the prior ensemble (left) and posterior ensemble (right) equilibrium concentrations (left) with respect to mCherry concentrations. The red dot indicates the equilibrium concentrations from the simulated truth. Note that the x-axis and y-axis are the same across panels.


Figure 3. Histograms showing the prior ensemble (left) and posterior ensemble (right) equilibrium concentrations (left) with respect to GFP-mCherry concentrations. The red dot indicates the equilibrium concentrations from the simulated truth. Note that the x-axis and y-axis are the same across panels.
It is also interesting to note that the improvements for the mRNA estimates are the most dramatic after those of mCherry and GFP-mCherry. This is because the concentrations of the substrate and product mRNA most directly impact the concentrations of the mCherry and GFP-mCherry. As such, there should be very strong statistical connections between substrate and product mRNA, and the concentrations of the mCherry and GFP-mCherry. These strong connections explain why the EnKF is very effective at updating these hidden quantities.

The estimates of enzyme-substrate and free enzyme received the least improvement. This is not surprising, these things only impact the concentrations of mCherry and GFP-mCherry indirectly through the substrate and product mRNA. As such, a weaker statistical connection is to be expected, resulting in a weaker improvement in these quantities.
Summary
In summary, we constructed a simple model of the RESCUE system that simulates the evolution of intracellular substrate mRNA, product mRNA, mRNA editing enzyme, enzyme-substrate mRNA, GFP-mCherry protein that is generated from the substrate mRNA, and, GFP protein that is generated from the product mRNA. We tested the potential of using the EnKF to infer the hard-to-measure mRNA concentrations using simulated laboratory measurements of GFP and GFP-mCherry. It was found that with the EnKF, the ensemble estimated concentrations of both mRNA species improved dramatically.
Appendix
References
Evensen, G. (1994). Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. Journal of Geophysical Research, 99(C5), 10143–10162. https://doi.org/10.1029/94JC00572

Kalnay, E. (2003). Introduction to least squares methods. In Atmospheric Modelling, Data Assimilation and Predictability (pp. 142–149). New York, NY: Cambridge University Press.