Team:DTU-Denmark/ModellingTheDesign

Modeling the Design

In order to create a successful brick, we needed to try out different combinations of growth medium and preparation of the bricks to gain the best compressive strength. An important factor when designing experiments is to decide what to test and which combinations are needed. Figuring this out without any help is pretty much impossible, which is why we approach it by modeling the experimental design. By doing this we can reduce the number of experiments significantly.

Two experimental designs were successfully created and carried out utilizing the optimal design approach for the compressive strength experiments. The designs were a replicated design with 17 trials, with four replicates, and an unreplicated full factorial design with 64 trials.

The data obtained in these experiments, lead to the creation of multiple statistical models. The aim of the models are to link the factors of baking time, baking temperature, substrate, incubation time, mixing ratio and density to compressive strength. The models enable us to predict the optimal processing of the fungal bricks, ensuring that the best possible compressive strength can be achieved.

To identify the best models, the following steps are carried out:

  1. Explorative and a descriptive analysis of the data. This was done at first to identify if any immediate trends can be discovered, which have to be taken into account in the modeling process.
  2. The actual modeling process, where different types of models are fitted and considered based on the explorative analysis.
  3. Modeling selection process, where the models are compared to determine which model describes the data best, with regards to parameter choice, likelihood and so on.
  4. Post hoc analysis to quantify uncertainties of the parameters, prediction uncertainty and contrast comparison.

With these four steps the optimal statistical model can be identified by linking compression strength to the previously mentioned parameters.

Explorative and descriptive analysis of the data

Summary stats

The unreplicated pilot experiment consisted of the load response variable and X factors. These can be seen from the table below:

Level Substrate Baking time Baking temperature Incubation time
1 s1: (jasmin rice) -1 (12 hours) -1 ($75^{\circ}$ ) -1 (1 week)
2 s2: (jasmin rice + saw dust) 1 (16 hours) 1,($100 ^{\circ}$ ) 1 (1.5 week)
3 s3: (coffee) - - -
5 s5 (blended rice + saw dust) - - -
6 s6 (coffee + saw dust) - - -
7 s7 blended rice - - -

Table 1: The table shows all the factor levels of the pilot experiment

The replicated experiment consisted of 5 factors, 1 random effect and 1 response variable. These are listed in the table below.

level substrate baking time baking temperature incubation time mixing ratio
1 s1 (jasmin rice) -1 (12 hours) -1 (75) -1 (1 week) mix1 1:0 (rice:saw dust)
2 s2 (blended rice) 1 (16 hours) 1 (100) 1 (1.5 week) mix2 50:50(rice:saw dust)
3
coffee
- - - mix3 75:25(rice:saw dust)

Table 2: The table shows all the factor levels of the replicated experiment.

Correlation matrix

To check the actual orthogonality of our data, provided by our design, a factor correlation matrix was calculated.

s1 s2 baking time baking temp. incubation time
s1 1.00 -1.00 0.22 0.22 0.07
s2 -1.00 1.00 0.22 0.22 -0.07
baking time -0.22 0.22 1.00 0.27 -0.03
baking temp. -0.22 0.22 0.27 1.00 -0.03
incubation time 0.07 -0.07 -0.03 0.03 1.00

Table 3: From the correlation matrix it is seen that outside each factor level, very little confounding is present. Thus the actual design fulfills the purpose of little confounding between each factor.

Boxplots

To get an idea of which different levels of the fixed effects might have an influence on the material compressive strength, a set of box plots for the main effects was created to identify if any immediate visual difference would occur. These are shown below:

Fig. 1: The boxplot data is from the replicated experiment. Seemlingly there is no difference between the two levels of baking time

Fig. 2: The boxplot data is from the replicated experiment. It seems the higher load is achived from low incubation time

Fig. 3: The boxplot data is from the replicated experiment. It seems the higher baking temperature is producing better compressive strength

Fig. 4: The boxplot data is from the replicated experiment. it seems as if there is no difference between the differing mixing ratios

Fig. 5: The boxplot data is from the replicated experiment. it seems as if the s1 substrate gives cubes with better load carrying capabilities

The overall analysis from boxplots is the following: There is seemingly little difference between the baking times nor is there any difference between any of the mixing ratios. The factors which may differ are the baking temperatures, incubation times and substrates. To see if there actually is a significant difference see the post hoc analysis (link til post hoc).

Modeling formulation, fitting of model and reduction of model

During this section multiple models are considered. The models included fixed effect, general and generalized random effect models. These are listed in the table below:

Model number Fixed effect models General random effect models Generalized random effect models
1 General gaussian linear fixed effect Compound symmetry random effect model Gamma model with gaussian second stage model
2 Generalized gamma fixed effect model - -

The best model would end up being chosen based on the BIC and AIC likelihood criterion where the lowest numbers would decide which model is the best. This is then further compared to the goodness of fit. Note that all the less successful models along with their diagnostics can be seen in the modeling appendix.

First the general model reduction and diagnostic workflow is presented. The method of forward selection was utilized. This is started by creating a simple main effect model and gradually building up with higher order terms. The reduction of the model is done one step at a time utilizing the partial partition ANOVA (type II ANOVA test) for the main effects, and afterwards utilizing a likelihood ratio test along with AIC selection for the higher order terms. These was chosen since neither of the methods relies on the design being completely balanced (1). The methods are shown in detail in the likelihood likelihood appendix.

Fixed gaussian effect model formulation, reduction and diagnostic

The fixed gaussian model can be formulated as previously stated in the DOE section, in the following general matrix nation: \begin{equation} \textbf{Y}=X\boldsymbol{\beta}+\boldsymbol{\varepsilon},\\ \boldsymbol{\varepsilon} \sim N(0,\sigma^{2}\textbf{I}) \end{equation} For the response variable $\textbf{Y}$, it is assumed that the $\textbf{Y} \sim N(\mu,\sigma^{2})$.
The design matrix of the experiments is denoted by $\textbf{X}$ and $\beta$ is a vector of regression coefficients spanning between $\beta_{0}...\beta_{n}$. The $\varepsilon$ is the random residual component which is assumed to be i.i.d and follow a normal distribution with an equally weighted $\sigma^{2}$ around a mean of 0.

The full main effect models of the replicated and unreplicated fixed gaussian linear models can be seen below:

Pilot model

\begin{equation} Y_{i}=\mu+\alpha(bake.time_{i})+\beta(bake.temp_{i})\\+\gamma(mix_{i})+\delta(incu.time_{i})+\zeta(sub_{i})\\+eta(W_{i})+ \varepsilon_{i} \ ,where \ \varepsilon_{i} \sim N(0,\sigma^{2}) \end{equation} The model is reduced via the type 2 ANOVA as shown below:

Type 2 ANOVA table
Factor $\chi^{2}$ Degrees of freedom P(>$\chi^{2}$)
$\zeta(sub_{i})$ 9.48 5 0.06
$\delta(incu.time_{i})$ 7.96 1 <0.01

The pilot model is now fully reduced and has no need of higher order terms since this only had the purpose of screening for significant levels of main effects (2). The reason as to why the substrate is kept in the model is due to the fact that this was the main factor of interest (see more in post hoc section). The replicated model of this type can be seen in the appendix.

Gamma model with gaussian second stage model

The next model is more complicated than the previous models (lnk to the model appendix) as no explicit likelihood solution exists for it (2). In general the models of this type can be formulated as: \begin{equation} \boldsymbol{y} = X \boldsymbol{\beta} + Z \boldsymbol{u} + \boldsymbol{\epsilon} \end{equation} where X is the design matrix of the fixed effects, and the Z is the design matrix of the random effects. Where \boldsymbol{y} follows a Gamma distribution and the \boldsymbol{u} follows a gaussian distribution.
The coefficients are stored in the $\boldsymbol{\beta}$ for the fixed effect and the \boldsymbol{u} contains the random effect coefficient. The explicit likelihood formulation and Laplace approximation solution to this type of the model is seen in the likelihood appendix.

Using the same link function (log link) as presented in the gamma model (link to the model appendix), the main effect model of this type can be written as: \begin{equation} \eta_{i}=log(\mu_{i})\\ where\ \mu_{i}=\beta x_{i}+\Psi_{i} \end{equation} This translates into: \begin{equation} log(Y_{i})=\mu+\alpha(bake.time_{i})+\beta(bake.temp_{i})\\+\gamma(mix_{i})+\delta(incu.time_{i})+\zeta(sub_{i})\\+eta(W_{i})+ A(id_{i})+ \varepsilon_{i} \\ where \ \varepsilon_{i} \sim G(\alpha,\beta) \ and\\ \ id_{i} \sim N(0,\Psi^{2}) \end{equation} The response variable of the model is:

  • $Y_{i}$ refers to the load measured in pascal for the i’th fungal cube

The fixed factor effects of the model are:

  • $\mu$ is the overall mean
  • $\alpha$ is the term describing the different baking times at the level of i=1..2
  • $\beta$ is the term describing the different baking temperatures at the level of i=1..2
  • $\gamma$ is the term describing the different mixing ratios at the level of i=1..3
  • $\delta$ is the term describing the different incubation times at the level of i=1..2
  • $\zeta$ is the term describing the different substrates at the level of i=1..2

Each term is to be understood as deviations from the overall mean

The numerical effect of the model is

  • $\eta$ which is the weight of each cube $W_{i}$ corresponding to the $Y_{i}$ measurement

The random effects of the model are:

  • $A(id_{i})$ which is the individual cubes tested on each tray.

From this the following reduced model is found via LRT testing, resulting in the following:

LRT testing of fixed effects
Factor Degrees of freedom AIC LRT p(>$\chi^{2}$)
$\delta(incu.time_{i})$ 1 459.24 15.93 <0.01
$\beta(baking temp_{i})$ 1 447.79 4.49 0.03
$\zeta(sub_{i})$ 1 449.82 6.52 0.01
LRT testing of random effects
Factor Degrees of freedom AIC LRT p(>$\chi^{2}$)
Random effect removed 5 459.85 - -
$A(id_{i})$ 6 456.01 7.62 <0.01

It seems the random cube to cube variation does contribute to the model as the LRT is significant and the AIC goes up if the term is removed. Thus the following model is identified: \begin{equation} Y_{i}=\mu+\beta(bake.temp_{i})+\delta(incu.time_{i})+\zeta(sub_{i})+ A(id_{i})+ \varepsilon_{i} \\ where \ \varepsilon_{i} \sim G(\alpha,\beta) \ and \ id_{i} \sim N(0,\Psi^{2}) \end{equation}

Inserting the 2 factor interactions resulted in a final model where no interaction terms were significant both for random and fixed terms. So, the final model of the gamma mixed model is the one presented above.

The residual analysis of the model is carried out the same way as the gaussian mixed model.

Fig. 6: This shows the residuals of the model vs the fitted values. no dependence is left in the model as only random patterns can be seen.

Fig. 7: This shows the variance of the predictions of the model. As can be seen the varriance is not constant and thus the assumption of constant variance is not fulfilled.

Fig. 8: The figure to the left shows the QQ-plot of the random effects. these seems to follow a normal distribution so the second stage of guassian PDF seems reasonable. As for the residuals they too seems to follow a normal distribution

Fig. 9: Seemingly no residuals are auto correlated nor partially auto correlated.So, the assumption of independence is fulfilled

Fig. 10: Seemingly no residuals are auto correlated nor partially auto correlated. So, the assumption of independence is fulfilled

From the model it does not seem as if no autocorrelation is present in residuals or random effects, thus the assumption of independence is fulfilled. The normality assumption seems to hold for the random effect, but could be better for the residuals. The residual vs fitted shows that the model reasonably models out most variation in the data, but the variance is not constant as it is seen for the higher predicted values and the higher variance.

Model comparison/selection

As mentioned in the beginning, to choose the best models several parameters comes into play. The direct comparison via the AIC, BIC and goodness of fit the table is shown here:

Model AIC BIC Goodness of fit
gaussian fixed effect 487.2746 496.1956 0 %
gamma fixed effect 446.2626 456.9677 100 %
gaussian mixed effect 479.5087 490.2138 0 %
gamma mixed effect 441.3038 452.009 100 %

From the table is it seen that the gamma mixed effect model is the better choice due to the goodness of fit of the data and the lower BIC and AIC. With the model choice in place, it is now time to do the post hoc analysis of the gamma mixed model.

Post hoc analysis

pilot model

From this model the goal was to find the optimal substrate in regards to optimizing maximum load. From the model the following least squared means was found and the contrast difference of each factor was identified via Tukey’s test. This is shown in the table below:

Coefficients Estimates Std.error 2.5% 97.5%
s1 -2.76 29.76 -61.10 55.59
s2 14.44 27.03 -38.54 67.42
s3 94.41 29.89 35.82 152.99
s5 141.31 50.93 41.49 241.14
s6 22.71 50.93 -77.11 122.53
s7 94.46 50.93 -5.36 194.29
Coefficients Difference 2.5% 97.5% p-value
s2-s1 34.48 -34.65 103.60 0.66
s3-s1 41.74 -29.90 113.39 0.50
s5-s1 11.98 5.71 218.25 0.03
s6-s1 -6.61 -112.89 99.65 0.99
s7-s1 65.13 -41.14 161.40 0.44
s3-s2 7.26 -58.04 72.56 0.99
s5-s2 77.50 -24.59 179.60 0.22
s6-s2 -41.09 -143.19 61.00 0.82
s7-s2 30.66 -71.44 132.76 0.94
s5-s3 70.24 -33.58 174.07 0.33
s6-s3 -48.36 -152.18 55.47 0.72
s7-s3 23.39 -80.44 127.22 0.98
s6-s5 -118.59 -248.75 11.55 0.09
s7-s5 -46.85 -177.00 83.30 0.88
s7-s6 71.75 -58.4 201.9 0.56

Fig. 11: From this it can be seen that there is no difference between the s1, s3 and s6 substrate. The highest load is achieved from the s5 and s7. Data revealed that the s2 contained less stable bricks than the results of s7 bricks. that is s2 caused more “0 load problems”, so s7 was chosen over s2 for the replicated design. This is also evident as s2 overlaps a larger uncertainty interval as seen from the estimated coefficients. From the contrast only the s5 and s1 seems to differ significantly from one another. However, since most of the substrates produced “0 load samples”, most of the time, most substrates were ruled out.

Replicated model

Each fixed parameter in the log domain of the gamma mixed model, is shown in the table below and their confidence interval is calculated from the Wald approximation (2). To identify which internal levels of parameters were differing, the contrast difference comparison via a least significant distance test is employed :

Substrate: Ls-means and contrasts
Coefficients estimates Std.error 2.5% 97.5%
$\zeta(sub_{1})$ 5.18 0.07 5.04 5.33
$\zeta(sub_{i})$ 4.88 0.07 4.74 5.02
Contrast Estimates Std.error 2.5% 97.5% p-value
s1-s2 0.306 0.10 0.11 0.51 <0.01
Baking temperature: Ls-means and contrasts
Coefficients estimates Std.error 2.5% 97.5%
$\beta(bake.time_{1})$ 5.18 0.07 5.04 5.33
$\beta(bake.time_{2})$ 4.88 0.07 4.74 5.02
Contrast Estimates Std.error 2.5% 97.5% p-value
b.temp1-b.temp2 -0.24 0.10 -0.44 0.04 0.02
Incubation time: Ls-means and contrasts
Coefficients estimates Std.error 2.5% 97.5%
$\delta(incu.time_{1})$ 5.34 0.06 5.22 5.47
$\delta(incu.time_{2})$ 4.72 0.08 4.56 4.88
Contrast Estimates Std.error 2.5% 97.5% p-value
I.time1-I.time2 0.62 0.10 4.17 0.82 <0.01

Table 4: Maybe I misread what you meant with above table txt.

(above table txt): From the above contrasts tables it can be seen that, the substrate factor difference showed that the s1 substrate gave a higher overall load yield. Furthermore, it was seen that the higher baking temperature gave a higher load. Finally it was seen that the short incubation time resulted in a higher load.
So, to gain the most optimal cube, which can handle the highest load, you should choose s1 (blended rice), high baking temperature $(100 ^{\circ}$ celsius) and short incubation time of 1 week .

Another interesting result is the low standard error of each parameter from the. This shows that the D-optimal design has run its course ensuring high precision of the parameters.

Now that the optimal parameters have been identified the next step is to find the prediction uncertainties of the Gamma mixed model. If a low average uncertainty is obtained, the A-optimal design has done its job. To showcase this the model vs data is presented below:

Fig. 12: The final model shows a reasonable prediction of each factor level combination. As can be seen very few points go beyond the prediction limits of the model and the model only have an average prediction error of 16.85. The factor intervals are shown in the table below.

Intervals Baking temperature Incubation time Substrate Avg. prediction Avg.prediction error
1:4 -1 -1 s1 247.08 38.11
5:8 -1 -1 s2 155.44 12.37
9:12 1 -1 s1 336.87 29.12
13:16 1 -1 s2 193.63 15.42
17:20 -1 1 s1 117.12 9.39
21:24 -1 1 s2 91.12 7.44
25:28 1 1 s1 126.76 11.30
29:32 1 1 s2 115.98 9.33
33:36 -1 -1 s1 175.36 16.30
37:37 -1 -1 s2 159.84 19.75
38:44 1 -1 s2 184.82 11.80

As can be seen there is a low prediction error of each average prediction. The average prediction error of 16.85 is much lower than each prediction. Thus, this is very low and the A-optimal criteria is fulfilled.

Discussion and conclusion

From the modeling, it was shown that the best model for handling the compressive strength data was the gamma mixed model. From this model, it was identified that the fixed parameters of substrate choice, baking temperature and incubation time were significant. It was also shown that the random component of cube to cube variation had a significant impact. For the higher order terms, no interactions had an impact. As for the internal contrast differences, it was found that the optimal conditions of each variable were: s1 (blended jasmine rice), high baking temperature $(100 ^{\circ}$celsius) and low incubation time (1 week).
The uncertainty of the estimations of each coefficient in the final model have a reasonable standard error, meaning the D-optimal criteria was effective. Likewise no extreme observations was present, so the G-optimal criteria was effective. Finally, the prediction uncertainties of the Gamma model shows low prediction errors and most data fits the model, so the mixed Gamma model is the better choice and a reasonable model to show the relation between process factors and compressive strength.

(1) (1) Montgomery DC. n.d.. Design and analysis of experiments. Retrieved from https://www.wiley.com/en-us/Design+and+Analysis+of+Experiments%2C+9th+Edition-p-9781119113478

(2) Madsen, Henric and Thyregod, P. (2012). Introduction to General and Generalized Linear Models. https://doi.org/10.1111/j.1541-0420.2011.01740.x