Determination of
Plasma device parameter
An orthogonal L18[3]7 test was designed to explore the effect of different parameter combinations of plasma-activated medium (PAM). Eighteen trials encompassing 7 factors (i.e., [T]treatment time, [A]the well size, [F]helium flow rate, [C]number of cells, [U]output voltage, [D1]distance from the tail of the plasma jet to the surface of the medium, [D2]thickness of medium) and 3 levels were conducted (Table 1 & 2). The frequency was fixed at 8.8KHz.
Linear model construction
Thre linear was constructed using R as equation(1):
The dependent variable Y is the measurement of virus amplification after Plasma-treated through orthogonal design, and the other 7 factors are independent variables in this equation.
The full model encompassing all these parameters was constructed by multivariate linear regression. The stepwise removal of each parameter was conducted followed by model feasibility assessment to identify independent parameters without collinearity.
Optimal parameter configuration of PAM identified for triple-negative breast cancer cells
T9 was selected as the optimal experimental configuration, which corresponds to ‘treatment time’ (‘T’) of 3 min, ‘liquid surface area’ (‘A’) of 4.5 cm2, ‘thickness of medium’ (‘D2’) of 0.2 cm, ‘number of cells’ (‘C’) of 1.5×105 cells/mL, ‘output voltage’ (‘U’) of 1.1 kV, ‘distance from the tail of the plasma jet to the surface of the medium’ (‘D1’) of 1 cm and ‘helium flow rate’ (‘F’) of 1.5 SLM, respectively.
Linear model assessment
Outlier test
Outliers were detected according to the student T test of studentized residuals from the outlier test and the Cook's distance from the influence analysis
T test
The Bonferroni corrected p value from the T-test of studentized residuals of the built TN linear model was used to identify the outliers of the trials. The studentized residuals [3] were computed using Equation (2).
where 'SRESI Di', 'ei', 'Syx', 'n' and 'Xi'' each represents studentized residuals, residual, standard error, sample size and ith the variable, respectively.
Influence analysis
Cook's D [3] was used to identify trials with strong influence on the results, as defined in Equation (3).
where 'hi', ‘Di’, ‘n’, ‘k’ each represents the leverage, Cook's D, sample size and the number of variables in the model. If Cook's D is greater than 4 / (n-k-1), it will be recognized as a strong influential trial with statistic significance.
Normality test
To determine whether the trials are well-modeled by a normal distribution, the quantile-quantile plot (QQ-plot) [4] of the standardized data against normal distribution was drawn. The trials were considered to follow the normal distribution if they fell close to the 45 degree line in the plot.
According to the normal QQ plot, the plot comparing residuals and fitted values, the trials fall close to the 45 degree line representing the correlation between standardized residuals and theoretical quantiles, the square root of the standardized residuals are almost randomly distributed across all fitted values. Thus, the optimal TN model satisfies the normality and homoscedasticity assumption of the fitted linear model.
Multicollinearity
To test whether there is a linear association between the variables in the model and all variables are independent, the variance inflation factor,[3] denoted as variance inflation factor (VIF) and defined using Equation (5), was used to test the multicollinearity of the model.
where 's2'' represents variance.
Multicollinearity, defined as the situation where one variable can be linearly predicted from the others with substantial degree of accuracy, was assessed using VIF that increases with collinearity. VIF is defined as VIF=1 / (1-s2), where s2 refers to the variance. It is canonically considered to have the multiple collinearity problems if VIF > 4. [5] The VIF values of ‘T’ ‘A’ ‘D2’ ‘C’ are all close to 1 (i.e., 1.000559, 1.000559, 1.000000 and 1.000000, respectively), suggesting that there is no multiple co-linearity between the four deterministic parameters
The linear model encompassing the four deterministic parameters is:
High titer
We collected a panel of genes responsible for virus multiplication through text mining, retrieved other associated gene by computing correlations using public datasets from the GEO database, and constructed the corresponding network using the fast heuristic algorithm and label propagation algorithm with GENEMANIA.
In particular, the heuristic algorithm was used for calculating a single composite functional association network from multiple data sources based on linear regression, and a label propagation algorithm was used to predict gene functionalities given the composite functional association network.
Fast heuristic algorithm
Each network data source is represented as a weighted interaction network where each pair of genes is assigned an association weight. The weight is either zero (indicating no interaction) or a positive value (reflecting the strength of interaction). The association of a pair of genes in a gene expression dataset can be assigned as the Pearson correlation coefficient of their expression levels across multiple conditions in an experiment. The more likely that the genes are co-expressed, the higher the weight is, which ranges from -1 to 1.
Both binary and continuous values can be used for building functional association networks. In the case of binary data, all zeros are replaced with log (1 - β ) and ones replaced with -log(β), where β is the proportion of samples with the given feature being 1. This allows for the emphasis of similarities between genes that share 'uncommon' features.
Similarity matrices were constructed for both types of data using the Pearson correlation coefficient to measure pair-wise similarities.
Label propagation algorithm
A variation of the Gaussian field label propagation algorithm was used here to predict the composite network.
Label propagation algorithm, like most functionality prediction algorithms, assigns a score to each node in the network, called the 'discriminant value'. This score reflects the computed degree of association that the node has to the seed list defining the given function. This value can be thresholded to enable predictions of a given gene function.
WeA positive weight reflecting its usefulness in predicting a given function of interest.is assigned to each functional association network derived from these data sources. with a positive weight reflecting its usefulness in predicting a given function of interest. Once these weights were calculated, the weighted average of the association networks was constructed into a function-specific association network.
Denote the vector of discriminant values by f, the bias vector by y, and the matrix derived from the association network by W. We can represent an association network over n genes by a symmetric matrix W whose non-zero entries indicate the associations in the network. In particular,(i,j)th the element of W, Wij, is the association between genes i and j,Wij with = 0 indicating no edge between genes i and j. To ensure that all associations are non-negative, any negative associations can be set to zero. There is l labeled genes and u unlabeled genes (n = l + u) for each binary classification task. These labels are used to specify a bias vector y, where y {+1, k, -1}, to represent that gene i is positive, unlabeled, or negative, respectively. In the label propagation algorithm:
where n+ and n- are the numbers of positive and negative genes, respectively. The discriminant values were computed by solving the following objective function:
which ensures that the discriminant values of positive and negative genes remain close to their label bias (first term in the summation) and the discriminant values of the associated genes (genes that have positive Wij) are not too different from each other (second term in the summation). Equation 1 can be written in matrix notation as:
where L = D - W is called the graph Laplacian matrix and D = diag(di) (D is a diagonal matrix with = Dij and Dij = 0 if i j) and di =. Since the association matrix is symmetric, L is symmetric and semi-definite positive, and equation 2 is a quadratic optimization problem with a global minimum. In fact, solutions to equation 2 can be obtained by solving a sparse linear system y = (I - L)f.
Suspension Culture
Frist, we sequenced the transcriptome of suspension and adherent cell lines of BHK-21 and CHO-K1, and then aligned their reads to the reference transcripts. The reference of BHK-21 is MesAur1.0 () and that of CHO-K1 is CriGri_1.0 (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/223/135/GCF_000223135.1_CriGri_1.0/GCF_000223135.1_CriGri_1.0_rna.fna.gz).
Second, through transcriptome analysis, we eliminated the low-quality genes using R and quantified the expression using FPKM (fragments Per kb per Million), and genes with FPKM < 0.5 were removed. Also, a threshold, FDR (false discovery rate) ≤ 0.05 and FC (fold change) ≥ 1, was defined to further filter the remaining genes to find DEGs (different expressed genes), i.e. BHK_DEGs (4916) and CHO_DEGs (3597).
We analysed genes differentially expressed between suspended and adherent cells to explore genes potentially responsible for the suspension feature of cells. We eventually obtained two sets of DEGs, one from BHK and one from CHO, namely BHK_sus_muts and CHO_sus_muts.
Further, we obtained the sus_muts gene set by analyzing their SNPs and mutations where genes with potential causal genetic changes were reserved. By intersecting each of the two DEGs set with sus_muts set respectively, we get BHK_keys and CHO_keys. Through taking the intersection of these two sets, we obtained 27 genes.
After checking and reserving genes with consistent regulatory directions in suspended vs. adherent cells, we obtained 18 genes as our target in the end. The network was constructed using the same algorithm as above as Figure XX↓, where the top gene was selected for genetic modulation in the experiments.
Reference
1. Montojo, J., et al., GeneMANIA: Fast gene network construction and function prediction for Cytoscape. F1000Res, 2014. 3: p. 153.
2. Team, R.C., R: A Language and Environment for Statistical Computing. 2018