Univariate analysis provides information on whether each metabolite in the experimental group undergoes significant changes, as well as the magnitude and direction of these changes. On the other hand, multivariate analysis allows us to directly observe whether these components can differentiate samples and identify which components contribute more significantly to classification. The combined use of univariate and multivariate analysis methods enables a more comprehensive interpretation of non-targeted metabolomics data.
Univariate Analysis in Bioinformatics
Significance Testing
Significance testing, a type of hypothesis testing, is the most commonly used univariate analysis method. It aims to detect differences between experimental and control groups and assess the significance of these differences. Based on the characteristics of the data distribution, testing methods are divided into parametric and non-parametric tests. Parametric tests utilize information about the population (population distribution, mean, variance, etc.) to infer population parameters using both population and sample information. Non-parametric tests, on the other hand, do not require information about the population and make inferences about the population distribution based solely on sample information. In metabolomics data analysis, parametric tests like T-tests and ANOVA are applied to normally distributed data, while non-parametric tests like Wilcoxon and Kruskal-Wallis are used for non-normally distributed data.
T-Test
T-Test is suitable for comparing means between two groups with normally distributed data and homogeneity of variances. It examines whether the difference in means between two groups is significant. There are independent sample T-tests, where data come from independent samples with no correlation between experimental groups (e.g., non-paired patients receiving surgeries A and B), and paired sample T-tests, requiring data from two paired populations (e.g., biological samples from the same patient before and after surgery).
Wilcoxon Test
Wilcoxon Test is applicable when the assumption of normal distribution is not met for two sample groups. It tests whether the distributions of two samples are significantly different. The approach involves merging and sorting the combined samples, assigning ranks to each data point, and calculating the number of times the rank of each data point in the first group exceeds that in the second group (U1) and vice versa (U2). If U1 and U2 are close, it indicates no significant difference; if the difference is large, it suggests a significant difference.
Analysis of Variance (ANOVA)
ANOVA, also known as F-test, assesses whether differences between groups result from experimental factors or random error. By analyzing the contributions of different sources of variation to the total variation, it determines the influence of controllable factors on study outcomes. ANOVA assumes normality and homogeneity of variances, making it a parametric test.
Kruskal-Wallis Test
Kruskal-Wallis Test (KW Test) is similar to Wilcoxon Test but is more suitable for comparing distributions among multiple groups. For instance, it is used to examine differences in hormone levels after continuous feeding of a specific diet to mice for 3, 6, 9, and 12 days when the distribution of the data is unknown.
Multiple Testing
Non-targeted metabolomics can detect nearly a thousand metabolites in a sample, implying that in inter-group comparisons, there will be almost a thousand independent hypothesis tests, leading to a sharp increase in the false positive rate. For instance, if a metabolomic analysis identifies 1000 metabolites and a between-group t-test is conducted at a confidence level of 0.05 (5% false positive rate), we might chance upon 50 significantly different metabolites by sheer randomness. However, all of these would be Type I Errors or false positives. Therefore, univariate statistical analysis often employs multiple testing to correct the raw p-values and control the occurrence of false positives. Commonly used multiple testing methods include the Bonferroni method and the Benjamin-Hochberg method.
The Bonferroni method adjusts the significance level by dividing it by the number of independent hypotheses being tested, denoted as 'n'. In other words, it adjusts the threshold for significance testing from 0.05 to 0.05/n, where 'n' is the number of statistical tests conducted, i.e., the number of metabolites. This correction is stringent and may lead to an increased likelihood of false negatives due to potential dependencies between metabolite concentrations (e.g., metabolites belonging to the same metabolic pathway). Consequently, the Bonferroni method is infrequently used in metabolomics data statistics.
The Benjamin-Hochberg method, on the other hand, is based on the distinct distribution characteristics of p-values from the same overall population and different populations with variations. The correction process involves two steps: (1) sorting all p-values in descending order, and (2) starting with the largest p-value, applying the correction formula.
In this context, where P.value represents the raw p-value, Max.Rank is the maximum rank, and Rank.of.P.value is the rank of each p-value, the Benjamin-Hochberg method is relatively conservative and is a commonly employed multiple testing approach in metabolomics.
Learn more
Univariate and multivariate statistical analysis of untargeted metabolomics data. (a) Box plots. (b) PCA scores plot. (c) Volcano plot. (d) Hierarchical clustering. (Winter et al., 2019)
Multivariate Analysis in Bioinformatics
PCA Analysis
PCA, or Principal Component Analysis, is an unsupervised dimensionality reduction algorithm. The main idea is to sequentially find a set of mutually orthogonal coordinate axes from the original high-dimensional space. The selection of new coordinate axes is closely related to the data itself. The directions of the new coordinate axes are the eigenvectors of the covariance matrix of the original data, also known as principal component directions. If there are n metabolites, theoretically, n principal component directions can be calculated. By projecting the data onto each principal component direction, the variance along each direction is calculated. The direction with the maximum projected variance is termed the first principal component direction, or PC1, and so on.
PLS-DA
PLS-DA, or Partial Least Squares Discriminant Analysis, is a discriminant analysis method commonly used to classify research objects based on observed or measured variable values. Like PCA, PLS-DA can also perform dimensionality reduction. However, unlike PCA, PLS-DA decomposes both the independent variable X matrix and the dependent variable Y matrix, utilizing their covariance information during decomposition. This allows PLS-DA to more efficiently extract inter-group variation information.
The variance along each PLS principal component direction can be used to calculate the VIP value, or Variable Importance for the Projection. VIP values measure the impact strength and explanatory power of the expression patterns of each metabolite on the discrimination of samples between groups, aiding in the selection of significant metabolites. A higher VIP value for a variable indicates a greater overall contribution to the model.
Additionally, for binary dependent variable data, OPLS-DA can be employed. It is an extension of PLS-DA, standing for Orthogonal Partial Least Squares Discriminant Analysis. OPLS-DA first uses orthogonal signal correction to decompose the X matrix information into two categories: information related and unrelated to the dependent variable Y. It then filters out information unrelated to classification, with the relevant information primarily concentrated in the first predictive component. This method effectively reduces model complexity and enhances interpretability without compromising predictive capability.
Random Forest
Random Forest is an ensemble learning algorithm in machine learning, comprising multiple decision trees. Decision tree algorithms use a tree-like structure to infer data layer by layer for final classification. The basic structure of a decision tree must include three elements: root node (representing the entire sample set), internal nodes (representing feature attribute tests), and leaf nodes (representing decision results).
Select Service
Reference
- Winter, Helen, et al. "Identification of circulating genomic and metabolic biomarkers in intrahepatic cholangiocarcinoma." Cancers 11.12 (2019): 1895.