Untargeted metabolomics is an unbiased approach for metabolic profiling that enables the simultaneous detection of a vast array of metabolite signals. The results of metabolomic analyses encompass both experimental samples and quality control (QC) samples. To facilitate a more thorough analysis of the data, a series of preprocessing steps are essential, including outlier filtering, missing value filtering, missing value imputation, and data standardization.
Outlier filtering is employed to identify and remove observations with significant deviations from the norm, ensuring the integrity of the dataset. Missing value filtering is necessary to address incomplete data points in both experimental and QC samples. Subsequently, missing value imputation techniques are applied to simulate and replace missing values, minimizing their impact on downstream analyses. Finally, data standardization procedures, such as normalization and internal standardization, are implemented to mitigate the influence of systematic errors and variations introduced during metabolite detection.
Through these preprocessing steps, the potential influence of outlier and missing data is mitigated, enhancing the accuracy of metabolite screening and discovery. This rigorous preprocessing of untargeted metabolomics data sets the foundation for robust and reliable analyses, allowing for more accurate identification and exploration of metabolites of interest.
Outlier Filtering
Outliers, also known as anomalies, typically refer to observed values with a relatively large deviation or departure from the normal range. The presence of outliers can significantly impact the distribution of data, leading to substantial differences between the dataset's mean and standard deviation, thereby affecting the results of statistical analyses. The utilization of the relative standard deviation (RSD), also known as the coefficient of variation (CV), as a metric for measurement, is employed using quality control (QC) samples.
The criterion for evaluating the stability of substance detection is based on the QC group's RSD. Substances with an RSD greater than 0.3 are considered unstable, prompting the removal of all detection data associated with that particular substance. This approach aims to enhance the reliability of the dataset by eliminating observations that may introduce undue variability and distort the statistical characteristics of the data.
Furthermore, this outlier filtering strategy not only safeguards against the impact of extreme values but also ensures the robustness of subsequent statistical analyses. By setting a threshold at an RSD of 0.3, substances exhibiting high variability in QC samples are identified and systematically excluded from further analysis, contributing to the overall integrity of the dataset. This methodological step is essential in refining data quality and reinforcing the validity of downstream interpretations and conclusions drawn from the statistical analyses conducted on the preprocessed dataset.
Flowchart showing the general strategy for preprocessing and analysis of LC/MS data for global untargeted analysis of metabolites and other analytes (Smith et al., 2006).
Missing Value Filtering
During the process of analytical detection, samples may experience missing values due to various reasons, such as low signal detection or algorithmic limitations. In metabolomics analysis, data filtering is often conducted based on the proportion of missing values within samples or groups. For instance, a common practice involves retaining metabolites with missing values not exceeding 50% in any single group or across all groups.
The occurrence of missing values in metabolomics datasets can arise from factors like technical limitations, experimental conditions, or inherent variability in biological samples. To ensure data quality and robustness in subsequent analyses, it becomes imperative to implement effective missing value filtering strategies.
By setting a threshold of 50% for allowable missing values, metabolites with a relatively higher proportion of missing data are selectively retained or excluded, depending on the study design and analytical objectives. This approach helps mitigate the impact of missing values on downstream statistical analyses, ensuring that the retained metabolites contribute meaningfully to the overall interpretation of the metabolomic dataset.
Moreover, the decision on the threshold for missing value retention can be tailored to the specific requirements of the study, balancing the need for data completeness with the potential biases introduced by missing values. Careful consideration of missing value patterns and their implications is essential to strike an optimal balance and derive meaningful insights from the metabolomics data.
Missing Value Imputation
Even after data filtering, missing values may persist. Ignoring these missing values could lead to anomalies in subsequent analyses, compromising the accuracy of the results. Therefore, it is crucial to address missing values through imputation techniques. A straightforward approach involves direct imputation using measures like the median, or half of the minimum value. Alternatively, more sophisticated methods, such as machine learning algorithms like K-Nearest Neighbors (KNN) and Singular Value Decomposition (SVD), can be employed for imputing missing values.
Half of the Minimum Value Imputation: This method involves filling missing values with half of the minimum value of all peak areas detected in the experimental samples.
K-Nearest Neighbors Algorithm:
The KNN algorithm is a simple yet effective approach for missing value imputation. It operates by calculating distances to identify k neighboring samples that are spatially similar or close. These neighboring samples are then used to estimate the values of missing data points, with the imputation of each sample's missing values being based on the average of the "k" nearest neighbors found within the dataset.
Singular Value Decomposition:
SVD is a more advanced technique for imputing missing values. It decomposes the dataset into singular vectors and values, allowing for the reconstruction of the missing values based on the relationships observed in the rest of the data. This method is particularly useful for capturing underlying patterns and structures within the metabolomics dataset, contributing to more accurate imputations.
Choosing the appropriate imputation method depends on the characteristics of the data and the desired level of complexity in the imputation process. By addressing missing values through these imputation techniques, researchers can enhance the completeness and reliability of the dataset for subsequent analyses, ensuring more robust and accurate results.
Data Standardization in Metabolomics Analysis
Metabolomic data exhibits typical characteristics such as high dimensionality and noise. The influence of factors like instrumentation can introduce systematic errors in detection data. Therefore, data standardization is an indispensable part of metabolomics analysis. Commonly used methods include internal standard normalization and area normalization.
Internal Standard Normalization:
Internal standard normalization typically involves selecting the internal standard (IS) with the minimum relative standard deviation (RSD). The ratio (Ri) is calculated by dividing the peak area of the metabolite in the sample (Areai) by the corresponding internal standard's peak area (AreaIS).
Ri= Areai / AreaIS
Area Normalization:
Area normalization involves summing the peak areas of all metabolites in a sample to obtain the total metabolite peak area (Areaall). The ratio (Ri) for each metabolite is then calculated by dividing its peak area (Areai) by the total.
Ri= Areai / Areaall
The presented methods aim to mitigate the impact of systematic errors and variations introduced during metabolite detection. Internal standard normalization utilizes a carefully selected internal standard to normalize individual metabolite peak areas, while area normalization considers the overall metabolite peak area in the sample. These standardized ratios facilitate meaningful comparisons and analyses across different samples.
Reference
- Smith, Colin A., et al. "XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification." Analytical chemistry 78.3 (2006): 779-787.