With the continuous evolution and maturation of mass spectrometry technology and bioinformatics tools, proteomics and post-translational modification (PTM) omics have found widespread applications in life sciences, basic medicine, and various research domains. However, akin to other multi-omics data, proteomics and PTM omics data require rigorous quality control and preprocessing before formal analysis. Generally, the preprocessing workflow for proteomics and PTM omics data encompasses several key steps:
- Raw Data Acquisition: Initial data collection through mass spectrometry techniques.
- Database Search: Utilization of databases to match acquired spectra with known protein sequences.
- Quantification: Measurement of protein abundance levels.
- Log Transformation: Application of logarithmic transformation to stabilize variance.
- Normalization: Adjustment of data to a common scale, allowing for meaningful comparisons.
- Missing Value Handling: Addressing and imputing missing values in the dataset.
- Batch Effect Removal: Mitigation of systematic variations introduced by different experimental batches.
- Differential Expression Analysis: Identification and exploration of proteins showing significant expression differences.
Given the diversity in experimental conditions, proteomics technologies, software tools, and even tissue types, there exists substantial variability in the preprocessing methods and workflows for proteomics and PTM omics data. Here, we introduce several commonly employed preprocessing procedures in proteomics:
Different laboratories, proteomics technologies, software tools, and even tissues may exhibit significant variations in the preprocessing approaches for proteomics data. The variability in preprocessing methods underscores the importance of careful consideration and validation of the chosen workflow based on the specific experimental context and research objectives.
DIA Proteomics Data Preprocessing Workflow
In the realm of DIA (Data-Independent Acquisition) proteomics, the preprocessing of data plays a pivotal role in ensuring the reliability and accuracy of research outcomes. Currently, two prominent data preprocessing workflows are prevalent: the Spectronaut database search strategy and the DIA-NN database search strategy.
Spectronaut Database Search Strategy
A notable study exemplifying the Spectronaut strategy is the work conducted by Matthias Mann's research group, published in "Nature Medicine" in 2022 (PMID: 35654907). This investigation focused on DIA proteomics data sourced from tissues and plasma, processed using Spectronaut v13 and v15.4. The workflow commenced with the acquisition of relative quantification information, followed by a log2 transformation. Subsequently, the data underwent treatment for missing values, with researchers opting for imputation using random values drawn from a normal distribution. The imputed data then proceeded to downstream statistical and bioinformatics analyses.
It is noteworthy that this study did not explicitly delineate the method for data standardization and the approach to address batch effects. In other studies (refer to Figure 1), it is common practice to perform median normalization within samples post-log2 transformation and prior to handling missing values. This step ensures that data from diverse samples are normalized, facilitating inter-sample differential protein comparisons. The mitigation of batch effects is typically achieved using the well-established Combat method.
Figure 1. Spectronaut database search strategy preprocessing workflow
DIA-NN Database Search Strategy
The DIA-NN strategy mirrors the Spectronaut approach in its data preprocessing steps. Following the DIA-NN database search, the data undergoes a sequence of processes including log2 transformation, data standardization, and missing value handling, culminating in the identification of differentially expressed proteins.
Figure 2. Data preprocessing workflow for DIA-NN database search strategy.
In summary, both Spectronaut and DIA-NN strategies incorporate comparable preprocessing steps in DIA proteomics studies. However, researchers must be cognizant of additional considerations such as data standardization and batch effect correction to ensure the robustness and accuracy of downstream analyses. Further exploration and comparative studies of these methodologies will contribute to refining and optimizing the data preprocessing workflows in DIA proteomics.
Label-Free Data Preprocessing Workflow
The preprocessing workflow for label-free proteomics data differs from DIA (Data-Independent Acquisition) data and primarily involves two database search approaches: MaxQuant and Proteome Discoverer.
MaxQuant Database Search Results
MaxQuant yields three quantitative values: Intensity, iBAQ, and LFQ intensity.
- Intensity: The sum of signal intensities for all unique and razor peptides within a protein group, serving as a raw intensity value.
- iBAQ (Intensity-Based Absolute Quantification): Derived by dividing the raw intensity by the theoretical number of peptides for a given protein, akin to length normalization.
- LFQ (Label-Free Quantification) Intensity: Corrects the raw intensity values among samples to eliminate inter-sample variances introduced by processing, sample loading, pre-fractionation, and instrument-related errors. iBAQ is suitable for intra-sample comparisons, while LFQ is employed for inter-sample comparisons.
For subsequent quantitative analyses, whether using Intensity, iBAQ, or LFQ, it is common to perform log2 transformation, sample-wise median or quantile normalization. Following this, missing values are filtered and imputed, leading to the final step of differential quantification analysis.
Proteome Discoverer Database Search
In Proteome Discoverer (PD), the default quantitative value is iBAQ. The commonly used normalization method is FOT (Fraction of Total), which involves normalizing based on the sum of iBAQ values for all proteins within a sample. Subsequently, missing values are imputed based on data distribution using parameters like 10e-5 or 10e-8, resembling a form of minimum value imputation. The processed data then undergoes downstream analytical procedures.
TMT Data Preprocessing Workflow
TMT (Tandem Mass Tag) proteomics technology is widely utilized, especially in large-sample studies and clinical cohorts. The TMT proteomics database search methods are diverse, including MaxQuant, Proteome Discoverer (PD), MSFragger, and MS-GF+.
Taking the example of the MSFragger database search strategy, let's illustrate the preprocessing workflow for TMT proteomics data.
Typically, MSFragger initially performs a database search on raw proteomics data, producing pepXML-formatted search result files. Subsequently, the Philosopher toolkit is employed for peptide, protein, and post-translational modification (PTM) quantification and filtering. Specifically, the output from MSFragger can undergo identification and validation of peptides using PeptideProphet. For modification-rich datasets such as phosphorylation, PTMProphet is employed for identification at the modification sites based on PeptideProphet results. Protein identification is processed using ProteinProphet. Finally, Philosopher is used for filtering false discovery rates (FDR) and quantification, yielding TMT reporter ion intensity at the peptide, modification, or protein levels.
However, relative quantification values for all proteins in each sample require correction based on a reference channel sample. This involves calculating the ratio (TMT ratio) of the TMT reporter intensity for a protein in a given sample to the intensity of the same protein in the reference channel sample.
Following the acquisition of TMT ratio values for each protein, a log2 transformation is applied, and an intra-sample median normalization process is conducted. This normalization is not a simple division of TMT ratio values by the median; instead, it involves multiple data transformation steps. Initially, the median TMT ratio for each sample is calculated, and the global median M0 is determined. Subsequently, following conventional analysis strategies, each sample undergoes median normalization based on its respective median, and the absolute median deviation (MAD) for each sample is calculated. The global absolute median deviation MAD0 for all samples is then determined. Finally, protein relative quantification values are standardized based on M0 and MAD0.
The final relative expression value (A) for a protein is represented as the standardized protein quantification value (after log2 transformation), added to the corresponding value of the reference channel protein (after log2 transformation). Subsequently, this is followed by processes such as missing value handling, batch effect removal, and downstream differential expression analysis.