Traditional protein identification methods, such as immunoblotting, chemical sequencing of internal peptides, comigration analysis of known or unknown proteins, or the overexpression analysis of genes with meaningful functions in an organism, are often time-consuming and labor-intensive, making them unsuitable for high-throughput screening. Currently, selected technologies include image analysis for protein identification, microsequencing, amino acid composition analysis for further peptide fragment identification, and techniques related to mass spectrometry.
Image Analysis
Image analysis in the context of 2D gel electrophoresis (2-DE) requires a departure from intuitive perceptions, as variations in spot intensities, upregulation, downregulation, appearance, and disappearance on each image may result from physiological and pathological conditions. Quantitative analysis necessitates computer-based data processing on the premise of generating a series of high-quality 2-DE gels with low background staining and high reproducibility. Image analysis encompasses spot detection, background subtraction, spot matching, and database construction.
The imaging systems commonly employed involve charge-coupled device (CCD) cameras, laser densitometers, and Phospho or Fluoro-imagers for digitalizing images, which are then transformed into a pixel-based spatial grid. Subsequently, image processing involves filtering and warping on the grayscale level to facilitate spot detection. Meaningful regions are separated from the background using operators such as Laplacian, Gaussian, and the difference of Gaussians (DOG), precisely defining spot intensity, area, perimeter, and orientation.
Spot detection in image analysis must align with spots observed by the naked eye. Systems often analyze spots based on controlling the centroid or peak intensity, with edge-detection software accurately describing spot appearance and performing edge detection and proximity analysis to enhance precision. Fundamental tools for spot detection, such as threshold analysis, edge detection, erosion, and dilation, can also recover boundaries of co-migrating spots. Software like Phoretix-2D, based on PC platforms, poses a challenge to traditional Unix-based 2-D analysis packages.
Once spots on 2-DE images are detected, various analyses, including comparison, addition, subtraction, or averaging, are often required. Due to the difficulty in achieving 100% reproducibility in 2-DE, the protein ratio between gels presents a challenge for image analysis systems. The advent of IPG technology has facilitated spot matching. Notably, software systems like Quest, Lips, Hermes, Gemini, and computer methods such as similarity, cluster analysis, hierarchical classification, and principal component analysis have been adopted. Future possibilities include neural networks, wavelet transformations, and practical analysis. Matching is typically conducted manually, with around 50 prominent spots set as "landmarks" for cross-matching, later expanded to the entire gel.
For instance, precise estimation of isoelectric point (pI) and molecular weight (MW) involves calculating unknown protein pI and MW using a standard curve composed of 20 or more known proteins on the reference gel. The accuracy of estimates relies heavily on the structure of the constructed grid and the type of specimens. Unmodified large proteins serve as markers, with an approximate error rate of ±0.25 units for the estimation of pI of denatured modified proteins. Similarly, known proteins' theoretical molecular weights can be calculated from databases, using the generated grid of apparent molecular weights to estimate protein molecular weight. The error rate for unmodified small proteins is approximately 30%, with greater variations in translated proteins, necessitating collaboration with other techniques for identification.
Microsequencing
Microsequencing of proteins has become a cornerstone in protein analysis and identification, providing crucial information. While amino acid composition analysis and peptide mass fingerprinting (PMF) can identify proteins separated by 2-DE, N-terminal Edman degradation remains the primary technique for identification. Automation of protein microsequencing has been achieved, where proteins separated by gel electrophoresis are directly transferred onto PVDF or glass fiber membranes, stained, cut, and then placed directly into a sequencer for identification of proteins at subpicomole levels. However, some considerations should be noted: Edman degradation is slow, producing one amino acid sequence every 40 minutes; it is resource-intensive compared to mass spectrometry; and the reagents are expensive, costing $3-4 per amino acid. These factors indicate that generalized Edman degradation of proteins is not suitable for analyzing hundreds or thousands of proteins.
Recently, the application of automated Edman degradation has yielded short N-terminal sequence tags, adapting the concept of sequence tagging from mass spectrometry to Edman degradation, proving to be a powerful method for protein identification. With simple hardware improvements to expedite the generation of N-terminal sequence tags at a rate of 10-20 tags per day, sequence tagging is suitable for identification in smaller protein sets. When combined with other protein attributes such as amino acid composition analysis, peptide mass, protein molecular weight, and isoelectric point, reliable protein identification can be achieved. Selection of BLAST programs allows for matching with databases. Currently, using a retrieval program like TagIdent enables interspecies comparative identification, further enhancing its role in proteomic research.
Technologies Related to Mass Spectrometry
Mass spectrometry has emerged as a pivotal technology bridging the gap between proteins and genes, unlocking the doors to large-scale automated protein identification. The analysis of proteins or peptides using mass spectrometry involves two main components: 1) the ion source for sample introduction and 2) the device measuring the molecular weight of the intervened ions. Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF) is a pulsed ionization technique that generates ions from solid-phase samples and measures their molecular weights in a flight tube. Electrospray ionization mass spectrometry (ESI-MS) is a continuous ionization method that produces ions from a liquid phase, followed by mass measurement in either a quadrupole mass spectrometer or a time-of-flight detector. In recent years, there have been significant advancements in mass spectrometry devices and techniques.
In MALDI-TOF, notable progress includes the implementation of ion reflectors and delayed ion extraction, enabling fairly precise molecular weight determination. In ESI-MS, the advent of nano-electrospray sources has made the analysis of microliter-level samples possible within 30-40 minutes.
The combination of reverse-phase liquid chromatography and tandem mass spectrometry (tandem MS) allows detection at levels of several tens of picomoles. Coupling capillary chromatography with tandem mass spectrometry enables detection at low picomole to high femtomole levels. When capillary electrophoresis is coupled with tandem mass spectrometry, detection can occur at levels below femtomoles, and even at attomole levels. Currently, the joint application of enzymatic digestion, liquid chromatography separation, tandem mass spectrometry, and computer algorithms is common for protein identification.
Protein Identification Analysis (Dubitzky et al., 2013)
Peptide mass fingerprinting (PMF)
Peptide mass fingerprinting (PMF), introduced by Henzel et al. in 1993, involves enzymatic cleavage of proteins separated by 2-DE, typically using trypsin, at the C-terminus of arginine or lysine residues. The precise molecular weights of the resulting cleaved peptides are then measured using mass spectrometry (MALDI-TOF-MS or ESI-MS). This technique can achieve peptide mass accuracy to 0.1 mass units. All measured peptide masses are subsequently matched against theoretical peptide masses in a database, where theoretical peptides are generated by the enzyme used in the experiment to "cleave" the protein. The matching results are ranked based on the number of shared peptide fragments between the experimental and unknown protein. The top-ranked peptide fragment may represent an unknown protein. If there is substantial difference between the top-ranked and second-ranked peptide fragments, and the protein is well-covered by the experimental peptide fragments, it suggests a higher likelihood of correct identification.
Partial sequencing of peptide fragments
Peptide mass fingerprinting, while valuable, has limitations in revealing the detailed peptide fragments or the complete protein sequence. To further identify proteins, a series of mass spectrometry methods have been developed to characterize peptide fragments. Enzymatic or chemical methods are employed to sequentially remove amino acids from either the N- or C-terminus, generating ladder-like peptide fragments. One approach involves controlled chemical degradation from the N-terminus, producing a series of varying-sized ladder peptides. The resulting peptide masses are measured using MALDI-TOF-MS. Another method utilizes carboxypeptidase, which removes varying numbers of amino acids from the C-terminus, forming peptide fragments. Both chemical and enzymatic methods can generate relatively long sequences, with molecular weights precise enough to distinguish between lysine (128.09) and glutamine (128.06). Alternatively, within the mass spectrometer, techniques such as post-source decay (PSD) and collision-induced dissociation (CID) are applied to generate a spectrum containing a series of peptide peaks differing by only one amino acid residue in mass. This allows inference of the peptide fragment sequence.
The analysis of peptide fragment PSD can yield partial sequence information on MALDI reactors. Initially, peptide mass fingerprinting identification is conducted. Subsequently, a meaningful peptide fragment is selected as the "parent ion" in the mass spectrometer, degrading into "daughter ions" during transit to the ion reactor. In the reactor, gradually decreasing voltage measures fragments of different sizes reaching the detector. However, incomplete fragments are often produced. The method for sequencing peptide fragments began with CID in the late 1970s and can be accomplished using a triple quadrupole ESI-MS or MALDI-TOF-MS combined with a collision cell.
In ESI-MS, peptide ions generated by the electrospray source are measured in the first quadrupole mass spectrometer, and meaningful peptide fragments are directed to the second quadrupole. Inert gas collision results in fragmentation, and the resulting products are measured in the third quadrupole. Compared to MALDI-PSD, CID is stable, robust, and widely applicable. Peptide ion fragments are predominantly produced along the backbone of the amide bond, forming a ladder-like sequence. Differences between consecutive fragments determine the mass of the amino acid at that point, allowing sequence inference. Several residue sequences, termed "peptide sequence tags," can be obtained from CID spectra. Combining the molecular weight of the peptide fragment parent ion with the distance from the N- to C-terminus is sufficient for protein identification.
Amino Acid Composition Analysis
Amino acid composition analysis, first introduced in 1977 as a tool for protein identification, represents a unique "footprint" technique. Leveraging the protein's heterogeneity in amino acid composition, this method provides an attribute independent of sequence, distinct from peptide mass or sequence tags. Latter demonstrated for the first time that amino acid composition data could be utilized for protein identification from 2-DE gels. The determination of protein composition involves either labeling proteins with radiolabeled amino acids or imprinting proteins onto PVDF membranes, followed by acid hydrolysis at 155°C for 1 hour. Through this simple process, amino acids are extracted, automatically derivatized, and chromatographically separated within 40 minutes, enabling the routine analysis of 100 proteins per week.
Based on the numerical differences representing the quantity disparity between two components, proteins in a database are ranked. The "champion" protein possesses amino acid composition most similar to the unknown protein. Considering the differences in scores between the champion and runner-up proteins, the confidence level is higher for the champion protein. Several programs available on the internet for amino acid composition analysis include AACompIdent, ASA, FINDER, AAC-PI, and PROP-SEARCH. In PROP-SEARCH, components, sequences, and amino acid positions are used to retrieve homologous proteins. However, some drawbacks exist, such as amino acid variations due to inadequate acid hydrolysis or partial degradation. Therefore, it is advisable to combine other protein attributes for comprehensive identification.
Reference
- Dubitzky, Werner, et al., eds. Encyclopedia of systems biology. Vol. 402. New York, NY, USA:: Springer, 2013.