Uncovering missing glycans and unexpected fragments with pGlycoNovo for site-specific glycosylation analysis across species

Evolving the pGlycoNovo approachpGlycoNovo is a fire-new software package developed within the pGlyco315 platform, characterized by its highly efficient glycan library-free search capabilities. pGlycoNovo follows a glycan-first search strategy and integrates a full range Y-matching method for glycan composition de novo sequencing without knowing the peptide part (Fig. 1a). In contrast to existing tools that primarily rely on the core Y ions, pGlycoNovo utilizes the full-range Y ions and significantly expands the glycan search space for both N- and O-glycopeptide identification (Fig. 1b).Fig. 1: Development of pGlycoNovo for rapid and glycan library-free identification of intact glycopeptides.a The main strategy of full-range Y-ion dynamic searching in pGlycoNovo. b pGlycoNovo is distinguished by its utilization of a full-range of Y ions, enabling the identification of N-/O-glycopeptide across a significantly expanded glycan search space. c The schema of pGlycoNovo workflow. Detailed software algorithms and processes are provided in Online Methods. d A significant expansion of the glycan search space was achieved by pGlycoNovo in comparison to other tools. The glycan library-dependent software’s search space is limited to the number of glycan compositions recorded in its library, while the glycan library-independent software, StrucGP, is currently restricted to mammalian species due to its reliance on prior knowledge and can roughly handle up to 10,000 glycan compositions, with the largest N-glycan being HexNAc(9)Hex(10)Fuc(5)Ac(4)Gc(4). In contrast, pGlycoNovo imposes no limitations on glycan library size and glyco units, allowing flexible customization. In this study, we employed an optimized glycan search space size of 168,000, as we found that a larger search space did not significantly enhance identification results (data not shown). e Sensitivity and precision evaluation of pGlycoNovo in a yeast dataset with a precursor tolerance of 2 ppm under different glyco-gap and glycan score conditions. f pGlycoNovo’s precision and sensitivity across three datasets using the parameters adopted in our study (2 ppm for precursor tolerance, a glyco-gap of 2, and a glycan score threshold of 20). Three benchmark datasets, from budding yeast, mouse, and human species, possessing well-established glycan libraries and in-depth knowledge of glycan compositions, were used for the demonstration of pGlycoNovo’s precision and sensitivity (Supplementary Note 1). We compare pGlycoNovo’s identification results with those of pGlyco3, which relies on a known glycan library, using the formula in the figure to evaluate pGlycoNovo’s precision and sensitivity. Detailed search parameter optimization and comparisons are provided in Supplementary Note 1. g Comparison of search speed using the three benchmark datasets. The comparison includes an assessment of the pGlyco3 and pGlycoNovo algorithms (both searching the same MGF files generated by pParse56).The schema of the whole pGlycoNovo process is shown in Fig. 1c. For a given glycopeptide MS2 spectrum, pGlycoNovo encompassed four key steps: glycan enumeration and matching of peak complementary masses (complementary mass refers to precursor mass minus peak mass); generation of Y-ion graph based on the identified Y-ions; deduction and filtration of glycan candidates from the Y-ion graph; retrieval of peptides for the identified glycans, followed by glycopeptide scoring and quality control. The detail of the pGlycoNovo algorithm is provided in the Online Methods (Methods). By employing the full-range Y-ion searching strategy, pGlycoNovo substantially expands the glycan search space, enabling de novo sequencing of ~168,000 glycan compositions. This expansion is 16 to 1000 times larger than that achieved by commonly used non-open-search approaches (Fig.1 d).We utilized intact N-glycopeptide data from budding yeast, mouse liver, and HeLa cells as our benchmark dataset (Supplementary Note 1), as these species are recognized for possessing well-established glycan libraries. This allowed us to fine-tune the search parameters and demonstrate pGlycoNovo’s performance in terms of sensitivity, precision, and search speed. After systematically optimizing the search parameters (Supplementary Note 1, Fig. 1e), pGlycoNovo exhibited a high precision exceeding 98.5% while maintaining efficient sensitivity of over 70.9% across three datasets using the parameters adopted in our study (Fig. 1f, Supplementary Note 1), putting it on par with well-established peptide de novo techniques33,34,35. Moreover, the efficient dynamic programming implemented in pGlycoNovo ensured a high-speed search (Fig. 1g), even outpacing the proven speed leadership of pGlyco3 among various software tools15.Assessing pGlycoNovo’s search performance on the SARS-CoV-2 spike proteinTo assess the performance of pGlycoNovo in glycopeptide identification, especially in the context of rare glycan identification, we conducted a comparative analysis with six other software tools: pGlyco3, MSFragger-open, StrucGP, Byonic, GlycanFinder, and Glyco-Decipher using one of the thirty raw data from a publicly available N-glycopeptide dataset of the SARS-Cov-2 Spike protein (PXD001850636). All seven software tools employed the same protein database containing the SARS-CoV-2 Spike protein sequences for protein searches, along with their respective N-glycan databases for glycan searches. Notably, pGlycoNovo does not rely on a predefined glycan database. Instead, it employs a maximum of forty-nine monosaccharides for glycan identification. The search details are listed in the Supplementary Note 2. The comparative results shown in Fig. 2a demonstrate that in the terms of overall identification results, pGlycoNovo performed comparably to GlycanFinder and Glyco-Decipher (216 vs. 219 vs. 218), and outperformed the other four tools (Supplementary Data). Even in comparison with MSFragger-Glyco open search17, GlycanFinder31, and Glyco-Decipher26, which specialize in-depth identification, pGlycoNovo stands out for its remarkable complementary capacities for the identification. It is noteworthy that the glycopeptide data we used here originated from the SARS-CoV-2 Spike protein expressed in insects, potentially containing rare glycans. StrucGP, which relies on prior glycan knowledge25, can only identify 0.85% of glycopeptides attached with rare glycans. Other five software tools including pGlyco3 are restricted by the limitations of their glycan libraries, making them incapable of identifying glycan types beyond those present in mammalian glycan databases. In contrast, pGlycoNovo detects 12.96% of glycopeptides attached with rare glycans (Fig. 2a, Supplementary Note 2, Supplementary Data).Fig. 2: Analysis of public SARS-CoV-2 Spike glycoproteome data.a Site-specific N-glycan identification comparison using different software tools on a single N-glycopeptide dataset of the SARS-CoV-2 Spike protein. b Search speed comparison using the same N-glycopeptide data (PXD001850636). All searches were initiated from RAW files, using the same protein database. Notably, pGlycoNovo performed searches in a glycan search space 16–1000 times larger than the other four tools. Detailed comparison procedures are in Supplementary Note 2. c Expanded N-glycoproteome revealed by pGlycoNovo in public SARS-CoV-2 Spike Protein MS data (PXD001850636, 30 RAW files). Results were compared with those published using Byonic for the same data (Supplementary Note 3). d Distribution of site-specific N-glycans on the SARS-CoV-2 Spike protein. We performed statistical analysis of the 523 site-specific glycans identified by combining results from pGlycoNovo and the published data. Glycans were categorized into five groups based on monosaccharide composition: oligomannose type, hybrid type, complex type, truncated glycans, and unclassified type, where “unclassified type” refers to rare glycans not included in existing glycan database-dependent search engines (Supplementary Note 4). e Analysis of site-specific O-glycans using pGlycoNovo on SARS-CoV-2 Spike protein. It is noted that this public data does not include ETD spectra, making it impossible for us to differentiate between neighboring sites. f An annotated spectrum of an intact N-glycopeptide with a rare glycan attached. g An annotated spectrum of an intact O-glycopepide attached with a rare glycan attached. Peptide sequence with “J” indicating the N-glycosylation site. The glycan symbols are as follows: green circle for Hex, blue square for HexNAc, red triangle for fucose, yellow star for xylose, and color block diamond for HexA. Here, we used publicly available datasets (PXD001850636, totally 30 RAW files). In figure a, one RAW data file was used to compare the seven software tools (Supplementary Note 2). In figure b, six RAW data files were used to compare the five software tools, and additional search time comparisons were provided in Supplementary Note 2. In figures c and d, all 30 RAW files were used to identify an expanded glycoproteome dataset with rare glycans (Supplementary Note 3).Furthermore, we performed a runtime comparison of pGlycoNovo with four other software tools using the same dataset. All these tools processed the data from the RAW file in the same computer environment and hardware conditions (Supplementary Note 2). pGlycoNovo completed the search of 6 RAW data files in an ultra-large glycan search space in just 6.7 minutes, significantly outperforming the other tools, which searched against the restricted glycan search space inherent to their glycan libraries (Fig. 2b, Supplementary Note 2). Building on the pGlyco3 infrastructure, pGlycoNovo also employs multi-processing to search MS data, with each RAW file assigned to a separate CPU core, operating independently of other analyses. Unlike multi-threading-based searches, this approach allows multiple RAW files to be processed simultaneously, reducing the IO time required to access multiple runs (Supplementary Note 2). However, when searching a single file, only one CPU core is utilized, which results in a slower process compared to multi-threading strategies.Then, we used pGlycoNovo to analyze all the thirty glycopeptide raw data from the SARS-Cov-2 Spike glycoprotein (PXD001850636). When comparing our results to previously published findings36, which were generated through Byonic searches (detailed methods and procedures are provided in Supplementary Note 3), pGlycoNovo demonstrated the ability to encompass approximately 71% of the published data, corresponding to 208 site-specific N-glycans out of the initially reported 293, and an additional identification of 230 site-specific N-glycans (Fig. 2c). The 230 site-specific N-glycans uniquely identified by pGlycoNovo covered all the 15 previously reported sites. These uniquely identified site-specific N-glycans show no bias in the distribution of the number of glycans at each site or in the presence of shorter glycans (Supplementary Note 3). This not only demonstrates the high matching capability of pGlycoNovo but also requires high-quality glycopeptide spectra. As a result, pGlycoNovo greatly expanded the publicly accessible dataset for the SARS-Cov-2 Spike N-glycoproteome, increasing it from 293 to 523 site-specific N-glycans. With a total of 523 site-specific N-glycans, we re-depicted the glycan compositions at the 17 N-glycosylation sites on the SARS-Cov-2 Spike glycoprotein (Fig. 2d). To convey the main glycosylation features at each site, we classified the glycans into five groups: oligomannose type, hybrid type, complex type, truncated glycans, and unclassified type, as depicted in Fig. 2d. Here, the “unclassified type” refers to rare N-glycan composition that is not included in existing glycan database-dependent search engines (Supplementary Note 4). Notably, our analysis revealed the presence of rare glycans at 12 sites, with the highest occurrence observed near the connector domain, spanning from N1074 to N1194 (Fig. 2d). The unique identifications by pGlycoNovo contributed rare N-glycans at different sites (Supplementary Note 3).Additionally, pGlycoNovo identified 30 O-glycan compositions with indistinguishable adjacent sites in this dataset, with 12 of them being rare glycans that contain the monosaccharides of xylose and hexuronic acid (Fig. 2e). It is noted that pGlycoNovo does not support site localization of glycans on HCD data. The public data we used does not include ETD spectra, which are necessary for differentiating between neighboring sites on a glycopeptide. Two annotated spectra illustrating an intact N-glycopeptide and an intact O-glycopepide are presented in Fig. 2f and g, both featuring rare glycans. Other annotated spectra are provided in Supplementary Note 3 and Supplementary Data. It is evident that pGlycoNovo effectively utilizes the full-range Y-ion fragments, including those containing xylose and hexuronic acid. This enables extensive interpretation of intact glycopeptides especially those with rare glycans in spectra that were previously challenging to decipher.Extensive N-glycoproteomics Across Diverse Model Species with pGlycoNovoTo further show the remarkable potential of pGlycoNovo in identifying diverse glycan compositions, we extended its application to the analysis of site-specific N-glycans across five evolutionarily distant species, with over a billion years of divergence, including plant (A. thaliana), worm (C. elegans), fly (D. melanogaster), zebrafish (D. rerio), and mouse (M. musculus). We employed the optimized LC-sceHCD-MS/MS methods in conjunction with both glycan library-independent pGlycoNovo and the glycan library-based pGlyco3, leading to the successful generation of an extensive N-glycoproteome dataset (Methods, Supplementary Note 5). This dataset comprises 32,549 site-specific N-glycans on 4,602 glycoproteins, with 643,045 glycopeptide-spectrum matches (GPSMs), all confidently identified at a 1% FDR at the intact glycopeptide level (Fig. 3a, Supplementary Data). We established the largest N-glycopeptide mass spectra data for the five species to data (Fig. 3b), and identified site-specific N-glycans in plant, worm, fly, and zebrafish for the first time on such a large scale, while also expanding the scale of site-specific N-glycans in mouse (Fig. 3c).Fig. 3: N-Glycoproteome profiling with pGlycoNovo and pGlyco3 across five evolutionarily distant species.a Overall workflow of intact glycopeptide profiling in five species. b Number of identified glycopeptide spectra in each species. c, Number of identified glycoproteins and site-specific glycans in each species. d Contribution of glycopeptide spectra identified by pGlycoNovo and pGlyco3. e Contribution of site-specific glycans identified by pGlycoNovo and pGlyco3. f Classification of site-specific glycans in each species by glycan type. Glycans were categorized into five groups based on their monosaccharide composition: oligomannose type, hybrid type, complex type, truncated glycans, and unclassified type, where “unclassified type” refers to rare glycans not included in existing glycan database-dependent search engines (Supplementary Note 4). g Distribution of specific monosaccharide-containing glycopeptides in each species. h An annotated spectrum of a glycopeptide with four fucoses. The peptide sequence with “J” indicating the N-glycosylation site. The glycan symbols are as follows: green circle for Hex, blue square for HexNAc and red triangle for fucose. i The workflows of the comprehensive 13C/15N isotopic-labeling strategy for the FDR validation. j Validation results from the isotopically labeled fission yeast and A. thaliana. The element-level error rate (incorrect number of N or C elements) of the identified glycopeptides was tested via the 15N-/13C-labeled precursor signals. Data are presented as mean values ± SD. Each bar represents the average value across biological replicates. Yeast experiments include three biological replicates, and plant experiments include two biological replicates. Each point represents an individual measurement, with error bars indicating the SD.The level of overlap in the identification results between pGlyco3 and pGlycoNovo is illustrated in Figs. 3d and 3e. Across the five species, both software tools exhibited an overlap in GPSMs and glycopeptide identifications ranging from approximately 54.47% to 73.74%. (Supplementary Note 5, Supplementary Data). Remarkably, about 8.21–23.08% of the identifications were exclusively reported by pGlycoNovo across different species (Fig. 3e). The species with the highest proportion of pGlycoNovo-only identifications were observed in worm, with 14.50% at the GPSMs level (Figs. 3d) and 23.08% at the site-specific glycan level (Fig. 3e). This suggests that within this dataset, a notable number of glycopeptides carry glycans not included in the glycan libraries, with C. elegans displaying the highest incidence of such rare glycans.We then classified the site-specific glycans identified in the five species into five types, revealing varying degrees of unclassified glycans across the five species (Fig. 3f). Among these species, worms exhibit a dominant high-mannose glycan type comprising 67.32%, accompanied by the highest proportion of unclassified glycans, accounting for 18.73%. In contrast, mice demonstrate the lowest proportion of unclassified glycans at just 5.04%, while displaying a relatively higher prevalence of complex glycan types. The other three species exhibit approximately 5.98%-7.84% unclassified glycans, with zebrafish and plants having a significant presence of high-mannose and hybrid/complex glycan types, while fly has the highest proportion of high-mannose glycans. The distribution of monosaccharides in site-specific N-glycans for each species is depicted in Fig. 3g. All species exhibit a high proportion of fucose-modified glycans, with plants reaching the highest at 61.43%. Xylose-modified glycans are prevalent in plants as well, accounting for 66.29%, which aligns with existing knowledge37. It is worth mentioning that we also observed small amounts of xylose-modified N-glycans in the other four species. Sialic acid was hardly detected in plant, worm, and fly, while zebrafish primarily contained NeuAc, and mouse exhibited both sialic acid types (Fig. 3g). As demonstrated by our comparative analyses of glycopeptides identified by pGlyco3 and solely by pGlycoNovo, our understanding of glycan types in various species is enhanced by pGlycoNovo’s capability to discover rare glycans not recorded in databases across different species (Supplementary Note 5).A series of matched Y ions within the annotated spectrum of a multi-fucose glycopeptide illustrates the precise deciphering capabilities of pGlycoNovo for glycopeptide fragments (Fig. 3h, Supplementary Note 5). To further validate the reliability of pGlycoNovo, we conducted N-glycopeptide analysis on mixed, isotope-labeled samples and performed FDR analysis using NaN ratio (Fig. 3i, Supplementary Note 6, Methods). The NaN ratio is calculated based on the MS intensity of an unlabeled glycopeptide and its 15N/13C-labeled counterpart, which has been previously shown to be an effective strategy for glycopeptide FDR validation by us15,28 and also used by other software tools38. In this study, we performed NaN ratio analyses on two different species: fission yeast (with an unlabeled/15N/13C sample ratio of 1:1:1) and plant (with an unlabeled/15N sample ratio of 1:1) (Supplementary Note 6). The results demonstrated that both pGlycoNovo and pGlyco3 reported NaN ratios below 1% for both species, confirming the reliability of pGlycoNovo identification, maintaining corresponding FDR control below 1%.Characteristics of site-specific N-glycoproteome in Five Evolutionarily Distant SpeciesWith this extensive N-glycoproteome dataset, we performed statistical analyses to explore the diversity of glycosylation and enhance our understanding of glycan modification characteristics across the five species. Our analysis revealed that, within each species, a majority of N-glycoproteins (exceeding 50%) predominantly feature a single glycosylation site (Fig. 4a), and more than half of the glycosylation sites undergo multiple glycan modifications (Fig. 4b). This macro- and microheterogeneity is evident across all five species (Figs. 4a, b). In mice, the number of proteins containing multiple glycosylation sites and the number of sites carrying various glycans are the largest, indicating the highest level of diversity (Fig. 4a, b). In addition, the average size of glycan chains in mice is also the largest, followed by zebrafish, which displays a glycan size distribution similar to that of mouse (Fig. 4c). In contrast, worm and fly exhibited similar glycan size distributions, with evenly distributed chains composed of 6–12 monosaccharides, as well as approximately 10% smaller glycans. Arabidopsis, on the other hand, primarily featured glycans consisting of 8–9 monosaccharides (Fig. 4c).Fig. 4: Characteristics of site-specific N-glycoproteome identified in five species.a Distribution of singly and multiply glycosylated proteins in each species. b Distribution of the number of glycans at each site in each species. c, Distribution of glycan size in each species. d Cellular localization of glycoproteins and the distribution of different glycan types on cellular localization in each species. e Recognition sequence motifs and the distribution of glycan types on different motifs in each species. (NXS/NXT, where N is asparagine, X is any amino acid except proline, S is serine, T is threonine; Others refer to N-X-Any motifs, where N is asparagine, X is any amino acid except proline, and Any represents any amino acid except serine and threonine.). f Secondary structure localization and the distribution of different glycan types on secondary structures in each species. In the Fig. 4d–f, the distribution of each glycan type was determined by normalizing through dividing the number of GPSMs for a glycan type within one category by the total number of GPSMs in that specific category (Supplementary Data). The secondary structure information in each species in Fig. 4f was obtained using a previously reported 3-state protein secondary structure prediction method57 (Supplementary Note 7). g Correlation of overall site-specific glycosylation in five mouse tissues. h Correlation of overall site-specific glycosylation in three plant organs. In Fig. 4g, h, the correlation of tissue/organ-specific site-specific glycosylation in five mouse tissues and three plant organs is depicted using Pearson correlation analysis of the GPSMs in each species/organ. The numerical values in the heatmap represent the degree of correlation, with values closer to 1 indicating higher inter-tissue/organ correlation, while values closer to 0 indicating lower inter-tissue/organ correlation.The cellular localization analysis of the N-glycoproteome in five different species exhibits a degree of conservation predominantly occurring in expected extracellular regions or certain intracellular compartments like the Golgi apparatus (Fig. 4d). This observation is in agreement with the conserved molecular machinery underlying N-glycosylation across diverse eukaryotes39,40. However, the characteristics of modified glycan types within these predominant subcellular organelles vary among these distinct species (Fig. 4d). For instance, proteins with the complex glycan type in plants and mice are primarily localized on the membrane surface, while in worms, they are mainly found in organelles like the Golgi apparatus. Glycoproteins with fucosylation glycans in zebrafish are predominantly located within organelles, whereas in worms, they coexist on both the membrane and within organelles, and in plant, fly, and mouse, they are distributed across different subcellular compartments. Xylose modifications discovered in flies and mice are mainly located in the endoplasmic reticulum, while in worms, they are enriched on the membrane surface. This variability may be linked to the specific functions of proteins unique to each species.Further analyses on the relationship between site-specific glycosylation and peptide sequence reveal that N-glycosylation adheres to consistent and stringent topological constraints across these species (Figs. 4e, f, Supplementary Data). As shown in Fig. 4e, N-glycosylation occurs more frequently with the motif of threonine than serine at the second position (Supplementary Data), and it is enriched in β-sheets while being depleted in α-helices across all organisms (Fig. 4f, Supplementary Note 7, Supplementary Data). These findings align with the previous observations derived from glycosylation site data41, confirming the presence of glycosylation canonical motifs and their structural localization characteristics. Additionally, we observe that glycan types and specific monosaccharide-containing glycans in these species also exhibit a remarkably consistent distribution pattern within these canonical motifs (Fig. 4e, Supplementary Data) and structural localizations (Fig. 4f, Supplementary Data). Our analysis not only validates the previous findings regarding the precise embellishment of proteins with N-glycosylation by the core N-glycosylation machinery, in strict concordance with the sequence motifs and topological locations of the substrates41, but also suggests that various N-glycosyltransferases, which determine glycan types, exhibit conservation in both sequence and structure throughout evolution.Additionally, comparative analyses showed that the glycopeptides uniquely identified by pGlycoNovo exhibited similar glycosylation modification patterns across the five organisms, including the number of glycosylation sites on a protein, the number of glycans at a site, and the glycan size (Supplementary Note 5). For instance, regardless of whether identified by pGlycoNovo alone (Figure S.Note5.5–3), pGlyco3 (Figure S.Note5.5–2), or both (Fig. 4), the results indicate that mice exhibit the highest diversity level, with the largest number of proteins containing multiple glycosylation sites and the largest number of sites carrying various glycans, indicating the highest level of diversity.Furthermore, we investigate the correlation of site-specific glycosylation modifications within different organs of the same species. We performed intact glycopeptide profiling on five mouse tissues (brain, heart, kidney, liver and lung) and three plant organs (bud, leaf and seed). Pearson correlation coefficients were calculated for each pair of organs or tissues. Consistent with our previous findings28, different mouse tissues displayed distinct glycosylation patterns. Specifically, brain tissue exhibited the most distinctive glycosylation profile, while heart and lung tissues demonstrated a higher resemblance to each other compared to the other tissues (Fig. 4g). Among the plant organs, bud and leaf showed some similarity (correlation coefficient of 0.61), whereas seed displayed distinct patterns (correlation coefficients of 0.27 and 0.22, compared to bud and leaf, respectively) in site-specific glycosylation comparisons (Fig. 4h).Unexpected glycan fragment detection in different samplesBased on the high performance and reliability of pGlycoNovo for comprehensive glycan fragment matching and de novo glycan analysis demonstrated in the previous sections, we observed prevalent presence of unexpected glycan fragments during the analysis of intact glycopeptides across different samples and experimental sources.To investigate this phenomenon, we utilized both pGlyco3 and pGlycoNovo to search glycopeptide data from different sample types and mass spectrometry analyses conducted in this study and other published studies (Supplementary Note 8). Then, we annotated the matched Y ions on the co-identified set of GPSMs from both pGlycoNovo and pGlyco3, followed by a comparative analysis (Fig. 5a). pGlycoNovo could exclusively identify numerous Y ions within these GPSMs (Supplementary Data). The Y-ion analysis in Fig. 5b illustrates that, as expected, Y ions containing pentasaccharide core fragments, such as Y-H(1)N(2), Y-H(2)N(2), and Y-H(3)N(2), or those with the core-fucose attached pentasaccharide core fragments, such as Y-N(1)F(1), Y-N(2)F(1), and Y-H(3)N(2)F(1), can be co-identified by pGlyco3 and pGlycoNovo with relatively high abundance. Significantly, pGlycoNovo exhibited unique capabilities in identifying a considerable portion of glycan fragments (Figs. 5b–e), many of which constituted unexpected fragment ions. Notably, the Y ions exhibiting multiple core-fucose compositions, such as Y-H(3)N(2)F(2), were exclusively identified by pGlycoNovo in the worm, aligning with the recognized multiple core-fucosylation nature in worms42. This further demonstrated the reliability of the Y ions extracted by full-range matching and graph-based filtration. Moreover, pGlycoNovo can also detect multiple core-fucose Y ions in other species (Fig. 5b, Supplementary Data). Furthermore, a noteworthy abundance of unexpected glycan fragment ions, such as Y-H(1)N(1), Y-H(1), Y-F(1), could be identified by pGlycoNovo regardless of the sample type or mass spectrometry conditions used for analysis (Fig. 5b, Supplementary Data). Figures 5c and d illustrate the comprehensive and accurate matching of Y ions by pGlycoNovo, encompassing those unexpected glycan fragment Y ions.Fig. 5: Unexpected glycan fragments in the analysis of intact glycopeptides.a The main pipeline for the analysis of Y ions in GPSMs. b Distribution of GPSMs containing specific pGlycoNovo-matched Y ions among co-identified GPSMs in each sample. c Annotated spectrum of a glycopeptide identified in mouse brain data (PXD02585925) with low-energy HCD (HCD@20) fragmentation analysis. d Annotated spectrum of a glycopeptide identified in worm data with sceHCD (HCD@30 ± 10) fragmentation analysis. e Proportion of GPSMs containing specific Y ions in a particular glycan type. f Proportion of GPSMs containing specific pGlycoNovo-matched Y ions among co-identified GPSMs in the plant (left), and the proportion of GPSMs containing specific Y ions in a particular glycan type (right). g Annotated spectrum of a glycopeptide identified in plant data. In figure b, e, and f, blue font indicates Y-ions matched by both pGlycoNovo and pGlyco3, and fuchsias font indicates Y-ions exclusively matched by pGlycoNovo. In figure c, d, and g, spectra peaks are magnified for better visibility; unexpected fragment ions matched by pGlycoNovo are highlighted in red dashed boxes. The peptide sequence with “J” indicates the N-glycosylation site. the glycan symbols are as follows: green circle for Hex, blue square for HexNAc, red triangle for fucose, and yellow star for xylose.It is worth noting that these unexpected glycan fragment ions consistently appeared in the spectra generated using both high and low-energy HCD fragmentation (Figs. 5b–d Supplementary Data), and Y-H(1)N(1) proportion is even higher at NCE = 20 (73.40%) than that at NCE = 33 (61.18%) (Fig. 5b). Moreover, all types of glycans, including truncated ones, were found to produce unexpected Y ions (Fig. 5e, Supplementary Data). Interestingly, the proportion of specific unexpected Y ions varies among different glycan types. For example, Y-H(1)N(1) is more prevalent in high-mannose glycan type, while Y-F(1) is more abundant in the complex glycan type. This potential pattern is related to the monosaccharide composition of the glycan types and is observed across different species (Fig. 5e, Supplementary Data). This phenomenon is widespread and is also evident in xylose-containing glycans in plant, where the proportion of Y-N(1)X(1) is 30% to the non-rearranged Y-H(1)N(2)X(1) (15.53% over 51.01%, Fig. 5f). The occurrence frequency of xylose-containing unexpected Y ions is relatively consistent across different glycan types (Fig. 5f). This may be attributed to the fact that xylose is typically attached to the first mannose in any glycan types, resulting in a uniform probability of rearrangement. Figure 5g illustrates the matching quality of the unexpected xylose fragments deciphered by pGlycoNovo.We suspect that these unexpected fragments are possibly resulted from the glyco-rearrangement in MS. Many noteworthy discoveries regarding glycan rearrangements during collisions have been reported, such as rearrangements of hexose43, fucose44,45,46, sialic acid47, and xylose48. However, to the best of our knowledge, there are currently no convenient ways to extensively explore glycan rearrangements at intact glycopeptide level. The full-range matching capability of pGlycoNovo in this context plays a unique role in detecting a wide range of rearrangements involving different monosaccharide combinations, including the rearrangement of big pieces of fragments, such as Y-H(3)N(1) and Y-H(4)N(1) (Fig. 5c, Supplementary Data).The wide range of glyco-rearrangement challenges the tree structure elucidation for glycans by using the fragment information, but it does not mean that the structure interpretation is impossible from MS information, especially for N-glycans of which we know the prior knowledge about the core and the possible branching structures. However, we shall always keep in mind that the frequent glyco-rearrangement may result in false tree structure assignment of rare glycans when glycan structure database is unavailable.

Hot Topics

Related Articles