Expanding N-glycopeptide identifications by modeling fragmentation, elution, and glycome connectivity

DatasetsWe demonstrated our method on two previously published stepped energy HCD glycopeptide datasets. The first dataset, originally published by10, was enriched glycopeptides from mouse brain (PXD005411), kidney (PXD005412), heart (PXD005413), liver (PXD005553), and lung (PXD005555) tissues, which we will refer to as the mouse tissue dataset as well as the fission yeast dataset (PXD005565) for the entrapment study, all collected using stepped collision energy of 20/30/40 nCE on an Thermo-Fisher Orbitrap Fusion Tribrid instrument. The second was originally published by ref. 31 and was enriched for sialic acid-containing glycopeptides from human serum (PXD005931) acquired using a Thermo-Fisher Scientific Q Exactive with a stepped collision energy of 15/25/35 nCE32. The process we used involved a sequential refinement, but at each stage, we used the same processed MS data, glycoproteome databases, and search parameters.LC-MS/MS PreprocessingWe downloaded raw data files for each dataset from PRIDE45 converted to mzML using ProteoWizard46, followed by peak picking, deisotoping and charge state deconvolution using GlycReSoft’s preprocessing tool14. The preprocessing tool averaged each MS1 scan with the preceding and following MS1 scan, did not apply background reduction, used a glycopeptide averagine (H15.75 C10.93 S0.02 O6.47 N1.65) for MS1 scans and a peptide averagine for MSn scans.Database constructionFor the mouse tissue dataset, we used the UniProt reference Mus musculus proteome UP00000058930 and extracted only those from SwissProt. We extracted the glycan list from pGlyco3’s large prebuilt mouse N-glycan database7,13 and simplified the entries from structures to compositions and combined it with a mammalian N-glycan biosynthetic simulation29 allowing NeuAc, NeuGc, and Gal-α-Gal terminal groups. We combined this proteome and glycome, allowing one glycosylation per peptide, generating peptides using a trypsin cleavage rule allowing up to two missed cleavages, and applied a constant carbamidomethyl modification at cysteine and variable oxidation modification at methionine. We generated decoy proteins by reversing complete protein sequences but retaining N-glycosylation sites at the disrupted sequons without introducing new sites as in ref. 7. For the entrapment study, we used the fission yeast reference Schizosaccharomyces pombe proteome UP00000248530. For the human serum datasets PXD005931 and PXD009654, we used the UniProt reference Homo sapiens proteome UP00000564030 and a human N-glycan biosynthetic simulation29 allowing only NeuAc terminal groups. For PXD005931, we included a short list of common mucin-type O-glycans as well.Search strategyWe followed the mass accuracy settings suggested in each dataset’s original publication but found that the number and type of adducts to consider were omitted or incomplete. For the datasets in the mouse tissue dataset and the yeast entrapment dataset, we allowed a 5PPM mass error tolerance for precursor ion matches and a 20PPM mass error tolerance for product ion matches, permitting up to two ammonium adducts. For the PXD005931 samples, we used 10PPM mass error tolerance for both precursor and product ion matches and also allowed one ammonium adduct14, and for PXD009654 we used a 10PPM precursor mass error tolerance. Our search strategy did not consider chimeric or co-isolating precursors, although when we compared results to pGlyco3, we included their chimeric solutions. GlycReSoft searches with identification methods were all subject to the same adduct deconvolution and retention time modeling procedures as described later for consistency.Base scoring modelWe built upon the GPSM scoring model and FDR estimation paradigm developed in pGlyco210. The scoring model used a linear mixture of peptide backbone and glycan structure evidence to score glycopeptides. The peptide score (Eq. (1)) was a mass accuracy weighted log-intensity summation, weighted by peptide sequence coverage (to exponent γ). The glycan score (Eq. (2)) followed the same pattern, save that the glycan coverage is broken into two terms, where the coverage along the entire topology is given one exponential weight α, while the coverage of the conserved N-glycan core is given an additional exponent β. The two components are combined by a linear mixing weight w. Because the peptide and glycan scores were retained, the same mixture model-based FDR estimation procedure is applicable, allowing us to do a direct comparison with the results published in ref. 7. We used α = 0.5, β = 0.4, γ = 1 and w = 0.65 for all variations of this scoring model, selectable as log_intensity_reweighted from the CLI.$${{{{{{\rm{Score}}}}}}}_{P}(\gamma,\, {{{{{\rm{tol}}}}}})=\left[\sum\limits_{i}^{{m}_{p}}\log ({{{{{{\rm{inten}}}}}}}_{i})\times \left(1-{\left| \frac{{{{{{{\rm{ppm}}}}}}}_{i}}{{{{{{\rm{tol}}}}}}}\right| }^{4}\right)\right]\times {{{{{{\rm{coverage}}}}}}}_{P}^{\gamma }$$
(1)
$${{{{{{\rm{Score}}}}}}}_{G}(\alpha,\beta,{{{{{\rm{tol}}}}}})= \left[\sum\limits_{i}^{{m}_{g}}\log ({{{{{{\rm{inten}}}}}}}_{i})\times \left(1-{\left| \frac{{{{{{{\rm{ppm}}}}}}}_{i}}{{{{{{\rm{tol}}}}}}}\right| }^{4}\right)\right]\times {{{{{{\rm{coverage}}}}}}}_{G}^{\alpha }\\ \times {{{{{{\rm{coverage}}}}}}}_{G,core}^{\beta }$$
(2)
We added a precursor mass accuracy bias (Eq. (3)) with μpre = 0 and σpre = 5 ppm to prefer solutions with better precursor mass matches, given equal fragmentation evidence, exploited in ref. 14. We also included a penalty when a signature ion is expected for a monosaccharide but not observed or when a signature ion is observed without being expected (Eq. (6)) similar to ref. 47, in this work only common sialic acids were considered, but other abundant modified monosaccharides are also available. This complete scoring function is expressed in Eq. (7).$${{{{{\rm{MassAcc}}}}}}({{{{{{\rm{ppm}}}}}}}_{pre},\, {\mu }_{pre},\, {\sigma }_{pre})=-10\,{\log }_{10}\left(1-\exp \left\{-\frac{{({{{{{{\rm{ppm}}}}}}}_{pre}-{\mu }_{pre})}^{2}}{2{\sigma }_{pre}}\right\}\right)$$
(3)
$${{{{{\rm{UnexpIon}}}}}}\, (o)=10\,{\log }_{10}\left(1-\frac{{{{{{{\rm{inten}}}}}}}_{o}}{{\,\!}_{\max} {{{{{\rm{inten}}}}}}}\right)$$
(4)
$${{{{{\rm{MissIon}}}}}}\, (o)=10\,{\log }_{10}\left(1-\min \left\{o/2,0.99\right\}\right)$$
(5)
$${{{{{\rm{SignIon}}}}}}=\sum\limits_{o\in (g[{{{{{\rm{NeuAc}}}}}}],g[{{{{{\rm{NeuGc}}}}}}])}\left\{\begin{array}{ll}{{{{{\rm{UnexpIon}}}}}}(o)\quad &o=0 \hfill \\ {{{{{\rm{MissIon}}}}}}(o) \hfill \quad &o > 0\,{{{{{\rm{and}}}}}}\,\frac{{{{{{{\rm{inten}}}}}}}_{o}}{{\,\!}_{\max} {{{{{\rm{inten}}}}}}}\le 0.01\\ 0 \hfill \quad &{{{{{\rm{otherwise}}}}}} \hfill \end{array}\right.$$
(6)
$${{{{{{\rm{Score}}}}}}}_{GP}(\alpha,\, \beta,\, \gamma,\, {{{{{\rm{tol}}}}}},\, w)= w\times {{{{{{\rm{Score}}}}}}}_{P}(\gamma,\, {{{{{\rm{tol}}}}}})+(1-w)\times {{{{{{\rm{Score}}}}}}}_{G}(\alpha,\, \beta,\, {{{{{\rm{tol}}}}}})+\\ {{{{{\rm{SignIon}}}}}}()+{{{{{\rm{MassAcc}}}}}}({{{{{{\rm{ppm}}}}}}}_{pre},0,5\times 1{0}^{-6})$$
(7)
Glycan coverage approximationOne advantage of pGlyco’s method is that it is able to compute a formal coverage ratio for the glycan component by using the peptide+Y ion ladder and an exact enumeration of the theoretical fragments for each of their glycan topologies. This comes at the cost of requiring a topology for each glycan to be searched, expanding the search space to consider, despite lacking diagnostic fragmentation to discriminate between most topological isomers. We introduce a method for approximating the total number of theoretical fragments a glycan composition could generate if its monosaccharides were arranged in a tree structure. N-glycans are branching structures, similar to binary trees. The height of a balanced binary tree with n nodes is ${\log }_{2}n$. Because peptide+Y fragment generation often involves cleavage events along multiple branches, we can assume an upper bound for fragments of a binary tree to be $n{\log }_{2}n$. N-glycans are not truly binary trees: the unfucosylated core motif’s root node has a single child node, suggesting the upper bound $\frac{n}{2}{\log }_{2}n$ for N-glycans without core fucosylation or xylosylation. Beyond the first fan-out from the core motif, N-glycans are usually linear, causing the $n{\log }_{2}n$ approximation to be too harsh, especially for large glycans. A change to the natural log $n\log n$ turns out to be a close bound for small glycans and forgiving of large glycans which an exact coverage-based method is more stringent for. A comparison of the different rates of growth and divergence is shown in Fig. 4. This allows us to generate a coverage metric for glycan compositions, letting us use a more compact glycan composition database rather than a glycan structure database.Fig. 4: Glycan composition fragment counting approximation.The number of distinct mass fragments produced by fragmenting a known topology of N-glycan of a given size, enclosed by our two approximation proposals, with and without the presence of core fucosylation. Each labeled vertical line denotes a class of complex-type N-glycan of ascending number of lactosamine units beyond the core motif, which are arranged as separate branches. The immature high mannose from Man9 to Man4 are also shown, which fall below the approximation due to the homogeneity of their Y fragments.We generated semi-structured peptide+Y ion ladders from glycan compositions by explicitly generating fragments assuming that a core motif is present, marking these fragments as core fragments, possibly with deoxyhexose or pentose side-chain, and then adding every combination of remaining non-labile monosaccharides to the core motif, including biosynthetically improbable ones. We treated NeuAc and NeuGc as labile. To calculate glycan coverage, we first approximated the “size” of a glycan composition as the number of monosaccharide residues in the glycan composition minus the number of NeuAc and NeuGc, and deduct one if two or more dHex/Fuc residues are present, calling the final number ng. We then used the approximation shown in Fig. 4, computing the normalizing factor dg (Eq. (9)).$${n}_{g}=\left\vert g\right\vert -g[NeuAc]-g[NeuGc]-(g[dHex]\, > \, 1)$$
(8)
$${d}_{g}=\max \left(\left\{\begin{array}{ll}{n}_{g}\log {n}_{g}\quad \hfill&g[dHex] \, > \, 0\\ \frac{1}{2}{n}_{g}\log {n}_{g}\quad &g[dHex]=0\end{array}\right.,\, {n}_{g}\right)$$
(9)
Where g[m] is the number of monosaccharide m units in g and $\left\vert g\right\vert$ is the cardinality of g, the number of discrete monosaccharide residues in the glycan composition. The coverageG,core is readily calculable as our algorithm explicitly enumerates the core fragments, while coverageG is the number of distinct peptide+Y fragments matched divided by dg. Both target and decoy glycans were treated the same way, save that decoy glycans’ peptide+Y fragments beyond Y1 were given a random mass shift between 1.0 and 30.0 Da drawn from a uniform random distribution. We treated O-glycans identically, noting that for ng≤ 7, $\frac{1}{2}{n}_{g}\log {n}_{g} < {n}_{g}$, so by Eq. (9)dg = ng. Many individual mucin O-glycans are less than 7 monosaccharides. No glycans considered in our O-glycome were above this threshold.MS2 FDR EstimationWe estimated FDR as described in14, except that the peptide FDR estimation procedure uses a semi-supervised linear SVM as described in48,49 using scikit-learn37. For the base scoring model, the features used are the peptide score (Eq. (1)) and peptide coverage (CoverageP), and for the fragmentation prediction scoring model, the features used are the peptide score (Eq. (12)), the peptide coverage (CoverageP), and the peptide fragment correlation $Co{r}_{P}^{{\prime} }(\theta )$ in Eq. (11). We explicitly did not calculate the glycan FDR for glycan compositions of size 3 or smaller as these produce too few peptide+Y ions to be meaningful.Following MSn FDR estimation, we localized all PTMs, including glycans, using an implementation of PTMProphet50. For each spectrum, all inferior localization solutions were removed from consideration for subsequent steps of aggregation and analysis.MS1 ScoringAfter MSn spectra were assigned and FDR-controlled, we extracted all deconvoluted MS1 peaks from the processed MS data file and constructed MS1 features as described in ref. 26, save that the charge state component is set to a constant 0.8 and no adduct scoring was performed.Adduct deconvolution and retention time modelingSee supplementary section 11.1 for details on the adduct deconvolution process and subsequent retention time model building and online glycan composition revision process.Glycopeptide fragmentation modelingInter-peak relationshipsGlycopeptide fragmentation is complex, including multiple charge states for the same theoretical fragment and the presence of both glycosylated and unglycosylated versions of peptide backbone fragments occurring interspersed. Many others20,21,51 have demonstrated that a peak-fragment ion match, which is supported by related peak-fragment ion matches is worth more than a peak matched in isolation. We used a Bayesian probability model based on UniNovo’s21. In addition to the link features described in the original method, we added a mass difference feature for HexNAc (203.0794 Da) for peptide backbone fragments as well as for HexNAc, Hexose (162.0528 Da) and dHex (146.0579) for peptide+Y ion series matches. We did not include neutral losses of NH3 or H2O and iterative refinement here, though they may be useful for future work. During training, we did not consider any peaks with an intensity rank below 1 but set no restriction on rank during inference.UniNovo models multiple partitions over theoretical precursors independently by precursor mass as a proxy for peptide length assuming that fragmentation propensities for these structures will differ. As glycopeptides have both a peptide and a glycan component and a larger range of charge states than bare peptides, we use a multi-dimensional partitioning by peptide length, glycan size, precursor charge, precursor proton mobility52, and the type and number of occupied glycosylation sites, with the ranges defined in supplementary Table 5. This produces up to 150 partitions per glycosylation type, though not all are expected to be populated.We extracted GPSMs passing a 1% FDR threshold and a total score threshold of 20 from all samples in each dataset, converted to an annotated MGF format. For the mouse tissue dataset, we chose to reserve the brain tissue subset as a test set and fit our model on the remaining tissue types to demonstrate model performance and ability to generalize. For the human serum dataset, there was no obvious distinction between samples, so we used all samples for training to demonstrate the effect of whole dataset modeling for a large study. We partitioned GPSMs according to the rules described above, though in order to smooth over small numbers of observations in some groups, we mixed adjacent charge state groups while holding all other constraints constant, and fit our model for each ion series.For each glycopeptide fragment match f we compute the posterior probability of that peak using its series and the set of unique peak-pair features, which we term the reliability of the fragment match ϕf.Peak intensity predictionA glycopeptide under collisional dissociation fragments in both the glycan and the peptide, with preference to weaker bonds breaking with greater frequency, subject to the physio-chemical properties of the molecule and the activation energy used53,54. Prediction of whether a fragmentation event should generate an abundant peak has repeatedly been used as a method for improving peptide identification23,25,51,52.The mobile proton hypothesis is a widely accepted kinetic model of fragmentation for protonated peptides53,55 which has been used to create many peptide fragmentation prediction algorithms51,52. Unsurprisingly, glycopeptide fragmentation depends on mobile protons as well, driving very different abundance patterns of fragmentation depending upon charge state6,54. We used the proton mobility classification scheme described in ref. 52, where the number of K, R, and H are compared to the precursor ion’s charge, where if the sum is greater than the charge, the precursor is immobile, if equal, the precursor is partially mobile, or less than, the precursor is mobile. We included this observation in the partitioning scheme we derived from UniNovo earlier and applied the same partitioning scheme when modeling relative intensities.We modeled the relative intensity of a fragmenting glycopeptide as a probability drawn from a multinomial distribution parameterized by a set of features listed in supplementary Table 6. The features chosen were based upon the approaches described in refs. 51,52. It has been made clear that more sophisticated modeling techniques may be applied to this type of problem23,25, but they lack interpretability and require substantial numbers of observations to train.For each partition of the training data, without allowing sharing between adjacent charge states, we estimated the parameters of the multinomial distribution from glycopeptide spectrum matches using iteratively re-weighted least squares, weighted by the total signal in each spectrum, with the individual peaks weighted by the reliability ϕ using the peak relation model for that partition, or 1 if this lead to an unstable solution. For each partition, we fit a model on all of the GPSMS for predicting peptide backbone fragment intensities. Next, we split the GPSMS into two groups based on whether the most abundant peptide+Y ion series ascends or descends in intensity as the glycan composition grows in size, and fit a model on each for use when predicting peptide+Y fragment intensities for observed matches that show the same size-intensity trend. After this step, each partition contained a peak relation model ϕ, a peptide relative intensity model, and two glycan relative intensity models θ.Integrating fragmentation modeling into scoring functionsWe incorporated peak relation-based reliability and peak intensity prediction into the glycopeptide scoring model’s two moiety-specific branches. For each theoretical GPSM, we found the appropriate partition’s models, or the nearest partition if it were missing.The prediction-enhanced scoring model extended the base scoring model with new components. For the peptide score shown in Eq. (12) the fragment-level reliability ϕ channels weight away from lower confidence peaks, diminishing their influence on the score, while the correlation term CorP(θ) gives a benefit to matches that match more fragments while still correlating with the intensity model (Eq. (10)). For Eq. (10), we added a shifted Pearson correlation ρ of the observed and predicted intensities of peptide fragments scaled by ${\log }_{10}{m}_{p}$ where mp is the number of peptide fragments matched to prefer solutions that match more peaks at the detriment of worse correlation, given the weak average correlation the peptide model has. We also found that the total reliability of all peaks matched was nearly as useful as the observed intensity itself. During FDR estimation, we explicitly include Eq. (11), which scales linearly with peptide backbone fragments normalized by coverage instead of logarithmically which we found provided better separation between targets and decoys. If the peptide correlation were improved, the correlation shift wouldn’t be needed, but the mp scaling factors might still be necessary because of the partial fragmentation of peptide backbones under SCE. Under different dissociation conditions that preserve more intermediate glycan fragmentation attached to the peptide backbone fragments, such as ETD or EThcD, it would require revisiting how reliabilities interact between peptide backbone fragments with partially fragmented glycans attached.$$Co{r}_{P}(\theta )=\left(\rho \left({{{{{{\rm{inten}}}}}}}_{P},\theta \left({{{{{{\rm{gpsm}}}}}}}_{P}\right)\right)+1\right)\times{\log }_{10}{m}_{p}$$
(10)
$$Co{r}_{P}^{{\prime} }(\theta )=\left(\rho \left({{{{{{\rm{inten}}}}}}}_{P},\theta \left({{{{{{\rm{gpsm}}}}}}}_{P}\right)\right)+1\right)\times{m}_{p}\times {{{{{{\rm{coverage}}}}}}}_{P}$$
(11)
$${{{{{{\rm{ScoreModel}}}}}}}_{P}(\gamma,\, {{{{{\rm{tol}}}}}},\, \theta )=\left(\left[\sum\limits_{i}^{{m}_{p}}{\log }_{10}({{{{{{\rm{inten}}}}}}}_{i})\times \left(1-{\left| \frac{{{{{{{\rm{ppm}}}}}}}_{i}}{{{{{{\rm{tol}}}}}}}\right| }^{4}\right)\right.\right.\\ \left.\left. \times ({\phi }_{i}+1)+{\phi }_{i}\right]+Co{r}_{P}(\theta )\right)\times {{{{{{\rm{coverage}}}}}}}_{P}^{\gamma }$$
(12)
The model glycan score, shown in Eq. (17) is more complex. For peptide+Y ions, the reliability ϕ and CorG(θ) are consistently larger but could be skewed for observations with a small number of matched fragments but strong agreement with the model. To counter this, we scaled these quantities by the number of non-trivial glycan fragments that have been observed, Eq. (13), sigfragg. We observed scenarios where the model chose different peptide backbones to favor low proton mobility when incomplete fragmentation was available, often abandoning abundant peptide backbone fragments to do so. Given that the glycan-describing fragments depend upon the peptide sequence being correct via the peptide backbone composition features, but do not themselves help define the peptide backbone beyond the crude mass aggregate, we limited the model’s contribution to the glycan score by scaling it by Eq. (14), a function of peptide coverage s(coverageP), where s was chosen from a family of exponential functions clamped to the [0, 1] range within the domain of [0, 1]. Similarly, we limited the contribution of the reliability by glycan correlation to prevent that term from selecting worse solutions that happen to match an extra random peak as shown in Eq. (15) where the total reliability is scaled down by a function of the glycan intensity correlation. Because reliabilities are often quite high for glycan fragments, we did not scale intensity by reliability, letting us factor the term out of the first sum and scale it prior to combining it with the base score. If glycan fragmentation were more sparse, it might be favorable to change this to follow the pattern used in Eq. (12).$${{{{{{\rm{sigfrag}}}}}}}_{g}=\sum\limits_{i}^{{m}_{g}}{{{\mbox{fragment glycan size}}}}_{i} \, > \, 1$$
(13)
$${{{{{\rm{pad}}}}}}(x)=0.5\times x+0.5$$$$s(x)=\min \left(\exp (3\times x-1),1\right)$$
(14)
$$Re{l}_{G}(\theta )=\sum\limits_{i}^{{m}_{g}}{{{{{\rm{pad}}}}}}({\phi }_{i})\times {{{\rm{max}}}} \, \left(\rho ({{{{{{\rm{inten}}}}}}}_{G},\, \theta ({{{{{{\rm{gpsm}}}}}}}_{G})),\, 0.25\right)$$
(15)
$$Co{r}_{G}(\theta )=\frac{\rho ({{{{{{\rm{inten}}}}}}}_{G},\, \theta ({{{{{{\rm{gpsm}}}}}}}_{G}))+1}{2}$$
(16)
$${{{{{{\rm{ScoreModel}}}}}}}_{G}(\alpha,\, \beta,\, {{{{{\rm{tol}}}}}},\, \theta )= \left(\left[\sum\limits_{i}^{{m}_{g}}{\log }_{10}({{{{{{\rm{inten}}}}}}}_{i})\times \left(1-{\left| \frac{{{{{{{\rm{ppm}}}}}}}_{i}}{{{{{{\rm{tol}}}}}}}\right| }^{4}\right)\right]+\right.\\ \left. \left[Re{l}_{G}\left(\theta\right)+Co{r}_{G}\left(\theta \right)\right]\times {{{{{{\rm{sigfrag}}}}}}}_{g}\times s\left({{{{{{\rm{coverage}}}}}}}_{P}\right)\right)\times \\ \hspace{1em}{{{{{{\rm{coverage}}}}}}}_{G}^{\alpha }\times {{{{{{\rm{coverage}}}}}}}_{G,core}^{\beta }$$
(17)
We combine Eqs. (12) and (17) in a linear mixture in Eq. (18), along with two auxiliary scores to help select a more parsimonious solution.$${{{{{{\rm{ScoreModel}}}}}}}_{GP}(\alpha,\, \beta,\, \gamma,\, {{{{{\rm{tol}}}}}},\, w,\, \theta )= w \times {{{{{{\rm{ScoreModel}}}}}}}_{P}(\gamma,\, {{{{{\rm{tol}}}}}},\, \theta )+(1-w)\times \\ {{{{{{\rm{ScoreModel}}}}}}}_{G}(\alpha,\, \beta,\, {{{{{\rm{tol}}}}}},\, \theta )+{{{{{\rm{SignIon}}}}}}()+\\ {{{{{\rm{MassAcc}}}}}}({{{{{{\rm{ppm}}}}}}}_{pre},\, 0,\, 5\times 1{0}^{-6})$$
(18)
Site-specific glycome network smoothingWe and others have shown that it is useful to exploit the relationships between biosynthetically nearby glycans when evaluating glycan confidence26,56. Briefly, the method described in ref. 26 constructs a graph over the glycome, where nodes are glycan compositions and edges are addition/removal of a monosaccharide, and divides it into biosynthetic neighborhoods. Next, it maps observed glycan compositions onto that graph and estimates coefficients τ of those biosynthetic neighborhoods and the glycans within them, both observed and unobserved, including a mechanism by which information from observed glycans ϕ, which we will refer to as u here as ϕ is already used, may be propagated to unobserved ones via a smoothing parameter λ. In effect, ug is a function of the observed score for previously seen glycan composition g, the weighted degree of g in the glycome graph, and a weighted inner product over the subset of ${A}_{g}^{t}\cdot {{{{{\boldsymbol{\tau }}}}}}$ that g overlaps, scaled by λ.We extended the method for released glycans we presented in ref. 26 to the glycoforms observed at each glycosite of a glycoprotein, using high-confidence identified glycopeptides spanning those sites, treating each site independently. For all datasets, only glycopeptides with MS1 scores > 0 and passing a joint FDR threshold of 1% were considered. For the mouse tissue dataset, we fit site-specific glycome models for each tissue type separately. For the human serum datasets, we used the full dataset to estimate site-specific glycome models. The human serum dataset used the neighborhood rules found in ref. 26, while the mouse tissue datasets used an extended set of neighborhood bounds to take into account biosynthetic pathways absent in humans and other old-world primates1. For more details on the site-level parameter estimation process, see supplementary Section 11.5.Integrating network smoothing into scoring functionsTo incorporate the u into the search process, when we generated theoretical glycopeptides, we determined their protein and spanned glycosylation site(s) of origin and looked up the appropriate ug for that glycopeptide’s glycan g from that site’s model. Decoy glycans were treated identically to their target counterparts, given the same value for ug. Decoy peptides are looked up against decoy proteins whose site models are the same as their target counterparts’ reflected along the reversed sequence to align with the same residue and also behave identically.We incorporated ug into the scoring function by adding it to the total evidence before being scaled by glycan coverage, as shown in Eqs. (20) and (19), where ug = 0 if no site model was fit for that glycosite. We applied this technique to only N-glycosylation sites no models were fit for O-glycosylation sites.$${{{{{{\rm{ScoreSmoothed}}}}}}}_{G}(\alpha,\, \beta,\, {{{{{\rm{tol}}}}}})= \left(\left[\sum\limits_{i}^{{m}_{g}}\log ({{{{{{\rm{inten}}}}}}}_{i})\times \left(1-{\left| \frac{{{{{{{\rm{ppm}}}}}}}_{i}}{{{{{{\rm{tol}}}}}}}\right| }^{4}\right)\right]+{u}_{g}\right)\\ \times \ {{{{{{\rm{coverage}}}}}}}_{G}^{\alpha }\times {{{{{{\rm{coverage}}}}}}}_{G,core}^{\beta }$$
(19)
$${{{{{{\rm{ScoreModelSmoothed}}}}}}}_{G}(\alpha,\, \beta,\, {{{{{\rm{tol}}}}}},\, \theta )= \left(\left[\sum\limits_{i}^{{m}_{g}}{\log }_{10}({{{{{{\rm{inten}}}}}}}_{i})\times \left(1-{\left| \frac{{{{{{{\rm{ppm}}}}}}}_{i}}{{{{{{\rm{tol}}}}}}}\right| }^{4}\right)\right]+\right.\\ \left[Re{l}_{G}(\theta )+Co{r}_{G}(\theta )\right]\times Signi{f}_{G}\times s({{{{{{\rm{coverage}}}}}}}_{P}) \\ + \left.{u}_{g}\right)\times {{{{{{\rm{coverage}}}}}}}_{G}^{\alpha }\times {{{{{{\rm{coverage}}}}}}}_{G,core}^{\beta }$$
(20)
Libraries usedGlycReSoft is written in Python and Cython57 and uses NumPy58, SciPy59, MatPlotLib60, and scikit-learn37. Library versions used in this publication are listed in Table 1.Table 1 Software versionsReporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Expanding N-glycopeptide identifications by modeling fragmentation, elution, and glycome connectivity

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models