Peptide clustering enhances large-scale analyses and reveals proteolytic signatures in mass spectrometry data

Ethics statementAll animal experiments are performed according to Swedish Animal Welfare Act SFS 1988:534 and were approved by the Animal Ethics Committee of Malmö/Lund, Sweden (permit number M131–16). The use of human wound materials was approved by the Swedish ethical review authority (etikprövningsmyndigheten application number 2023-05051-02). Written consent was received from all subjects prior to participation.Porcine wound samplesIn a previous study exploring the effects of the thrombin-derived antimicrobial peptide TCP-25, partial thickness wounds of Göttingen minipigs were either infected with S. aureus or P. aeruginosa26. The wounds were then dressed with a polyurethane dressing, which was changed 24 and 48 hours post infection, and the old dressings were collected for wound fluid extraction. The same dressing procedure was also performed for uninfected control wounds as well as double-infected wounds, which were first infected with S. aureus, and infected with P. aeruginosa after 24 hours, during the first change of polyurethane dressing. The dressing of the double-infected wound was also collected 72 hours post S. aureus infection26. The collected dressings were placed in syringes and soaked in 10 mM Tris (pH 7.4), before ejecting the fluids. Halt Protease Inhibitor Cocktail (Thermo Fisher Scientific, USA) was then added to half of the extracted wound fluid from each sample. All samples were then stored at −80 °C before use.Quantitative bacterial countsThe swabs and dressing fluid samples were diluted with sterile PBS to generate 7 10-fold serial dilutions from 10× to 107×. Six separate 10 μL drops of the undiluted sample and each of the dilutions were deposited on a Todd-Hewitt agar plate. The plates were incubated at 37 °C in 5% CO2 overnight. The next morning, the number of colonies was counted and recorded.Wound fluids from non-healing human leg ulcersWound fluid from patients with venous non-healing wounds was collected from Mepilex dressings applied on the wounds for 48–72 hours (trial registration number NCT05378997). The dressings were extracted as described above, and Halt Protease Inhibitor Cocktail added as above before storage at −80 °C.Identification of bacteria by MALDIColonies from wound swab samples were prepared using the extended direct transfer sample preparation procedure on stainless steel MALDI target plates as described by the manufacturer (Bruker Daltronik GmbH). A Microflex LT/SH SMART MALDI-TOF mass spectrometry (MS) instrument with flexControl v. 3.4 (Bruker Daltronik GmbH) was used to analyze the target plate and collect mass spectra in linear mode over a mass range of 2 to 20 kDa. A spectrum of 240 summed laser shots was acquired for each sample spot. The spectra were analyzed using a MALDI Biotyper (MBT) Compass v. 4.1 with the MBT Compass Library Revision L (DB-9607, 2020) (Bruker Daltronik GmbH).Sample preparation, mass spectrometry, and data processingWound fluid extracts with supplemented protease inhibitor had their protein concentrations measured using the Pierce BCA Protein Assay Kit (Thermo Fisher Scientific, USA) according to the provided instructions. A volume corresponding to 500 μg of protein for pig wound fluids and 100 μg of protein for human wound fluid was diluted in 10 mM Tris (pH 7.4) to 100 μl, and then further diluted in 300 μl 8 M urea (in 10 mM Tris (pH 7.4) to a final concentration of 6 M urea) supplemented with 0.067% RapiGest SF (Waters, USA) (to a final concentration of 0.05% RapiGest SF). The samples were then incubated for 30 minutes at room temperature. Meanwhile, Microcon–30 centrifugal filter units were rinsed with 100 μl 6 M urea (in 10 mM Tris (pH 7.4)) by centrifugation for 15 minutes at 10,000 × g at room temperature (RT). The samples were then loaded onto the filter units and centrifuged for 30 minutes at 10,000 × g at RT, followed by a final rinse with an additional 100 μl 6 M urea (in 10 mM Tris, pH 7.4) and 5 minutes of centrifugation at 10,000 × g at RT. The filtrate was then stored at −20 °C until LC-MS/MS analysis.In total, wound fluids from 47 pig wounds (17 wounds infected with S. aureus from 4 pigs, 17 wounds infected with P. aeruginosa from 4 pigs, 13 uninfected wounds from 4 pigs) at two time points (24 and 48 hours post infection) had their peptides extracted. In addition, wound fluids from 4 double-infected wounds from 1 pig at three different time points (24, 48 and 72 hours post S. aureus infection) also had their peptides extracted. 18 human wound fluids from 4 subjects had their peptides extracted.Extracted peptide samples were acidified by adding 1 μl 100% formic acid (FA) to 60 μl of peptide filtrate. Meanwhile, UltraMicro Spin Columns (The Nest Group, USA) were wet by adding 100 μl 100% acetonitrile (ACN) + 0.1% FA and centrifuging the column at 800 g for 1 minute at room temperature. These conditions were used for the remainder of the centrifugation steps of the solid phase extraction. The columns were then equilibrated by centrifuging 100 μl 2% ACN + 0.1% trifluoroacetic acid (TFA) through them twice. Samples were then spun onto the columns, followed by a washing step where 100 μl 2% ACN + 0.1% TFA was centrifuged through. The samples were then eluted by centrifuging 100 μl 70% ACN + 0.1% TFA through the columns. Once eluted, the samples were dried using an Eppendorf Concentrator plus at 45 °C and redissolved in 30 μl 2% ACN + 0.1% TFA.The redissolved peptide samples were loaded onto Evotip Pure columns according to the provided instructions, apart from that the loaded samples were dissolved in 30 μl 2% ACN + 0.1 % FA instead of 20 μl 0.1% FA. These were then analyzed by LC-MS/MS on an Evosep One LC (Evosep, Denmark) coupled to a timsTOF Pro mass spectrometer (Bruker, USA). The LC was equipped with an EV1137 Performance Column – 15 cm × 150 μm, filled with 1.5 μm ReproSil-Pur C18 beads (Evosep, Denmark), and separation was performed using the accompanying 30 samples per day program. The MS used the DDA PASEF mode, doing 10 PASEF scans every acquisition cycle. The accumulation and ramp times were both set to 100 ms. Precursors with a +1 charge were ignored and the target intensity was set to 20,000, with dynamic exclusion active, at 0.4 min. The isolation width was set to 2 at 700 Th and 3 at 800 Th.The data from the LC-MS/MS runs were searched with PEAKS X. UniProtKB reviewed (Swiss-Prot) protein list of pig proteins was used as a database when searching the pig samples, with the exchange of fibrinogen alpha chain (FIBA_PIG) and fibrinogen beta chain (FIBB_PIG) to the UniProtKB unreviewed (TrEMBL) versions F1RX36_PIG and F1RX37_PIG respectively (downloaded May 11th, 2023). When searching the human samples, UniprotKB reviewed (Swiss-Prot) protein list of human proteins was used as a database (downloaded September 29th, 2023). Data refinement was set to merge scans, correct precursor based on mass and charge states with charges between 1 and 4. It was also set to associate features with chimera scans and filter features between charges 2 and 8. During the search, the precursor tolerance was set to 20.0 ppm using monoisotopic mass and the fragment tolerance was set to 0.03 Da. Oxidation (M, +15.99) was treated as a possible modification, and a maximum of 1 modification per peptide was allowed. The search results were filtered at 1% FDR and at least 1 unique peptide for each protein. FDR was set to be estimated with decoy-function.The peptide intensities were ${\rm{lo}}{{\rm{g}}}_{2}$-transformed. Identified but unquantifiable peptides were imputed by sampling from a uniform distribution ${\rm{U}}\left(2,8\right)$ which is lower than the least abundant quantifiable peptides. The intensities were then mean-normalized so that all samples had equal intensity-means.ZymogramsZymogram gels were created with a separation gel consisting of 375 mM Tris buffer (pH 8.8), 0.1% (w/v) SDS, 0.1% (w/v) gelatine, 10% (w/v) acryl amide, 0.05% (w/v) TEMED and 0.05% (w/v) APS in Milli-Q water and a stacking gel consisting of 125 mM Tris (pH 6.8), 0.1% (w/v) SDS, 4% (w/v) acryl amide, 0.1% (w/v) TEMED and 0.05% (w/v) APS in Milli-Q water. For each sample, 5 μg of protein for porcine or 2 µg of wound fluid extract for human samples without added protease inhibitor was diluted to 5 μl with Milli-Q water, and then mixed with 5 μl sample buffer consisting of 400 mM Tris-HCl (pH 6.8), 20% (v/v) glycerol, 5% (w/v) SDS and 0.03% (w/v) bromophenol blue in Milli-Q water, which was then added to the wells. The gels were then run using an electrophoresis buffer consisting of 25 mM Tris, 200 mM glycine and 0.1% (w/v) SDS in Milli-Q water at pH 8.7 for 60 minutes at 150 V. Afterwards, the gels were washed with deionized water and incubated for 60 minutes in 2.5 % Triton X-100 at room temperature, with 160 rpm shaking, and followed by another deionized water wash. Next, the gels were incubated overnight at 37 °C in an enzyme buffer consisting of 50 mM Tris-HCl (pH 7.5), 200 mM NaCl, 5 mM CaCl2 and 1 μM ZnCl2 with 50 rpm shaking. The next day, the gels were washed in deionized water and incubated in a staining buffer consisting of 0.25% (w/v) Coomassie brilliant Blue G-250, 38.4% (v/v) ethanol and 7% (v/v) acetic acid in Milli-Q water for 60 minutes. The gels were then placed in a de-staining solution consisting of 9.6% (v/v) ethanol and 7% (v/v) acetic acid in Milli-Q and imaged using a Chemidoc MP Imaging System (Bio-Rad Laboratories, USA).SDS-PAGEFrom each wound fluid sample, 20 μg of protein with added protease inhibitor was mixed with Milli-Q water to 8 μl. 10 μl Tricine SDS Sample Buffer (2×) and 2 μl NuPAGE Reducing Agent (10×) was added to each sample. The samples were then incubated at 95 °C for 5 minutes. 10–20% Tricine gels and running buffer were prepared as described by the manufacturer’s instructions and were then run for 90 minutes at 100 V. Once the runs were finished, the gels were stained with Gelcode Blue Safe Protein Stain (Thermo Fisher Scientific, USA) according to the instructions provided by the manufacturer. Imaging was then performed using a Chemidoc MP Imaging System (Bio-Rad Laboratories, USA).Generating peptide clustersThe goal of generating peptide clusters is to group highly similar and proximate peptides into a single entity. Here, developed a fuzzy algorithm that utilizes networks and community detection to yield the most optimal grouping of peptides. This coincides with capturing the endopeptidase activity while filtering out the effect of exoproteases which generate clusters of highly similar peptides. The first step in generating peptide clusters entails generating weighted peptide networks, where peptides we consider to be part of the same cluster are strongly connected. In the second part, the peptide networks are partitioned by identifying communities within the network. These two steps work together to generate the optimal peptide clusters.The distance between each pair of peptides is calculated using a distance function, d. The distance function devised and applied here takes peptide overlap, centroid distance, and peptide length ratios into account. The rationale behind the choice of these variables is the following:

Peptides with a high degree of overlap are likely to belong to the same cluster.

Long peptides with high overlap, but a large distance between their centroids are unlikely to be from the same cluster.

Peptides with different lengths could belong to the same cluster, but should not be connected directly. If intermediate products are present, they will have an indirect connection, however, if not, they should not be connected, as they are most likely from different clusters.

The different parts of the distance function for peptides $i,j$ with starting positions ${s}_{i,j}$, end positions ${e}_{i,j}$ lengths ${l}_{i,j}$ and centroid positions ${c}_{i,j}$ are:$${d}_{{overlap}}\left({s}_{j},\,{e}_{i},\,{l}_{i},\,{l}_{j}\right)=1/\frac{{e}_{i}-{s}_{j}}{{l}_{i}+{l}_{j}-{e}_{i}+{s}_{j}}+\epsilon$$
(1)
$${d}_{{length}}\left({l}_{i},\,{l}_{j}\right)=\max (\frac{{l}_{i}}{{l}_{j}},\frac{{l}_{j}}{{l}_{i}})$$
(2)
$${d}_{{centroid}}\left({c}_{i},\,{c}_{j}\right)={\rm{|}}{c}_{i}-{c}_{j}{\rm{|}}$$
(3)
Where $\epsilon$ is a noise variable of ${10}^{-8}$. The complete distance is the sum of the factors:$$D={\lambda }_{1}{d}_{{overlap}}+{\lambda }_{2}{d}_{{length}}+{\lambda }_{3}{d}_{{centroid}}$$
(4)
where ${\lambda }_{1-3}$ are coefficients that allow for the tuning of the weighting of the individual terms. In our study, ${\lambda }_{1}={\lambda }_{2}={\lambda }_{3}=1$. The peptides are connected if:where $T$ is an arbitrary threshold that is purpose specific. In our data, an optimal threshold of 4 was identified empirically. Motivation for the choice of parameters and the implications of these on the clustering can be seen in Supplementary Notes 2 and Supplementary Fig. 2. A network is created for each protein separately. The created networks are partitioned into isolated components, with weaker connections between clusters with lowly overlapping components. To further partition the network, the Leiden community detection algorithm is applied.The Leiden community detection algorithm is highly similar to the more common Louvain detection algorithm but improves on it by guaranteeing well-connected communities31. These algorithms seek to maximize the modularity of the network, which is calculated as:$$Q=\mathop{\sum }\limits_{{ij}} \left({{A}_{{ij}}}-\gamma {k}_{i}{k}_{j}/2m \right) \delta ({\sigma }_{i},\,{\sigma }_{j})$$
(6)
where $A$ is the adjacency matrix, ${k}_{i}$ is the weighted degree of node $i$, m is the total sum of edge weights and $\delta \left({\sigma }_{i},\, {\sigma }_{j}\right)=1$ if $i$ and $j$ are in the same community. $\gamma$ is the resolution parameter that defines the expected number of communities. A greater resolution coefficient results in more communities. In our dataset, the optimal resolution factor was derived empirically and set to 0.8. This value is motivated in Supplementary Notes 2 and Supplementary Fig. 2.It was noted that if a protein contained regions of highly varying peptide densities, the algorithm tended to split clusters into high-density regions even though the peptides overlapped to a great extent. To correct this, a subsequent step was applied, where clusters were merged if the distance of each cluster termini was below the given threshold. Here, a threshold of 2 was used. Lastly, clusters with fewer than 3 peptides were removed from the dataset since these could not be quantified.Analyzing peptide clustersThe clusters can be quantified using different methods. We implemented two such methods, one in which it is quantified by the N most intense peptides, and one in which the N peptides exhibiting the largest difference metric, by DE score40, fold-change or p value, are used for quantification41. To investigate what clusters differed the most between the sample types, differential abundance analysis, and machine learning-based feature extraction were performed. For pair-wise differential abundance analysis, p values were calculated using linear regression, and Q values were computed by correcting for multiple hypothesis testing using the Benjamini-Hochberg procedure. When comparing more than two groups, p values were calculated using one-way ANOVA.To investigate if classification models could be used to distinguish between sample types, classification was performed with an XGBoost classifier. The classifier takes the z-scaled cluster intensities as input. The configuration for the XGBoost classifier was set to the default as per the implementation of the xgboost Python package. To investigate how many peptide clusters and which peptide clusters were important for classification, a modified recursive feature extraction scheme was implemented. Here, the feature matrix is iteratively reduced by removing the cluster with the lowest importance to the model. This iterative reduction of the feature matrix is conducted until a single feature remains. The feature importance is estimated using SHAP and is computed using the TreeExplainer as implemented in the SHAP Python package.It was hypothesized that the high-dimensional cluster space could be used to estimate the level of bacterial colonization in the double-infected wounds. Weighted kNN (k = 30) regression was used to compute the cluster peptidomic likeness of the double-infected samples to the samples infected by S. aureus and P. aeruginosa. The output was then correlated to the CFU fraction of S. aureus. Bootstrapping (bootstrap N = 10) was implemented by adding noise $\sim N\left(0,1\right)$ to the scaled features.The most probable cut site giving rise to each cluster is the most common terminal, i.e., the mode of the terminal positions for each cluster and sample. This has been motivated in Supplementary Notes 3 and Supplementary Fig. 3. To identify cut site specificity, windows spanning 8 amino acids surrounding these sites were considered (p4-p4’). The amino acid distribution for each sample type and position was weighted with the mean peptide intensity in each cluster. The influence of the pathogens on the cut sites was compared against the control. To quantify the dissimilarity between these distributions, the KL divergence was computed as follows:$${D}_{{KL}}={\sum }_{x\in X}p\left(x\right)\log \left(\frac{p\left(x\right)}{q\left(x\right)}\right)$$
(7)
Where $p\left(x\right)$ is the amino acid distribution for the sample type of interest and $q\left(x\right)$ for the sample type used as a comparison/background. For the single infections and double-infected samples, the control samples were used as the background distribution. The contaminated samples were compared against the S. aureus-infected samples, since this was the expected bacterial colonizer. The divergence was used to weigh the amino acid frequency at each position, before displaying them in a logoplot.The cut site amino acid frequency distribution is a matrix of dimensions $23\times 8$. Stacking this matrix into a vector of length $23\cdot 8$ and concatenating all samples results in a matrix of size ${N}_{{samples}}\times 184$. To investigate the proximity of cut sites the dimensionality of the matrix was reduced to two dimensions using UMAP. The UMAP was configured as per the default configuration implemented in the umap-learn Python package.Comparison between peptide-centric and cluster-centric workflows on the urine peptidomes of type 1 diabeticsTo investigate the generalizability of our method, we utilized a publicly available dataset from a study by Van et al.37, which we downloaded from ProteomeXchange (PXD012210). The dataset consists of mass spectrometry (MS/MS) peptidomic data from 30 urine samples, including 15 samples from patients with type 1 diabetes and 15 control samples. Parts of the original analysis by Van et al. were re-created, specifically Fig. 2e, f in their publication. P values were calculated using linear regression and corrected for multiple hypothesis testing using the Benjamini–Hochberg method.Clustering was performed with a resolution of 0.8 and a cutoff of 4, consistent with parameters used in other analyses within our study. Clusters were quantified using the top three peptides with the highest DE scores.The feature quality improvement was examined through an iterative scheme. The original peptide and cluster feature matrices were subsetted to include the top N features sorted by p value, starting with N = 3. The number of missing values for progressively larger feature matrices was identified. Additionally, we used the feature matrices to train 10 logistic regression classifiers on different stratified subsets of half the dataset and evaluated AUROC on the remaining halves. The logistic regression models were configured according to the defaults of sklearn 1.3.0. This iterative process continued until all features were included.Statistics and reproducibilityTo ensure that the findings were reproducible, a random subset of samples was re-analyzed. These samples were chosen in a stratified but randomized manner from both singly infected samples and uninfected controls collected on day 1, with four samples selected from each group. The entire sample preparation pipeline and mass spectrometry analysis were repeated, but this time, the sample annotations were blinded. The annotations were revealed after creating clusters and projecting the cluster quantities to two dimensions using UMAP. Since the analysis of the original samples, the mass spectrometry park has been upgraded from timsTOF pro to timsTOF HT.For zymogram analysis of porcine wound fluid samples from day 2; n = 2, for other wound fluid samples; n = 1. For zymogram analysis of human wound fluid samples, n = 2.Implementation and softwareThe complete bioinformatic analysis was conducted in Python 3.9. networkx and igraph were used to generate networks. The Leiden algorithm was used as implemented in the leidenalg-package. xgboost, scikit-learn, and shap were used for machine learning applications. The umap-learn package was used for dimensionality reduction. The processing package DPKS was used for quantification and differential abundance analysis42. A package for the creation of peptide clusters is available under an MIT license24. All illustrations were created in BioRender.com.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Peptide clustering enhances large-scale analyses and reveals proteolytic signatures in mass spectrometry data

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Quick chemistry crossword #055

Hot Topics

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Popular Articles

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle