Chemoproteogenomic stratification of the missense variant cysteinome

Variant peptide identification enabled by MSFragger two-stage database search and false discovery rate (FDR) estimationTo enable chemoproteogenomic identification of SAAV-containing peptides, we established a customized proteogenomics pipeline (Fig. 1A). Motivated by a prior report3 that demonstrated proteogenomic sample searches performed with sample-specific databases both improved coverage (~45% more variants) and decreased rates of SAAV peptide false discovery, we generated a cell line-specific variant peptide database from HEK293T RNA-seq data (Fig. 1A, Supplementary Fig. 1 and Supplementary Data 1). Next, to afford a reduction in the likelihood that a variant peptide will be mismatched to wild-type spectra17, we established a two-stage database search and FDR control scheme (Fig. 1A), using an MSFragger45,46 command line pipeline within FragPipe computational platform. In this strategy, the first search of acquired MS/MS spectra is performed against a reference database of canonical protein sequences. Subsequently, peptide-to-spectrum (PSM) matches identified with a certain high level of confidence (e.g., passing 1% FDR) are removed, and the remaining spectra are then searched against a variant-containing, sample-specific database.Fig. 1: Establishing an MSFragger-search pipeline for variant peptide identification.A Two-stage FDR MSFragger-enabled variant searches–variant databases are generated from non-redundant reference protein sequences that are in-silico mutated to incorporate sequencing-derived missense variants followed by two-stage FDR MSFragger/PeptideProphet search to identify confident variant-containing peptides. First, raw spectra are searched against a normal reference protein database, confidently matched spectra (passing 1% FDR) are removed, and the remainder of spectra are searched with a variant tryptic database. B Chemoproteomics workflow to validate heavy and light biotin47. HEK293T cell lysates were labeled with pan-reactive iodoacetamide alkyne (IAA) followed by ‘click’ conjugation onto heavy or light biotin azide enrichment handles in known ratios. Following neutravidin enrichment, samples are digested and subjected to MS/MS analysis. C Heavy to light ratios (H:L) from triplicate datasets (n = 3) comparing identifications from reference and variant searches; mean ratio value indicated, dashed lines indicate ground-truth log2 ratio, statistical significance was calculated using a two-sided Mann-Whitney U test, **p <0.01, ns p >0.05 (1:1, p = 0.002; 10:10, p = 0.083; 1:4, p = 0.84, 4:1, p = 0.093; 1:10, p = 0.056; 10:1. p = 0.061). D Retention time difference for heavy and light identified peptides for reference and variant searches; mean value indicated, statistical significance was calculated using a two-sided Mann-Whitney U test, ns p >0.05 05 (1:1, p = 0.47; 10:10, p = 0.42; 1:4, p = 0.45, 4:1, p = 0.57; 1:10, p = 0.13; 10:1. p = 0.34)… Box plot center line, median; limits are upper and lower quartiles; 1.5x interquartile range. Proteomic data is found in Supplementary Data 1 and source data in the Source Data file.We then subjected our chemoproteogenomics pipeline to benchmarking by generating a set of high-coverage cysteine chemoproteomics datasets (Fig. 1B) in which reference and variant proteinacious cysteines in cell lysates labeled with iodoacetamide alkyne (IAA)20 and conjugated isotopically labeled ‘light’ (1H6) or’ heavy’ (2H6) biotin-azide reagents47 (+ 6 Da mass difference between the reagents) were combined pairwise in biological triplicate at different H/L ratios (1:1,10:10, 1:4, 4:1, 1:10, and 10:1). By searching these datasets using our two-stage FDR search, we sought to validate the accuracy of variant identification. Peptide quantification using IonQuant48,49, following the workflow shown in Fig. 1A, revealed MS1 intensity ratios for both canonical and variant peptide sequences that matched closely with the expected values (Fig. 1C and Supplementary Data 1). We also compared the retention times of the heavy- and light-peptides and observed a ~2-3 sec shift for the deuterated heavy sequences for both the variant and canonical peptide sequences (Fig. 1D and Supplementary Data 1). These retention time shifts are consistent with our previous study47 and with prior reports50,51. Analogous to studies that utilize isotopically enriched synthetic peptide standards to validate peptide sequences52,53,54, the observed co-elution of both heavy and light variant peptides provides further evidence to support the low FDR of our data processing pipeline. Lastly, the high concordance between observed and expected MS1 ratios provides compelling support for the use of the heavy and light biotin azide reagents in competitive cysteine-reactive compound screens, in which elevated MS1 intensity ratios are indicative of a compound-modified cysteine.FragPipe graphical user interface (GUI) with improved two-stage MSFragger search and FDR estimationMotivated by the multi-faceted uses of the two-stage FDR search pipeline for general proteogenomic applications, we next simplified the search workflow by establishing a semi-automated execution of these searches in FragPipe (see Supplementary Discussion for details). To further improve the sensitivity of variant peptide identification, we added an option to run MSBooster and Percolator instead of PeptideProphet (Supplementary Fig. 2). As part of our semi-automated search pipeline, we enabled compatibility with isobaric labeling reagents, which we expect will further broaden the utility of our approach (Supplementary Fig. 3). Using the GUI features, we observed comparable coverage for both the command-line and automated GUI implementations of the two-stage FDR search with a slight increase in numbers of identifications observed for datasets processed with MSBooster and Percolator (Supplementary Fig. 4 and Supplementary Data 1). The ratio differences between variant and reference cysteine peptides are comparable (Supplementary Fig. 2). In total we identified 50 missense variants at the protein level, including 11 acquired cysteines and 39 proximal to reference cysteines. This very low coverage of variant data prompted us to reconsider our cell line selection and to prioritize the addition of whole exome sequencing data to enhance the coverage of functionally significant variants.High missense burden cancer cell lines are rich in acquired cysteines, including in census genesWe hypothesized that the genomes of missense-variant rich cell lines would similarly encode a high burden of acquired cysteine SAAVs and SAAVs proximal to reference cysteines and, therefore, could serve as useful model systems for establishing cysteine chemoproteogenomics. To test this hypothesis and establish a useful toolbox of cell lines and genomics data for our chemoproteogenomics platform, we analyzed the variant burden across all cell line data available in the Catalog of Somatic Mutations in Cancer Cell Lines Project database(COSMIC-CLP)55,56 (Fig. 2A). A comparatively small subset of cell lines was observed to be particularly missense rich, with only 15 out of 1020 total cell lines in COSMIC harboring 77,693 or ~18% of the ~2 million unique missense variants cataloged (Fig. 2B, Supplementary Fig. 5A and Supplementary Data 2). Gratifyingly and consistent with our hypothesis, cysteine was a top-gained amino acid, both across all COSMIC cell line variants (23,220 total acquired cysteines; 5.4% of total COSMIC cell line mutations) and across the top 15 high missense burden cell lines (4725 total acquired cysteines found in 3, 688 genes), with a strong correlation between overall missense burden and net acquired cysteines (Fig. 2B–D and Supplementary Fig. 6). These data suggested that a comparatively small fraction of cell lines could prove useful for proteogenomic analysis of somatic variants in cancer.Fig. 2: Acquired cysteines are prevalent across cancer genomes, particularly for high missense burden cell lines.A The full scope of acquired cysteines in the COSMIC Cell Lines Project (COSMIC-CLP, cancer.sanger.ac.uk/cell_lines) (v96)55,56 were analyzed. B 1020 cell lines stratified by the number of gained cysteines and total missense mutations; color indicates cancer type for the top 15 highest missense count cell lines. C Net missense mutations (gained-lost) from COSMIC-CLP (v96). D Top 15 cell lines with highest missense burden from panel (B); linear regression and 95% confidence interval shaded in gray. E Overlap of genes with acquired cysteines in top 15 subsets from panel (B) with Census genes and targets of FDA-approved drugs. F Panel of cell lines used in this study with MMR status (dMMR = deficient mismatch repair, pMMR=proficient mismatch repair). Data is found in Supplementary Data 2 and source data is in the Source Data file.Nearly 30% (219/738) of the Census genes (v98) identified in the top 15 missense-rich cell lines were found to harbor one or more gained cysteines (Supplementary Data 2), and <10% of these genes have been targeted by FDA approved drugs29,57 (Fig. 2E).dMMR cell lines are enriched for SAAVs, including acquired cysteinesMicrosatellite instability (MSI) caused by deficiencies in mismatch repair (dMMR), as opposed to functional MMR or proficient mismatch repair (pMMR), is a prominent feature of missense mutation-rich cell lines. Notably, 7 of the top 15 missense cell lines in COSMIC are known mismatch repair deficient cell lines58,59,60 (Fig. 2D and Supplementary Fig. 7), and only MeWo cells, which are derived from metastasized melanoma, were reported to be microsatellite stable (MSS)58. The majority of missense-rich cell lines, including the dMMR lines were observed to encode between 5000 and 10,000 total SAAVs and 200 and 500 acquired cysteine SAAVs (Supplementary Fig. 5). By causing C → T mutations primarily at CpG sites, the mutational signature of defective mismatch repair (SBS6) should favor gain-of-cysteine61. While the missense-rich nature of the dMMR cell lines provides an exciting opportunity for high variant coverage proteogenomics, the predominance of MSI across the cell line panel together with the marked overrepresentation of colorectal carcinoma (CRC) cell lines (Fig. 2B and Supplementary Fig. 7) prompted us to broaden our cell line panel to better represent genetic variation and to further assess how cell line MSI status impacted variant content.An expanded cell line panel incorporates high-value acquired cysteinesGiven the considerable interest in targeting G12C KRAS, we opted to add several KRAS mutated cell lines to our panel (MIA-PACA-2, H2122, and H358) in order to favor the detection of the G12C peptide. Notably, the smoking-associated mutational signature is C → A/G → T62, which should also favor gain-of-cysteines. Therefore, we additionally sought to test whether smoking-associated NSCLC-derived H2122 and H1437 adenocarcinoma cell lines would be enriched for acquired cysteines when compared to other pMMR cell lines, including lung cancer cell lines (H358 NSCLC and H661 metastatic large cell undifferentiated carcinoma (LCUC) lung cancer cell lines). Lastly, we opted to include CACO-2 cells, an MSS CRC cell line, given the preponderance of missense rich dMMR CRC cell lines. Our prioritized cell line panel features 11 cell lines in total (2 female and 9 male) spanning 6 tumor types and encoding 22,559 somatic variants and 1296 somatic acquired cysteines, as annotated by COSMIC-CLP (Fig. 2F and Supplementary Data 2), with aggregate enrichment for gained cysteines observed for the entire panel (Supplementary Figs. 8, 9). Of the proteins that harbor gained cysteines, 486 are Census genes, and 5% are targeted by FDA-approved drugs (Supplementary Data 2).Incorporating rare variants into our proteogenomic pipelineCOSMIC and related cancer databases often do not report germline variants found in the general population. Therefore, to enable chemoproteogenomic assessment of non-cancer-associated SAAVs, we sequenced exomes and RNA of our cell lines and subjected NGS reads to variant-calling (Fig. 3A and Supplementary Fig. 10). For all 11 cell lines sequenced, we identified average 82% of the variants reported in COSMIC-CLP, including known driver mutations, and 70% of missense mutations reported by Cancer Cell Line Encyclopedia (CCLE)58 databases (Supplementary Data 3). 9485 total rare variants and 22,010 total common variants were identified that had been not previously reported in COSMIC-CLP. Of those variants not in the COSMIC-CLP 237 are annotated as pathogenic/likely pathogenic/VUS in ClinVar, and 1251 variants encode acquired cysteines (Supplementary Data 3). Analysis of DNA damage repair-associated genes revealed specific mutations (Supplementary Data 3), including DDB2 R313* in MeWo cells, which provide an explanation for the previously unreported high missense burden—inactivating mutations in DDB2 are implicated in deficient nucleotide excision repair63. Pointing towards opportunities to improve coverage of reference-cysteine-containing peptides, 16,381 total reference cysteines were located proximal (within 10AA) to missense variants, including 10,508 variants not previously identified in the COSMIC-CLP (Supplementary Data 3).Fig. 3: Incorporating variants into sample-specific search databases.A Sequencing portion of the ‘chemoproteogenomic’ workflow to identify chemoproteomic detected variants–extracted genomic DNA or RNA from cell lines undergo sequencing followed by variant calling using Platypus (v0.8.1)118 and GATK-Haplotype Caller (v4.1.8.1)119 for RNA and exomes respectively and predicted missense changes were computed. B Total numbers of missense mutations identified from either RNA-seq or WE-seq; stripe vs solid denotes common and rare variants. C Net amino acid changes for all cell lines combined. D Totals of gained and lost cysteine in each cell line separated by rare and common variants, dashed line indicates dMMR cell lines. E Net missense mutations (gained-lost) from dbSNP (4-23-18)65. F Non-synonymous changes are incorporated into reference protein sequences, and combinations of variants are generated for proteins with less than 25 variant sites to make customized FASTA databases. Details in methods. Supplementary Data 3 and Supplementary Data 4 and source data in the Source Data file.We also compared the variant landscape of each cell line with the goal of identifying ubiquitous common variants together with rare and cell line-specific variants. As with our analysis of COSMIC-CLP, we detected a high missense burden for the dMMR cell lines compared to the pMMR cell lines (Fig. 3B). In total, 1634 variants were shared across all cell lines, and 34,636 were unique to individual cell lines, which illustrates the added value of analyzing multiple cell lines. Notably, when compared to the pMMR cell lines, we found that nearly all of the dMMR cell lines, most notably HCT-15 and Molt-4 cell lines, were comparatively enriched for rare variants and particularly rare, acquired cysteines compared to the pMMR cysteines (Fig. 3B,C), irrespective of sequencing coverage (Supplementary Fig. 11). In contrast, both pMMR and dMMR genomes harbored comparable numbers of common variants, including common acquired and lost cysteines (Fig. 3D and Supplementary Figs. 12–14). This finding points towards an opportunity to use dMMR cell lines for proteogenomic analysis of rare variants and particularly rare acquired cysteines.Looking beyond cysteine acquisition, we also considered how the broader missense amino acid signature varied across cell lines to identify other features that might impact our proteogenomic pipeline. For common variants, the amino acid gain/loss signatures were generally consistent across cell lines (Fig. 3D), including for smoking versus non-smoking-associated lung cancer cell lines (Supplementary Fig. 15), characterized by marked enrichment for acquired histidine and cysteine together with loss-of-arginine (Supplementary Figs. 12–14). For rare variants, cell-line-specific differences in SAAV content were observed, most notably when comparing the dMMR to pMMR cell lines (Fig. 3D and Supplementary Figs. 12–14). MeWo cells harbored many gains of rare phenylalanine and lysine (Supplementary Fig. 13), consistent with UV radiation-induced pyrimidine dimers (Supplementary Fig. 16 and Supplementary Data 3). Thus, we expect that the ubiquity of loss-of-arginine together with the MeWo gain-of-lysine signature should alter the tryptic peptide landscape, and proteogenomic analysis should enable improved detection of this class of missense variants.Acquired cysteines are ubiquitous in both healthy and diseased genomesLooking beyond cancer variants, we were also interested in determining whether our chemoproteogenomic platform could prove useful for the study of acquired cysteines more broadly, including ubiquitous common variants and rare variants that may have links to monogenic disorders. We hypothesized that gain-of-cysteine missense variants should also be ubiquitous in healthy genomes, due to the comparative instability of CpG–a key consequence of this instability is the frequent loss-of-arginine codons (4/6 CG dinucleotides)64. To test this hypothesis, we aggregated and quantified the amino acid changes resulting from common missense variants reported by dbSNP65 (4-23-18), a repository of single nucleotide polymorphisms, and ClinVar66 (09-03-22), a repository of variants with reported pathogenicity. We find that cysteine acquisition is the third most common consequence of missense variants identified in dbSNP (Fig. 3E and Supplementary Data 2) for common variants—common variants are defined by NCBI as of germline origin and/or with a minor allele frequency (MAF) of ≥ 0.01 in at least one major population, with at least two unrelated individuals having the minor allele. Analogous stratification of variants reported by ClinVar also revealed a preponderance of gained cysteines compared with lost cysteines, albeit to a more modest degree than that observed for cancer genomes (Supplementary Fig. 17 and Supplementary Data 2). For the pathogenic variant subset of ClinVar, both gain- and loss-of-cysteine and gain-of-proline were frequently observed (Supplementary Fig. 17). Comparing the variants in our cell line panel to those found in dbSNP and ClinVar, we find that 25,735 (dbSNP/common) and 3982 (ClinVar; 3409 common and 573 rare) variants are found in our cell line panel, which highlights additional opportunities for analysis of acquired cysteines relevant to other genetic contexts, including rare disease and healthy genomes (Supplementary Data 3). Notably, 3560 variants are found within 5 amino acids of additional variants. The proximity of missense variants, particularly rare and common variants, points toward a need for combinatorial databases67 for proteogenomics.Deploying chemoproteomics with combinatorial databases improves coverage of acquired cysteines and proximal variantsTo establish our proteogenomics pipeline, we were inspired by the recent report68 combinatorial databases to improve the detection of proximal SAAVs, such as the aforementioned variants that are found within 30 amino acids. To improve the detection of such variants, we established an algorithm (Supplementary Fig. 1B) to generate all combinations of SAAVs derived from both RNA/WE-seq data within 30 amino acids flanking the variant site. These combinations were then converted into a peptide FASTA database containing two tryptic sites flanking each variant site (Fig. 3F). On average, >4500 total multi-variant peptide sequences were generated per cell line. Our approach differs from most prior custom database generators, which offer ‘Single-Each’12,52,69,70 or ‘All-in-One’ outputs71,72 for the former, all protein sequences harbor one SAAV each; for the latter, each protein harbors all SAAVs detected. While establishing our combinatorial databases, we observed that a small number of highly polymorphic genes (Supplementary Data 4) markedly increased database size—exemplifying this increased complexity, upwards of 1 billion combinations (2^n -1) are possible for protein sequences with 30 or more SAAVs. To determine the practical limit for the number of SAAVs/protein, we performed test searches where we limited the number of variants to combine (Supplementary Data 4). We find that nearly all variants are retained with databases that include combinations for proteins with up to 25 variants (Supplementary Data 4). For the small set of highly polymorphic protein sequences (e.g., HLA, MUC, and OBSCN (Supplementary Data 4), Single-Each sequences were searched (Fig. 3F).Next, for all 11 sequenced cell lines (Supplementary Data 3), we prepared and acquired a set of high-coverage cysteine chemoproteomics datasets (Fig. 4A) with the goal of identifying acquired cysteines and variants proximal to reference cysteines. In aggregate, 32,638 total canonical cysteines were identified on 7233 total proteins (Supplementary Fig. 18 and Supplementary Data 4). Two-stage MSFragger search using our sample-specific combinatorial databases identified a total of 59 gained cysteines and 302 SAAVs located proximal to 343 reference cysteines (Fig. 4B and Supplementary Data 4).Fig. 4: Variant peptide identification on tumor cell lines.A Cell lysates were labeled with pan-reactive iodoacetamide alkyne (IAA) followed by ‘click’ conjugation onto biotin azide enrichment handle. Samples were prepared and acquired using our established SP3-FAIMS chemoproteomic platform31,32,131 single pot solid phase sample preparation (SP3)132 sample cleanup, neutravidin enrichment, sequence-specific proteolysis, and LC-MS/MS analysis with field asymmetric ion mobility (FAIMS) device133. Experimental spectra are searched using a custom FASTA for variant identification. The sample-set includes a reanalysis of previously reported datasets from Yan et al.31. (Molt-4, Jurkat, Hec-1B, HCT-15, H661, and H2122 cell line) with newly acquired datasets (H1437, H358, Caco-2, Mia-PaCa-2, and MeWo cell lines). B Total numbers of unique missense variants identified from either RNA-seq or WE-seq or both after using two-stage MSFragger search and philosopher validation from duplicate (n = 2) datasets; stripe vs solid denotes common and rare variants, black triangles represent replicate total counts, indicated is sequencing source and type of variant. C Overlap of identified cysteines from variant searches with cysteines in the CysDB database29. D Net amino acid changes for all cell lines combined. E Example of cysteines identified from loss-of-arginine/lysine peptides. Data is found in Supplementary Data 4, and source data is in the Source Data file.Across all these identified SAAVs, we were particularly interested in assessing the impact of our combinatorial exome and RNA-seq SAAV databases on variant identification. We identify six multi-variant-containing peptides (Supplementary Data 4). One noteworthy example is the L86P/F92C peptide from the mitochondrial enzyme HADH, which catalyzes beta-oxidation of fatty acyl-CoAs—two variants, one from RNA-seq and one from exome-seq were detected in this peptide. For the I105V/A114V peptide from enzyme GSTP1, the I105V variants were flagged as bad-quality reads from RNA-seq data but passed filters from the exome-seq data (Supplementary Data 4). Of these combination variants, two are exome-seq-only derived variants that span exon boundaries. While the coverage of these multi-variant peptides is modest, these examples illustrate the value of combinatorial databases for proteogenomic search.We next investigated the specific features of the identified variants, with the goal of determining if we were capturing cysteines not covered in prior chemoproteomic studies, including those gained due to variant-induced changes to the tryptic peptide landscape. By comparing to our high coverage database of cysteine chemoproteomic data, CysDB29, we find that chemoproteogenomics identified 74 canonical sequence cysteines located proximal to variants and 60 acquired cysteines that had not been previously reported in CysDB (Fig. 4C). Notable examples of acquired cysteine variants not reported in CysDB include acquired cysteines KRAS G12C and PRKDC R2899C. Consistent with the aforementioned genomic data findings, we observe arginine as the most frequently lost out of detected Cys-proximal SAAVs (Fig. 4D). We detect 15 total cysteines in peptides that harbor gain/loss-of-arginine that were previously too long or too short to be identified (Fig. 4E and Supplementary Data 4). Exemplifying these peptides, for the cysteine protease cathepsin B (CTSB), we identify Cys207 in HCT-15 cells, which was not identified in CysDB–a K209E mutation that creates a longer tryptic peptide sequence compared to the reference sequence (‘CSK’ to ‘CSEICEPGYSPTYKQDK’) (Fig. 3K). Taken together, these examples illustrate how the pronounced loss of arginine can impact the detection of both reference and variant cysteines.Chemoproteogenomics identifies both rare and common variants including highly deleterious sitesOne of our overarching goals for establishing chemoproteogenomics was to enhance the discovery of likely functional variants. Therefore we next assessed features of chemoproteogenomic-identified SAAVs that could provide insights into the discovery of SAAVs of functional significance. Given the comparatively modest coverage of acquired cysteine SAAVs, relative to the genomic dataset, we opted to analyze both datasets in parallel to delineate specific features that could inform both the likelihood of proteomic detection and residue functionality.With the goal of parsing features that favor SAAV detection, we next asked whether chemoproteogenomics favored the detection of rare or common variants or those identified in either or both RNA- and exome-sequencing datasets, with the hypothesis that rare variants may be less likely to be expressed at the protein level. We find that the relative proportion of SAAVs identified by chemoproteogenomics (Fig. 4B) largely parallels the trends observed in our sequencing data (Fig. 3B), with higher detection for dMMR cell lines, particularly for rare variants. These trends extend to acquired cysteines, with similar proportions of rare and common cysteine SAAVs identified by both genomic and chemoproteogenomic analysis. Notably, most chemo proteogenomic-detected SAAVs were found in both the exome and RNA-Seq datasets (Fig. 4B), pointing toward the likelihood that variant calling from RNA-seq data should prove sufficient for variant detection.Towards guiding the discovery of functional SAAVs, we also stratified the predicted deleteriousness of the identified missense variants (Fig. 5A and Supplementary Data 3). We focused on the Combined Annotation- Dependent Depletion (CADD) score due to its highly reported specificity and sensitivity73 and our prior findings that showed a strong association between cysteine functionality and a high CADD score35. Unsurprisingly, our analysis revealed higher CADD scores for rare variants compared to common variants, across the cell line panel (Fig. 5B, C and Supplementary Data 3). More unexpectedly, we observed a marked increase in the predicted pathogenicity of the rare variants detected in dMMR cell lines compared with pMMR cell lines (the top 1% most predicted deleterious mutations have CADD phred-scaled scores >20) (Fig. 5B, C and Supplementary Figs. 19, 20). These trends were maintained in our proteomic datasets, with enrichment of high CADD score missense variants in the dMMR rare variant subset, including for gain-of-cysteine SAAVs (Supplementary Fig. 21). Even more striking, further stratification by specific gained or lost amino acids (Fig. 5D and Supplementary Figs. 22–25), revealed that gained cysteine missense mutations are the most significantly enriched for high predicted deleterious scores across all pMMR and dMMR cell lines (Supplementary Data 3). These findings provide evidence in support of the use of dMMR cell lines as useful model systems for proteogenomic detection of likely deleterious variants.Fig. 5: Chemoproteogenomics identifies predicted deleterious sites.A Scheme of CADD score analysis for two dMMR and non-dMMR cell lines. B Distribution of CADD scores for indicated variant grouping; statistical significance was calculated using a two-sided Mann-Whitney U test, ****p <0.0001 (Common, p = 3.1e-16; Rare, p = 5e-46). C Empirical cumulative distributions (ECDF) were computed for CADD scores with indicated grouping; statistical significance was calculated using a two-sample Kolmogorov-Smirnov test, ****p <0.0001 (Common, p = 1.7e−12; Rare, p = 6.4e-34). D CADD score distributions for gain-of-cysteine separated by grouping; statistical significance between gained cysteine values was calculated using a two-sample Kolmogorov-Smirnov test, ****p <0.0001 (Common, p = 3.1e-6; Rare, p = 6.4e-10). Data is found in Supplementary Data 3, and source data is in the Source Data file.Possibly complicating matters, nearly all of the 77 variants in Clinvar and identified by chemoproteogenomics were annotated as benign (Supplementary Data 4). Similarly, chemoproteogenomics failed to capture several key Census gene SAAVs that we detected on the genomic level (e.g., SMAD4 (D351H) in CaCo-2, FBXWY (R505C) in Jurkat and CDK6 (R220C) in Molt-4 cells). These examples provide additional anecdotal evidence of the challenges associated with detecting deleterious variants.Chemoproteogenomics did, however, capture 16 mutations and 7 putative driver mutations (dN/dS p-values) that were identified in Census genes. Several high-value census gene SAAVs were distinguished by both high CADD scores (>20) and proximity to known pathogenic mutation sites. These variants of interest include MLH1 R385C, RAD17 L557R (proximal Cys551/556), MSN R180C, HIF1A S790N (proximal Cys800) and CTCF R320C, a likely pathogenic position in this protein (Supplementary Data 4). A prevalent driver was KRAS G12C, which was identified in several of the cell lines known to harbor this variant as a driver mutation (MIA-PACA-2 and H358 but not H2122). As KRAS expression is known to vary across cell lines58, this data suggests both H358 and MIA-PACA-2 cell lines are suitable for chemoproteogenomic target engagement analysis of G12C-directed compounds.Chemoproteogenomics captures previously undetected variantsExemplifying the utility of chemoproteogenomics (Fig. 6A) to uncover previously undetected variants, we find that 20 of the identified SAAVs have not been previously reported in COSMIC, CCLE, or ClinVar (Supplementary Data 4). One variant of unknown significance, not reported in ClinVar, is high mobility group box 1 (HMGB1) R110C labeled in the Molt-4 cell line (Fig. 6B) (CADD score = 24.1). Adjacent Cys106 is a cysteine under a highly controlled redox state that is responsible for inactivating the immunostimulatory state of HMGB174,75,76,77. We also identify Cullin-associated NEDD8-dissociated protein 1 (CAND1) G1069C—a site which mutated in the Arabidopsis thaliana ortholog reduces auxin response78—and SARS R302H (proximal Cys300;CADD = 32), a mutation in the ATP binding site of serine-tRNA ligase, which is a tRNA ligase involved in negative regulation of VEGFA expression79. These three examples illustrate the capacity of chemoproteogenomics for the identification of potentially functionally relevant variants.Fig. 6: Chemoproteogenomics identifies SAAVs proximal to likely functional sites.A Scheme for chemoproteomics data search to identify variants from duplicates (n = 2). B Crystal structure of HMGB1 indicating detected Cys110 and nearby Cys106 (yellow) (PDB: 6CIL). C Proportion of variants belonging to the indicated sites; AS/BS = in or near active site/binding site in genomics data as annotated by UniProtKB or Phosphosite; statistical significance calculated using the two-sample test of proportions, *** p <0.001, ****p <0.0001, ns p >0.05.D) Chemoproteogenomic-identified variants identified in or near active and binding sites with CADD score, common/rare, cell line dMMR/pMMR annotations. E Amino acid changes at protein methylation sites as identified by Phosphosite from genomics data. F Re-analysis of SP3-Rox24 oxidation state data in Jurkat cells (n = 6) acquired cysteines and 54 variants proximal to acquired cysteines. G Example of cysteines identified from loss-of-arginine/lysine peptides. H Schematic of highly variable HLA binding pocket containing cysteine with bound peptide. I Coverage of HLA cysteines from this study and in CysDB; color indicates HLA type or multi-mapped cysteines. J Crystal structure of HLA-B 14:02 (PDB: 3BXN) with highlighted Cys67 and Arg P2 position of bound peptide; alignments of Cys91 regions of three HLA-B alleles. K Workflow to visualize HLA cysteine labeling; first cells were harvested and treated with IAA followed by lysis, FLAG immunoprecipitation, and click onto rhodamine-azide. L Cys-dependent cell surface labeling of HLA-B alleles with IAA, the band indicated with a red arrow and non-specific band represented with an asterisk (representative of 2 two biological replicates). Data is found in Supplementary Data 3 and Supplementary Data 4, and source data is in the Source Data file.Chemoproteogenomics identifies SAAVs proximal to likely functional sitesAs CADD scores only provide a prediction of deleteriousness, we also asked whether any of the identified variants are located proximal to known functional sites and sites of post-translational modification. At the genomic level, we find that the dMMR rare variant set is enriched for known proximal active site/binding site residues (Fig. 6C and Supplementary Data 3). Within the proteomic dataset, only 3 variants were located at annotated active or binding sites including previously mentioned HMGB1 R110C, tRNA synthetase EPRS R1152L (proximal Cys1148; CADD = 33), a mutation known to cause complete loss of tRNA glutamate-proline ligase activity80, and SARS R302H. Thus, we broadened our analysis to include SAAVs at or proximal to UniProtKB annotated active sites (AS) and binding sites (BS) (Fig. 6D). We find that 27 SAAVs are located within the permissive range of 10 amino acids of a known functional residue, including 4 active sites and 24 binding sites.Beyond AS/BS proximity, we also assessed proximity to other likely functional sites, known functional domains, and PTM-modified sites reported by Phosphosite81. We find generally no marked bias for variants located in specific domain types, with the ubiquitous P-loop NTPase domain as the most SAAV-rich domain (Supplementary Fig. 27 and Supplementary Data 4). We do, however, observe that variants in GPCR transmembrane domains are likely challenging to detect by proteogenomics. In our genomic datasets, GPCR transmembrane domains are enriched for variants. This enrichment does not extend to our proteomic analysis (Supplementary Fig. 27 and Supplementary Data 4). This difference in coverage can be rationalized by membrane proteins’ generally low abundance, hydrophobicity, and the lack of tryptic sites in transmembrane domains, which together make proteomic detection of peptides from GPCRs and related proteins particularly challenging9,82,83.Intriguingly, analysis of known PTM-modified sites reported by Phosphosite81 revealed a significant association between arginine methylation sites and rare variants in dMMR cell lines (Fig. 6E). Examples of such variants that we detected via chemoproteogenomics include the methylation sites XRN2_p.R925C (CADD = 31) and HSPH1_p.R265C (CADD = 32), as well as phosphorylation site CNN2_p.S244Y (CADD = 27.5). These findings are consistent with loss-of-arginine as a frequent consequence of exonic CpG mutability64,84, together with the roles of MMR in protecting against CpG-associated deamination85. As 60% of the gained cysteines in our data resulted from loss-of-arginine (Supplementary Fig. 26), we expected that many of these variants would result in altered PTM status.Because cysteines play critical roles in protein structure via disulfide bond formation together with additional cysteine oxidative modifications86, we asked whether identified loss-of-cysteine variants (10 in total) were annotated as involved in disulfides. Likely due to the comparatively small number of loss-of-cysteine variants, none were observed with disulfide annotations. To further pinpoint whether any variants are sensitive to oxidative modification, we subjected our previously reported Jurkat cell redox chemoproteomics datasets to reanalysis24. For nearly all of the cysteines quantified with proximal variants, both in our reference database searches and second stage searches, we observed a high concordance between variant- and reference sequence oxidation (R2 = 0.77). One notable exception was the Mitochondrial-processing peptidase enzyme (PMPCA) Cys225, where markedly different cysteine oxidation states were measured for the reference peptide cysteine (~3% oxidation) and variant peptide cysteine (~88% oxidation) (Fig. 6F). These data provide evidence that the proximal P226S (CADD = 25.1) mutation profoundly impacts Cys225 sensitivity to oxidative modifiers.Chemoproteogenomics enables the high confidence detection of multi-mapped genes, including for highly polymorphic sequencesOne challenge for chemoproteogenomics is the accurate assignment of variant-containing peptide sequences to the corresponding mutated gene. Exemplifying this challenge, and as a cautionary example in mapping peptides, we identify several SAAV-peptides that match to multiple protein sequences, including sequences in human leukocyte antigens (HLA) and POTE ankyrin domain family proteins (Fig. 6G). Most notably, the RHOT2 R425C (CADD = 23.2) mitochondrial GTPase peptides in H358 cells have exact sequence similarity to KRAS G12C peptides; these half-tryptic peptides are also identified in H1437 cells that do not harbor the KRAS G12C variant. Thus, without cell-line matched variant databases, chemoproteomic data for the RHOT2 cysteine could easily be misconstrued as reflective of the G12C KRAS peptide.The HLA or Major Histocompatibility Complex (MHC) Class I molecules represent another particularly challenging class of sequences for chemoproteogenomic analysis, distinguished by the presence of multiple possible variant combinations and high sequence redundancy. HLA are highly polymorphic, with ~15,000 HLA alleles reported in the human population87. Exemplifying the impact of this polymorphism on proteomic sequence coverage, our panel of cell lines alone harbor >25 HLA-A, B, and C alleles (Supplementary Data 3), while most protein reference databases only contain one copy of each MHC Class I and Class II molecule. This complexity together with the important functions in innate immunity and therapeutic relevance of the HLA proteins88,89,90,91 inspired us to assess the impact of chemoproteogenomics on achieving improved coverage of highly polymorphic genes (Fig. 6H).Demonstrating the value of our proteogenomic analysis, we achieved ~50% more coverage of HLA-A sequence in comparison to reference searches (Fig. 6I, Supplementary Fig. 28). A key finding of our analysis was detection of HLA-B Y91C (CADD = 4.9) (C67 post signal peptide cleavage), which lies in the extracellular peptide binding pocket of HLA-B and was identified as IAA-labeled in MeWo cells (Fig. 6J). The MeWo cell line HLA alleles (HLA-B*14:02 and HLA-B*38:01) both harbor this comparatively rare Cys. Notably, this cysteine is also a key feature of the pathogenic ankylosing spondylitis associated allele HLA-B*27(Brewerton et al. 1973; Alvarez et al. 2001).To further vet the capacity of our chemoproteogenomic platform in faithfully capturing cysteine peptides from multi-mapped genes, we established a gel-based activity-based protein profiling (ABPP)92,93,94 platform for Cys67 HLA alleles. We co-expressed C-terminal FLAG tagged HLA-B*38:01 and the related and pathogenic HLA-B*27:05 alleles with beta-2-microglobulin (β2m) and subjected cells to in situ IAA labeling followed by lysis, FLAG immunoprecipitation to enhance the detectability of the HLA cysteine and click conjugation to rhodamine azide (Fig. 6K). Gratifyingly, we observed a Cys67-specific rhodamine signal (Fig. 6L) that was blocked by the Cys67Ser point mutation, showcasing the utility of gel-based ABPP in visualizing HLA small molecule interactions. Notably IAA labeling was also observed for HLA-B27:05, although the presence of a strong co-migrating band in the HLA-B27:05 C67S immunoprecipitated sample complicates interpretation of the specificity of this labeling to Cys67. We were unable to observe comparable signal in lysate-based labeling studies, supporting enhanced accessibility of this cysteine to cell-based labeling (Supplementary Fig. 29).Assessing how differential expression impacts chemoproteogenomic detectionOur comparatively modest coverage of SAAVs achieved by chemoproteogenomics (particularly when compared to our genomics datasets) is on par with the coverage reported by most prior proteogenomics studies6,8,17. A notable exception is a recent study by Coon and colleagues that implemented ultra-deep fractionation to achieve more global coverage of variants9. Inspired by this work, we next sought to ask whether chemoproteogenomics, with its built-in enrichment step, would enable sampling of variants not detectable by fractionation methods (Fig. 7A). We subjected lysates from HCT-15 and Molt-4 cells, which were chosen based on high rare missense burden, to tryptic digestion, off-line high pH fractionation and LC-MS/MS analysis. In aggregate across both cell lines, we identified 8,435 proteins and 149,006 peptides, including 1069 unique SAAVs found in 1352 total peptides using our two-stage FDR MSFragger search (Fig. 7B, Supplementary Fig. 30 and Supplementary Data 5). Illustrating the use of our combinatorial databases, 26 peptides were identified that contained multiple variants (Fig. 3F and Supplementary Data 5).Fig. 7: Comparison of variants identified from cysteine enrichment and bulk proteomics.A Workflow for high-pH fractionation of lysates. Cell lysates are treated with DTT and iodoacetamide followed by digestion, high-pH fractionation, and LC-MS/MS analysis. Triplicate high-pH sets (n = 3) for HCT-15 and Molt-4 cells were used. B Total numbers of unique missense variants identified from either RNA-seq or WE-seq or both after using a two-stage MSFragger search of high-pH datasets, black triangles represent replicate total counts. C Overlap of cysteine-containing peptide variants identified from bulk fractionation and cysteine enrichment datasets. D Fold enrichment of amino acids as a ratio of the net amino acid frequency (gain minus loss) to the amino acid frequency in all missense-containing proteins detected in high-pH and cys-enriched datasets. E High-pH detected variants stratified by CADD score and ClinVar clinical significance. F Peptide lengths of reference and variant peptides identified in dataset types, statistical significance using two-sample Kolmogorov-Smirnov tests, ****p <0.0001. G DE-seq normalized transcript counts for all RNA variants. ‘All’, variants detected from cys-enrichment ‘C’, and variants detected from high-pH fractionation ‘H’ in HCT-15 cells; bar indicates the mean value (All vs C, p = 7e-17; C vs H, p = 0.17; All vs H, p <2e-16). H Label-free quantitation (LFQ) intensities for proteins matched to all RNA variants ‘All’, variants detected from cys-enrichment ‘C’, and variants detected from high-pH fractionation ‘H’ in HCT-15 cells; bar indicates the mean value (All vs C, p <2e-16; C vs H, p = 0.19; All vs H, p <2e-16). I Variant allele frequencies (VAF) (total reads/total coverage per site) for RNA-seq variants called in HCT-15 and Molt-4 cells (All vs C,p = 0.74; C vs H, p = 0.053; All vs H, p = 9e-5). G–I bar indicates the median, statistical significance was calculated using two-sample Kolmogorov-Smirnov tests, ****p <0.0001, ns p >0.05. Data is found in Supplementary Data 5, and source data is in the Source Data file.With these bulk datasets in hand, we next compared the variant content to that afforded by chemoproteogenomics for the matched HCT-15 and Molt-4 proteomes (145 total SAAVs identified by chemoproteogenomics for these two cell lines). Net gained amino acid analysis (Supplementary Figs. 31, 32) revealed similar trends, with cysteine in the top three gained and arginine as the most lost amino acid for both enriched and unenriched datasets. Illustrating the added value of chemoproteogenomics, 70 SAAVs, including eight acquired cysteines, were uniquely identified compared to unenriched datasets (Fig. 7C and Supplementary Data 4, 5). Furthermore, we find that enrichment afforded a ~5-fold boost in the relative fraction of acquired cysteines captured (Fig. 7D). Alongside the benefits of chemoproteomics capture, bulk proteomic analysis revealed unique variants. Bulk analysis identified 85 notable variants belonging to Census genes, including BRD4 E451G (CADD = 31) and KRAS G13D (CADD = 23.8), and 26 rare/ common variants of uncertain significance in ClinVar, including rare gain-of-cysteines ubiquitin hydrolase USP8 Y1040C (CADD = 28.5) and LMNA R298C (CADD = 27.2) (Fig. 7E and Supplementary Data 5). Most of these census variants are found in peptides not containing cysteines and thus, should not be detected by chemoproteogenomics.Given that cysteine chemoproteomics requires peptide derivatization with a comparatively large (463 Da) biotin modification, we additionally postulated that some differences in coverage could also be ascribed to the behavior of peptides during sample acquisition. Comparing the properties of the SAAV peptides detected by chemoproteogenomics versus proteogenomics, we observed a more restricted charge state distribution for cysteine-enriched samples and no appreciable differences in the amino acid content beyond enrichment for cysteine (Supplementary Fig. 33). While we did not observe differences in the peptide lengths in our comparison between the chemoproteomic-enriched and high pH detected SAAV peptides, a marked significant increase in SAAV peptide length (average 5AA longer) was observed compared to reference peptides in both datasets (Fig. 7F). This increased peptide length is consistent with the ubiquity of loss-of-arginine SAAVs in both datasets, which are favored in the longer length peptides (Supplementary Fig. 34). Thus, we concluded that chemical properties are not the primary reason for the difference in coverage between bulk and cysteine enrichment proteomics.Therefore, we asked whether protein or RNA abundance might rationalize the differences in SAAV coverage for each method. Comparison of normalized transcript counts for SAAV-matched genes identified either by chemoproteogenomics or in our bulk proteomic dataset, for HCT-15 cells revealed no significant difference between measured transcript abundance between the sets (Fig. 7G and Supplementary Data 5). A Supplementary Data subset of SAAVs (3262 total, including PIK3CA E545K, TP53 S241F, SMARCA4 R885C TCGA hotspot mutations all with CADD >27) with low abundance transcripts (less than 4000 normalized counts) were not detected in either the chemoproteogenomics or bulk proteogenomics. This finding provides evidence that low transcript abundance correlates with a decreased likelihood of variant detection both for bulk proteomics and for chemoproteomics. These trends for relative ease of proteomic detection are not restricted to variants and also extend to reference cysteines, with a marked enrichment of undetected cysteines encoded by low abundance transcripts (Supplementary Fig. 35).Given the likely disconnect between transcript abundance and protein abundance95,96,97 for some SAAVs analyzed, we also extended these analyses to measures of protein abundance. Using label-free quantification (LFQ) analysis, no difference was observed in protein abundance, inferred from quantified protein intensities, between the bulk fractionated samples and the chemoproteogenomic samples (Fig. 7H and Supplementary Data 5). Consistent with low abundance protein variants being challenging to detect, SAAVs detected via both proteomics workflows were observed to belong to more abundant proteins, in comparison to variants only detected via genomics.Both the transcript and protein abundance analyses do not delineate reference from variant-specific transcript/protein sequences. Therefore, to further delineate the capacity of chemoproteogenomics to detect low abundance variants, we assessed the variant allele frequencies (VAF) for detected SAAVs. We find that high-pH variant allele frequencies (VAF) were significantly higher than the chemoproteogenomic detected SAAVs, including the acquired cysteine subset, which were comparable to the aggregate bulk RNA-seq VAFs (Fig. 7I, Supplementary Data 5 and Supplementary Fig. 34). This enrichment for lower VAF for the chemoproteogenomic detected SAAVs hints at the utility of chemoproteogenomics for capture of rare variant-containing peptides.Guided by these findings, we asked whether chemoproteogenomics was well suited to capture deleterious variants, with the hypothesis that proteins harboring these likely damaging variants may be lowly expressed. Consistent with this premise, the mean CADD scores for the chemoproteogenomics identified variants were significantly higher than those calculated for the variants identified via bulk proteomics (Supplementary Fig. 36). Notable high-CADD score (>29) variants identified only from enrichment include lysine demethylase KDM3B D1444Y, RNA polymerase POLRMT R805C, glycoprotein transporter LMAN2 R218C and Serine/threonine-protein phosphatase PP1-alpha catalytic subunit PPP1CA D203N (Fig. 7C). Taken together these findings illustrate the added value of chemoproteogenomics in capturing functionally interesting variants.Chemoproteogenomics enables ligandability screeningAs demonstrated by our previous studies, cysteine chemoproteomics platforms are capable of pinpointing small-molecule targetable cysteine residues21,30,31,34. Therefore, we next paired our two-stage FDR search method with cysteine-reactive small molecule ligandability analysis to establish a chemoproteogenomic small molecule screening platform (Fig. 8A). We first opted to use the widely employed scout fragment KB0221 (Fig. 8B) to compare the ligandable variant proteomes for three high variant burden dMMR cell lines (HCT-15, Jurkat, and Molt-4). For KB02 treated samples, we identified 210 total variants, of which 8 were ligandable (Fig. 8C). The high concordance for ratios detected for variant peptides with multiple alleles provides evidence of the robustness of our platform and hints that most cysteine proximal variants do not substantially alter cysteine ligandability (Fig. 8D).Fig. 8: Assessing ligandability of variant proximal cysteines and gain-of-cysteines.A Schematic of activity-based screening of cysteine reactive compounds; cell lysates are labeled with compound or DMSO followed by chase with IAA and ‘click’ conjugation to heavy or light biotin click conjugation to our isotopically differentiated heavy and light biotin-azide reagents, tryptic digest, LC-MS/MS acquisition, and MSFragger analysis. B Chloroacetamide compound library. C Total quantified variants and total ligandable variants (log2 Ratio >2) identified stratified by cell line (KB02 data) or compound (HCT-15 cell line). D Correlation of high-confidence variant containing and reference cysteine ratio values from KB02 data. E Correlation of high-confidence variant containing and reference cysteine ratio values from SO compound data. F Log2 heavy to light ratio values for variant containing and reference cysteine peptides. G Subset of gain-of-cysteine peptide variant log2 ratios. Data is found in Supplementary Data 6, and source data is in the Source Data file.To provide a focused assessment of the structure-activity relationship (SAR) of small molecules for individual cysteines, we next subjected the HCT-15 proteome to more in-depth analysis using a small panel of custom electrophilic fragments (Fig. 8B and Supplementary Fig. 37). We observed 27 total liganded variant peptides in 27 proteins in the HCT-15 proteome, which were labeled by one or more compounds (Fig. 8C). As with the KB02 cell line comparison, nearly all multi-allelic peptides showed comparable ratios (Fig. 8E). One notable exception was EPRS P1482T (CADD = 27.2), which showed markedly different reference and variant ratios—the mutated proline nearby Cys 1480 may be requisite for labeling by electrophilic fragments (Fig. 8F). As multi-allelic acquired cysteine sites cannot be captured sans cysteine, no analogous ratio comparisons could be performed for the 6 total quantified acquired cysteines (Fig. 8G).We also asked whether any of the ligandable variants would likely alter protein activity. We chose to focus on three metrics to guide our prioritization of likely variants for functional analysis, CADD score, proximity to known functional sites, and variants that result in gained cysteines. We analyzed active site and binding sites within 10 angstrom distance of the ligandable cysteine residues and cysteine-proximal variant sites (Supplementary Data 6). We find three ligandable cysteines near or in active/binding sites including previously identified HMGB1 Cys106 (R110C, CADD = 24.1) (Fig. 8A), as well as Aldolase A ALDOA Cys178 (G196G, CADD = 26.2) and HLA-B/C Cys125 (V127L/S123Y, CADD <1). Other notable sites were the aforementioned CAND1 G1069C and Tubulin beta 6 (TUBB6) G71C, CADD = 32, which resides proximal to the GTP binding site (Fig. 8C, G).Of these intriguing variants, we selected CAND1 and HMGB1 for follow-up analysis. For each protein, we generated both the corresponding gain-of-cysteine mutations together with tryptophan mutations. Our prior work98 and that of others99 have shown the comparatively bulky tryptophan mutation serves as a useful surrogate for small molecule binding. Therefore, as our scout fragments are modestly potent, we chose to use tryptophan point mutations in lieu of small molecule treatment to minimize the risk of non-specific compound labeling complicating the interpretation of variant functionality. Using a coimmunoprecipitation assay, we find that CAND1 G1069C but not G1069W completely blocks interactions with CUL1 (Fig. 9A). This finding is notable given the important functions of this hairpin in mediating SKP1-SKP2 dissociation from SCF, which is critical to regulating the functions and composition of E3 ligase complexes100,101.Fig. 9: Functional studies of CAND1 and HMGB1.A WT and G1069C mutant CAND1 proteins bind Cul1 while the G1069C CAND1 mutation perturbs binding. HEK293T cells were co-transfected with FLAG-Cul1 and the given HA-tagged CAND1 protein (WT, G1069C, or G1069W) or control FLAG-GFP. Anti-FLAG resin was used to pull down FLAG-Cul1 from cell lysates along with any complexed proteins. Western Blots were incubated with the indicated primary antibodies, *indicates a non-specific HA band. B HMGB1 proteins were tested for the ability to induce TLR4-mediated immune response using HEK-Blue reporter cell lines (hTLR4 and Null control) and corresponding PRR assay. Results show mean response ratios (error bars = SD, n = 4 per condition) of hTLR4 and Null cells to increasing concentrations (μg/mL) of WT, R110C, and R110W proteins as indicated over 2 independent experiments. AT = commercially available all-thiol fully reduced HMGB1; diS = commercially available disulfide HMGB1; working concentration of 0.2 μg/mL for both. Reference lines on the graph indicate ECmax (solid line), EC50 (dashed line), and H2O control (dotted line) response ratios to canonical positive control ligand (LPS) specific to the hTLR4 cell line. Significance determined via unpaired two-tailed student’s t test; ** =  p <0.01, *** = p <0.001, **** = p <0.0001. TLR4 200 ng/mL (WT vs R110C, p = 0.009; WT vs R110W, p <0.0001); TLR4 600 ng/mL (WT vs R110C, p = 0.009; WT vs R110W, p <0.0001). C Response ratio curve of hTLR4 and Null cells to positive control ligand (LPS). EC values are generated using nonlinear regression (Asymmetric (five parameters), X is concentration). For (A), western blot data are representative of three independent measurements. Data is found in Supplementary Data 6, and source data is in the Source Data file.As a second case study, we turned to HMGB1, which is known to function as a redox-active cytokine74,75,76. Therefore, we opted to assess its binding to toll-like receptor (TLR) 4, which has previously been reported as bound specifically by the disulfide (Cys23-Cys45) form of HMGB1—the fully reduced (all thiol) protein does not activate TLR4 signaling activity. Notably the fully oxidized (including Cys106) form of HMGB1 is also inactive77. Thus, we hypothesized that the R110C mutation we identified would decrease cytokine activity. To test this hypothesis, we expressed and purified recombinant wild-type HMGB1 together with both the R110C and R110W mutant proteins. Then using a human TLR4 HEK-Blue reporter cell line74,75,76, we compared the relative TLR4 response to treatment with each protein. Providing evidence that our HMGB1 protein is active in this assay, we observe no significant difference relative to commercially available (TECAN) disulfide (diS) protein and our wild-type protein (Fig. 9B, C). Revealing the functional impact of the R110 mutations, we find that both the acquired cysteine and bulkier tryptophan scanning mutation significantly attenuate HMGB1-induced TLR4 response, with a more substantial effect observed for the tryptophan mutation. Taken together, these two case studies illustrate the utility of chemoproteogenomics in the discovery of functionally important gain-of-cysteine variants.

Hot Topics

Related Articles