Detection of host cell microprotein impurities in antibody drug products

Transcriptome-wide analysis of CHO cell translation initiation and elongation using Ribo-seqThe reduction of cell culture temperature (i.e., temperature shift) is a method used to extend the viability of some commercial cell culture processes and improve product quality35. In this study, we simulated an industrial temperature shift and conducted a series of Ribo-seq experiments. Our laboratory has demonstrated that temperature shift induces significant differences in gene expression and alters the cellular metabolism of a mAb-producing CHO-K1 cell line (CHO-K1 mAb)36. Given our previous findings and studies from other laboratories reporting the alteration of canonical protein abundance10,37,38,39, we reasoned that this temperature shift model would also induce widespread changes in translation regulation and provide an opportunity to identify novel Chinese hamster ORFs.To perform ribosome footprint profiling, we conducted two identical cell culture experiments for the analysis of translation initiation and elongation. For both experiments, 8 replicate shake flasks were first grown for 48 h at 37 °C before the cell culture temperature was reduced to 31 °C (temperature shifted (TS) group; n = 4) while maintaining the remainder at 37 °C (non-temperature shifted (NTS) group; n = 4). Cells from both the TS and NTS groups were harvested for Ribo-seq 24 h after the reduction of cell culture temperature (Fig. 1a), at which point there was a reduced cell density of 30% (initiation experiment) and 24% (elongation experiment) in the TS sample group (Supplementary Fig. 1; Supplementary Data 1a). We performed ribosome footprint profiling experiments using harringtonine (HARR) (n = 8), an inhibitor of translation initiation24, and cycloheximide (CHX) (n = 8), an inhibitor of translation elongation16 (Fig. 1b). For each harringtonine-treated sample, a parallel sample (n = 8) was treated with DMSO and flash frozen to arrest translation (we refer to these data as No-drug (ND)). For the CHX samples, matched gene expression profiles were acquired using total RNA-seq (n = 8) (Fig. 1c) to enable the identification of significant differences in translational efficiency (TE) between the NTS and TS sample groups.Fig. 1: Analysis of CHO cell translation using ribosome footprint profiling.a 8 replicate shake flasks were seeded with a mAb-producing CHO-K1 cell line cultured for 48 h; at this point, the temperature of 4 shake flasks was reduced to 31 °C. At 72 h post-seeding, cells were harvested from the non-temperature and temperature-shifted cultures. We utilised (b) Ribo-seq using different inhibitors to capture information from initiating (harringtonine) and elongating (cycloheximide and no drug) ribosomes. The chemical structures shown for harringtonine (CID = 276389) and cycloheximide (CID = 6197) were obtained from PubChem94. In addition, (c) RNA-seq was used to measure the RNA levels. Following pre-processing of the raw Ribo-seq data, we (d) retained reads within the expected size range (coloured in blue) of RPFs (28-31nt). As expected, no peak was observed for the RNA-seq data. An optimum P-site offset of 12 nt was selected for all datasets, where (e) an average of 66% of RPFs exhibited triplet periodicity. No framing pattern was observed for the RNA-seq data. A metagene analysis was conducted for each Ribo-seq dataset, confirming (f) the expected enrichment of RPFs at the TIS of annotated protein-coding genes for harringtonine Ribo-seq when compared to elongation Ribo-seq (average cycloheximide and no-drug treated RPFs). Separate metagene profiles for CHX and ND are shown in Supplementary Fig. 3. Source data are provided as a Source Data file. Panels a, b and c created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.Sequencing of the 24 resulting Ribo-seq libraries yielded an average of ~69, ~68, and ~58 million reads per sample for the CHX, HARR, and ND Ribo-seq, respectively. An average of ~56 million reads per sample were obtained for the 8 RNA-seq libraries. Low-quality reads were removed, and adapter sequences were trimmed from the raw Ribo-seq and RNA-seq data. For Ribo-seq data, an additional filtering stage was carried out to eliminate contamination from non-coding RNA. Reads were mapped to bowtie40 indices constructed from Cricetulus griseus rRNA, tRNA, and snoRNA sequences obtained from v22 of the RNA Central database41. Reads aligning to any of these indices were discarded from further analysis. This filtering stage removed an average of ~55%, ~40%, and ~39% of trimmed reads for the CHX, HARR, and ND samples, respectively (Supplementary Fig. 2; Supplementary Data 2).Next, we examined the remaining Ribo-seq reads within the expected RPF length range (25-34nt) (Fig. 1d) to select the P-site offset (the distance from the 5’ end of a read to the first nucleotide of the P-site codon). Each Ribo-seq dataset was mapped to the Chinese hamster PICRH-1.0 genome using STAR42. The Plastid tool43 was used to assess the P-site offset and determine the proportion of reads exhibiting triplet periodicity (Fig. 1e) for NCBI-annotated canonical protein-coding genes for each offset. The optimum P-site offset was found to be 12 nt, for which ~ 71%, 65%, and 64% of reads exhibited the expected triplet periodicity for the CHX, HARR, and ND Ribo-seq datasets, respectively, and we retained the reads between 28–31 nt for further analysis (Fig. 1d). Prior to ORF identification, we confirmed the expected preferential enrichment of ribosomes at the translation initiation sites (TIS) of annotated protein-coding genes for the HARR Ribo-seq data in comparison to the CHX and ND Ribo-seq data (Fig. 1f; Supplementary Fig. 3).Ribo-seq enables the characterisation of novel ORFs in the Chinese hamster genomeThe Ribo-seq data was used to refine the annotation of translated regions of the Chinese hamster PICRH-1.0 genome by conducting a transcriptome-wide analysis using ORF-RATER44. The ORF-RATER algorithm integrates initiation and elongation Ribo-seq data to enable the identification of unannotated ORFs by first finding all potential ORFs beginning at user-defined start codons that have an in-frame stop codon. The experimental Ribo-seq data is then used to confirm occupancy at each TIS and assess whether the putative ORF is undergoing active translation. To maximise the sensitivity of ORF detection, we merged the RPFs for all replicates in each type of Ribo-seq experiment yielding a total of approximately 144, 169, and 140 million RPFs for the harringtonine, cycloheximide, and no-drug treated Ribo-seq, respectively. Prior to ORF identification, transcripts from 4583 pseudogenes were removed. In addition, transcripts that had low coverage (n = 18,951), a high proportion of multimapped reads (n = 10), or where the RPFs aligned to a small number of positions within a transcript (n = 1662) were also excluded from further analysis. For the remaining transcripts, the initial ORF-RATER search step was limited to ORFs that began at an AUG or near-cognate start codon (i.e., CUG, GUG, or UUG). To determine if a potential TIS was occupied, only the RPF data from the HARR Ribo-seq was considered while CHX and ND-treated Ribo-seq data was utilised to determine if putative ORFs were translated by comparing the RPF occupancy of each ORF to the typical pattern of translation elongation observed for CDS of annotated protein coding genes.An initial group of 27,784 ORFs identified by ORF-RATER with an ORF-RATER score of ≥ 0.545,46 and ORF length ≥ 5 aa was selected for further analysis. The proteoforms identified included those present in the current annotation of the Chinese hamster genome (i.e., Annotated) and N-terminal extensions (i.e., Extension). Two distinct classes of ORFs initiating upstream of the annotated CDS (i.e., the main ORF) were also identified. The first type, called upstream ORFs (i.e., uORFs), initiates upstream and terminates before the start codon of the main ORF. The second upstream ORF type, termed overlapping upstream open reading frames (ouORFs), also initiates in the 5’ leader of mRNAs but extends downstream beyond the main ORF’s start codon and is translated in a different reading frame. As well as ORFs in mRNAs, we also identified ORFs in transcripts classified as non-coding in the PICRH-1.0 genome that had previously unannotated start and stop codons (i.e., New ORFs).The conditions used to inhibit translation initiation can, in some cases, lead to the identification of false positive internal ORFs due to the capture of residual elongating ribosomes45. In our case, we utilized flash freezing in combination with harringtonine, which will also result in the capture of a proportion of RPFs from elongating ribosomes, increasing the probability of erroneous identifications. To reduce false positives from internal TIS, we discarded truncated ORFs (n = 9365), internal ORFs (n = 1723) classifications, and other low-confidence isoforms (n = 872) from further analysis. For the remaining ORFs, we utilised a method developed by Lee et al47. to perform relative quantitation of the harringtonine signal at each TIS when compared to the ND Ribo-seq data. ORFs with a \({{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}\) value < 0.01 were eliminated from further analysis (n = 4633). The validity of ORFs with non-AUG TIS was further assessed in comparison to other proteoforms that overlapped on the same transcript. Where an AUG and non-AUG ORF were predicted to start within a 7nt window, we eliminated the non-AUG initiated ORF. In cases when a pair of AUG-initiated ORFs, or a pair of non-AUG-initiated ORFs were found within the window, only the ORF with the maximum \({{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}\) value was retained. For overlapping ORFs that started outside of the 7nt window, non-AUG ORFs were retained if the \({{{\rm{R}}}}_{{{\rm{harr}}}}-{{{\rm{R}}}}_{{{\rm{nd}}}}\) value at the TIS was at least five times higher than that of the AUG-initiated counterpart. This process eliminated a further 465 ORFs. The final stage in the assessment of novel ORFs was achieved through the calculation of Ingolia’s fragment length organisation similarity score (FLOSS)48, which was calculated for all ORFs (annotated and novel) and those ORFs with a FLOSS score classified as extreme outliers were removed (n = 525).Upon the completion of the filtering process, 10,201 high-confidence ORFs were retained (Fig. 2a, Supplementary Fig. 4, Supplementary Data 3), of which ~44% (n = 4491) were not annotated in the Chinese hamster PICRH-1.0 genome. ~56% of these new identifications were predicted to start at near-cognate codons (i.e., CUG, GUG, or UUG). The novel N-terminal extensions, uORFs, and ouORFs that remained after this filtering process were compared to annotations in uORFdb49. The proportion of ORFs with near cognate start codons was comparatively lower in Chinese hamster than in the 11 species examined, including human, mouse, and rat (Supplementary Fig. 5). The ability to identify initiation at non-AUG codons enabled us to discover alternative ORFs of conventional protein-coding genes that would not be possible with previous annotation approaches for the Chinese hamster genome. For instance, ~11% (n = 527) of novel ORFs identified were N-terminal extensions of annotated protein-coding transcripts (e.g., Aurora kinase A (Fig. 2b)).Fig. 2: Ribo-seq identifies thousands of novel CHO cell ORFs.In this study, we utilised the ORF-RATER algorithm to identify ORFs initiating at near cognate (i.e., NUG) start codons from the Ribo-seq data. A total of (a) 10,201 ORFs were identified, including 4491 that were not previously annotated in the Chinese hamster genome. These ORFs included N-terminal extensions in annotated protein-coding genes. For instance, we identified (b) a CUG initiated N-terminal extension in a transcript of Aurka. The RNA-seq, CHX coverage of the transcript (full coverage [coloured grey] and P-site offset CHX coverage [coloured by reading frame relative to the annotated TIS]) along with the HARR-ND coverage (P-site offset) are shown, illustrating the initiation signal at the CUG start codon upstream of the NCBI annotated AUG start codon. Source data are provided as a Source Data file.The Chinese hamster genome harbours thousands of short open reading framesWe also identified a considerable number of previously uncharacterized short open reading frames (sORFs) in the Chinese hamster genome (Supplementary Data 3). sORFs are defined as ORFs predicted to produce proteins < 100 aa, termed microproteins50. More than 90% of the ORFs identified in the 5’ region of mRNAs (uORFs (Fig. 3a) and ouORFs (Fig. 3b)) or in transcripts previously annotated as non-coding (Fig. 3c) were sORFs (Fig. 3d). 2276 uORFs were classified as sORFs and had an average putative microprotein length of 23 aa (Supplementary Fig. 6a). AUG (49.9%) was the most prevalent start codon, followed by CUG (29.3%), GUG (12.7%), and UUG (8.2%). The average predicted microprotein length of the ouORFs classified as sORFs (n = 918) was 39 aa (Supplementary Fig. 6b), with CUG (37.7%) the most frequent start codon, followed by AUG (33.0%), GUG (18.5%), and UUG (10.8%). For the New ORF class, the majority (480 of 487) were sORFs and found in transcripts annotated in NCBI as non-coding. The average length of the microproteins predicted to be encoded by these sORFs was 30 aa (Supplementary Fig. 6c). AUG (70.0%) was the most common start codon, followed by CUG (18.3%), GUG (8.3%), and UUG (3.3%).Fig. 3: Ribosome footprint profiling uncovers thousands of short open reading frames in the Chinese hamster genome.Examples are shown of (a) an uORF found in a Ddit3 transcript, (b) an ouORF in a Rab31 transcript, and (c) a sORF found in the transcript of a long non-coding RNA gene. a, b and c show the full CHX coverage [coloured grey] and P-site offset CHX coverage [coloured by reading frame relative to the annotated TIS]. Many previously uncharacterized ORFs identified in this study were (d) sORFs predicted to produce proteins < 100 aa. We focused on short open reading frames found in the 5’ leader of protein-coding transcripts (i.e., upstream ORFs and start overlapping uORFs) as well as ORFs found in non-coding RNAs where > 90% of all identified ORFs in these classes were sORFs. Comparison of the (e) amino acid frequencies of uORFs (both uORFs and ouORFs) and ncRNA sORFs to annotated proteins, as well as the expected amino acid frequency for the Chinese hamster genome, revealed differences in usage of amino acids, including arginine and glycine when compared to conventional protein-coding ORFs ( ≥ 100 aa). Supplementary Fig. 7 shows the frequency of all amino acids. The sORF populations were also found to have (f) a reduced codon adaption index (CAI) compared to previously annotated canonical proteins. A two-sided Kolmogorov–Smirnov test was used to assess the CAI difference between ORF types; a p-value < 0.01 was considered significant. The (d) and (f) boxplot center lines show the median length, and the whiskers extend to 1.5× the interquartile range. Source data are provided as a Source Data file.Upstream ORFs and sORFs in the New ORF group were found to have differences in amino acid usage when compared to annotated proteins with ≥ 100 aa (Fig. 3e and Supplementary Fig. 7). The amino acid usage was comparable to a recent analysis conducted for microproteins encoded in the human genome26. CHO cell sORFs were found to have increased usage of arginine, glycine, and tryptophan and decreased usage of asparagine, glutamate, lysine, and aspartic acid. Alanine and proline were more prevalent in uORFs than annotated proteins and sORFs found in ncRNA, while methionine usage was more frequent in sORFs in ncRNA. We also compared the codon adaption index (CAI)51 between previously annotated protein-coding genes and novel ORF types. We found that the sORF population had a lower CAI (Fig. 3f), indicating that microproteins tend to have a lower abundance than canonical proteins.Microproteins are a source of process-related impurities in antibody drug productsNext, we sought to determine if microproteins predicted to be encoded by novel sORFs increased the coverage of MS-based HCP detection. We performed HCP analysis in our laboratory utilising the SP3 sample preparation method followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS)52,53 to analyse 5 commercial mAb drug products (adalimumab, denosumab, nivolumab, pertuzumab and vedolizumab) as well as a Fc-fusion protein drug product (etanercept). We also utilised a publicly available dataset from a previous HCP analysis of 4 mAb drug products (adalimumab, bevacizumab, nivolumab and trastuzumab) (Fig. 4a)54. In total, we analysed 10 separate LC-MS/MS datasets spanning 8 antibody drug products, 5 different sample preparation methods (Fig. 4b) performed on two different Orbitrap MS instruments both operated in data-dependant acquisition (DDA) mode (Fig. 4c).Fig. 4: Microproteins are a class of potential host cell impurity in antibody drug products.We utilised (a) data from LC-MS/MS-based HCP analyses of 8 antibody-drug products generated in our laboratory as well as a previous study by Pythoud et al. spanning (b) 5 sample preparation methods and captured on (c) 2 types of Orbitrap MS instruments. The DDA MS data was first searched against canonical proteins using MetaMorpheus with spectral mass calibration enabled, resulting in the identification of (d) canonical proteins in each product tested. For microproteins, the PepQuery2 algorithm was used to (e) detect microprotein PSMs. The false positive rate is reduced by PepQuery2 by statistical evaluation against randomly shuffled sequences (i.e. PepQuery p-value) and an unrestricted modification search against known proteins. We identified (d) 40 microproteins with (f) 28 microproteins found in more than one drug product. Note: The number of microproteins detected for each sample preparation method employed in the Pythoud et al. study is shown in Supplementary Fig. 8. Source data are provided as a Source Data file. Panels a and c created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.The first stage in our MS analysis was to search the data using MetaMorpheus55 with spectral mass calibration enabled to identify canonical HCPs present in each drug product. For each drug product, a protein sequence database was constructed which was comprised of the Chinese hamster reference proteome (n = 23,887, downloaded from UniProt on 27/03/2024), sequences of sample contaminants from the common Repository of Adventitious Proteins (cRAP, thegpm.org/crap), the respective recombinant protein sequencesand any quantitation or retention time standards that were added to the sample. A canonical protein was considered confidently identified if ≥ 2 peptides were detected and the protein-level FDR was < 0.01. We found canonical HCPs in all antibody drug products tested in both studies, including several previously identified HCPs in adalimumab (e.g., cathepsin L1, S100a1156) and vedolizumab (e.g., clusterin57) (Fig. 4d, Supplementary Data 4).To identify microproteins in the drug product HCP data, we used PepQuery258,59, a peptide-centric algorithm designed specifically for detecting novel proteins and utilized previously for microprotein validation60. A PepQuery2 index was constructed from the MetaMorpheus mass-calibrated LC-MS/MS data from all drug product HCP analyses comprising 132 samples and > 5.2 million MS/MS spectra. The microproteins annotated by Ribo-seq (n = 3681) were digested with semi-tryptic specificity and searched against the PepQuery index to identify candidate peptide spectral matches (PSMs). False positive microprotein identifications are initially eliminated by comparing these PSMs against peptides from the reference proteome (i.e., annotated Chinese hamster proteins, all antibody drug products, mass spectrometry standards, and contaminant sequences). The known peptide set was comprised of tryptic peptides from all reference proteins. In addition, we utilized MetaMorpheus to perform a liberal search of the drug product data with semi-tryptic specificity. Semi-tryptic peptides from a protein identified in any drug product at a FDR < 10 % (n = 2973) were also included in the known peptide set. The PSMs initially associated with microproteins that were subsequently found to have an equal to or greater PepQuery score for a known peptide were excluded. The remaining candidate microproteins PSMs were compared to randomly shuffled microprotein peptide sequences to determine the FDR. In the final stage of PepQuery, an unrestricted modification search was performed, and if a candidate microprotein PSM was found to have a better match to a post-translationally or artefactually modified known peptide it was eliminated58,59.The identification of microproteins using mass spectrometry is challenging due to their size, as a lower abundance and fewer cleavage sites amenable to digestion with trypsin in comparison to canonical proteins results in a reduction of the number of detectable peptides by MS61. Similar to other studies26,62, we considered a single peptide sufficient for microprotein identification. Only those microprotein peptides designated as confident by PepQuery (i.e., a p-value < 0.05 if the peptide length ≤ 8 aa, or < 0.01 for peptides > 8 aa) (Fig. 4e) and found in at least 50% of the replicates of each drug product or sample preparation cohort were retained (Supplementary Data 5).A total of 40 microprotein HCPs were identified across the eight antibody products (Fig. 4d). Of the 5 sample preparation methods, native digestion protocols resulted in the lowest number of microprotein identifications (Supplementary Fig. 8). While we detected two or more peptides from 13 of these microproteins across the dataset, most microprotein identifications resulted from the detection of a single peptide. Etanercept had the largest number of individual microproteins identified (n = 11) from the data conducted in our laboratory, while the bevacizumab and trastuzumab samples analyzed by Pythoud et al. had the largest number of microproteins (n = 18). The lowest number of microproteins detected was in vedolizumab (n = 4). Twenty-eight microproteins were identified in more than one of the drug products examined (Fig. 4f). A single microprotein was found in the adalimumab analyzed in this study and the Pythoud et al. study, while two microproteins were found in the nivolumab data from both studies. A 16aa microprotein from a CUG-initiated ouORF in a Znf883 transcript (XM_027419542.2) was found in 7 of 8 antibody-drug products tested (Fig. 4f).We utilised the data generated in our laboratory for six drug products to assess the quantities of canonical proteins and microproteins present in the drug products. The canonical and microprotein PSMs identified by MetaMorpheus and PepQuery were combined, and FlashLFQ was used to perform label-free quantitation. The Hi3 standard quantitation method53 was used to determine the presence of individual HCPs. The ten quantifiable canonical HCPs detected across the six drug products ranged in concentration from 1.52 ppm to 46.94 ppm (median = 14.28 ppm). The Hi3 method requires > 3 identified peptides for confident quantification of protein. A single quantifiable microprotein met the 3-peptide confidence level for accurate quantitation and was found to be present in etanercept at a concentration of 1.92 ppm. Given the challenges of microprotein identification, we also decided to estimate the concentration of the microproteins with fewer than three peptides identified (Supplementary Fig. 9). A microprotein with two peptides identified was found at 0.16 ppm. Quantified microproteins that were identified from a single peptide (n = 15) represent the lowest confidence estimates in terms of HCP abundance. The majority of these microproteins (n = 14) were found to be below the median concentration observed for canonical HCPs. In contrast, the remaining microprotein abundance estimate was ~800 ppm exceeding the canonical HCP range.The translation efficiency of sORFs found in non-coding RNA genes is altered in response to a reduction of cell culture temperatureRibo-seq can also be used to assess translation efficiency by normalizing RPF occupancy of each ORF to the corresponding RNA abundance. Comparing translation efficiency between conditions enables transcriptome-wide differences in translational regulation to be identified16. To understand the extent to which changes in the translatome are associated with the CHO cell response to sub-physiological temperature, we performed a count-based analysis of translation efficiency using the CHX-treated Ribo-seq data along with the parallel RNA-seq data captured for TS (n = 4) and NTS (n = 4) samples (Supplementary Fig. 10).The Plastid43 cs generate algorithm was used to construct a gene-level annotation by merging the positions of all exons found in all transcripts of a gene. Only those RPFs/reads mapping to the CDS regions common to all transcripts for a particular gene contributed to the overall count. The RPFs/reads from the first 15 and last five codons for ORFs > 100 aa and the first and last codons for ORFs < 100 aa were excluded. This step was intended to reduce potential bias from the cycloheximide-associated accumulation of ribosomes at the beginning and end of the CDS and enrich for those RPFs most likely to be associated with elongation63. It is not possible to accurately distinguish the expression/occupancy of the uORF/ouORFs from the canonical ORFs with the gene-level CDS counting approach used here. We, therefore, focused only on the ORF cohort identified in genes previously classified as non-coding in the reference annotation. 495 of these ORFs were identified, 480 of which were predicted to encode a microprotein (Fig. 5a). The average length of potential microproteins was 30 aa (Fig. 5b), with the majority of parent transcript harboring 1 or 2 sORFs. However, there were instances where as many as five sORFs were found in a single non-coding RNA transcript (Fig. 5c). To ensure compatibility with the Plastid read/RPF counting algorithm43, we utilised only the longest sORF for each transcript (n = 395).Fig. 5: Temperature shift induces alterations in translation regulation of CHO cell canonical ORFs and sORFs.To characterise the impact of reducing cell culture temperature, we analysed translation efficiency for canonical ORFs and non-coding RNA sORFs. Of the 513 ORFs classified as New by ORF-RATER, the (a) majority (n = 480) were sORFs found in non-coding RNA genes. The average length of these sORFs was (b) 30 aa with as many as (c) 5 encoded by a single transcript. Only the longest ORF per transcript (n = 395) predicted to encode a microprotein was included in the analysis. The deltaTE method was used to identify changes in RNA abundance and RPF occupancy between the TS (biological replicates n = 4) and NTS samples (biological replicates n = 4) using the RNA-seq (RNA) and Ribo-seq (RPF) data. d 2837 genes were found to be forwarded (significant RNA and RPF difference, no translation efficiency (TE) difference following DESeq2 analysis (two-sided Wald test, Benjamini-Hochberg (BH) adjusted p-value < 0.05)). 279 genes were found to be regulated exclusively at the level of translation (significant difference in RPF and TE, no RNA difference). 392 genes were buffered (a significant RNA difference anticorrelated with a difference in TE). 199 genes were found to be intensified (RNA difference correlated with TE difference). Following the application of the fold change filter (with ≥ |1.5| fold change for RNA and RPF for forwarded genes and ≥ |1.5| TE for translation exclusive, buffered and intensified), the resulting 1220 genes were used to perform an overrepresentation analysis against GO biological processes. The proportion of translationally regulated genes contributing to the significant enrichment (BH adjusted p-value < 0.05) of the 56 biological processes was determined. The (e) 10 biological processes with the largest proportion of translationally regulated genes are shown. 15 sORFs were found (f) within the forwarded category, and 5 were (g) found to undergo changes via buffering, intensification, and regulation upon the reduction of cell culture temperature. The (g) boxplot centre line shows the median TE (normalised RNA count divided by normalised RPF count), and the whiskers extend to 1.5× the interquartile range. Source data are provided as a Source Data file.We retained only the CDS regions (n = 10, 741), which had an average count of 10 in both the RNA-seq and Ribo-seq datasets, and used the deltaTE64 to identify and classify ORFs with differences in transcription and/or translational efficiency (ΔTE) between the TS and NTS samples. The deltaTE method introduces an interaction term to the DESeq2 generalized linear model to assess differences between the biological conditions observed from RNA-seq and Ribo-seq data separately and, importantly, between the different assays to calculate the false discovery rate (FDR) for changes in the RNA abundance, RPF occupancy and the translation efficiency for each gene. We initially classified the outputs from DESeq2 using only statistically significant differences (adjusted p-value < 0.05) and not fold change, as outlined by the deltaTE developers (Fig. 5d). Of the 3707 genes classified by deltaTE, we found that 76.5% (n = 2837) were transcriptionally forwarded, where a significant increase or decrease in RPF occupancy agreed with the change in RNA abundance observed (i.e., ΔTE adjusted p-value ≥ 0.05). 7.5% (n = 279) of differences between the TS v NTS samples were found to be translation exclusive, where both the RPF occupancy and ΔTE were significantly altered, while the RNA was unchanged (adjusted p-value ≥ 0.05). The remainder of genes that were altered at both the transcriptional and translational level (i.e., RNA and ΔTE adjusted p-value < 0.05) were further classified by taking the direction of the change into account. 10.5% (n = 392) of genes were found to be buffered where changes in transcription were tempered at the level of translation, e.g., an increase in RNA abundance was associated with a decreased ΔTE. The remaining 5% (n = 199) of genes were found to be intensified by translation regulation, e.g., an upregulation in transcription was accompanied by an increased ΔTE.To determine the extent to which translation regulation plays a role in the CHO cell response to mildly hypothermic conditions, we further filtered the deltaTE output. For transcriptionally forwarded genes, we retained only those genes (n = 863) with ≥ |1.5| fold change between the TS and NTS samples for both the RNA and RPF data. For translationally regulated categories (translation exclusive, buffered, and intensified), only genes with a ≥ |1.5| change in ΔTE were retained (n = 357) (Supplementary Data 6). Both cohorts of genes were combined, and an overrepresentation analysis against the genome ontology (GO) was performed (Supplementary Data 7). We determined the proportion of translation-exclusive, buffered, and intensified genes contributing to the 56 enriched GO categories (FDR < 0.05). For 20 significantly enriched biological processes, > 25% of genes were found to be differentially translationally regulated (Fig. 5e). We identified 15 sORFs within the forwarded cohort where a ≥ |1.5| fold increase or decrease in RNA-seq and Ribo-seq was observed without a change in ΔTE (Fig. 5f, Supplementary Fig. 11). A further five sORFs had a | 1.5| fold increase or decrease in ΔTE and were classified as buffered, intensified, or regulated exclusively through translation (Fig. 5g).The abundance of CHO cell microproteins change in response to mild hypothermia and between the exponential and stationary growth phasesWe utilised proteomic mass spectrometry to determine if microproteins predicted to be encoded by sORFs were present in whole-cell lysates. These data allowed us to overcome the inherent limitations of the RNAseq and Ribo-seq analyses and enabled the identification of microproteins from the uORF and ouORF classes and cases when a single non-coding RNA gene encodes multiple microproteins. We also sought to determine if microprotein abundance was altered upon reducing cell culture temperature (Fig. 6a). For this experiment, we again acquired cells from a non-temperature shifted control at 72 h post seeding (n = 3) and 24 h post-temperature shift (72 h post seeding) (n = 3) as well as a sample at 48 h post-temperature shift (96 h post seeding) (n = 3) (Supplementary Fig. 12a, Supplementary Data 1b). An additional proteomics experiment was performed for a second CHO cell line to assess if microproteins could be detected and if abundance was altered between the exponential and stationary phases of cell growth (Fig. 6b). Here, a non-mAb producing CHO-K1 GS cell line was cultured for 7 days; samples were acquired for proteomics at 96 h post-seeding when the cells were in exponential growth (n = 4) and at 168 h when the cells had entered stationary phase (n = 4) (Supplementary Fig. 12b, Supplementary Data 1,c). Cell lysates from both proteomics experiments were subjected to a SP3 protein clean-up procedure and tryptic digestion before LC-MS/MS (Fig. 6c). The resulting MS data from each proteomics sample was searched in an identical manner to that of the drug product HCP data. Canonical proteins were identified using MetaMorpheus (protein-level FDR < 0.01, ≥ 2 peptides detected). For PepQuery, an index was constructed comprising > 2.9 million MS/MS spectra. The complete set of tryptic peptides from the reference proteins, along with those semi-tryptic peptides identified from a liberal MetaMorpheus database search (FDR < 10%) of the data from both proteomics experiments (n = 16,949) was utilised for the PepQuery known peptide set. Only those microproteins designated as confident by PepQuery in at least 50% of the replicates of a sample cohort were retained.Fig. 6: Proteomic analysis of CHO cell microproteins in response to mild hypothermia and at different cell culture growth phases.To determine if ORFs predicted from the Ribo-seq could be identified at the protein level, we conducted LC-MS/MS-based proteomics. To generate the samples for proteomics (a) the temperature shift model used for Ribo-seq and RNA-seq was repeated (biological replicates, n = 3) and (b) cells from a different non mAb-producing CHO-K1 GS cell line were captured (biological replicates, n = 4) during exponential growth (Day 4) and in the stationary phase (Day 7). Proteins were extracted from cell lysates, and a (c) SP3-based protein cleanup method followed by tryptic digestion was used to prepare samples for MS analysis. The resulting data was searched using MetaMorpheus for canonical proteins and PepQuery2 for microproteins in the same fashion as the drug product data. This analysis identified (d) 4737 and 5024 canonical proteins for the temperature shift and growth rate experiments, respectively. For microproteins, e 110 and 53 microproteins were identified from the temperature shift and growth rate proteomics experiments, respectively. The microproteins identified for both experiments (f and g) originated from uORF, ouORF, and RNAs previously annotated as non-coding. h 28 microproteins were identified in both proteomics experiments, while 2 lysate-identified microproteins were also found in antibody drug products. We identified microproteins that were significantly differentially abundant (proDA two-sided Wald test, BH adjusted p-value < 0.05) upon a comparison of non-temperature shifted control to those samples acquired at (i) 24 h, (j) 48 h post-temperature shift, as well as between (k) the exponential and stationary phases of cell culture. Source data are provided as a Source Data file. a, b, and c created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.For the temperature shift proteomics experiment, 4737 canonical proteins were identified across the nine samples analysed by mass spectrometry (Fig. 6d, Supplementary Data 8a). 110 microproteins, were detected from the uORF (n = 45), ouORF (n = 47) and New (n = 18) classes (Fig. 6f, Supplementary Data 8b). For the growth phase experiment, 5024 canonical proteins along with 53 microproteins (Fig. 6e, Supplementary Data 8c) originating from uORFs (n = 28), ouORFs (n = 19), and the New (n = 6) classes (Fig. 6g, Supplementary Data 8d). Twenty-eight microproteins were detected in both the temperature shift and growth phase experiments (Fig. 6h), with two microproteins found in the CHO lysate samples and in antibody drug products (Fig. 6h).The PSMs for confidently identified canonical proteins and microproteins for each experiment were merged and FlashLFQ with match-between runs enabled was used to generate LFQ values for each experiment. Only those canonical and microproteins quantified in at least 50% of samples in a replicate cohort were retained for further analysis. The proDA algorithm65 was used to identify proteins significantly altered (i.e., log2 fold change of ≥ |1.2| and BH adjusted p-value < 0.05) between conditions for the temperature shift and growth rate experiments. Upon comparison of the 24 h and 48 h post-temperature shift samples to the non-temperature shifted control, 454 and 1117 canonical proteins were differentially expressed, respectively (Supplementary Data 9a, 9c). Of the 49 microproteins reliably quantified in this experiment, the abundance of 4 microproteins at 24 h and 9 microproteins for the 48 h post-temperature shift were found to be altered (Fig. 6i, j, Supplementary Data 9b, 9d). In the second proteomics experiment, 1,636 canonical proteins (Supplementary Data 9e) and 3 of the 10 quantified microproteins were found differentially expressed upon a comparison of the exponential and stationary phases of cell growth (Fig.6k and Supplementary Data 9f).

Hot Topics

Related Articles