Rp3: Ribosome profiling-assisted proteogenomics improves coverage and confidence during microprotein discovery

Reference-guided transcriptome assembliesTo maintain consistency with the previous analysis, we obtained the GTF files produced by Martinez et al. (2023)26 for brown, beige, and white mouse adipose tissue. These files were generated after trimming the reads, removing the ones that mapped to rRNAs and tRNAs and aligning the remaining reads to the genome with STAR. The transcriptome was assembled using StringTie with default parameters to identify novel transcripts. For human data, we used GTF files from the cell lines HeLa-S3, and K562 from Martinez et al. (2020)2 and HEK293T from Ma et al. (2018)39. These GTF files were generated after aligning trimmed and ribosomal-clean reads to the human genome (assembly hg19) with STAR. The transcriptome was assembled using Cufflinks with default parameters while requiring multi-mapping and fragment bias correction. The reads for these three cell lines were obtained from total RNA-Seq ran on Illumina HiSeq 2500 and NextSeq 500.Generation of custom databases for proteomicsTo generate a database containing the whole coding potential of each cell line, we translated each transcriptome assembly generated to the three reading frames with the script GTFToFasta2, giving priority to ATGs. In the case of multiple ATGs for a region between two stop codons, the most upstream ATG was selected. For the mouse datasets, we generated a non-redundant GTF file containing the merged transcriptomes for the three assemblies. This was done because the mass spectrometry files in the study were generated by combining the tryptic digests of the secretomes and proteomes of all three adipose tissue phenotypes. Afterward, we appended the Swiss-Prot Mouse proteome from Uniprot (UP000000589) to the database and tagged each annotated sequence to be removed in the post-processing steps. For each resulting database, we generated a decoy database containing the reverse sequences of each protein plus the contaminants, which were appended to the initial database. We performed the same bioinformatic steps for the HLA peptidomics proteogenomics analysis but used previously assembled human transcriptomes from HEK293T39, HeLa-S3, and K562 cell lines2 to generate the custom databases, to which we appended the Swiss-Prot human reference proteome from Uniprot (UP000005640). We have implemented this step in the database mode of the Rp3 pipeline.Peptide search and post-processing of mass spectrometry dataTo identify microproteins with Rp3, we used MSFragger (v3.5)40 to map fragmentation spectra to the custom databases generated in the last step. For mouse datasets we used proteomics datasets that were previously generated for mouse brown, beige, and white adipose tissues. We used DDA files from whole-cell lysates and whole-secretome proteomes. The parameters were set as follows: precursor mass tolerance of 20 ppm, fragment mass tolerance of 50 ppm, Data-dependent acquisition as data type, fragment mass tolerance of 0.02 Da, and included carbamidomethylation (+57.021464 Da) as a fixed modification, and oxidation of methionine (+15.9949 Da) as a variable modification. For the validation step, we used Percolator (v3.06.1)41 to infer the FDR with a cutoff set at 0.01 at the protein level using the –picked-protein method. All pin files from the same datasets were processed together to better infer the FDR. After the search, we removed any peptides that matched an annotated protein but kept those that matched more than one predicted microprotein from the three-frame translated database. To make sure no annotated sequence was included in the results, we performed a Blastp (v2.12.0+) search against NCBI Refseq and excluded any microproteins that had a hit with identity and query cover 100%, i.e., a perfect match. Lastly, we performed a second round of searches to re-score the identified peptides by appending the final set of microproteins from the first search to the same reference proteome. Then, this smaller database was searched using the same mass spectrometry data and following the same bioinformatics steps. This was done to accurately assess the FDR without the effect of bloated databases coming from the three-frame translation of the transcriptome. During re-scoring, we applied MSBooster (v1.2.1)42 after the peptide search with MSFragger to predict retention times (RT) for each peptide using DIA-NN43. The delta_loess_RT values were included in the ‘pin’ file generated during the peptide search as an extra feature for Percolator, so its model would include the calibrated RT values during prediction, which correspond to the experimental RT mapped to the predicted scale of a local regression.To process HLA peptidomics data, since we are looking for peptides with no specific enzymatic cleavage, such as trypsin’s, we specified nonspecific cleavage with the parameters — search_enzyme_cutafter ARNDCQEGHILKMFPSTWYV and –search_enzyme_name nonspecific. Additionally, we specified the parameters –num_enzyme_termini 0, –precursor_true_tolerance 6, –digest_mass_range 500.0_1500.0, –max_fragment_charge 3, –digest_min_length 8, –digest_max_length 25. Since the evidence provided by HLA peptidomics is the intact peptide, and a protein does not undergo experimental digestion beforehand, we applied a cutoff for an FDR of 0.01 at the peptide level. Similarly to the mouse datasets, we performed the same re-scoring steps and applied MSBooster to predict RT retention times. After removing every peptide that matched an annotated protein, we ran mhcflurry-predict-scan from the MHCFlurry package to identify which microproteins would contain peptides with strong binding affinity to the MHC Class I using the alleles HLA-A*02:01,HLA-A*03:01,HLA-B*57:01,HLA-B*45:01,HLA-C*02:02,HLA-C*07:02,HLA-A*01:01,HLA-A*02:06,HLA-B*44:02,HLA-B*07:02,HLA-C*01:02,HLA-C*03:01. We selected peptides with an affinity ≤ 100 and an affinity percentile ≤ 0.02. Only peptides 7–12 aa long were included (–peptide-lengths 7–12).To check for proteomics coverage for the microproteins encoded by the smORFs identified with Ribo-Seq, we used the same parameters for MSFragger and Percolator and applied similar filters. The databases consisted of microproteins identified by either PRICE, Ribocode, or RibORF in each cell line. Since the database was much smaller than the proteogenomics one used for Rp3, we did not re-score the results from the first search to increase the number of identifications and keep fair comparisons. To identify which Rp3 microproteins would be annotated in Uniprot unreviewed databases, we performed a string match for each mass spectrometry peptide and protein in the Uniprot database and classified them based on their annotation level, which indicates how well-characterized the protein is. We have implemented the peptide search and post-processing of mass spectrometry data in the search mode of the Rp3 pipeline.Processing of Ribo-Seq dataWe trimmed the Ribo-Seq reads to remove the Illumina adapters (AGATCGGAAGAGCACACGTCT) using fastx_clipper with the parameters -Q 33, -l 20, -n, -v, -c, piped to fastx_trimmed with the parameters -Q 33, -f 1, both tools from the FastX Toolkit (v0.0.13) (http://hannonlab.cshl.edu/fastx_toolkit/). To remove reads mapping no tRNAs and rRNAs, we mapped the trimmed reads to a contaminant database running STAR with -outSAMstrandField intronMotif, –outReadsUnmapped Fastx. To map reads to the genomic coordinates of Rp3 smORFs, we used unmapped reads in fastq format as input for STAR with the parameters –outSAMstrandField intronMotif, -outFilterMismatchNmax 2, –outFilterMultimapNmax 9999, –chimScoreSeparation 10, –chimScoreMin 20, –chimSegmentMin 15, –outSAMattributes All. We specified a max of 9999 multi-mappings so STAR would not discard all reads mapping up to that many regions, as it does not keep reads with a number of alignments above the specified threshold. Reads for RibORF, Ribocode, and PRICE were aligned specifying a –outFilterMultimapNmax 4, as these tools use Ribo-Seq evidence alone. The Rp3 pipeline is more permissive because it processes smORFs identified with proteomics evidence, and uses the reads only to check for coverage for these microproteins. RibORF was run with default parameters, keeping only the ones with a score ≥0.7. Ribocode and PRICE both require CDS and start and stop codons in the GTF file to generate metaplots and predict 3-nt periodicity. As we are working with reference-guided assemblies from StringTie and Cufflinks, these GTF files contain only transcript and exon features. To include CDS features, we ran Transdecoder (v5.7.1) (https://github.com/TransDecoder/TransDecoder) to predict the longest ORF in each transcript and then appended the predicted CDS to the GTF files. These predicted CDS are only used as a starting point for predicting 3-nt periodicity, but ORFs are predicted from other putative regions in the transcripts later by the tools regardless. For each transcriptome assembly with CDS information, we generated a STAR (v2.7.4a) index with the custom annotations specified with the parameter—sjdbGTFfile. For PRICE, reads were aligned the same way as for RibORF. For Ribocode, since it requires reads to be aligned to the transcriptome and not the genome, we ran STAR with the parameters –quantMode TranscriptomeSAM GeneCounts –outFilterType BySJout and –alignEndsType EndToEnd, as suggested in their documentation.We chose to infer the 3-nt periodicity and generate metaplots using each software’s own method, instead of defining the same off-sets for every analysis. This was done to keep consistency with the method of each tool, since they also calculate statistics based on their own predictions. To score ORFs with PRICE, we ran it specifying an FDR of 0.01. Since it does not generate protein sequences or GTF files, we ran $ gedi -e ViewCIT to generate results files filtered with the specified FDR and generated a custom GTF file based on the coordinates, which was used to generate a fasta file containing protein sequences with Gffread44. To score the ORFs with Ribocode, we first ran $ prepare_transcripts specifying the adequate custom GTF file for each condition and the same reference genome used when preparing the indexes. Then, we ran $ metaplots and $ Ribocode with default parameters. Afterward, we filtered the results, applied a cutoff for the adjusted p-value (FDR) of 1%, and removed any ORF that was annotated by blasting them to the NCBI RefSeq, or that had a sequence longer than 150 amino acids or 450 codons. The same filtering was applied to results coming from PRICE.Analysis of ambiguous and multi-mapping readsTo check for Ribo-Seq coverage for the smORFs we could find peptide evidence for with proteogenomics, we first appended the coordinates of those smORFs to the mouse mm18 reference GTF file. Then, we performed four rounds of read counting with featureCounts (v1.6.3)31, first with default parameters and then allowing multi-mapping reads with -M, allowing ambiguous mapping reads with -O, and allowing both ambiguous and multi-mapping reads with -M and -O. We plotted the counts for each combination of these two parameters (-M: MM, -O: Amb, -M -O: MM_Amb, and Default, when using default parameters) into a heatmap using the Python package nheatmap (https://pypi.org/project/nheatmap/). We then classified the smORFs based on their detectability using the different parameters of featureCounts. A smORF is included in a specific group if its set of parameters results in at least 10 raw counts for that smORF. The mapping classification for proteogenomics smORFs can be obtained by running the Rp3 pipeline in the ribocov mode. For each group, we performed comparisons regarding a number of isoforms and repeated regions as follows: first, for the overlapping features, we ran bedtools intersect (v2.30.0) (ref. 31) using the reference GTF file to identify how many features overlapped each smORF, and plotted those into box plots. To investigate repeat regions, we used repeat files for the mm10 genome provided by RepeatMasker (http://www.repeatmasker.org/species/mm.html) and checked which smORFs overlapped the repeat coordinates. For overlapping regions, we classified the overlap with the reference GTF file for mm10 from Ensembl after running bedtools intersect on the reference and custom GTF file containing the smORFs.Circos plot generationTo generate the circos plot for read mapping visualization at the genome level, we used the Python package PyCircos (v0.3.0) (https://github.com/ponnhide/pyCircos) and added bars corresponding to the coordinates of each gene annotated in the reference mm10 GTF file to the outermost ring. Then, we did the same for the smORFs found with Ribo-Seq and for the Rp3 smORFs, which were added to the second and third ring, from edge to center, respectively. To analyze the multi-mapping landscape and generate the circos plot, we performed additional filtering in the sam files to analyze the distribution of reads with secondary alignments across the genome. For each sam file, we first kept only the reads with the flag 0x100, which indicates a secondary alignment, by running Samtools (v1.13) with the parameter -f 0x100. Afterward, we iterated the alignments and selected the reads that aligned at least once to a region in the genome that encodes an Rp3 microprotein based on the coordinates of our custom GTF file, while keeping all the other regions these selected reads mapped to. Using the filtered sam file containing only reads that mapped to at least one Rp3 smORF, we selected the reads that mapped to the representative smORF. Afterward, we added links to the center of the plot whose coordinates they point to in the genome correspond to the other regions the reads map to. To generate the multiple sequence alignment, we first identified homologs by running tblastn on the mouse genome assembly mm10. Then, we performed the multiple sequence alignment (MSA) with these sequences using MAFFT45 and generated the MSA figure using the msa4u tool from the package uORF4u46.Sequence analysis comparison among annotated, Ribo-Seq and proteogenomics microproteinsTo plot the amino acid composition, we extracted the microproteins sequences for each group of microproteins and plotted the distribution with the Python package matplotlib. To obtain the number of UTPs for each microprotein and the isoelectric point for each peptide, we used the tool RapidPeptidesGenerator (RPG)47, using trypsin as the enzyme for the in silico protein digestion, and plotted the distributions with matplotlib.Conservation analysisTo identify Rp3 smORF homologs in the genome of other eukaryotic organisms and check their overall level of conservation, we a performed a tBlastn search against the genome assemblies of Equus caballus (GCF_002863925.1), Homo sapiens (GCF_000001405.40), Canis lupus (GCF_000002285.5), Sus scrofa (GCF_000003025.6), Bos taurus (GCF_002263795.2), Pan troglodytes (GCF_028858775.1), Rattus norvegicus (GCF_015227675.2), Danio rerio (GCF_000002035.6), Balaenoptera musculus (GCF_009873245.2), Loxodonta africana (GCF_000001905.1), Drosophila melanogaster (GCF_000001215.4), Macaca mulatta (GCF_003339765.1). Then, we filtered the alignments to include only those with an E-value < 0.001 and a bitScore > 50. We then annotated a common NCBI tree in phylip format for these species using EvolView48 by adding bars to the side of the tree corresponding to the smORFs identified with Rp3 in the organism of interest and numbers of smORF homologs in each species, according to their Ribo-Seq coverage.Statistical analysisWe used the Python package SciPy (https://scipy.org/) to perform statistical analysis when adequate. To compare metrics such as the number of peptides per protein (Fig. 2f), molecular weight (Fig. 2g), pI (Fig. 2h), and number of overlapping features in the genome (Fig. 4a), we first combined the microprotein results coming from the same tool that passed the FDR thresholds. We performed comparisons based on the combined results of each tool (n = 5 biological replicates for a tryptic digest pool for mass spectrometry experiments. n = 2 biological replicates for each phenotype for Ribo-Seq experiments). To compare distribution of read assignment (Fig. 3d), we ran featureCounts on each RNA-Seq (n = 2 biological replicates for each of the three adipose tissue phenotypes for total RNA and another 2 for mRNA, all grouped together) and Ribo-Seq (n = 2 biological replicates for each of the three adipose tissue phenotypes, all grouped together) fastq files and compared the summary output containing the number of assigned reads for each group. We first checked for a Gaussian distribution for each group using the Shapiro–Wilk test. As our data followed a non-parametric distribution (Fig. 2f–h), we ran the Kruskal–Wallis test followed by Dunn’s post hoc adjusted for multiple hypothesis testing with the Benjamini-Holchberg FDR correction using SciPy. To perform comparisons between the read assignment of RNA-Seq and Ribo-Seq (Fig. 3d), since the data followed a parametric distribution, we ran T-tests using SciPy. We added statistical significance to the plots using statsannotations (https://github.com/trevismd/statannotations).Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Rp3: Ribosome profiling-assisted proteogenomics improves coverage and confidence during microprotein discovery

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis