DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

The adaptive immune response is a complex and tightly regulated defence mechanism that protects the human host from a wide range of pathogens. It is triggered by the interaction of naïve T cells with antigen-presenting cells. In the case of CD8+ T cells, this requires the specific recognition of MHC-I-peptide complexes on the surface of antigen presenting cells by the clonotypic T cell receptors [1]. With the help of cytokines and co-receptors, this leads to activation and clonal expansion of the antigen specific cytotoxic CD8+ T cells and induces their differentiation into cytotoxic effector T cells (CTLs). These can migrate to the site of infection. When they re-encounter the same antigenic MHC-I-peptide complexes on the surface of infected cells, the CTLs induce apoptosis in the target cell through secretion of perforins, granzyme B and/or ligation of Fas on the surface of the target cells with Fas-ligand, which is expressed by the CTLs [2]. Most MHC-I-binding T cell epitopes (MHC-I-epitopes with the length of 8–12 amino acids) are generated through degradation of intracellular proteins by the ubiquitin–proteasome system (UPS) in the cytosolic compartment [3]. After their transport into the endoplasmic reticulum (ER), the peptides are loaded onto MHC class I molecules and the complexes reach the cell surface via the Golgi apparatus [1, 4]. To study the involvement of CD8+ T cells in the elimination of a pathogen, it is therefore necessary to analyse the generation of MHC-I-epitopes and to determine the peptide sequences. Such studies have been successfully performed for viral epitopes, e.g. from influenza A virus or SARS-CoV-2 [5, 6]. In addition, there is evidence that intracellular infection by bacteria such as Listeria monocytogenes or Staphylococcus aureus is accompanied by presentation of MHC-bound peptides [7, 8].The knowledge of specific peptide sequences of epitopes can be used for the development of peptide-based vaccines which stimulate immune cells without the necessity of immunizing with the whole pathogen. This targeted approach enhances the efficiency in design and production of a vaccine, reduces possible side effects and increases the coverage if epitopes are shared between different strains of a pathogen [9]. One possibility for the development of peptide-based vaccines is based on the manual selection and extraction of candidate proteins [10] from databases like the UniProt Knowledgebase [11], followed by the prediction of single epitopes within each sequence by algorithms like NetMHCpan or MHCflurry [12, 13]. Further in-depth approaches considering the peptide-MHC-I-T cell receptor binding affinities such as TLImm [14] or PRIME [15] may provide higher accuracy in the prediction of possible neoantigens. However, these tools still rely on manually selected candidate sequences to be analysed. Pathogens like bacteria comprise complex proteomes containing a high number of proteins with diverse functions, cellular localization and interactions with the host cells. Analysing these for immunogenic epitopes necessitates either labour-intensive steps in manually choosing relevant candidate proteins or extensive computational resources and datasets which may lack abundancy. DiscovEpi focusses on the peptide presentation step and simplifies the prediction of immunogenic proteins or protein regions of whole proteomes by connecting the UniProt database with NetMHCpan to automatize the manual analysis. The automated extraction of species, strains, or even subcellular localization in combination with ranked epitope predictions based on epitope density and average binding score accelerates the search for promising candidates. Additionally, DiscovEpi visualizes all putative epitopes at their positions in the respective protein sequences. The colour intensity reflects their predicted immunogenicity, making the identification of immunogenic regions and fast comparison of potential vaccine targets easy and user-friendly.ImplementationThe DiscovEpi algorithm is available on GitHub (https://github.com/cmahncke/DiscovEpi) containing packages for the usage under Windows 10 and 11 and Linux. DiscovEpi is implemented in Python 3.10 [16] and the Qt framework using PySide6 for integration (The Qt Company, Qt for Python project; PySide, max. version 6.6.3.1 available at https://pypi.org/project/PySide). Seaborn visualization library is used to produce the peptide map in the form of a heat map [17]. Remaining necessary third-party packages DiscovEpi has been built with are XlsxWriter (version 3.2.0), matplotlib (version 3.8.4), numpy (version 1.26.4), pandas (version 2.2.2) and requests (version 2.31.0). As interface between the UniProt database [11] and NetMHCpan [12] we implemented REST APIs hosted by UniProt and IEDB [18]. NetMHCpan (version 4.1) is implemented in the python scripts and executables. NetMHCpan generates a score of predicted strength of peptide-MHC binding, which is based on a trained neuronal network. This neural network calculates the strength of binding based on amino acid properties, peptide-MHC interactions and experimentally measured binding affinities [12]. This score is compared to the distribution of scores in a large, maintained reference library, and NetMHCpan provides a percentile rank as a measure of the relative strength of the predicted peptide-MHC-I binding. To date NetMHCpan proves to be the best predictor for MHC-I epitopes as it is the recommended predictor in the IEDB Analysis Resource [18]. The recommendation is updated based on weekly benchmarks which the authors of DiscovEpi will follow to ensure reliable results.The first output file contains log data about the protein retrieval parameters and the protein data on a second sheet and is named “unp_ORGANISM_LOCATION.xlsx” where the bold letters are replaced by the respective input. After the epitope prediction step the retrieved putative epitopes with position normalized NetMHCpan binding score and the later described protein scoring metrics are written to the second sheet of the second output file named “nmp_ORGANISM_LOCATION_ALLELE.xlsx” in the previously specified directory with meta-data (UniProt-ID, Protein name, UniProt-Link) on the first sheet. The putative epitopes are ordered by their binding score showing the most promising ones in front. The file also contains data about the number of occurrences of each epitope and the query parameters.DiscovEpi epitope density and average binding scoreDiscovEpi allows the protein-centric search for putative MHC-class I binding epitopes in whole proteomes based on the epitope density and average binding score of predicted epitopes. The binding score given by DiscovEpi is based on the percentile rank of each epitope as one of the output metrics from NetMHCpan. The percentile rank represents the relative binding affinity compared to a large reference group, including binding affinity scores for a diverse set of peptides and MHC class I molecules from experimental measurements and known MHC class I binding datasets. It is also used by DiscovEpi to discard peptides scoring higher than a specific percentile rank (default value = 3), assuming these peptides are non-binders [12]. To compare the epitopes in the resulting limited set of high affinity binders the DiscovEpi epitope score $s_{e}$ is computed by normalizing the percentile rank to values between 0 and 1 (Formula 1).Formula 1: Epitope score. The percentile rank is normalized over the threshold (default value = 3). The difference between the threshold and percentile rank is divided by the threshold.$$s_{e} = \frac{threshold – \% rank}{{threshold}}$$The epitope score allows interpretation of protein sequences based on their density of potential epitopes. The epitope scores (Formula 1) are then used to compute the overall protein score for the whole protein sequence by averaging the retrieved epitope scores and over every possible epitope (Formula 2).Formula 2: Protein score. $l_{p}$ is the protein sequence length and $l_{e}$ the epitope sequence length and $s_{e}$ the epitope scores of all retrieved epitopes for the respective protein sequence.$$s_{p} = \frac{{\sum s_{e} }}{{l_{p} – l_{e} + 1}}$$As the protein score $s_{p}$ does not differentiate between the presences of a few well binding or many weakly binding epitopes in a protein the epitope density is calculated as well (Formula 3). The epitope density $d$ of a protein defines the ratio of the number of predicted epitopes to the total number of possible oligopeptides of the same length in this protein. Thereby in combination with the protein score, a rational is created to value the length-independent immunogenicity of a protein so that proteins of different lengths can easily be compared. However, very short proteins can skew the output to give high epitope densities which will be visible in the epitope map and have to be handled with care during evaluation of the data.Formula 3: Epitope density. $l_{s}$ is the protein sequence length and $l_{e}$ the epitope sequence length and $s_{e}$ the epitope scores for all MHC class I epitopes retrieved for the respective protein. Here, the score itself is not essential but the number of retrieved epitopes scoring below the threshold so that in combination with the protein density score (Formula 2) a holistic evaluation is possible.$$d = \frac{{\# (s_{e} < threshold)}}{{l_{s} – l_{e} + 1}}$$DiscovEpi visualizationThe epitope map is created using a heat map with amino acid positions on the x-axis and the proteins on the y-axis. Each line of the heat map describes the amino acid sequence of one protein. By default, the length of the x-axis matches the length of the longest protein in the set. However, this value can be set individually as DiscovEpi takes the maximum length as input selected on the third tab of the he graphical user interface (GUI) (Fig. 1C). The length of shorter proteins is illustrated as light grey background which represents protein but no epitope whereas white background represents no protein so automatically no epitopes. Numerically, the underlying matrix contains the value 0.5 at each protein position and 0.0 when there is no protein. For each epitope of each protein the DiscovEpi scores are added to the default 0.5 in the respective line and amino acid sequence position (row). If there are overlapping epitopes the scores are added up. Visually, the intensity of the grey epitope markings depends on the calculated scores i.e., the darker the epitope marking, the higher the score. A high score here can reflect few very probable epitopes or many of less probability. Limiting the x-axis length increases resolution of shorter proteins since the shorter sequences are visualized using more horizontal space. Vertically, the resolution of the epitope map can be enhanced by setting a maximum number of proteins to be visualized on the map. This value can also be set on the third GUI tab (Fig. 1C). Especially bacterial protein sets can extend the resolution since the vertical height of the figure is fixed. The proteins shown on the map are ordered according to the DiscovEpi protein score so that even if the number is limited, the map still shows the most promising proteins. The resulting map is saved as PNG-file named “ORGANISM_LOCATION_heatmap.png” to the location specified on GUI tab one (Fig. 1A).Fig. 1DiscovEpi schematic workflow and output. Workflow of DiscovEpi: A the protein sequence retrieval to generate the dataset; B the prediction parameters encompassing HLA allele, peptide length and NetMHCpan-score threshold; and C the parameters for visualization parameters of the predicted epitopes with the maximum number of top scoring proteins and the maximum protein length to be depicted. The resulting heat map (D) is generated with the parameters shown in A–C where each bar marks the presence of one or multiple overlapping putative epitopes and the intensity indicates the epitope score. The length of the protein is visualized through the light grey background. The proteins are ordered by their epitope density

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data | BMC Bioinformatics

Hot Topics

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Popular Articles

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics