KiNext: a portable and scalable workflow for the identification and classification of protein kinases | BMC Bioinformatics

The KiNext pipeline consists of 7 processes for kinome identification and protein kinase classification followed by phylogenetic analysis of the recovered kinases (Fig. 1).Fig. 1Overview of the 7 processes from the KiNext pipelineKinome identificationThe search for protein kinases from an annotated genome (i.e. proteome) of a given species requires one or more probabilistic models called “Hidden Markov Models” or HMMs profiles. This type of search enables homologous protein sequences to be identified using a probability algorithm. Two separate models were collated for protein kinase identification, one for the identification of ePKs and a second for the identification of aPKs. The catalytic domains of ePKs from Homo sapiens, Caenorhabditis elegans and Drosophila melanogaster were downloaded from Kinbase (The kinase Database. http://www.kinase.com/kinbase/FastaFiles/. Accessed 20 November 2023) and added to a fasta file used to create the HMM model, which was subsequently be used as a reference to the search for ePKs. At the same time, the catalytic domains of human, D. melanogaster and C. elegans aPKs were added to a separate fasta file, this time to create the HMM model for identifying aPKs.In order to run, the KiNext pipeline requires (1) the proteome (predicted amino acid sequences obtained from automatic genome annotation) in fasta format, (2) files containing the HMM profiles for ePKs and aPKs (here based on kinases from H. sapiens, C. elegans and D. melanogaster, but different HMM models could be provided) and (3) HMM models containing separate HMMs profiles for each kinase groups. The HMM profiles used in the test dataset presented here are available in the KiNext Gitlab repository (https://gitlab.ifremer.fr/bioinfo/workflows/kinext).The first process searches for protein sequences homologous to the HMMs profiles of ePKs and aPKs (Fig. 1, step 1) run on the proteome of the given species with HMMSEARCH [18] v3.3.252 and the following options: “−E 0.05” to set the E-value to its standard value, “–tblout” to summarize the results in a table and “–noali” option to remove the alignment from the main output in order to reduce its volume. The output file from this step is a table of identifiers from the proteome homologous to the HMM model of the ePKs and aPKs.The second process is a python script which retrieves the table of sequence identifiers in order to control the assignment to class (ePKs/aPKs) according to the associated e-value (Fig. 1, step 2). The script associates the identifiers with their corresponding fasta sequences extracted from the proteome of the given species. At the end of the second process, 3 fasta files are generated: the complete kinome, and ePKs and aPKs in separate files.Protein kinase classificationThe next objective is to assign protein kinases according to a specific kinase group. This is completed during the third process, which uses the previously identified kinome as input to search for sequences homologous to the HMM profiles specific to the kinase groups identified by the kinomer project [19] (Fig. 1, step 3). This library used by KiNext contains the HMMs profiles of protein kinase families from 4 model organisms: H. sapiens, D. melanogaster, C. elegans, and S. cerevisiae [20]. The distinctive feature of the kinomer libraries is that each protein kinase group is divided into families. Indeed, Altman et al. [21] found that subdividing large protein groups increased the recognition accuracy of HMMs by allowing the unique characteristics of each family to be captured more accurately. Furthermore, the results of Miranda-Saavedra and Barton [20] indicated that this library of HMM profiles was superior not only to BLAST [22] but also to a general HMM from the catalytic domain of the kinase. This search was carried out using HMMSEARCH v3.3.252 with the following options: “−E 0.05” to set the E-value to its standard value, “–tblout” to summarize the results in a table and “–noali” (Fig. 1, step 3). The output file from this Nextflow process is a table of identifiers homologous to the kinase family profiles.In the fourth process, a python script is used to assign each kinase to the family with the lowest e-value (Fig. 1, step 4). This process ensure that each kinase is attributed to only one kinase group. Kinases that are identified by multiple HMM profiles are listed in the family with the lowest e-value. Once the classification has been checked and validated automatically, this script extracts the fasta sequences corresponding to the identifiers. The end result of this fourth process is the creation of 5 fasta files containing: the complete annotated kinome, the annotated ePKs, the annotated aPKs and another fasta file containing the sequences identified as protein kinases not assigned to a family during the first process of the pipeline. The last fasta file generated corresponds to the full kinome with each kinase annotated to a group (or assigned to the unknown category).Kinase phylogenyOnce the kinome of the given species is identified, two phylogenetic trees are constructed, one for ePKs and another for aPKs, in order to group kinases according to the similarity of the amino acid sequences. The fifth process thus performs sequence alignment and phylogenetic reconstruction of ePKs and aPKs.Firstly, the fasta file output from the fourth process containing the annotated ePKs is aligned with the HMMALIGN v3.3.2 tool (Fig. 1, step 5). This tool aligns the full kinase sequences with the ePK HMM profile (catalytic domains) and produces a multiple sequence alignment in ClustalW [23] format. The following options were used in HMMALIGN: the ‘–trim’ option to remove non-homologous residues (assigned to the N and C states in the optimal alignments) from the result of the multiple alignment and the ‘–outformat’ option to write the output alignment in ‘clustal’ format. The same process is conducted on the annotated aPKs using the aPK HMM profile.The sixth process consists of phylogenetic reconstruction in IQtree [24] v2.1.258 based on the alignments obtained during the fifth process (Fig. 1, step 6). IQTree is set by default for use with -st AA (amino acid sequences). The -m MFP option allows IQtree to automatically select the best evolutionary substitution model for tree construction. And to estimate the robustness of the tree topology, the “ultrafast bootstrap” method -bb 1000 and the SH-aLRT -alrt 1000, an approximate Shimodaira-Hasegawa likelihood ratio test, were used. The phylogenetic reconstruction of the aPKs can help the KiNext user to confirm and refine the classification of atypical protein kinase families. It is important to use these trees with caution and to be aware of their limitations, particularly pertaining to aPKs. The phylogenetic trees generated by IQtree were viewed manually using the ETE3 framework [25] to automatically colored them by kinase group. ETE3 simplifies the reconstruction, analysis, and visualization of phylogenetic trees. An ETE3 python script is available in the KiNext Gitlab repository (https://gitlab.ifremer.fr/bioinfo/workflows/kinext).The seventh and final process of the KiNext pipeline is a python script that builds two summary tables (.csv) from all the files produced during the KiNext run (Fig. 1, step 7). The first table is composed of 6 columns including the identifiers of the protein kinases found by the pipeline, a second column indicating whether they are ePKs or aPKs, the next column indicating the “e-value” obtained for this result, the fourth and fifth columns correspond respectively to the family and the score obtained for this classification, and finally the last column corresponds to the description of the sequence (predicted annotation). The second table summarizes the number of sequences identified for each of the protein kinase families, the count of sequences that were identified as being a protein kinase but for which the pipeline was unable to assign a group, and finally, the total number of sequences making up the kinome. These two final tables provide users an overview of results obtained that can be used as a starting point for more in-depth analyses [1].Finally, the user can conduct a validation process by checking manually the e-values of the newly predicted kinase domains against those of annotated kinase proteins to ensure consistency and reliability. AlphaFold, a state-of-the-art protein structure prediction tool [26], can be employed by the KiNext user to verify the structural integrity of the predicted proteins by modeling the 3D conformation of the protein and then, making a structural superposition against a reference kinase to and verifying the presence of the catalytic domain. Another option is to use Foldseek [27] with the Alphafold results and search for similarities and structural overlap against references databases, like Alphafold/Swiss-Prot. This validation approach was used in trial runs of KiNext on the genomes of C. gigas and O. tauri to verify that the protein kinases identified by the pipeline based on structural similarity.

Hot Topics

Related Articles