Leveraging multiple data types for improved compound-kinase bioactivity prediction

DataThe bioactivity data used in our experiments was retrieved from two public databases: ChEMBL3221 and PubChem24. We collected a total of 205,545 IC50 measurements from 90,091 compounds tested against 462 wild-type human protein kinases. When available, we selected data based on the binding assay type; otherwise, we included values with missing assay type information. The IC50 values come from various experimental techniques, including fluorescence, luminescence, and radioactivity-based measurements. Negative IC50 readouts were filtered out. Only IC50 values given in nM, μM, and mM were included, all of which were converted to nM before calculating pIC50 as $-{\log }_{10}$(IC50).The compounds were standardized, and those failing our cleaning workflow were filtered out (see Section “Compound standardization and cleaning workflow”). In cases where multiple activity measurements were present for a compound-kinase pair, we summarized these into a single activity value by taking the geometric mean of the pIC50 values. After data cleaning and summarizing, we obtained a final dataset comprising 79,075 compounds measured across 462 kinases, for a total of 141,193 compound-kinase pairs.Additionally, we collected single-dose activity measurements also from the ChEMBL and PubChem databases. Specifically, we collected examples of compound-kinase pairs for which percentage activity and/or percentage inhibition was measured at at least two separate concentrations. Percentage activity values were converted to percentage inhibition by applying the formula 100 − %activity. This yielded a total of 69,669 compound-kinase pairs, across 302 kinases.Compound standardization and cleaning workflowCompound SMILES and InChIKeys were first standardized following a process similar to the ChEMBL structure curation pipeline36. Duplicate structures were then identified by matching InChIKey strings. If no exact match was found, we further ensured there were no duplicates by running a fingerprint-based similarity search using three different fingerprints: Daylight, ECFP4, and ECFP6. If a Tanimoto score given by any of the fingerprints equals 1, the compound is considered a duplicate.The standardized compounds were then filtered using a set of SMARTS filters, which included, among others, SMARTS for reactive groups, phosphates, sugars, macrocycles, etc. (see ref. 37 for a full list of SMARTS). The compounds were also filtered based on their molecular weight, selecting those with a weight between 250 and 670. Compounds reported with a fluorescence label were stripped down to only their parent compounds. Additionally, a filter was applied to exclude staurosporine- and cholesterol-like compounds, as broad-spectrum tool compounds were not of interest to this study.Prediction scenariosThree different training and validation data splits were explored, based on the difficulty of prediction tasks (Fig. 1C). First, we created training and validation sets by randomly splitting compound-kinase pairs (‘ck split’). Next, we split the compounds randomly to ensure distinct compounds in training and validation sets (‘compound split’). In the most challenging scenario, compounds were first clustered, and then some clusters were held out for validation (‘cluster split’). It should be noted that in the ’ck split’, a compound present in the training set might also appear in the validation set; however, specific compound-kinase pairs from the training set will never appear in the validation set.For the ‘cluster split’, we construct training and validation sets as follows. First, we perform k-means clustering based on ECFP4 fingerprints of all compounds in the dataset. Then, we continue to add clusters of compounds to the validation set until 10% of the molecules have been designated for validation. The remaining compounds (and all associated compound-kinase pairs) are used for training.POC data integrationAt the data integration step, we train a model to learn a mapping from individual POC measurements to a dose–response IC50 value. The inputs to the model are vectors containing percentage inhibition values at different binned concentration values. Specifically, an input is x = (x1, …, xK) where xj corresponds to a percentage measurement (a scalar between 1 and 100), at concentration bin j. The bins are defined in nanomolar (nM) units, with thresholds set at b0 = 0 nM, b1 = 100 nM, b2 = 500 nM, b3 = 1000 nM, b4 = 5000 nM, b5 = 10,000 nM, b6 = 50,000 nM, b7 = 100,000 nM, b8 = 1,000,000 nM, b9 = ∞. A concentration falls into bin j if it is between bj and bj+1. If a given input has no measurement in bin j, it is assigned a special value − 10 representing “no measurement”. During training, we restrict to compound-kinase pairs that have (1) percentage inhibition measurements in at least two separate bins, and (2) an associated IC50 value. A schematic representation of our approach can be found in Fig. 1A. Because of the presence of missing values in the data, we choose to use a random forest for the data integration step, which can naturally handle this aspect of the data27.Second-stage modelsIn total, five second-stage models were trained and evaluated on training and validation datasets derived from ’ck split’, ’compound split’ and ’cluster split’ (Section “Prediction scenarios”). Two distinct metrics were reported for each model: Spearman correlation, and root mean squared error (RMSE). We assessed the impact of integrating POC data by training the models with the inclusion of inferred IC50 values (two-stage model) and without them (baseline single-stage model).Note that in this section, the terms ‘compound’ and ‘ligand’ as well as ‘protein’ and ’kinase’ are used interchangeably.Pairwise kernel ridge regressionWe use a pairwise kernel ridge regression model25,26, that operates on an input protein-ligand pair (p, l), where the protein p is represented as an 85-residue kinase binding pocket sequence retrieved from the KLIFS database38, and the ligand l is represented as a 1024-bit ECFP4 fingerprint computed using the RDKit library. For a given input (p, l), the model’s activity predictions are computed as$$f(p,\, l)={\sum}_{i=1}^{n}{\alpha }_{i}k((p,\, l),({p}_{i},\, {l}_{i})),$$
(1)
for training protein-ligand pairs (p1, l1), …, (pn, ln). The pairwise kernel k operating on protein-ligand pairs is defined by the product of a protein kernel and a ligand kernel:$$k((p,\, l),({p}^{{\prime} },\, {l}^{{\prime} }))={k}_{P}(p,\, {p}^{{\prime} })\cdot {k}_{L}(l,\, {l}^{{\prime} }),$$
(2)
where kP is calculated based on the (normalized) Striped-Smith-Waterman sequence alignment, and kL is the Tanimoto kernel. Note that as both kP and kL are bounded between 0 and 1, the pairwise kernel k is also bounded within this range. The parameters α1, …, αn are fit by minimizing a standard kernel ridge regression objective using the conjugate gradient method.The pairwise kernel ridge regression model also admits a convenient interpretation as a Gaussian process associated with the kernel k29. This means that we can naturally compute an uncertainty estimate associated with each activity prediction. Specifically, since our pairwise kernel is 1 for identical compound-kinase pairs, the expression for the variance of a new compound-kinase pair (p, l) is given by$${\sigma }^{2}(p,\, l)=1-{{{\bf{k}}}}((p,\, l),\, S){({{{\bf{K}}}}+\lambda {{{{\bf{I}}}}}_{n})}^{-1}{{{\bf{k}}}}(S,\, (p,\, l)),$$
(3)
where K is the n × n training kernel, λ is a regularization parameter, In is the identity matrix on ${{\mathbb{R}}}^{n}$, and k(S, (p, l)) is the n-dimensional vector whose ith entry is k((pi, li), (p, l)) for the ith training example (pi, li) ∈ S.Random forestWe use a standard random forest model as implemented in scikit-learn software package39. Rather than designing a single model that takes protein-ligand pairs as inputs, we fit separate random forests for each kinase. For ligands, we use 1024-bit ECFP4 fingerprints computed using the RDKit library.DeepDTAThe DeepDTA11 architecture consists of two embedding modules: one ligand embedding module, and a second protein embedding module. Both embedding modules share the same architecture, consisting of a series of convolutional layers, followed by a pooling layer to obtain sequence-level embeddings from compound SMILES and 85-residue kinase binding pocket sequence strings, respectively. After embedding a SMILES string and an amino acid sequence, the resulting feature vectors are concatenated, and passed through a series of linear layers interspersed with dropout layers to obtain the final scalar output.BiMCASimilar to DeepDTA, BiMCA5 uses convolutional neural network layers to learn feature embeddings from both compound SMILES and protein sequence strings. Then, unlike DeepDTA, BiMCA uses context attention layers to fuse information from both modalities, allowing the ligand representation access to contextual information from the protein embedding, and vice versa. Finally, the resulting feature vectors are concatenated and passed through a fully connected module to produce a scalar output.ConPLexConPLex14 is another recent neural network-based model used to predict compound-kinase binding affinity. ConPLex featurizes ligands using ECFP4 fingerprints, and kinases using a pre-trained ProtBERT language model20. The model then uses fully connected layers with ReLU activations to project compound and kinase features into a shared embedding space. From this shared space, binding affinity between a compound and a kinase is estimated by computing a dot product between the compound and kinase embeddings. Unlike the other models considered here, ConPLex also employs contrastive learning stage, wherein the model is trained to simultaneously predict bioactivities, while maximizing the distance between real drugs and synthetically-generated decoys in embedding space. Here, we used the same features as those reported in their model for binding affinity prediction, along with the same contrastive learning procedure.Post-analysis on predictionsSignificance testingTo test the statistical significance of the improvement from data integration, we use a non-parametric permutation test. For every example in the validation set, we make predictions using both the single- and two-stage models, and compute the difference in performance metrics between the two models. Then, to generate a null distribution, we randomly permute single- and two-stage labels 10,000 times across examples in the validation set, and calculate the difference in each performance metric for each permutation. The observed differences are then compared against the null distribution to calculate p-values for each metric and model.Compound-wise analysisTo further analyze the impact of integrating inferred IC50 values on predicting the activities of structurally diverse compounds, we applied k-means clustering to the compounds in the validation set for the ‘cluster split’ scenario, using ECFP4 fingerprints. We set the number of clusters at 100 to capture potential common scaffolds within each cluster. After clustering, we calculated the differences in performance metrics (Spearman correlation and RMSE) between the single-stage and two-stage models for compound-kinase pairs associated with compounds in each cluster. Additionally, for visualization purposes, we first applied PCA to the compound ECFP4 fingerprints, and then we employed t-SNE on the 20 principal components (Fig. 3A and Supplementary Fig. 3).Experimental profilingWe experimentally profiled a total of 347 compound-kinase pairs, encompassing 13 kinases (ACVR1, BTK, CSF1R, EGFR, ERBB2, FLT3, IRAK1, IRAK4, JAK2, MERTK, MKNK1, PIK3CA, SYK) and 139 compounds (see Supplementary Data 1 for a list of compound SMILES), based on predictions from the top-performing two-stage pwkrr model (see Section “Experimental testing demonstrates practical model utility in early-stage drug discovery”). Compounds were purchased from a compound vendor (MolPort) and confirmed to be at least 95% pure. To generate new dissociation constant (Kd) and percentage inhibition values, we sent the compounds to DiscoverX (Eurofins Corporation) for KINOMEscan profiling service. The KINOMEscan screening platform utilizes an active site-directed competition binding assay to measure interactions between test compounds and selected human kinases, without the need for ATP. This technique hinges on the principle that compounds binding to the kinase active site prevent the kinase’s interaction with the immobilized active-site directed ligand, and therefore result in a diminished amount of kinase captured on the solid support40.In our experiments, Kd determination was conducted using the KdELECT method (https://www.eurofinsdiscovery.com/solution/kdelect), while percentage inhibition at a compound concentration of 1000 nM, relevant in the context of kinase inhibition, was measured using the scanELECT protocol (https://www.eurofinsdiscovery.com/solution/scanelect). Both methods are parts of the KINOMEscan platform.KINOMEscan protocol descriptionKinase-tagged T7 phage strains were grown in an Escherichia coli host derived from the BL21 strain. The E. coli were grown to log-phase, infected with T7 phage (multiplicity of infection = 0.4), and incubated with shaking at 32 °C until lysis occurred (90–150 min). The lysates were then centrifuged (6000 × g) and filtered to remove cell debris. The remaining kinases were produced in HEK-293 cells and subsequently tagged with DNA for qPCR detection.Streptavidin-coated magnetic beads were treated with biotinylated small molecule ligands for 30 min at room temperature to generate affinity resins for kinase assays. In order to remove unbound ligand and reduce non-specific binding, the ligand-bound beads were then blocked with excess biotin and washed with a blocking buffer (SeaBlock (Pierce), 1% BSA, 0.05% Tween 20, 1 mM DTT). Binding reactions were constructed by mixing kinases, ligand-bound beads, and test compounds in 1× binding buffer (20% SeaBlock, 0.17× PBS, 0.05% Tween 20, 6 mM DTT).Test compounds for percentage inhibition assays were prepared as 40× stocks in 100% DMSO, whereas for Kd assays, they were prepared as 111× stocks in 100% DMSO. Kd’s were determined using an 11-point threefold compound dilution series with three DMSO control points. Prepared compounds were directly diluted into the assays. All reactions were carried out in polypropylene 384-well plates in a final volume of 0.02 ml. Following incubation at room temperature with shaking for 1 h, the affinity beads were washed with a wash buffer (1× PBS, 0.05% Tween 20). Subsequently, the beads were resuspended in an elution buffer (1× PBS, 0.05% Tween 20, 0.5 μM non-biotinylated affinity ligand), and incubated at room temperature with shaking for 30 min. Finally, the concentration of each kinase in the eluates was measured using qPCR.
Determination of percentage inhibition and K
d

In case of percentage inhibition assays, test compounds were screened at a single concentration of 1000 nM, and the percentage inhibition of a kinase was calculated as follows:$$\,{{{\rm{Percentage}}}}\,{{{\rm{Inhibition}}}}\,=100-\left(\frac{\,{{{\rm{Test}}}}\,{{{\rm{Compound}}}}\, {{{\rm{Signal}}}}-{{{\rm{Positive}}}}\,{{{\rm{Control}}}}\, {{{\rm{Signal}}}}}{{{{\rm{Negative}}}}\,{{{\rm{Control}}}}\, {{{\rm{Signal}}}}-{{{\rm{Positive}}}}\,{{{\rm{Control}}}}\, {{{\rm{Signal}}}}\,}\right) \\ \times 100,$$
(4)
where the negative control is DMSO (0% inhibition) and the positive control is the control compound (100% inhibition).
Kd’s were calculated with a standard dose–response curve using the Hill equation:$$\,{\mbox{Response}}={\mbox{Background}}+\frac{{\mbox{Signal}}-{\mbox{Background}}\,}{1+\left({\,{K}\,}_{{{{\rm{d}}}}}^{{{{\rm{Hill}}}}\,{{{\rm{Slope}}}}}/{{\mbox{Dose}}}^{{{{\rm{Hill}}}}\,{{{\rm{Slope}}}}}\right)}.$$
(5)
The Hill Slope was set to -1, and curves were fitted using a non-linear least square fit with the Levenberg–Marquardt algorithm41,42.
Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Leveraging multiple data types for improved compound-kinase bioactivity prediction

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models