The application of chemical similarity measures in an unconventional modeling framework c-RASAR along with dimensionality reduction techniques to a representative hepatotoxicity dataset

Analysis of the chemical diversity of the data setThe chemical diversity plot presents an analysis to explore the structural nature and similarity of the compounds constituting our dataset. We have used the software DataWarrior (https://openmolecules.org/datawarrior/) to generate this plot for the chemical diversity analysis. The plot (Fig. 3) represents the structural similarity (based on substructure fragment dictionary-based binary fingerprint FragFp) of the dataset compounds with a well-known hepatotoxic compound—acetaminophen (paracetamol) as the reference43. The color coding of the data points reflects the similarity levels, i.e., green for highly similar compounds, yellow for moderately similar compounds, and red for highly dissimilar compounds (with respect to the reference compound selected). From this plot, it is evident that the similarity levels of most of the data points are quite low, which makes this dataset highly diverse, and this poses a challenge to developing mathematical models. A later section in this paper infers that the low levels of similarity are the reason for the incorrect prediction for acetaminophen, which is supported by the chemical diversity analysis in Fig. 3. Additionally, we have developed a scatter plot of the o/w partition coefficient (LOGP99) and molecular weight (MW) (Fig. 4) that also demonstrates chemical diversity based on physicochemical properties and observed that the active and inactive compounds appear in the same chemical space, which adds to another reason why modeling of such a dataset can pose a lot of challenges.Fig. 3Chemical Diversity analysis representing the diversity of the compounds in our dataset (the structure of a reference compound acetaminophen is shown).Fig. 4A chemical diversity plot of LOGP99 vs MW.Selection of the important QSAR descriptorsFrom the most discriminating feature selection algorithm, we identified 27 different QSAR descriptors with absolute mean difference values > 0.03. On analyzing the chemical information these features encode, we observed that all of them were structural descriptors, and there was a clear absence of physicochemical and global descriptors. However, MW and LOGP99 are some of the most common and widely used physicochemical features, but our initial model-independent feature selection process only identified structural features in this study. We have, therefore, deliberately incorporated the descriptors LOGP99 and MW that represent the o/w partition coefficient and the molecular weights of compounds. The revised number of features was now 29, which was subjected to various ML-based classification QSAR modeling. All 29 descriptors and their significance have been represented in Table S2 and Fig. S1 of the Supplementary Material SI-2.Results of the classification-based QSAR modelsThe different validation metrics computed from the models developed using a variety of ML algorithms have been shown in Table 1, and the corresponding optimized hyperparameter settings have been reported in Table S3 of the Supplementary Material SI-2. It is evident that the linear modeling framework LDA and the non-linear LR have moderate classifiability of the training and test set compounds, while other non-linear models like RF and SVM exhibit poor performance, especially when we observe the MCC and Ckappa values of the training and test sets. This poses the challenging question of whether the promising c-RASAR approach can develop models with better quality and efficient classifiability than these traditional QSAR models. The validation metrics suggest that ML models based on conventional QSAR descriptors cannot provide sufficient classifiability.Table 1 Results of the different ML-based QSAR and c-RASAR models.We have performed a SHAP analysis of the training set for the best QSAR models (LDA and LR). As evident from Figure S2 in Supplementary Materials SI-2, MW is an essential descriptor in both cases.Selection of the important c-RASAR descriptorsThe most discriminating feature selection analysis on the RASAR descriptor matrix helped us to identify the set of essential RASAR descriptors. The descriptors having an absolute difference in the mean values > 0.06 were considered important descriptors. Such descriptors were 8 in number, which included descriptors like RA function, gm, gm_class, gm*Avg.Sim, gm*SD_Similarity, and sm1, the majority of which are represented by the Banerjee-Roy concordance and similarity coefficients and their derived descriptors. It is essential to note that the QSAR models were developed using 29 descriptors, while the c-RASAR models were developed using only eight descriptors. The most discriminating RASAR descriptor, having the highest absolute mean difference value (gm_class), was also used to develop a simple univariate model. The selected RASAR descriptors have been shown pictorially in Fig. 5, and the variations in their values for the first and the last 20 compounds in the training and test sets have been represented in the form of a heat map in Fig. S3 in Supplementary Materials SI-2.Fig. 5List of RASAR descriptors used to develop the c-RASAR models. The bubble diameters represent the importance of the descriptors in terms of discriminating power between the positives and negatives.Results of the classification-based RASAR (c-RASAR) modelsUsing the same set of ML modeling algorithms, various c-RASAR models have been developed. The list of various classification-based validation metrics computed for all the different ML c-RASAR models has been reported in Table 1, and the corresponding optimized hyperparameter settings have been reported in Table S3 of Supplementary Material SI-2. As the results show, the LDA and LR models also perform better than the non-linear RF and SVM. Observing the validation metrics values, we can infer that the simplest LDA c-RASAR model provides the best predictivity. Moreover, the performance of all the different ML c-RASAR models on the test set supersedes the predictive performance of even the best ML QSAR model (i.e., LR QSAR). We have additionally shown the enhancement of the external predictive performance of the c-RASAR models using a t-test of the means of different selected metrics of different models (like Accuracy, MCC, Cohen’s kappa, and AUC-ROC for the test set predictions) (Table S4 in Supplementary Materials SI-2) and the Sum of Ranking Differences (SRD) of the competing models44,45,46,47 (Fig. 6).Fig. 6Normalized SRD values of different models compared to random ranking based on metrics Accuracy, MCC, Cohen’s kappa, and AUC for LDA, RF, SVM and LR models using the SRD approach for (a) training and test set data, (b) test set data.The sum of ranking differences (SRD) approach of Prof. Heberger44,45,46,47 is an effective method to compare different metrics, processes, models, analytical techniques, etc. in a general manner. We have used this method here to compare the performance of various QSAR and RASAR models considering selected metrics (Accuracy, MCC, Cohen’s kappa, and AUC) of the training and test sets and also separately for the test set. Here, the models to be ranked are placed in the rows, and the metrics are in the columns of an input matrix. The columns are scaled to unit length. Then, the transposed matrix is used for the SRD analysis, taking the maximum row values as the reference. Then, the results of each model are ranked in the order of increasing magnitude. The difference between the rank of the model results and the rank of the standard results (here, row maximum) is then computed. Calculating the sum of absolute values of the differences for all models follows this. A lower value of SRD (close to 0) indicates a better model. The closeness of SRD values indicates the similarity of the models, whereas large variation indicates dissimilarity. A permutation test is used to validate the SRD method, which uses a recursive algorithm to compute the discrete distribution for a small number of objects (n < 14, as in this case) or the normal distribution if the number of objects is large. The theoretical distribution is visualized for random numbers and can be used to identify SRD values for models that are far from being random. The SRD runs were made using the program available from http://aki.ttk.hu/srd/.The normalized SRD values of different models compared to random ranking based on metrics Accuracy, MCC, Cohen’s kappa, and AUC for LDA, RF, SVM and LR models based on the SRD approach are shown based on training and test set data (Fig. 6a) and based on only test set data (Fig. 6b). Based on Fig. 6a, the order of performance of the models in comparison to the reference (maximum values of the metrics) is (LDA-RASAR, SVM-RASAR, LR-RASAR), RF-RASAR, (LDA-QSAR, LR-QSAR), (RF-QSAR, SVM-QSAR) confirming the superiority of the RASAR models over QSAR models statistically. Based on Fig. 6b, the order is (LDA-RASAR, RF-RASAR), (SVM-RASAR, LR-RASAR), (LDA-QSAR, RF-QSAR, SVM-QSAR, LR-QSAR).From the above SRD plots, it is evident that the c-RASAR models are much better-performing models than the corresponding QSAR models. Additionally, it should be noted that the QSAR models were developed using 29 descriptors, while the c-RASAR models were developed with just eight descriptors. So, our RASAR models are also statistically more meaningful when enhancing the degree of freedom.On comparison among the different ML-based c-RASAR models, it can be confirmed that the LDA c-RASAR model was the best performing model for external set predictions, taking the MCC, Ckappa, and AUC as the objective functions. Additionally, a univariate LDA c-RASAR model was also developed using gm_class as the only descriptor (Table 1). This model also proved to be sufficiently predictive, especially in modeling a very diverse dataset and a considerably large number of compounds for a simple univariate model. This reiterates the potential and the predictive power of the similarity and error-based RASAR descriptors even in the case of diverse and less-modelable data points.Statistics for cross-validation—a measure of the robustness of the developed c-RASAR modelsOne of the fundamentals for developing statistically reliable models is to perform rigorous cross-validation. This ensures that the inherent quality of the model does not change significantly on the omission of a selected number of data points. In this study, we have rigorously cross-validated all our ML-based c-RASAR models to show their robustness. We have employed 20 times fivefold cross-validation and 1000 times shuffle-split cross-validation to observe the robustness of the models. From the results in Table 2, it can be concluded that all our ML c-RASAR models were statistically very robust, with the LDA c-RASAR model (the model that generates the best predictivity as per Table 1) having the highest degree of robustness as observed from the highest cross-validated accuracy values among all the reported c-RASAR models. Also, the difference in the accuracy and cross-validated accuracy values were the lowest in the case of the LDA c-RASAR model.Table 2 Results from the 20 times fivefold cross-validation and 1000 times shuffle-split cross-validation (Training set).Analysis of the feature importance in the LDA c-RASAR modelEstimating the relative importance of the modeling features in a modeling framework is essential to understanding the contribution of specific features. Although ML models have been referred to as “black box” many times, the application of explainable AI (XAI)48,49 has enhanced the interpretability of the different ML models. In this context, researchers are now using the SHapley Additive explanation (SHAP) concept to derive feature importance for complex ML models. This is based on the game theory and is an extension of the Local Interpretable Model-agnostic Explanations (LIME) approach. In this work, we have developed SHAP analysis50 plots for the training and test sets (Fig. 7) in the case of the LDA c-RASAR modeling framework to understand the importance of the various RASAR descriptors in the model that encompasses the entire dataset. It should be noted that the first four variables (in the SHAP plots) of the training and test sets are the same; only their order of importance has been altered. This is a common phenomenon where the test set may not exactly replicate the order of importance of the features from the training set. The biggest difference between Figs. 7 and 5 (Bubble plot of the most discriminating features) is that the former is specific for a particular modeling algorithm, while the latter is a feature selection process (based on the discriminatory power of different descriptors) that is model-independent33. Further details on the contributions of these descriptors to the endpoint are given below. Please note that the contributions of the modeling features have been derived as per the LDA c-RASAR model coefficients (Table S5 in Supplementary Materials SI-2).Fig. 7SHAP analysis plots demonstrating the feature importance for (a) training set and (b) test set.The descriptor RA function is a Read-Across-derived function that ensembles chemical information from the selected QSAR (structural and physicochemical) descriptors. This is similar to a latent variable encoding information of a wide array of descriptors into a single variable. The RA function acts as a resultant vector, being a function of the feature space defined by the selected structural and physicochemical QSAR descriptors. This descriptor contributes positively towards hepatotoxicity, which is obvious since this represents the collective information of all the selected QSAR descriptors. It can be observed that Oxaprozin (520) has a high RA function value, and the experimental response value suggests that it is hepatotoxic. Conversely, compounds like Alclometasone (767) have a low RA function value, and the experimental response value suggests that it is non-hepatotoxic. The descriptor MaxPos represents the similarity value of a query compound to its closest positive/hepatotoxic source compound, and this descriptor contributes positively towards hepatotoxicity. This is because when a query compound has a high level of similarity to a hepatotoxic compound, it is more likely that the query compound has the propensity to show hepatotoxicity. This can be exemplified by a compound Alitretinoin (30) that has a high MaxPos value (the closest neighbor of 30 being Isotretinoin, which is also a retinoic acid derivative and is also hepatotoxic), and the experimental data shows that it is hepatotoxic, while compounds like Mometasone (1084) possess a low MaxPos value and are non-hepatotoxic. The descriptor MaxNeg represents the similarity value of a query compound to its closest negative/non-hepatotoxic source compound. This descriptor exerts a negative contribution since the propensity of a query compound is non-hepatotoxic when it has a high similarity level to a non-hepatotoxic source compound. In the case of Oxamniquine (518), the MaxNeg value is zero (since there is the absence of any non-hepatotoxic compounds among the close source neighbors) and the experimental data infers that it is hepatotoxic, while in the case of Dextrothyroxine (879), the MaxNeg value is high and it is observed to be a non-hepatotoxic compound. The descriptor gm (a.k.a. Banerjee–Roy concordance coefficient) is a concordance measure that indicates the propensity of a query compound to be positive (hepatotoxic) or negative (non-hepatotoxic). This descriptor encodes two different information, firstly is the fraction of the positive compounds among the list of close source neighbors (PosFrac), and the second is considering the similarity levels to the closest positive (MaxPos) and closest negative (MaxNeg) compounds, whichever is higher. In this LDA c-RASAR model, it was observed that gm had a positive contribution towards hepatotoxicity since higher values of MaxPos and PosFrac reflect the propensity of a query compound to be positive. This can be observed in the case of Tetrachloroethylene (682) where the value of gm is high, and it is reported to be a hepatotoxic compound, while Isoproterenol (1010) has a low value of gm, and it is observed to be a non-hepatotoxic compound. The descriptor gm*Avg.Sim is a derived descriptor obtained from the product of the values of gm and Avg.Sim, and this descriptor contributes negatively towards hepatotoxicity. This can be observed in Halazepam (979) for which the value of gm*Avg.Sim is high and the compound is reported to be non-hepatotoxic, while Ethylene dichloride (276) has a low value of gm*Avg.Sim and it is reported that Ethylene dichloride is hepatotoxic. The descriptor gm*SD_Similarity is a derived descriptor obtained from the product of the values of gm and SD_Similarity, and this descriptor contributes positively towards hepatotoxicity. As evident from Cefonicid (122), it is observed that the value of gm*SD_Similarity is high and the compound is observed to be hepatotoxic, while Pramlintide (1153) has a low value of gm*SD_Similarity and it is observed to be non-hepatotoxic. The descriptor sm1 is the Banerjee-Roy similarity coefficient 1 which is computed based on the information of the closest positive and negative compounds. This descriptor contributes negatively towards hepatotoxicity, possibly to penalize gm_class (see below) based on its quantitative values since this descriptor ideally should have a positive contribution (Fig. S4 of Supplementary Material SI-2), and can also be used as a diagnostic measure to estimate the modelability of a dataset. The compound Levoleucovorin (1024) has a high value of sm1 and the experimental data indicate that this compound is non-hepatotoxic. Similarly, Teniposide (675) has a low value of sm1 and it is observed that this compound is hepatotoxic. The descriptor gm_class is a form of gm that excludes the uncertainty in cases where gm is close to or equal to zero, to provide a more deterministic propensity of a query compound to be positive or negative. This descriptor is binary in nature with 0 and 1 as the only possible values. Like gm, gm_class also contributes positively towards hepatotoxicity. The compound Acebutolol (14) has gm_class value of 1 and it has been observed that this compound is hepatotoxic. Similarly, Methscopolamine (1062) has gm_class value of 0 and it has been observed that this compound is non-hepatotoxic.Analysis of the c-RASAR derived predictions—a local approach using explainable AI with an example of acetaminophenThe underlying expectation of statistical modeling approaches is that they should efficiently predict the active and inactive compounds. However, often these predictions are erroneous due to some of the limitations that involve the machine learning algorithm employed, the selected descriptor space defining the applicability domain, and the data structure of the training set, using which a model is trained. It is worth noting that acetaminophen (a training set data point in the present analysis) is a well-known hepatotoxic compound, which is also evident from our dataset. However, our LDA c-RASAR model misclassifies this compound. This is an example of a false negative, where the experimental data says that it is hepatotoxic, but the model-derived prediction infers that it is non-hepatotoxic. In this section, we have analyzed this ambiguity using the concept of similarity coupled with explainable AI. It is worth noting that RASAR is a similarity-driven approach, which is dependent on the nearest neighbors of a particular query compound. Initially, as evident from the chemical diversity plot (Fig. 3), we find that acetaminophen has very low levels of structural similarity among other compounds in the entire dataset. Additionally, we have identified 10 nearest neighboring source compounds of acetaminophen using the concepts of Read-Across and observed that only five of those compounds were hepatotoxic (PosFrac = 0.5). Moreover, the analysis of the top 5 nearest neighbors suggests that four out of the five source compounds were non-hepatotoxic. This shows that although there is an equal fraction of hepatotoxic and non-hepatotoxic compounds among the close source neighbors, most of the non-hepatotoxic compounds had higher similarities with the query compound acetaminophen. This information drives the model towards predicting that acetaminophen is a non-hepatotoxic compound. Additionally, in Fig. 8a, it is observed that Primidone is the nearest neighbor for Acetaminophen (nearest neighbors are arranged in a clockwise direction according to the decreasing similarity levels), although they are structurally very dissimilar. To understand the effect of the RASAR descriptors on this particular compound, a local SHAP plot was developed (Fig. 8b), which depicts that the expected predicted value should have been positive (hepatotoxic) instead of negative (non-hepatotoxic). Like our expectations, most of the RASAR descriptors contributed negatively (except sm1 and MaxPos, which had marginal positive contributions). They led to the prediction, which wrongly states that acetaminophen is non-hepatotoxic. Although RASAR descriptors are entirely similarity-driven, some of these descriptors help the modeler to understand the reliability of the negatively predicted compound, which conventional QSAR descriptors fail to provide. Considering the descriptor gm (Banerjee-Roy coefficient), whose value ranges from − 1 to + 1, a gm value towards either of the extremities infers that the prediction is reliable. Unfortunately, the value of gm is 0 in the case of acetaminophen, which infers that the negative prediction of acetaminophen is unreliable. Additionally, the values of sm1 and sm2 are not in the expected range, where both these values should have been positive for an active compound. According to the definition of Banerjee and Roy20, a positive compound having negative values of sm1 and sm2 can be considered an activity cliff. The collective information from the above-mentioned theories and observations is the reason why acetaminophen has been misclassified as a non-hepatotoxic compound.Fig. 8(a) Nearest neighbors of acetaminophen and (b) local SHAP plot of acetaminophen demonstrating the contribution of features leading to misclassification.Variation in the values of the RASAR descriptors across the datasetThis is an additional analysis where we tried to explore the variation of the descriptor values with the positive and negative compounds. The dataset was divided into four parts according to a particular descriptor value. These four parts contain a particular descriptor’s high, medium–high, medium–low, and low values. We have now taken the number of positive and negative compounds in each part. This process was done for three important RASAR descriptors, namely RA function, gm, and sm1. On analyzing the variations of RA function (Fig. 9a), it was observed that a high value of RA function is present in most of the positive compounds, while a low value of RA function is present in most negative compounds. This is expected since the RA function acts as a composite function encoding information of the entire chemical space of the QSAR descriptors. In the medium–high value region of RA function, there is a near-equal balance of the positive and negative compounds, suggesting that the number of positive and negative compounds are nearly equal in this region. This is attributed to the reduced discriminatory power of the descriptor in this region. On analyzing gm (Fig. 9b), the high-value region of gm has a higher number of positive compounds and the low-value region of gm has a higher number of negative compounds. This is also expected since gm reflects the propensity of a query compound to become positive or negative. The descriptor sm1 is designed to identify the modelability of a given dataset. As evident from Fig. 9c, the high-value and the low-value regions of sm1 contain a remarkably low number of positive and negative compounds, and most of the compounds lie in the medium–high and medium–low regions. This reiterates the fact that this dataset has poor modelability, which further enhances the potential of RASAR descriptors since the c-RASAR models had relatively good statistical validation metrics.Fig. 9Variation in the values of selected RASAR descriptors across the dataset.Importance of RASAR descriptors – a statistical analysis using different dimensionality reduction methodsThis is a section that deals with how the RASAR descriptors and the c-RASAR models provide superior performance as compared to the traditional QSAR models developed using various structural and physicochemical descriptors. Apart from the fact that the c-RASAR models are developed using a significantly lower number of descriptors (eight) as compared to the conventional QSAR model (twenty-nine), this section shows the discriminatory power of the RASAR descriptors. To assess this, we have employed three different dimensionality reduction techniques: the t-distributed Stochastic Neighbor Embedding (t-SNE)51, Uniform Manifold Approximation and Projection (UMAP)52, and Arithmetic Residuals in k-groups Analysis (ARKA) framework15. While the former two methods consider only the descriptor space, the recently developed ARKA framework not only considers the descriptor space but also considers the observed response values of the training compounds. In this section, we will analyze the different observations using the three different dimensionality reduction techniques and derive certain key information that is useful in modeling terms.t-distributed Stochastic Neighbor Embedding (t-SNE) analysisThe t-SNE plots utilize a non-linear algorithm to reduce dimensionality and cluster similar data points. This uses an Euclidean distance-based approach to compute the similarity among data points and cluster similar data points. The basic understanding is that the set of descriptors (QSAR and c-RASAR) that better express the similarities among compounds should generate better clustering of the data points. As evident from Fig. 10, one can observe that the t-SNE plot developed using the eight different c-RASAR modeling descriptors shows better clustering as compared to the t-SNE plot developed using the 29 different QSAR modeling descriptors. Therefore, this analysis infers that the RASAR descriptors are better at representing similar data points.Fig. 10t-SNE plots of the (a) QSAR modeling descriptors and (b) RASAR modeling descriptors.Uniform manifold approximation and projection (UMAP) analysisThe UMAP plots also adhere to a non-linear dimensionality reduction technique similar to t-SNE. An Euclidean distance-based approach is used to define the similarity among data points. Again, the basic concept is that the descriptor matrix with a better clustering ability is more efficient in properly representing similar data points. Identical to the previous case, we developed UMAP plots for 29 QSAR and 8 RASAR modeling descriptors. The observation was similar since the RASAR descriptors have a better clustering ability (Fig. 11).Fig. 11UMAP plots of the (a) QSAR modeling descriptors and (b) RASAR modeling descriptors.Arithmetic Residuals in k-groups Analysis (ARKA) frameworkSo far, we have discussed the potential of the RASAR descriptors using various dimensionality reduction techniques that utilize only the descriptor space. However, these dimensionality reduction techniques cannot identify potential activity cliffs since their algorithm does not involve information from the training set response values. This is how the ARKA framework differs from the conventional dimensionality reduction methods since it utilizes information from the training set response to derive a mathematical weightage value. It would be interesting to see the clustering ability of the 29 QSAR descriptors and the 8 RASAR descriptors in this particular framework and analyze some of the activity cliffs identified by the ARKA plot generated from the RASAR descriptor. Figure 12 represents the plots for ARKA_2 vs. ARKA_1 for the training and test sets of the 29 QSAR modeling descriptors and the 8 RASAR descriptors. The ARKA descriptor plots generated using the QSAR descriptors showed that most of the data points lie in the less modelable and borderline zone, as defined by Banerjee and Roy15, and this infers that the QSAR descriptors, after the dimensionality reduction, cannot sufficiently describe the complete chemical information required for the model to generate efficient predictions. On the other hand, the ARKA descriptor plots generated using the 8 RASAR descriptors (after the dimensionality reduction) showed enhanced modelability and had a lower number of compounds in the borderline and less-modelable zone. In Fig. 12, the rectangle near the origin denotes the region of less-modelable data points, and the number of data points inside this rectangle reflects a lower reliability and modelability of the data. As evident, the ARKA plots for the QSAR descriptors have a large number of data points in this zone, while the ARKA plots for the RASAR descriptors have a lower number of such data points, again implying the importance of RASAR descriptors in enhancing the modelability of a dataset.Fig. 12ARKA_1 versus ARKA_2 plots of the training and test sets using (a) QSAR modeling descriptors and (b) RASAR modeling descriptors.In ideal cases, the positive compounds should lie in the fourth quadrant of the ARKA plot (where ARKA_1 is positive and ARKA_2 is negative), and the negative compound should lie in the second quadrant of the ARKA plot (where ARKA_1 is negative and ARKA_2 is positive). However, the occurrence of positive compounds in the second quadrant or negative compounds in the fourth quadrant is the instance where we can term such compounds as activity cliffs (provided that they do not lie in the less-modelable or borderline zone for a comprehensive evaluation). From the ARKA plots developed using the RASAR descriptors, we have identified five compounds from the training set and four compounds from the test set as activity cliffs. An analysis of two such activity cliffs has been performed and their structural and activity representation have been provided in Table 3.Table 3 Structural representation of the activity cliffs, their closest source neighbors from the ARKA plots and the similarity values.As evident from Table 3, the training set compounds Bupivacaine (94) and Mepivacaine (1048) have a high level of similarity, and Mepivacaine is the closest neighbor of Bupivacaine, as observed from the ARKA_2 vs ARKA_1 plot. Additionally, these compounds belong to the same pharmacological class, i.e.; local anesthetics. Even the main structural moiety is the same for both of these drugs, and they differ slightly in the side chain (the butyl group in Bupivacaine and the methyl group in Mepivacaine). According to the principle of similarity, two molecules having similar structural features are expected to show similar properties. In contrast, the experimental data says that Bupivacaine is an active (hepatotoxic) molecule while Mepivacaine is an inactive (non-hepatotoxic molecule). Similarly, from the test set, it can be observed that Dienestrol (885) and Phenol (561) have a high level of similarity and the same structural moiety (the phenol ring). Still, they differ in their activity values since the experimental data says that Dienestrol is an inactive (non-hepatotoxic) compound while Phenol is an active (hepatotoxic) compound. This demonstrates the power of the RASAR descriptors and the ARKA framework, where the placement of the compounds Bupivacaine and Dienestrol were not in the desired quadrant in the ARKA plot, and has proven to be activity cliffs.Analysis of activity cliffs using Read-Across-derived informationThe basic concept of Read-Across is that for a query compound, we locate its close source neighbors (having experimental response values) in the chemical similarity space (based on the selected structural and physicochemical descriptors) and derive a consensus-like prediction for the query compound. This particular information is essential in identifying whether a particular query compound is an activity cliff. One of the theories for this is that if the nearest neighbor of a particular active query compound is an inactive molecule or vice versa, then such query compounds can be termed activity cliffs since they have identified a compound of the opposite class that most closely resembles its structural characteristics. Apart from the prediction files and the RASAR descriptor files generated by the tools Read-Across-v4.2.1 and RASAR-Desc-Calc-v3.0.3, respectively, both of these tools additionally generate a file (“Sort.xlsx”) that identifies the nearest neighbors for a particular query compound, based on which Read-Across predictions and the RASAR descriptors are calculated. The target compounds that have been analyzed in this section are from the test set. Taking information on the nearest neighbors of every query compound, we have identified the closest neighboring compounds for every query compound and analyzed whether they are activity cliffs or not. In usual cases, it is expected that the nearest neighbor of a positive query compound should ideally be positive (Fig. 13a) while the nearest neighbor of a negative query compound should ideally be negative (Fig. 13b). However, a query compound can be termed activity cliff only when the nearest neighbor of a positive query compound is negative or the nearest neighbor of a negative query compound is positive (Fig. 13c). As exemplified in Fig. 13a, the query compound Cycloserine is hepatotoxic, and its nearest neighbor is Hydroxyurea, which is also hepatotoxic. Similar observations can also be made for the query compound Glimepiride, whose nearest neighbor is Glyburide, and both of these compounds are hepatotoxic. Additionally, both Glimepiride and Glyburide are antidiabetic drugs from the group of arylsulfonylureas, and it is expected that they should show similar properties since they are structurally very similar. Figure 13b suggests that the query compound Buprenorphine is a non-hepatotoxic compound, and its nearest neighbor (Oxymorphone) is also non-hepatotoxic. This is also quite obvious since both these molecules belong to the opioid analgesic class of drugs, and it is expected that they should show similar properties. Additionally, the query compound Chlorpheniramine, which is also a non-hepatotoxic compound, has another non-hepatotoxic compound—Brompheniramine as its nearest neighbor. This is also expected since both of these molecules are antihistamines and only differ in their halogen substitution (chlorine in Chlorpheniramine and bromine in Brompheniramine). Figure 13c presents typical examples of activity cliffs. For the query compound Amphotericin B, the nearest neighbor is Nystatin. However, the experimental report infers that Amphotericin B is a hepatotoxic molecule, while Nystatin is non-hepatotoxic. This is a classic example of an activity cliff since both of these molecules are macrocyclic polyene antifungal drugs and possess very high levels of structural similarity. If we observe the distribution of the ten nearest neighbors of Amphoptericin B, we find that most of the molecules are hepatotoxic just like Amphotericin B. With this information, we can safely infer that Nystatin has been identified as an activity cliff. In another case, for the query compound Esomeprazole, the nearest neighbor is Famcyclovir. It is to be noted that Esomeprazole has been experimentally labeled as a non-hepatotoxic compound but Famcyclovir has been labeled as a hepatotoxic compound. If we observe the distribution of the 10 close source neighbors for the non-hepatotoxic query compound Esomeprazole, we find that 8 of them were labeled hepatotoxic compounds. Therefore, in this case, a non-hepatotoxic compound Esomeprazole has high levels of structural similarity with such molecules that have experimentally been labeled as hepatotoxics. Therefore, this observation concludes that the molecule Esomeprazole is an activity cliff. The positive and negative query compounds, along with their 10 close source neighbors (arranged in a decreasing order of similarity and depicted clockwise for each query compound), have been presented in Fig. 13. Thus, the compounds—Nystatin from the training set and Esomeprazole from the test set have been correctly identified as activity cliffs by the values of sm1 and sm2 (positive sm1 and sm2 values for the inactive compounds Nystatin and Esomeprazole).Fig. 13An analysis of activity cliffs using the concepts from Read-Across and c-RASAR. (a) indicates an ideal case for positive query compounds, (b) indicates an ideal case for negative query compounds, and (c) indicates the activity cliffs.Results for the true external set predictionThe evaluation of the model performance on a true external data set is a benchmark for assessing the generalizability of the developed model. The main motive is to check whether the model efficiently identifies hepatotoxic compounds on true external set data. The experimental active/inactive data was collected from the works of He et al.39, who also reported predictions generated using their voting classifier developed using eight different ML models. From a general conscience, it is expected that a voting classifier model from 8 different ML models should more efficiently be able to identify the hepatotoxic compounds compared to a simple LDA model. However, the novelty/specialty of this LDA model is that it was trained using RASAR descriptors. The true external set compounds that appeared in our training set were removed from our analysis. Therefore, this final list of 184 compounds has no such data points that appeared in either our or He et al.’s training sets, leading to the foundation of an unbiased evaluation. The predictions for these compounds were made using the LDA c-RASAR model, which generated the best predictivity. At this stage, we have the experimental response values, predicted values using the previously reported ensemble model, and predictions developed using our LDA c-RASAR model. Classification-based validation metrics like Accuracy and Sensitivity (Recall) were computed for both predictions. The reason for choosing these metrics is that the best model should not only be able to identify the hepatotoxic compounds efficiently but also should be able to generate a lower number of false negatives (data points that are hepatotoxic, but the model predicts that they are non-hepatotoxic). Additional metrics like Precision and F1 score were also computed and reported. The complete comparison has been reported in Table 4, where it is observed that the LDA c-RASAR model not only has enhanced accuracy in predicting the true external set but also generates a lower number of false negatives. From a pharmacological and statistical point-of-view, a model generating a higher number of false negatives is not sufficiently reliable as it can bypass potential hepatotoxic compounds by predicting them as non-hepatotoxic.Table 4 Comparison of the performance of the true external set data using our LDA c-RASAR model and the previously reported ensemble model.Comparison with previously reported modelsThis section indicates a comprehensive comparison of the performance of our model and the previously reported models. Liew et al.1 used the same dataset to develop machine learning models. They used 617 base classifiers of k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), and Naïve Bayes (NB), and used these predictions as descriptors for subsequent modeling using Naïve Bayes as the final stacking classifier. This procedure appears extremely complex, with very little room for reproducibility, especially when the authors predicted three external sets containing just 120, 47, and 20 compounds. Additionally, the authors defined that the external set containing 20 compounds is ten pairs of activity cliffs. If we analyze their prediction results, some important validation metrics like AUC, specificity, and G-mean did not even pass their threshold values, even though the authors used an array of modeling algorithms. Moreover, the prediction for true external set compounds using the presented modeling framework would be tedious, and these authors also did not evaluate the performance of the final stacking classifier model on true external sets. On the other hand, we have developed a simple, linear, reproducible, and transferable LDA c-RASAR model that can easily be used to compute true external set predictions. Additionally, we have not only employed this model to predict a test set containing 631 compounds (which is more than three times larger than the combined three different external sets of Liew et al.) but also have predicted another true external set of 184 compounds. As evident from the results of various statistical validation metrics of the predictions for the test and the true external sets, it is evident that the LDA c-RASAR model provided more robust, reliable, and accurate predictions of unseen hepatotoxic compounds. Toropova et al.5 also used this same dataset to develop a semi-correlation-based model that uses a regression modeling framework employing an optimal SMILES-based descriptor on graded data to generate classification predictions. However, the authors ignored the unique/rare structural features during the computation of their descriptor, which appears to introduce a bias in their modeling framework. Additionally, they have divided the dataset into four equal parts—active training, inactive training, calibration, and validation sets, and have only developed the model using the active training set (314 compounds). This model was evaluated for the predictive potential using a validation set of 322 compounds. On the other hand, our LDA c-RASAR model was not only trained using a higher number of data points (643 compounds) but also its predictive ability was evaluated on a large number of test set data points (631 compounds). Additionally, we have performed validation using a true external set to prove the generalizability of our model, which Toropova et al.5 did not perform. Our LDA c-RASAR model has not only been developed on a balanced training set but also provides competitive performance on the training, test, and external sets, as evident from the different classification-based validation metrics. Moreover, the present work provides a deep insight into potential activity cliffs using multiple methods, which is missing from the previous works.

Hot Topics

Related Articles