MetaCGRP is a high-precision meta-model for large-scale identification of CGRP inhibitors using multi-view information

Chemical space analysis of CGRP inhibitorsThe chemical space analysis conducted herein focused on understanding the unique characteristics between the bioactivity classes (i.e., active and inactive) of CGRP inhibitors. Over time, guidelines for drug-like molecules have been formulated by the pharmaceutical industry and medicinal chemists, anticipating their permeability across biological membranes32,33. This exploration employed Veber’s and Lipinski’s Ro5, including AlogP ≤ 5, MW ≤ 500, HBDon ≤ 5, HBAc ≤ 10, TPSA ≤ 140 Å2, and nRotB ≤ 10. Although most small molecules that are orally available tend to follow the Ro5 criteria, many exceed these boundaries71. This observation suggests that these criteria are based on the understanding that effective medications often comprise larger molecules with a certain degree of lipophilicity72.MW represents the compound’s mass and is commonly used for mathematical calculations and interpretations, whereas AlogP serves as an established parameter that gauges a molecule’s hydrophobicity, or lipophilicity, offering insights into its capability to enter and traverse cell membranes. The counts of HBDon and HBAc are employed to evaluate a molecule’s ability to form hydrogen bonds (H-bond). Additionally, TPSA is linked to the pattern of hydrogen bonding of a molecule under investigation in an aqueous environment, while nRotB is associated with molecules with flexibility that establish intrachain hydrogen bond interactions, thereby enhancing permeation into cell membranes73,74. Supplementary Figure S1 illustrates the chemical space using a combination of box and violin plots for all the descriptors mentioned above. Supplementary Figure S1A shows a concentrated distribution of active compounds, with a mean molecular weight of 600 Da and an approximate range of 550 to 650 Da, while the inactive compounds are clustered between 450 and 620 Da, with a mean of 550 Da. The AlogP of active compounds show a range of 2 to 6 with a mean of 4 while the inactive compounds fall in the range of 1 to 7 with a mean at 3 (Supplementary Figure S1B). Additionally, our statistical analysis, using the Mann–Whitney U test, found a significant difference between the active and inactive compounds (p-value < 0.001) for the AlogP descriptor, but no statistical significance for MW between the groups.The visualization further emphasizes that most compounds possess HBDon values under 5 and HBAc values below 10. A significant statistical difference (p-value < 0.001) between active and inactive compounds was noted only for the HBAc property, indicating a higher number of H-bond acceptors in the active compounds (Supplementary Figure S1C-S1D). Similarly, the distribution of active and inactive compounds in terms of TPSA shows a trend with a mean value at 120 Å2 (Supplementary Figure S1E). Although the nRotB of most compounds in both classes were within the desired range (i.e., less than 10), their distribution was observed to be statistically significant with a p-value < 0.001 (Supplementary Figure S1F).However, it’s important to note that for most compounds in the dataset, their MW surpasses the threshold defined by Ro5. This deviation can be attributed to the nature of the CGRP receptor, which consists of two distinct domains, CLR and RAMP1. To effectively inhibit or bind to the CGRP receptor, a molecule needs to interact with both of these components, necessitating a larger molecular size69. An illustration of this is the potent inhibitor olcegepant, which has a MW of 869.645 Da (DrugBank ID: DB04869). Additionally, FDA approved small-molecule drugs for CGRP such as rimegepant, ubrogepant, atogepant, and zavegepant have MWs of 534.6, 549.5, 603.5, and 638.8 Da, respectively, all of which exceed the Ro5 rule of 500 Da. This might be a result of larger molecules containing extra hydrophobic regions or functional groups, making them more likely to partition into organic phases such as octanol. Moreover, larger molecules with higher MW have a greater potential for establishing numerous interactions with the target molecule75.Furthermore, to assess the chemical diversity of the dataset, Tanimoto similarity was calculated and represented as a heatmap, illustrated in Supplementary Figure S2. The color of each cell in the heatmap corresponds to the Tanimoto similarity score between the molecules represented by the respective row and column. Cells with colors representing higher values (red) indicate high similarity (values closer to 1), while cells with lighter or less intense colors (blue) indicate low similarity (values closer to 0). In addition, the diagonal of the heatmap (from the top left to the bottom right) signifies the comparison of each molecule with itself, and should have a Tanimoto similarity score of 1 (or 100%), indicating perfect similarity. As can be seen from Supplementary Figure S2, the diagonal line is uniformly dark red, as each molecule is identical to itself. There are also several off-diagonal blocks that range from dark red to red, orange, and yellow, indicating clusters of molecules that are highly and moderately similar to each other. Finally, outside these blocks, the heatmap is mostly blue, suggesting that these clusters are quite different from the rest of the dataset. This interpretation suggests that while there are some groups of similar molecules, the dataset overall contains a good amount of diversity, which is generally favorable for creating robust and generalizable models.Comparison of different ML methods and molecular representation methodsIn this section, we evaluated and analyzed the impact of several baseline models trained on the 12 powerful ML methods and 12 conventional molecular descriptors in the identification of CGRP inhibitors by performing both 10-fold cross-validation and independent tests. The performance of these prediction models is recorded in Supplementary Tables S1-S2 while Supplementary Tables S3-S6 provide the average performance of each ML method and molecular descriptors. As seen in Supplementary Table S3, the highest average cross-validation MCC of 0.765, 0.764, 0.758, 0.752, and 0.752 are obtained from LR, XGB, ET, RF, and SVM, respectively. Based on the performance of the 12 molecular descriptors, the highest average cross-validation MCC of 0.722, 0.735, 0.736, 0.752, and 0.777 are obtained from FP4, CKDExt, CKD, Circle, and Hybrid, respectively (Supplementary Table S5). In addition, Supplementary Figure S3 highlights the chemical space in terms of applicability domain of the 12 molecular descriptors as visualized using the t-distributed Stochastic Neighbor Embedding (t-SNE) approach. The compounds in the independent dataset are observed to form clusters in the same areas as compounds from the training dataset, indicating their similarity and reliability for prediction purposes. This suggests that these ML methods and molecular descriptors are beneficial in CGRP inhibitor identification.Furthermore, we compared the performance of 144 baseline models and determined the best-performing one as judged by cross-validation MCC. From Fig. 2 and Supplementary Tables S1-S2, several observations can be summarized as follows: (i) The top-five powerful baseline models consist of XGB-Hybrid, RF-Hybrid, SVM-Circle, LR-Hybrid, and LR-Circle with respective cross-validation MCC of 0.824, 0.817, 0.811, 0.811, and 0.810; (ii) Among the top-five powerful baseline models, all of them are developed by using Hybrid and Circle. These observations again confirm the discriminative ability of these two molecular descriptors; and (iii) We noticed that XGB-Hybrid outperforms other prediction models in terms of cross-validation MCC. On the other hand, MLP-Pubhem (MCC of 0.804) outperforms XGB-CDK (MCC of 0.649) as judged by the MCC of the CGRP-IND dataset. These observations indicate that the performance of single-based feature descriptors is not robust as indicated by the CGRP-IND dataset. Thus, it is desirable to develop a more robust model by using the meta-learning strategy.Fig. 2Performance comparison of top 20 baseline models as evaluated using training (A-B) and independent (C-D) datasets.Comparison of different multi-view feature representationsIn this study, we employed three multi-view feature representations, including CF, PF, and CPF, to address the limitation of the single feature descriptor. Furthermore, we applied the GA-SAR method to each feature representation for improving its discriminative ability. Herein, the GA-SAR method identified 10, 19, and 8 informative features for constructing the best feature sets for CF, PF, and CPF, respectively. For convenience of discussion, the best feature sets for CF, PF, and CPF are referred as CF_FS, PF_FS, and CPF_FS, respectively. To assess the feature ability of six feature representations in the identification of CGRP inhibitors, both 10-fold cross-validation and independent tests were used on the CGRP-TRN and CGRP-IND datasets, respectively. The prediction performance of the six new feature representations is recorded in Table 3. As seen in Table 3, all the six metrics of CF_FS, PF_FS, and CPF_FS are higher than that of their original feature vectors on the CGRP-TRN dataset. Furthermore, the MCC values of CF_FS, PF_FS, and CPF_FS were 5.65, 9.44, and 8.12, % higher than their original feature vectors, respectively. Among the three optimal feature representations, the performance of PF_FS was slightly higher than CF_FS and CPF_FS over the 10-fold cross-validation test. In case of the independent test, PF_FS secures the best performance in terms of all the six measures. To be specific, the ACC, SP, MCC, AUC, and F1 of PF_FS were 2.56–3.85, 2.26–5.13, 2.56, 5.16–7.52, 0.79–1.26, and 2.70–4.26% higher than the compared feature representations. Overall, PF_FS showed a stable performance on both the CGRP-TRN and CGRP-IND datasets. Therefore, the PF_FS was selected as input feature vector to develop our proposed meta-model (named MetaCGRP).Table 3 Cross-validation and independent test results of different feature representations.Meta-learning strategy is capable of contributing to performance improvementTo reveal the advantages of the meta-learning strategy, we investigate and evaluate the feature ability of our new feature representations (PF_FS) and the effectiveness of the proposed model (MetaCGRP) by conducting two individual comparative experiments. In the first comparative experiment, we compared the performance of the PF_FS with 12 conventional molecular descriptors. For a fair comparison, we trained and evaluated XGB classifiers with these molecular descriptors in terms of both 10-fold cross-validation and independent tests. The performance evaluation results are provided in Table 4. Additionally, Fig. 3 highlights the performance comparison of PF_FS with the top-five feature descriptors having the highest cross-validation MCC (i.e., Pubchem, MACCS, FP4, Circle, and Hybrid). It can clearly be seen that PF_FS attained better performance than the top-five feature descriptors on both the CGRP-TRN and CGRP-IND datasets.Table 4 Cross-validation and independent test results of different feature descriptors.Fig. 3Performance comparison of our generated feature PF_FS and top-five molecular descriptors on the training (A–B) and independent test (C–D) datasets.Furthermore, the highest MCC of 0.792 and 0.824 were achieved using Circle and Hybrid on the CGRP-TRN dataset, respectively. This indicates that Circle and Hybrid were the most powerful molecular descriptors in the identification of CGRP inhibitors. For the convenience of our investigation, we compared the performance of the PF_FS against these two molecular descriptors in terms of their predictive ability using tSNE approach (Fig. 4) and their overall performance (Table 4). From the comparison results summarized in Table 4, it is clear that the overall performance of the PF_FS is better than all the 12 conventional molecular descriptors in terms of the CGRP-TRN dataset. Furthermore, in case of the CGRP-IND dataset, the ACC, MCC, and F1 of the PF_FS were 3.85–7.69, 6.12–15.03, 5.13–8.63% higher than that of Circle and Hybrid, indicating the contribution of the PF_FS for performance improvement. Similarly, in Fig. 4, the positive and negative classes (i.e., active and inactive compounds) are clearly identifiable as two distinct clusters using our generated PF_FS feature, compared to Circle and Hybrid, for both the CGRP-TRN and CGRP-IND datasets. This demonstrates that the PF_FS feature generated in this study provides more significant information for improving CGRP inhibitor identification.Fig. 4Visualization of t-SNE distribution for our generated feature PF_FS and two compared molecular descriptors (i.e., Hybrid and Circle) on the training (A–C) and independent test (D-F) datasets.In the second comparative experiment, we compared the performance of MetaCGRP with the top-five powerful baseline models having the highest cross-validation MCC (i.e., LR-Circle, LR-Hybrid, SVM-Circle, RF-Hybrid, and XGB-Hybrid). Based on the same CGRP-TRN and CGRP-IND datasets, Fig. 5; Table 5 display the 10-fold cross-validation and independent test results of the proposed MetaCGRP with the top-five powerful prediction models. It can clearly be seen that MetaCGRP attained better performance than the top-five powerful baseline models as judged by four out of six metrics, including ACC, SN, F1, and MCC, on both the CGRP-TRN and CGRP-IND datasets. In case of the CGRP-IND dataset, the ACC, SN, F1, and MCC, and F1 of MetaCGRP were 2.56–7.69, 5.13–10.26, 30.8–8.11, 4.65–15.47% higher than that of the top-five powerful baseline models. In addition, to verify the stability of the proposed model, MetaCGRP was performed on 10 individual cross-validation and independent test results over corresponding training and independent test datasets. The detailed results are recorded in Supplementary Table S7. The average ± STD AUC scores of MetaCGRP were 0.952 ± 0.011 and 0.943 ± 0.016 over the training and independent test datasets, respectively. Taken together, these results reveal that MetaCGRP could efficiently attain more accurate and stable CGRP inhibitor identification in terms of both the 10-fold cross-validation and independent tests.Fig. 5Performance comparison of our proposed MetaCGRP and top-five baseline models on the training (A–B) and independent test (C–D) datasets.Table 5 Performance comparison of MetaCGRP and top-five baseline models on the training and independent test datasets.Feature importance analysisIn this section, we utilize the Shapley Additive explanation (SHAP) method76 to improve our understanding of MetaCGRP’s output and highlight the important features48,49,51,52. Features exhibiting a strongly positive value, indicated by the red color toward the positive x-axis, were considered highly influential in CGRP inhibition. As mentioned above, the PF_FS was generated based on 19 selected baseline models (Fig. 6A-B and Supplementary Table S8). It could be observed that the top-five informative features consisted of XGB-Hybrid, XGB-Pubchem, RF-Circle, MLP-CKDExt, and DT-Pubchem. Based on their SHAP values, almost all top-five informative PFs, with the only exception of DT-Pubchem, exhibited a strongly positive value, highlighting that these features were considered highly influential in CGRP inhibition. In additional, we utilize the SHAP method to analyze the XGB-Pubchem output in order to gain a deeper insight into the specific substructural elements that may be responsible for potential inhibitory effects against CGRP. Figure 6C-D and Supplementary Table S9 show the top 20 important features. The details of these analyzed substructure fragments, along with their SMARTS patterns, are presented in Table 6.Fig. 6Feature importance from MetaCGRP (A-B) and XGB-PubChem (C-D) as ranked by SHAP values based on the training dataset. (A and C) Magnitude and direction of the contribution of each feature to the model prediction of CGRP inhibitors. (B and D) Mean absolute SHAP values, where positive and negatives SHAP values influences the predictions toward positive and negative samples, respectively.Table 6 Summary of the top-twenty important features ranked by SHAP along with their corresponding substructure descriptions and SMARTS pattern.It is noteworthy that among the top 20 important substructures, 16 exhibit positive and high feature values (indicated by red on the positive scale in Fig. 6D), thereby contributing significantly to CGRP inhibition. Particularly remarkable is that six out of these 16 features consist of nitrogen-containing substructures, including 2-(methylamino) acetaldehyde, methanediamine, hydrazine, methylamine, and 2-methylpropan-2-amine. The substantial contribution from nitrogen-containing compounds in the top features is unsurprising, given their wide utilization in medicinal chemistry77. Notably, among these, 2-(methylamino) acetaldehyde (Pubchem602), methanediamine (Pubchem375), and methylamine (Pubchem365) stand out. These amides are characterized by a nitrogen atom directly bonded to carbon atoms and serve as key precursor elements in the manufacturing of pharmaceutical compounds due to their ability to form hydrogen bonds and interact with receptor sites, which are essential properties in drug design. Moreover, studies have shown that 2-(methylamino) acetaldehyde is a part of secondary amines and is associated with mu opioid antagonism (patent: US-7902221-B2), while methanediamine serves as a precursor for compounds exhibiting anti-cancer activities78. Additionally, nitrogen substructures play a significant role in CGRP inhibition, as evidenced by nitrogen-based heterocyclic substructures in FDA-approved (Supplementary Figure S4) and newly designed small molecule CGRP inhibitors79,80. These drugs utilize nitrogen-containing structures to enhance their binding affinity and specificity for CGRP receptors, thereby inhibiting CGRP signaling and providing relief from migraine symptoms. The role of nitrogen in these compounds highlights the importance of such functional groups in the development of effective CGRP inhibitors.Furthermore, other substructures that make substantial contributions include halogen-containing ones, such as 1-bromobutane, 2-chlorotoluene, and iodoethane corresponding to Pubchem675, Pubchem759, and Pubchem350, respectively. Furthermore, alkanes (i.e., pentane and 2-methyl-heptane), acetaldehyde, 3-methylphenol, and p-xylene also play significant roles. Halogen motifs can influence the pharmacokinetic and pharmacodynamic properties of drugs, including enhanced binding affinity, metabolic stability, structure optimization, and bioisosteric replacement81,82,83. This influence is evident from Supplementary Figure S4, where three of the four FDA-approved drugs for CGRP contain a halogen motif that interacts with the CGRP receptor. Moreover, the most prominent feature was 1-bromobutane (Pubchem675), also known as butylbromide, which has been implicated in several studies regarding the inhibition of CGRP in irritable bowel syndrome84,85. However, no study using butylbromide has been conducted on the inhibition of CGRP in migraines. The incorporation of the aforementioned substructures into drug molecules can enhance various interactions, including hydrophobic, halogen bonding, and π-π stacking, all of which are critical for effective CGRP receptor binding and inhibition.Furthermore, it is interesting to note that six of the 16 important features are natural products (Fig. 6D-C; Table 6). For example, Pubchem713, which pertains to p-xylene, is a natural product found in Basella alba (malabar spinach), Helianthus tuberosus (Jerusalem artichoke) among others86. Shanta et al.87 reported that the viscous liquid obtained from the leaves and tender stalks of Basella alba is a remedy for habitual headaches. Similarly, Sawicka et al.88 studied the beneficial effects of Helianthus tuberosus and found it can function as anti-diabetic, anti-carcinogenic, anti-fungistatic, anti-constipation, metabolism-improving, and body mass-reducing agents. However, their effects on migraine have not been studied. Likewise, iodoethane (Pubchem350), pentane (Pubchem578 and Pubchem618), and methylamine (Pubchem365) are all derived from natural products found in Mastocarpus stellatus and Fucus vesiculosus, Calendula officinalis and Allium ampeloprasum, and Peucedanum palustre and Vitis vinifera, respectively, which show positive feature contributions towards active compounds. The products obtained from some of the above-mentioned species have been indicated as migraine relief medications in alternative treatments such as homeopathy, ayurveda and TCM, as they all exhibit anti-oxidant and anti-inflammatory properties89,90. However, relatively little research indicates the anti-migraine potential of these natural products91,92,93,94. Collectively, the important features identified by SHAP analysis can potentially serve as active substructures for CGRP inhibitors.Discovery of new drugs to treat migraine through inhibition of CGRP using Thai Herbal PharmacopoeiaIn the preceding sections, we demonstrated that MetaCGRP outperformed several traditional ML classifiers, showcasing its superior performance. Therefore, in this section, we utilized MetaCGRP for virtual screening, employing data from the Thai herbal pharmacopoeia to identify potential natural compounds with activity against CGRP. In addition, we conducted molecular docking analyses to uncover how these compounds bind and to assess their binding affinities. For this investigation, we utilized the crystal structure of CGRP in complex with an inhibitor as a reference (further details in Material and Methods). Table 7 provides a list of FDA-approved CGRP inhibitors and the top-five natural compounds, along with their probabilities, corresponding docking scores, and interaction residues. Figure 7 illustrates the protein structure of CGRP in its docked conformation, highlighting the interacting residues of the top-five compounds. The docking scores for the top-five compounds were as follows: -10.3, -10, -9.9, -9.7, and − 9.7 kcal/mol corresponding to clerosterol 3-glucoside, stigmasterol 3-glucoside, sennoside D, sennoside C, and stantalin A, respectively. These scores were comparable to two FDA-approved drugs, namely ubrogepant, and rimegepant, with docking scores of -10.1 and − 9.8 kcal/mol, respectively (Table 7).Table 7 List of FDA-approved CGRP inhibitors and the top-five natural compounds, along with their probability, corresponding docking scores and interaction residues.Fig. 7Protein-ligand interactions of CGRP (PDB ID: 3N7S) with (A) the co-crystal ligand (ocelgepant in yellow) where CLR and RAMP1 subunits are highlighted in cyan and magenta, respectively. Close-up views of the binding interaction of CGRP and (B) clerosterol 3-glucoside, (C) stigmasterol 3-glucoside, (D) sennoside D, (E) sennoside C, and (F) stantalin A.The top-two compounds (i.e., clerosterol 3-glucoside and stigmasterol 3-glucoside) are natural products found in various plants, including bamboo and rice bran95. They belong to the class of phytosterols, which are sterol compounds derived from plants. Extensive studies have been conducted on these compounds for their prospective health-promoting benefits, such as their anti-cancer and anti-inflammatory properties96,97. Hernández-Flores et al.98 studied the analgesic effects of phytosterol-derived ibuprofen-like substances while mitigating the gastric side effects, in comparison with ibuprofen, and found that the phytosteryl ibuprofenates (including stigmasteryl (S)-ibuprofenate and cholesteryl (S)-ibuprofenate, among others) possessed comparable activity to ibuprofen at the same mg/kg doses, but without the associated gastric effects. This analgesic effect could be beneficial in migraine treatment. Moreover, they have shown promise in the treatment of cardiovascular diseases due to their cholesterol-lowering effects99,100. Among the top-two compounds, more literature is present on the benefits of stigmasterol as it is among the most abundant sterols found in plants, with a major role in maintaining the structure and function of cell membranes101. In addition, stigmasterol was shown to exert anticonvulsant effects when assessed on various recombinant GABAA receptor subtypes102. Furthermore, a study by Parthasarathy et al. https://doi.org/10.1016/j.fshw.2019.01.001 screened thirteen anti-migraine compounds from the leaves of Abrus precatorius (i.e., Indian liquorice) identified through GCMS analysis against CGRP. The authors identified stigmasterol, among the thirteen anti-migraine compounds through molecular docking that could serve as inhibitors for migraine headache.Upon examining the binding interactions between the top compound (i.e., clerosterol 3-glucoside (Fig. 7B)) and the ectodomain complex of the CGRP receptor, H-bonds were observed with residues ARG119CLR, THR120CLR and THR122CLR, along with π-sigma and π-alkyl interactions with residues ARG38CLR, TRP72CLR, and TRP74RAMP1. Additionally, hydrophobic interactions involved ILE41CLR, MET42CLR, ARG67RAMP1, ALA70RAMP1, ASP70CLR, ASP71RAMP1, GLY71CLR, PHE83RAMP1, TRP84RAMP1, PRO85RAMP1, LYS103CLR, TRP121CLR, and TYR124CLR. These interactions are consistent with those observed in FDA-approved CGRP antagonists (Supplementary Figure S4). Similarly, stigmasterol 3-glucoside forms H-bonds with ARG119CLR, THR120CLR, and LYS103CLR, along with π-sigma (TRP74RAMP1) and π-alkyl interactions (LEU34CLR, ARG38CLR, ILE41CLR, and TRP72CLR), consistent with clerosterol 3-glucoside. Moreover, some studies have suggested a correlation between high cholesterol levels and migraine103, thus the cholesterol-lowering effects of these compounds derived from natural products may have potential applications in migraine treatment.Similarly, the other top compounds also demonstrated H-bonds and hydrophobic interactions with the residues mentioned above, albeit in different capacities (Fig. 7C-F; Table 7). Taking it a step further, we calculated the physicochemical properties of the top-five compounds and the FDA-approved drugs, presented in Supplementary Table S10. Clerosterol 3-glucoside, stigmasterol 3-glucoside and stantalin A demonstrated the closest adherence to the Ro5 and Veber’s rule for orally active drugs. However, sennoside D and sennoside C showed violations for all criteria except nRotB, indicating that they are not suitable for further development. Therefore, clerosterol 3-glucoside and stigmasterol 3-glucoside show the greatest promise and can be explored in future research endeavors.

Hot Topics

Related Articles