An artificial intelligence accelerated virtual screening platform for drug discovery

Computational methodsThe computational methods are organized into three primary sections. The first section focuses on the development of the Rosetta general forcefield for virtual screening (RosettaGenFF-VS) and the Rosetta virtual screening protocol (RosettaVS). The second section presents the benchmarks of RosettaGenFF-VS and RosettaVS. The final section provides detailed information about the AI-accelerated virtual screening protocol.Development of RosettaGenFF-VS and RosettaVSIn this section, we present detailed information about the developments of the entropy models used to augment Rosetta’s general forcefield to enable the ranking of different ligands binding to the same target and the Rosetta virtual screening protocols we developed within Rosetta GALigandDock.Entropy estimationIn a virtual screening task, the entropic contribution to binding free energy caused by “freezing” ligand torsions and rigid-body DOFs upon binding is approximated. The contribution from the receptor is not considered. The entropy change upon binding takes the following form:$$\Delta S={w}_{1}\Delta {S}_{{rb}}+{w}_{2}\Delta {S}_{{torsion}}$$
(1)
The entropy change from ligand rigid body degrees of freedom (∆Srb) is approximated as a function of molecular mass (m)52. The entropy change from ligand torsions (∆Storsion) is directly estimated from its entropy in the unbound state, assuming that absolute entropy in the bound state is negligible. To estimate the probability distribution of each rotatable torsion value at an unbound state, conformations from every 3000 steps of ligand-only Monte Carlo (MC) simulation using RosettaGenFF at room temperature (300 K) are collected. Then the probability distribution is converted into entropy as:$${\Delta S}_{{torsion}}={RT} {\sum}_{i}{\sum}_{j}(-{p}_{{ij}}\log {p}_{{ij}})$$
(2)
where R is the gas constant, T is 300 K, and pij is the probability of a given torsion angle i being in bin j (with 60° bin size). Because we assume ligand torsions are independent, the net entropy loss from ligand torsions is simply the sum of these factors of overall rotatable torsions. This scheme effectively captures the pre-organization effects of ligand torsion angles at their unbound state, therefore addressing shortcomings in simpler algorithms that treat these angles as fully free when unbound. The optimal weights for ∆Srb and ∆Storsion were obtained using a grid search to maximize the AUROC of a non-overlapping subset of DUD-E set53 (see SI, ‘Subset of DUD-E’ for more details). The weights (w1, w2) considered in this search were {(0,0), (0,1), (1,0), (1,1), (0,2), (2,0), (2,1), (1,2), (2,2)}. The optimal weights obtained (2.0 and 1.0 for rigid-body and torsion angles, respectively) did not vary much from a naive guess of uniform weights (1.0 for both). Compared to the uncorrected results, including entropy estimation improved the AUROC metric by 3% and had a negligible effect on enrichment-based metrics. This is the default entropy estimation method, thus we termed it “Default Entropy”.A second entropy estimation method was also implemented. This method, which we have named “Simple Entropy”, assumes a simpler formulation compared to the first method. Instead of utilizing MC simulation, we used the number of rotational bonds to estimate the torsional entropy of the unbound small molecule. Although this method overlooks the pre-organization effect of the ligand in its unbound state, we observed comparable benchmark results using this simpler approach. The equation for this simpler approach is as follows:$$\Delta S={w}_{1}\Delta {S}_{{rb}}+{w}_{2}\Delta {S}_{{torsion}}={w}_{1}\log (m)+{w}_{2}{n}_{{rotor}}$$
(3)
The optimal weights were determined using a grid search ranging from 0.0 to 3.0, with a step size of 0.1. This was done to maximize the correlation between the predictions and experimental binding affinities derived from a dataset curated from the PDBbind54 refine dataset. This curated dataset termed the “PDBbind-refine-no-metal” dataset, was prepared by excluding cases that had metal ions present in the complex. To prevent data leakage, any case in CASF2016 was also removed from the “PDBbind-refine-no-metal” dataset. The optimal weights were found to be w1 = 0.0 and w2 = 0.4, indicating that molecular weight does not contribute to the entropy estimation. Therefore, the final formulation of the “Simple Entropy” is $\Delta S=0.4*{n}_{{rotor}}$.Binding affinity estimationBinding affinity is estimated by adding together the enthalpic and entropic contributions. The enthalpy changes (∆H) upon binding is estimated using the following equation:$$\Delta H={E}_{{complex}}-{E}_{{protein}}-{E}_{{ligand}}$$
(4)
where Ecomplex, Eprotein, Eligand are the Rosetta energy of the complex, the protein, and the ligand, respectively, from the Rosetta general force field (RosettaGenFF). RosettaGenFF combines the Rosetta protein energy model55,56,57 and generic energy terms for non-protein molecules, i.e., Lennard-Jones, Coulomb, hydrogen-bond, implicit-solvation and generic torsion energy. A complete description of RosettaGenFF can be found in ref. 23 ∆H is essentially the interaction energy from RosettaGenFF between the protein and the ligand. The predicted binding affinity is then calculated using the equation: $\Delta G={dH}-{TdS}=\Delta H+\Delta S$, where temperature T is implicitly included in ∆S estimation described above. This estimated binding affinity is used for ranking different ligands that bind to the same target. It’s worth noting that the scale of the estimated binding affinity is arbitrary, and it doesn’t directly correspond to the experimental binding affinity. We have named this entropy-augmented force field as the Rosetta general force field for virtual screening (RosettaGenFF-VS).Rosetta virtual screening (RosettaVS) evaluation modeThe evaluation mode in RosettaVS was developed for binding affinity estimation of complex structures. It offers several options for estimating the binding affinity. It can estimate the binding affinity for the provided structure, perform local minimization of the ligand alone within the pocket, or carry out local minimization of the ligand and a set of pocket residues within a certain cutoff (default is 4.5 Å) of the ligand. The minimization can be performed with or without coordinate constraints. These options allow for a more accurate and tailored approach to estimate binding affinities without docking the ligands.Rosetta virtual screening (RosettaVS) screening modeRosettaVS leverages the “run mode” in our previously developed GALigandDock for various tailored tasks. In this work, we have refined and introduced two run modes for fast ligand docking with binding affinity estimation using RosettaGenFF-VS. These are the Virtual Screening eXpress (VSX) and Virtual Screening High Precision (VSH). The VSX mode treats protein side chains as rigid and executes five iterations with a gene pool size of 50. This mode is designed for quick and efficient screening. On the other hand, the VSH mode allows for flexible pocket residue side chains during docking and conducts a two-stage conformational search on the precomputed energy grid. The first stage up-weighs the coulombic interactions threefold and runs for five iterations with a gene pool size of 100. The second stage uses the default weight for coulombic interactions and also runs for five iterations with a gene pool size of 100. In terms of runtime, the VSX mode takes ~ 90 to 150 seconds to screen a ligand, while the VSH mode runs about six times slower. Despite the difference in speed, both modes support docking multiple ligands in a single batch. This feature significantly reduces the load on the file system by reading in a single input file for multiple ligands and outputting a single output file for multiple ligands. The choice of the batch size can be arbitrary to best utilize the local computing cluster. For instance, we used a batch size of 50 for the VSX mode and 5 for the VSH in this study. Detailed settings for running VSX and VSH are provided in the supplementary information in RosettaScripts58 XML format.Improvements to the forcefield and docking protocolsWe made several general enhancements to the Rosetta General Forcefield (RosettaGenFF) and docking protocols to improve performance and accuracy. New atom types and torsion types were incorporated into RosettaGenFF to enable better modeling of three and four-membered rings. We also fixed a few issues with the torsional potential that arose from incorrect definitions of some torsion types. In addition, we corrected an error that occurred during the optimization of the tautomer state of histidine in ligand docking. Furthermore, we optimized the small molecule preprocessing scripts to address known issues and enhance their robustness. This included accurately handling the atom typing of aromatic ring nitrogens, protonated nitrogens, and oxime oxygen atoms, as well as dealing with molecules with collinear structures and others.Score function and virtual screening benchmarksCASF2016 benchmarksProtein structures from CASF2016 were preprocessed by relaxing the entire structure with coordinate constraints using the Rosetta FastRelax protocol59,60,61. The ligands in the CASF2016 dataset were processed using AmberTools 23.0’s antechamber62 to generate Mol2 files with am1bcc partial charges. These Mol2 files were then converted into Rosetta params files using the mol2genparams.py script in Rosetta. Since there was no sampling process for this dataset, we used the RosettaVS evaluation mode to assess the score function performance of RosettaGenFF-VS. For the scoring power test, ligands and pocket side chains were minimized with coordinate constraints. The binding affinities reported in Supplementary Fig. 2 were estimated using the locally optimized structure with two different entropy models, namely “Simple entropy” and “Default entropy”. The results showed that the RosettaGenFF-VS with the “Simple entropy” model was the leading physics-based score function for binding affinity prediction. It’s worth noting that ∆VinaRF20 used machine learning descriptors. For the docking power test, small molecule decoys were minimized within the pocket using coordinate constraints. The scores from the locally optimized structure were then used to calculate docking success. We reported the performance in Fig. 1d and Supplementary Figs. 3, 7 without specifying the entropy model since the docking power of RosettaGenFF-VS doesn’t depend on the choice of the entropy model. As shown in Fig. 1d and Supplementary Fig. 3, RosettaGenFF-VS achieved leading performance in ligand docking accuracy. We then further examined binding funnels following the analysis in ref. 28. The purpose of the binding funnel analysis was to demonstrate the quality of the funnel-like shape that forms around the lowest energy minimum. Unlike docking accuracy, binding funnel analysis measures the efficiency of the energy potential in driving the conformational sampling toward the lowest energy minimum. As depicted in Supplementary Fig. 7, RosettaGenFF exhibited superior binding funnels across a broad range of ligand RMSDs, suggesting a more efficient search for the lowest energy minimum compared to other methods. For the screening power test, decoys were allowed to minimize freely within the pocket, mirroring real-world virtual screening settings. The predicted binding affinities from this process were used to calculate both the enrichment factor and success rate. The results (Fig. 1e, f and Supplementary Figs. 4, 5) showed that RosettaGenFF-VS achieved state-of-the-art performance on the screening power test. To further examine the improved performance on the screening power test, we analyzed the binding affinity prediction models with different entropy estimations on three subsets based on the excluded volume inside the binding pocket upon binding (∆VOL), buried percentage of the solvent-accessible area of the ligand upon binding (∆SAS), and hydrophobic scale of the binding pocket (H-scale). The result in Supplementary Fig. 8 indicated that both the binding affinity methods performed equally well or better on almost every subset compared to other methods. Furthermore, our method showed notable improvements on target proteins with more polar (subset H1), shallower (subset S1), and smaller (subset V1) pockets where other methods generally underperformed. In Fig. 1e and f, we reported the best performance from RosettaGenFF-VS-Simple as RosettaGenFF-VS for simplicity. Although our primary comparison is with other well-established physics-based scoring functions in this work, to provide a more comprehensive evaluation of the model’s performance, we have compared our model to several state-of-the-art deep-learning based scoring functions (see Supplementary Data 1), where RosettaGenFF-VS fares favorably.DUD benchmarksThe Directory of Useful Decoys (DUD) dataset was downloaded from https://dud.docking.org/https://dud.docking.org/. Receptor structures were prepared by replacing any non-canonical amino acid in the provided protein pdb file with its corresponding canonical form, followed by a coordinate constraint relax using the Rosetta FastRelax protocol. The input complex structures for RosettaVS were prepared by randomly placing a ligand molecule inside the binding pocket in each receptor to indicate the position of the binding pocket. Small molecule params files were converted from mol2 files provided by DUD. For Example, the Rosetta scripts XML file and Rosetta command line can be found in the supplemental information. The results (Fig. 1b,c) showed that both VSX and VSH modes in RosettaVS achieved superior performance compared to other virtual screening methods. On average, the VSH mode showed a slight improvement in performance compared to the VSX mode. We examined the cases where the VSH mode improved performance and presented two examples in Supplementary Fig. 10, highlighting the importance of modeling flexible sidechains in ligand docking.AI accelerated virtual screening protocolReceptor preprocessingFor KLHDC2 virtual screening, the crystal structure of KLHDC2 in complex with SelK degron peptide (PDB: 6DO3) was downloaded from the Protein Data Bank63. We removed all solvent molecules and retained one copy of the degron-bound monomer structure (chains A and B). This monomer complex was relaxed using the Rosetta FastRelax protocol with coordinate constraint. As a final step, we replaced the C-end degron peptide in the relaxed structure with a random small molecule as an indication of the binding site. This modified structure was then used as the input to RosettaVS. For Nav1.7 virtual screening, the cryoEM structure of human NaV1.7 – NavAb channel chimera (PDB: 5EK0)26 was downloaded from the Protein Data Bank. After removing all solvent molecules, we retained a region of the human NaV1.7 – NavAb channel chimera VSD4 (from M1493 to P1617) that directly interacts with the ligand GX-936 and used it as the receptor structure. This receptor structure was relaxed with the ligand GX-936 using the Rosetta FastRelax protocol with coordinate constraints. This resulting relaxed structure was then used as input to RosettaVS. It’s important to note that GX-936 was only used to indicate the binding site for RosettaVS.Small molecule preprocessing and library preparationRosettaVS ligand docking requires Rosetta params files as input for small molecules and small molecule mol2 format is required for the generation of the params files. The Enamine REAL 2022-q1 library, which contains ~ 5.5 billion SMILES, was downloaded from Enamine (https://enamine.net/library-synthesis/real-compounds). Using dimorphite-dl64, SMILES strings were assigned a protonation state at PH = 7.4. These properly protonated smile strings were then converted to mol2 files with molecular 3D structure and MMFF94 partial charges using OpenBabel65. The ZINC22 library, which contains approximately 4.1 billion ready-to-dock small molecule mol2 files, was downloaded from CartBlanche22 web server (https://cartblanche.docking.org). The mol2genparams.py script in Rosetta was used to convert the mol2 files into Rosetta params files for use as input in RosettaVS. To enhance the efficiency of inference on the entire library, we pre-generated fingerprints for the entire collection of the molecules, preparing them for input to the deep learning models. Utilizing RDKit66, we generated 1024-bit Morgan fingerprints67 with a radius of 2.Active learning modelWe implemented an active learning model using a simple fingerprint-based feed-forward neural network (FFN) for training target-specific classification models of binders versus non-binders. The choice of the FFN was made to reduce the cost of the surrogate model, and the model architecture and hyperparameter search are subject to future improvements. The model takes a 1024-bit fingerprint vector as input and outputs a single value representing the probability of a molecule being a binder. The FFN model consists of two densely connected hidden layers, each containing 3000 nodes. These layers are followed by batch normalization and a dropout rate of 0.5 to prevent overfitting. The final layer is linear and is followed by a sigmoid activation layer, which compresses the output values between 0 and 1, thereby representing the probability of a molecule being a binder. With precomputed fingerprints, the inference of one million molecules using this model will take, on average, around 110 s using a single CPU (Intel Xeon E5-2695 v3 @ 2.30 GHz), or around 11 s on average using an RTX2080 GPU. The model performance was monitored throughout the active learning process, and the final model for KLHDC2 has an AUC of 0.886, and the final model for the Nav1.7 target has an AUC of 0.927 on an independent test set, which shows the final ML model is indeed a good binary classifier for each target.OpenVS platformWe developed an open-source platform, OpenVS, to streamline the entire AI-accelerated virtual screening process. A key feature of this platform was its ability to enable the parallelization of the virtual screen and reduce the load on the file system. This was achieved by batching multiple ligands into a single virtual screen job through the multi-ligand docking feature in the RosettaVS protocol. By docking N ligands in a single job, the number of input and output files was effectively reduced by a factor of N, significantly decreasing the load on the file system. In addition, batching multiple ligands together also reduced the input and output (I/O) load on the system since fewer jobs were required to run in parallel. The platform used the SLURM workload manager (https://slurm.schedmd.com) and GNU parallel68 for high parallelization of virtual screens, exhibiting linear scaling of virtual screening time relative to the number of CPUs and nodes utilized. Furthermore, this platform could be easily adapted to support other job schedulers, making it a versatile tool for various computational environments.Workflow of AI accelerated virtual screeningIn this work, we utilized active learning techniques to guide the exploration of the vast chemical space. The greedy strategy was used to select new compounds for each iteration to augment the training dataset without using any explicit uncertainty information to reduce the computational cost and inference time. In principle, uncertainty estimation can be obtained using Monte Carlo (MC) dropout69 by running model inference multiple times with activated dropout layers within the current framework. The concrete workflow is described as follows. Our first step was to create a specialized subset, referred to as the druglike-centroid library, from the ZINC1570 3d druglike database containing ~ 493.6 million molecules. The creation of this subset involved clustering similar molecules from the ZINC15 3D druglike database, using a cutoff of 0.6 Tanimoto similarity. From each cluster, the smallest molecule was selected and added to the library, serving as the centroid of the cluster. This process resulted in the formation of the druglike-centroid library, which includes around 13 million molecules. The purpose of creating the druglike-centroid library was to ensure that the model was exposed to a wide range of chemical space during the initial iteration. For the first iteration, 0.5 million and 1 million molecules were randomly selected as the training and testing datasets, respectively, from the druglike-centroid library. We used the VSX mode in RosettaVS to dock these molecules to the target pocket. Morgan fingerprints using a radius of 2 and 1024 bits were generated for all the molecules as the input to the deep learning model. We selected a predicted binding affinity (dG) cutoff corresponding to the top N% of the molecules in the testing set, for each iteration. This cutoff was used to assign binders (<= dG_cutoff) or non-binders (> dG_cutoff) labels to the docked molecules. We used ten log spaces between 10% to 0.01% to set the top N% for each iteration. For example, the first iteration had N% = 10%, the second iteration had N% = 4.64%, and the tenth iteration had N% = 0.01%. The model was trained as a target-specific binary classifier to predict whether a molecule is a binder or not. We computed the cross-entropy loss on the testing set for each epoch and used early stopping to prevent the model from overfitting to the training set. We used the model to predict the entire Enamine REAL 2022-q1 library (~ 5.5 billion compounds) or ZINC22 library (~ 4.1 billion compounds). From these predictions, we selected the top-ranked 0.25 million and an additional 0.25 million randomly selected molecules for the next iteration docking. In the next iteration, the newly selected 0.5 million molecules were docked to the pocket using VSX mode. These 0.5 million molecules, combined with previous molecules in the training set, were used to train a new model. A tighter dG cutoff was selected corresponding to the current iteration’s top N% in the testing dataset. After training a new model, we used it to predict the entire library again and selected another 0.5 million molecules for the next iteration. We repeated this process until the maximum number of iterations (ten iterations) was reached or the predicted binding affinities of top-ranked molecules converged. The top-ranked 50,000 or 100,000 molecules from the virtual screening were re-docked using VSH mode to account for the flexibility of the receptor. (See Supplementary Fig. 1 for the flowchart of this protocol.)Filters for selecting promising compoundsWe computed the log octanol/water value of the partition coefficient of the compound (cLogP), the number of unsatisfied hydrogen bonds (Nunsats) at the interface between the protein and ligand, and the number of torsion angle outliers (N_unusual_torsion) from CSD torsion geometry analysis using RDKit, Rosetta71, and CSD72 python package respectively. For the KLHDC2 virtual screen, docked poses with cLogP > 3.5 Nunsats > 1 were discarded. Similarly, for the NaV1.7 virtual screen, docked poses with cLogP > 3.5, Nunsats > 1, and N_unusual_torsion > 1 were discarded. These filters helped us in reducing the false positives and refine our selection of compounds for final experimental validation.Filtering and clusteringTo reduce the redundancy of the molecules selected for experimental validation, we clustered the top-ranking molecules from both screens based on Tanimoto similarity. For the KLHDC2 virtual screen, we took the top-ranked 1000 compounds and filtered out compounds with low predicted solubility (removing 93 compounds) and unsatisfied hydrogen bonds in the bound conformation (removing 754 compounds), yielding 153 compounds. We then applied clustering with a Tanimoto similarity cutoff of 0.6 to remove similar molecules, which resulted in 54 clusters, each represented by the member with the best binding affinity. These 54 cluster representative molecules were then subjected to manual inspections. For the NaV1.7 screen, we clustered the top 100,000 molecules with a cutoff of 0.6 Tanimoto similarity. This resulted in 16820 clusters, each represented by the member with the best predicted binding affinity. The top 1000 ranked cluster representative molecules were subjected to the filtering process where we removed molecules with low predicted solubility (removing 183 compounds), unsatisfied hydrogen bonds in the bound conformation (removing 139 compounds), and torsion angle outliers (removing 520 compounds) in Cambridge Structural Database (CSD)72. A total of 160 molecules passed the filtering were examined manually. The filters for unsatisfied hydrogen bonds and torsion angle outliers are the two that removed the most molecules. Although the current energy model achieves state-of-the-art performance, our filters suggest that these conformations are ‘under penalized’ by the current energy model, indicating an opportunity for improvement in the scoring function.Rescreening against KLHDC2To further validate the reliability of RosettaVS for hit discovery, we performed a rescreening experiment against the KLHDC2 target using the Enamine Screening library. This library contains around 4.1 million compounds, which are the first samples of billions of synthesizable compounds. Our confirmed best-hit compound, C2.8, was included in this screening library. We were able to rediscover compound C2.8 from this experiment. In fact, it was among the top 1000 compounds, which is within 0.024% of the 4.1 million compounds, before any filtering. After applying the buried unsatisfied hydrogen bond filter, we removed any docked complex that had more than one buried polar atom. Following this, the hit compound C2.8 was among the top 50 compounds. The success of this rescreening experiment shows our virtual screening protocol is both effective and reliable for hit discovery.Common chemical propertiesWe have reported several chemical properties that are important in small molecule drug discovery, including the quantitative estimate of drug-likeness (QED)73, the calculated octanol-water partition coefficient (cLogP)74, and synthetic accessibility (SA)75. As shown in Supplementary Table 2, the compounds ordered for both targets have an average QED above 0.7, indicating a high degree of drug-likeness. The cLogP values of the compounds fall within a reasonable range, as it is used as a filter to remove compounds that are excessively hydrophobic (see ‘Filters for selecting promising compounds’). The average SA is low, as expected because all the compounds from Enamine REAL or ZINC22 are highly synthesizable.Experimental methods for KLHDC2Molecular biology and protein purificationFor DNA extraction, E.coli DH5α was grown for 16 hr at 37 °C. For bacmid production, E.coli DH10Bac was grown for 16 hr at 37 °C. For baculovirus production and amplification, Sf9 (LifeTechnologies, B82501) insect cells were grown for 2–3 days at 26 °C. For protein expression, both E.coli BL21(DE3) (grown for 16 hr at 18 °C) and HighFive insect cells (grown for 2–3 at 26 °C, 105 RPM) were used. LB Broth Miller (Fisher BioReagents) was used for E.coli. Sf9 insect cells were maintained in Grace’s Insect Medium (Gibco) supplemented with 7% FBS (Gibco) and 1% Penicillin-Streptomycin (HyClone) solution. Suspension HighFive(LifeTechnologies, B85502) cells were grown in EXPRESSTM FIVE SFM(Gibco) supplemented with 5% L-Glutamine 200 mM (HyClone) and 1% Penicillin-Streptomycin (HyClone) solution. Tissue culture media and supplements were from GIBCO Life Technologies (Carlsbad, CA, USA). Cells have been authenticated by the vendors. No further authentication was performed for commercially available cell lines.The kelch repeat domain of human KLHDC2 (UniProt: Q9Y2U9, amino acid 1–362) was subcloned into the pFastBac vector with an N-terminally fused glutathione-S-transferase (GST), and a TEV-cleavage site. A recombinant baculovirus was produced and amplified three times in Sf9 monolayer cells to produce P4. The P4 virus was used to infect HighFive suspension insect cell cultures to produce the recombinant GST-KLHDC2 protein. The cells were harvested 2–3 days post-infection, re-suspended, and lysed in lysis buffer (20 mM Tris, pH 8.0, 200 mM NaCl, 5 mM DTT) in the presence of protease inhibitors (1 μg/ml Leupeptin, 1 μg/ml Pepstatin and 100 μM PMSF) using a microfluidizer. The GST-KLHDC2 protein was isolated from the soluble cell lysate by PierceTM Glutathione Agarose (Thermo Scientific). For AlphaLISA competition assays, the GST-tagged KLHDC2 was further purified by Q Sepharose High-Performance resin (GE Healthcare). The NaCl eluates were subjected to a Superdex-200 size exclusion chromatography column (GE Healthcare). All samples were flash-frozen in liquid nitrogen for storage prior to use. For protein crystallization, the kelch repeat domain of human KLHDC2 (amino acids 22–362) was subcloned into the pET vector with an N-terminally fused His-elongation factor Ts (TSF) and a TEV-cleavage site. The His-TSF-KLHDC2 protein was overexpressed and purified from BL21 (DE3) E. coli cells. Bacterial cells transformed with the pET-based expression plasmid were grown in LB broth to an OD600 of 0.8–1 and induced with 0.5 mM IPTG. Cells were harvested, re-suspended, and lysed in lysis buffer (20 mM Tris, pH 8.0, 200 mM NaCl, 20 mM imidazole) in the presence of protease inhibitors (1 μg mL–1 leupeptin, 1 μg mL–1 pepstatin and 100 μM phenylmethylsulfonyl fluoride) using a microfluidizer. The His-TSF-KLHDC2 protein was isolated from the soluble cell lysate by HisPurTM Ni-NTA Superflow Agarose (Thermo Fisher Scientific, Waltham, Massachusetts). After TEV cleavage of the His-TSF, KLHDC2 was further purified by Mono Q™ 5/50 GL (GE Healthcare, Chicago, Illinois). The NaCl eluates were subjected to Superdex-200 size-exclusion chromatography (GE Healthcare). Native mass spectrometry was used to confirm KLHDC2 in its apo form. The samples were concentrated by ultrafiltration to 10–22 mg mL−1. All samples were flash-frozen in liquid nitrogen for storage prior to use.Protein crystallizationThe crystals of KLHDC2 in its apo form were grown at 25 °C by the hanging-drop vapor diffusion method with 2 parts protein sample to 1 part reservoir solution containing 0.03 M MgCl2*6H2O, 0.03 M CaCl2*2H2O, 10% (w/v) PEG 20000, 20% (v/v) PEG MME, 0.1 M Tris (base)/ bicine pH 8.5. Crystals of maximal sizes were obtained and harvested after a few days. Cryoprotection was provided by the crystallization condition.Data collection and structure determinationAfter collecting native datasets at Advanced Light Source Beamline 8.2.1 (Data collection: exposure − 0.4 sec, energy − 12397.1 keV, wavelength − 1 Å, temperature − 100 K), X-ray diffraction data was automatically processed using xia2 to run DIALS 3.8. The structures were solved by molecular replacement using the kelch domain of KLHDC2 (PDB:6DO3) with Phaser from the PHENIX suite of programs package version 1.20.1. All structural models were manually built, refined, and rebuilt with COOT76 version 0.9.8.91 and PHENIX77 version 1.20.1. PyMOL34 version 2.5.5, and LIGPLOT78 version 2.0 were used to generate figures. After refinement, the Ramachandran statistics are as follows : Ramachandran favored – 97.62 %, Ramachandran allowed – 2.38 %, Ramachandran outliers – 0.00 %, Rotamer outliers – 0.36 %.AlphaLISA luminescence proximity assayAlphaLISA assays for determining and measuring protein-protein interactions were performed using an EnSpire reader (PerkinElmer). GST-tagged KLHDC2 was attached to anti-GST AlphaLISA acceptor beads. Synthetic N-terminal biotinylated 12 aa SelK degron peptide ([Biotin]HLRGSPPPMAGG[C) (Bio-Synthesis, Inc.) was immobilized to streptavidin-coated AlphaLISA donor beads. The donor and acceptor beads were brought into proximity by the interactions between the SelK peptide and KLHDC2. Excitation of the donor beads by a laser beam of 680 nm promotes the formation of singlet oxygen. When an acceptor bead is in close proximity, the singlet oxygen reacts with thioxene derivatives in the acceptor beads and causes the emission of 615 nm photons, which are detected as the binding signal. If the beads are not in close proximity to each other, the oxygen will return to its ground state, and the acceptor beads will not emit light. Competition assays were performed in the presence of numerous compounds, which were titrated at various concentrations. The experiments were conducted with 3.83 nM of GST-KLHDC2 and 5.55 nM biotinylated 12 aa SelK peptide in the presence of 5 μg/ml donor and acceptor beads in a buffer of 25 mM HEPES, pH 7.5, 100 mM NaCl, 1 mM TCEP, 0.1% Tween-20, and 0.05 mg/ml Bovine Serum Albumin. The compound concentrations used in competition assays ranged from 15 nM to 1.5 mM. The experiments were done in triplicates. IC50 values were determined using non-linear curve fitting of the dose-response curves generated with Prism 8 (GraphPad).Octet Bio-Layer Interferometry measurementOctet BioLayer interferometry competition assay monitoring the ability of C29 to interfere with the binding between the biotinylated 12 aa SelK peptide and GST-KLHDC2 was monitored using the Octet Red 96 (ForteBio, Pall Life Sciences) following the manufacturer’s procedures. The reaction was carried out in black 96 well plates maintained at 30 °C. The reaction volume was 200 μL in each well. The Octet buffer used throughout the experiment contained 20 mM Tris-HCl, 200 mM NaCl, 5 mM DTT and 0.1% BSA, pH 8.0. The Loading buffer contained the Octet buffer and biotinylated 12 aa SelK peptide, at a final concentration of 200 nM. The Quench buffer contained the Octet buffer and biocytin, at a final concentration of 0.1 mM. The Association buffer contained the Octet buffer, GST-KLHDC2, at a final concentration of 50 nM, and the competitor small molecule – C29 at various final concentrations (30 µM, 10 µM, 3.33 µM, 1.11 µM or 0 µM). Prior to binding analysis, the streptavidin coated optical probes were incubated in Octet buffer for 60 sec, loaded for 96 sec with biotinylated 12 aa SelK peptide (Loading buffer), quenched for 60 s with biocytin (Quench buffer) and baselined in Octet buffer for 60 sec. The binding of the analyte GST-KLHDC2 in the presence or absence of the C29, to the optical probes was measured simultaneously using instrumental defaults for 225 sec as the probes were then incubated in Association buffer. The dissociation was measured for 600 sec while the probes were incubated in Octet buffer. While not loaded with ligand, the control probes were quenched. There was no binding of analyte and competitor to the unloaded probes. The data were analyzed by the Octet data analysis software version 9.0.. The association and dissociation curves were fitted with a Local Full fit, 1:1 ligand model. The data was plotted using Excel. The IC50 value was calculated using Prism 8 (GraphPad) from the nonlinear regression curve fit of the response values vs. log competitor concentration.Experimental methods for NaV1.7Cell cultures and electrophysiologyHEK-293 cells stably expressing human NaV1.5 and NaV1.7 were obtained from Dr. Chris Lossin (UC Davis, CA). HEK293 cells stably expressing HERG (hKv11.1) were a gift from Craig January (University of Wisconsin, Madison). Cells were cultured in complete DMEM supplemented with 10% FBS, 1% penicillin/streptomycin, and G418. All manual whole-cell patch-clamp recordings were performed at room temperature (22–24 °C) using an EPC-10 amplifier (HEKA Electronik, Lambrecht/Pfalz, Germany) on cells that were grown to 60–80% confluency, lifted with TrypLE, and plated onto poly-l-lysine–coated coverslips. External bath solution contains (in mM) 160 NaCl, 4.5 KCl, 2 CaCl2, 1 MgCl2, 10 HEPES (pH 7.4 and 305 mOsm) as. Patch pipettes were heat-pulled from soda lime glass (micro-hematocrit tubes, Kimble Chase, Rochester, NY) and had resistances of 2–3 MΩ when filled with an internal solution. Patch recordings of NaV1.5 and NaV1.7 channels were done using a cesium fluoride-based internal solution containing (in mM) 10 NaF, 110 CsF, 20 CsCl, 10 HEPES, 2 EGTA, (pH 7.4, 310 mOsm). Recordings of HERG channels were conducted using a potassium fluoride-based internal solution with added ATP (160 mM KF, 2 mM MgCl2, 10 mM EGTA, 10 mM HEPES, 4 mM NaATP, pH = 7.2 and 300–320 mOsm). Data acquisition and analysis were performed with Pulse-PulseFit (HEKA Electronik GmbH, Germany), IgorPro (WaveMetrics, Portland, OR), and Origin 9.0 software (OriginLab Corporation, Northampton, MA). To investigate specifically the inactivated state block of NaV1.7 and NaV1.5, holding potentials were set at the V1/2 of maximal inactivation (− 70 mV and − 80 mV, respectively) before voltage stepping to − 150 mV for 20 msec, and then 0 mV for 50 msec to elicit inward currents. Inhibition of currents in the resting state was recorded using a holding potential of − 120 mV. Control test currents were monitored for up to 5–10 min to ensure that the amplitude and kinetics of the response were stable. Series resistance was compensated to 80–90% and linear leak currents and capacitance artifacts were corrected using a P/4 subtraction method. The pulse interval was 0.1 Hz. For all experiments, stock solutions of the drugs were prepared fresh from 10 mM stocks in DMSO and test solutions prepared in external bath solutions were applied to individual cells into the recording bath. For measuring inhibition, currents were allowed to saturate with repeated pulsing before addition of subsequent doses. IC50 values were derived from measurements performed on individual cells that were tested with at least three or more concentrations of each peptide. For the HEK-239 HERG cells, a 2-step pulse (applied every 10 sec) from − 80 mV first to 40 mV for 2 sec and then to − 60 mV for 4 sec, was used to elicit HERG currents. The percent reduction of HERG tail current amplitude by the compounds was determined. The percent of a block of tail current amplitude by the drugs was determined and data are shown as mean +/− SD (n = 3–4 per data point).Compound synthesisChemical synthesis for all compounds in this work was performed by Enamine within 5 weeks with over 90% purity and used without further purification. Purity of compounds were assessed based on liquid chromatography–mass spectrometry (LC-MS). See Supplementary Data 2-3 for purity and LC-MS spectra.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

An artificial intelligence accelerated virtual screening platform for drug discovery

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis