Predicting transcriptional responses to novel chemical perturbations using deep generative model for drug discovery

PRnet: perturbation-conditioned deep generative model for transcriptional response predictionIn this paper, we formulate the transcriptional response prediction as a distribution generation problem conditioned on perturbations. Cells inherently recognize and respond to chemical stimuli, with their responses influenced by both external stimuli and their intracellular state. In the single cell or bulk HTS, transcriptional responses to chemical perturbations are affected by multiple conditions, such as compound structures, compound dosages, and covariates, such as cell types, and cell lines. Given an n-dimensional unperturbed transcriptional profile (a single cell or bulk RNA-seq observation) and a chemical perturbation imposed by a compound with a specific structure and dosage, PRnet aims to predict the distribution of the perturbed transcriptional profile. Modeling perturbation patterns enables capturing the novel chemical perturbation effect on gene-level programs and quantifying them in various perturbation scenarios (Fig. 1a).Fig. 1: Overview of the methodological approach.a Problem formulation: Given unperturbed transcriptional profiles (single cell or bulk) and applied perturbations (structures and the dosages of the compounds), predict transcriptional responses. Red arrows indicate changes in transcriptional profiles. b Model architecture: PRnet is a perturbation-conditioned deep generative model for transcriptional response prediction with three components, including the Perturb-adapter, the Perturb-encoder, and the Perturb-decoder. Crucially, the model operates as a data-driven model, allowing for effective generalization to novel perturbations. c Screening candidates with PRnet: The training compound library comprises a bulk high-throughput screening library (175,549 bioactive compounds), and a single-cell high-throughput screening library (188 active compounds). The screening library comprises four compound libraries, including 935 FDA-approved drugs, 4158 active compounds, 30,455 natural compounds, and 29,670 in-house druglike compounds, respectively. For in-silico screening, PRnet initially predicts the average transcriptional profile, fold-change in the gene, and gene rank after perturbing the specific cell line with screening compounds. Given the gene signature for a particular disease or sensitive drug, the enrichment scores of screening compounds were computed for compound ranking. In vitro validation experiments were conducted using MTT assays on colorectal cancer and small cell lung cancer cell lines to validate the activity of compound candidates. After in vitro validation, PRnet is utilized to recommend drug candidates for 233 different diseases. d The large-scale integration atlas of perturbation profiles, including the L1000 dataset, the Sci-plex3 dataset, the FDA-approved drugs dataset, the anti-cancer compounds dataset, the natural compounds dataset, the drug-like compounds dataset, and the Gtex dataset. Some icons were created in BioRender. Qi, X. (2024) biorender.com/l45e810.PRnet (as detailed in “Methods”) is a flexible and scalable perturbation-conditioned deep generative model designed to predict transcriptional responses to novel complex chemical perturbations at bulk and single-cell levels. The design of PRnet comprises three components, including the Perturb-adapter, the Perturb-encoder, and the Perturb-decoder (Fig. 1b). In preprocessing, each perturbed profile is assigned an unperturbed transcriptional profile of the same cell line. Initially, given a chemical perturbation imposed by the structure of compounds represented by Simplified Molecular Input Line Entry System30 (SMILES) strings and the dosages of the corresponding compounds, PRnet leverages RDKit31 to capture the functional topology information of the structures, generate the Functional-Class Fingerprints (FCFP) of compounds. These FCFPs were scaled by their dosages and then summed to generate rescaled Functional-Class Fingerprints (rFCFP) embedding. Without loss of generality, for the i-th compound perturbation, the Perturb-adapter encodes the fingerprint rFCFPi to an additive latent embedding \({z}_{i}^{p}\) which allows generalization to novel compounds and compound combinations. Then, the Perturb-encoder maps the chemical perturbation effect on heterogeneous unperturbed states \({x}_{i}^{u}\) into the interpretable latent space \({z}_{i}^{l}\). At last, the Perturb-decoder estimates the distribution of the transcriptional response \({{{\mathcal{N}}}}({x}_{i}| {\mu }_{i},{\sigma }_{i}^{2})\) within the context of the chemical perturbation effect on the unperturbed state \({z}_{i}^{l}\), the applied perturbation \({z}_{i}^{p}\) and a noise \({z}_{i}^{n}\). PRnet encodes the chemical effect on the unperturbed state to a learnable latent space, estimates the distribution, and performs conditioned sampling to generate a transcriptional response with biological and chemical contexts. Sampling generates a specific transcriptional profile \({\widehat{x}}_{i}\) that provides gene-level up- and down-regulation information. For bulk HTS data, the predicted transcriptional responses of 978 landmark genes \({\widehat{x}}_{i}\) are transformed into 12,328 genes by linear transformation. For single-cell HTS data, 5000 highly variable genes (HVGs) of transcriptional profile are selected. SMILES30 is widely used for representing chemical structures due to its simplicity and efficiency in encoding complex molecules as strings. By taking SMILES of the compound as the input, the Perturb-adapter has sufficient flexibility to screen large-scale compound libraries without any prior knowledge. Driven by data, PRnet automatically identifies the heterogeneity in latent space corresponding to compound, dosage contexts, and cell-type specific contexts. This allows the model to directly generalize to novel perturbation scenarios involving novel compounds, pathways, cell types, and cell lines that have not been previously perturbed.With the ability to predict transcriptional responses to novel perturbations, PRnet enables efficient screening of candidates for complex diseases (Fig. 1c). Inspired by the assumption embodied in the CMap15 concept that gene signatures are used as indicators reflecting the underlying mechanisms of diseases, PRnet predicted new therapeutic candidates by finding drugs that reverse the disease signature. There are two steps to applying PRnet to downstream tasks. In step 1, for in silico screening, PRnet predicts the transcriptional profile of specific cell lines perturbed by a user-defined compound library with multiple gradient concentrations. In step 2, we calculate the average transcriptional profile and fold-change in gene expression of each compound, and rank genes according to their fold-change values. Then, given the query gene signature for a particular disease or known sensitive compounds, gene set enrichment analysis (GSEA32) was employed to evaluate compound efficacy with their enrichment scores. Finally, compounds are ranked based on these enrichment scores. Large-scale high-throughput screening data are initially fitted to the model, facilitating its adaptability to diverse compound libraries and diseases.PRnet was trained on two compound libraries and screened four compound libraries (Fig. 1c). The training compound libraries contain a bulk high-throughput screening library consisting of over 883,269 transcriptional profiles of 175,549 bioactive compounds1, and a single-cell high-throughput screening library consisting of 290,888 transcriptional profiles of 188 active compounds2. Being well trained in HTS observations, PRnet enables in silico high-throughput screening of novel compound libraries for various cell lines. PRnet has further been applied to screen active and drug-like compounds for SCLC and natural compounds for CRC. In vitro validation experiments with MTT assays confirmed the efficacy of the candidate compounds against SCLC and CRC cell lines. Lastly, PRnet screened four compound libraries and generated a large-scale integration atlas of perturbation profiles (Fig. 1d), including (1) 82 cell lines perturbed by 935 FDA-approved drugs, (2) 88 cell lines perturbed by 4158 active compounds, (3) 14 CRC cell lines perturbed by 30,456 natural compounds, (4) 6 SCLC cell lines perturbed by 29,670 drug-like compounds and (5) 54 tissues perturbed by 935 FDA-approved drugs. Based on the large-scale integration atlas of perturbation profiles, PRnet is capable of a variety of downstream applications. PRnet has been utilized to recommend drugs for 233 different diseases. PRnet successfully predicted drug candidates for these diseases and demonstrates the potential in drug discovery.PRnet robustly predicted the response of novel perturbation and learned interpretable latent embeddingsTo evaluate the performance of PRnet for predicting responses to unseen perturbations, all datasets were strictly split by perturbation attributes (compound_split, cell_line_split, and pathway_split) into 3 subsets: train, validation, and test (Supplementary Fig. 1a). The held-out test sets were used to simulate datasets of novel perturbations. Three train-test data split strategies were employed to assess the performance of out-of-distribution perturbation scenarios, including (1) Random Split: randomly divides compounds and cell lines, (2) Unseen Compounds: testing compounds not seen perturbed during training, and (3) Unseen cell lines: testing cell lines not seen perturbed during training. Five-fold cross-validation was applied in each split strategy, and the average performance over five folds was computed as the overall metric for comparison. Two high-throughput screening data of different resolutions were used to test model performance, consisting of a bulk HTS dataset (from the L1000 project1) and a single cell HTS dataset (from the sci-Plex3 assay2). All models were trained and compared separately on two HTS datasets.We employed bulk high-throughput screening data from the L1000 project1 to fit the model, in which 978 genes (hereinafter called landmark genes) were selected to represent the diversity of biological pathways and processes in human cells. We first preprocessed these data (as detailed in “Methods”) and obtained 836,352 paired bulk RNA-seq observations (represented by the expression levels of 978 landmark genes), covering 82 cell lines and their perturbation data perturbated by 175,549 compounds. To quantitatively evaluate the compound-induced gene expression changes, we compared the Pearson correlation between the true and predicted post-perturbation of the average logarithm of fold-change in gene expression (log(FC)) for the hold-out test set with alternative approaches. The “Pearson of log(FC) in compounds” metric evaluated the Pearson correlation between the true and predicted mean log(FC) perturbed by the same compound in the test set. We demonstrated the performance of the PRnet on the bulk HTS data in Fig. 2a, where a higher value indicates a better performance. PRnet consistently demonstrated the best performance across all three split strategies, particularly well-fitted in unseen compound prediction scenarios with an average Pearson Correlation (PCC) of 0.8. PRnet significantly outperformed in predicting unseen cell line log(FC) with an increase in PCC over 0.3 compared to other approaches. Some hold-out predicted cases in predicting transcriptional responses of unseen compounds and cell lines were illustrated in Supplementary Fig. 1b–d. In more challenging scenarios, we evaluated the performances of predicting cell-line-specific compound-induced changes in genes by the “Pearson of log(FC) in cov_compounds” metric. The “Pearson of log(FC) in cov_compounds” metric is the Pearson correlation between the true and predicted mean log(FC) perturbed by the same compound within the same cell line in the test set. PRnet achieved the best performance in the “Pearson of log(FC) in cov_compounds” metric across three scenarios. In particular, PRnet exhibited more than two times better performance in unseen cell line predictions compared to other methods and demonstrated improvements of 0.16 in unseen compound predictions, demonstrating the generalization of PRnet to novel perturbations (Fig. 2b).Fig. 2: PRnet outperforms alternative approaches in predicting post-perturbation change in gene expression.a Performance of out-of-distribution perturbation scenarios in three train-test data split strategies (1) Random split, (2) Unseen compounds split, and (3) Unseen cell lines split. The “Pearson of log(FC) in compounds” metric evaluates the mean log(FC) perturbed by the same compounds in the test set, which are presented as mean values ± SD. Error bars indicate the standard error (SD) for each method, and the points (n = 5) refer to the 5-fold used for validation. b The “Pearson of log(FC) in cov_compounds” metric evaluated mean the (log(FC)) of compounds within the same cell line in the test set. Results are presented as mean values ± SD (n = 5). c T-SNE representations of latent embeddings learned by PRnet. Post-perturbation transcriptional profiles are from 82 cell lines from 20 different organs/tissues. d The “R2 in compounds” metric evaluates the R2 score of the average gene expression perturbed by the same compounds in the test set. Results are presented as mean values ± SD (n = 5). e The “R2 in cov_compounds” metric evaluates the R2 score of the average gene expression perturbed by the same compounds within the same cell line. Results are presented as mean values ± SD (n = 5). f T-SNE representation of latent embeddings from massive single-cell screens of 188 drugs across 3 cancer cell lines. g T-SNE representation of latent embedding of transcriptional profiles from MCF7 cell line perturbed with AG-14361. The pseudo-dose trajectory is displayed with a black line. Source data are provided as a Source Data file.To better characterize the heterogeneous gene-level change under certain perturbations, it is desirable to identify the set of cells or cell lines and isolate the precise variations enriched in the data from the corresponding cell lines or cells. After training, PRnet learned interpretable embeddings in the latent space in the context of the base unperturbed state and the applied perturbation. The low-dimensional (t-SNE33) representation (Fig. 2c) illustrates the latent embeddings of post-perturbation transcriptional profiles learned by PRnet. In the latent space, embeddings from the same cell line tend to form a cluster together. Each cancer cell lines form specific gene-level responses for corresponding perturbations. In a way, PRnet captures strong cell line-specific transcriptional profile variations under different conditions. Interestingly, we observed that the embedding learned by PRnet also represents cell line similarity in response to various perturbations. Figure 2c illustrates the t-SNE representation of the latent embeddings for all cell lines, where cell lines originating from the same organ exhibit a similarity preference in the latent space, resulting in close spatial locations, such as cell lines from the colon, breast, and lung. Supplementary Fig. 6a–c demonstrates in detail the low-dimensional (t-SNE) representation of the latent embeddings for cell lines from colorectal cancer, breast cancer, and lung cancer, respectively. To quantitatively evaluate the similarity of perturbations among cell types, we computed the normalized cosine similarity among the mean embeddings in the latent space for all cell lines. The resulting cosine similarity heatmap (Supplementary Fig. 6g) shows that most cell lines from the same tissue exhibit higher similarity in the latent space compared to those from different tissues.Human cancer cell lines have facilitated drug discoveries in cancer biology, but they are neither clonal nor genetically stable. This instability can generate variability in drug sensitivity, as shown by Ben-David et al.34. The genomic evolution of cancer cell lines leads to a high degree of variation across cell line strains. To explore the impact of inter- and intra-heterogeneity of cell lines on drug responses, we conducted additional experiments focusing on A549 and MCF7 cell strains, as described by Ben-David et al.34 in Supplementary Note 2. These experiments were designed to explore inter- and intra-cellular heterogeneity, similar patterns of the same MOA drug across cell strains, and screening candidate compounds for heterogeneous cells. These results indicated that drug response was highly similar to genetic or gene expression which aligned with findings in the original study. These findings underscore the utility of our model in capturing both inter- and intra-heterogeneity, enhancing its application in screening for specific cancer strains.PRnet adapted to model profiles of HTS at different resolutions. The sci-Plex assay2 screened 188 compounds in 3 cancer cell lines at single-cell resolution, measuring millions of cells. The screened cell lines A549 (lung adenocarcinoma), K562 (chronic myeloid leukemia), and MCF7 (breast adenocarcinoma) were treated to each of these 188 compounds at four dosages (10 nM, 100 nM, 1 μM, 10 μM). To quantitatively evaluate the performance of PRnet in single-cell HTS data, we followed commonly used metric4,5,6 to compare the R2 score between the true and predicted post-perturbation gene expression for the hold-out test set with alternative approaches (Fig. 2d, e). The “R2 in compound” metric evaluated the R2 score of the average post-perturbation gene expression of the same compound in the test set. We also compared the cell-type-specific performance “R2 in cov_compound” metric, which evaluated the R2 score of the average post-perturbation of the same compound within each specific cell line. PRnet outperformed alternative models in the “R2 in the compound” and “R2 in cov_compound” metrics in unseen compounds (R2 in the compound: 0.969) and unseen pathways scenarios (R2 in the compound: 0.97) than the other methods. The low-dimensional (t-SNE) representation illustrates latent embeddings learned by PRnet from massive single-cell screens of 188 compounds across 3 cancer cell lines (Fig. 2f). The latent embeddings automatically cluster the concordance of cells to their cell types. The results demonstrated that PRnet not only captures the heterogeneous responses of large-scale HTS but also resolves similar responses of homogeneous cells to various perturbations. The t-SNE embedding (Fig. 2g) in the latent space of MCF7 cells treated with AG-14361 produces a pseudo-dose trajectory, indicating that AG-14361 induced heterogeneous responses. Several other pseudo-dose trajectory examples were illustrated in Supplementary Fig. 6d–f.To validate the clinical relevance of PRnet, we incorporated scRNA-seq data from a cohort of pediatric acute myeloid leukemia (AML) patients35. We trained PRnet to predict transcriptional responses to chemotherapy treatments and validated its potential for clinical application. PRnet showed robustness in predicting transcriptional responses for pediatric AML patients post-chemotherapy, demonstrating PRnet’s potential to assist in clinical applications. For detailed experimental results, please see Supplementary Note 1 and Supplementary Fig. 8.PRnet captures gene programs in various perturbation scenariosWe reason that PRnet can be a valuable tool to analyze and capture functional transcriptional experiments that aim to uncover gene programs or effects on various conditions by introducing perturbation to the system. To investigate this in greater detail, we first collected and tested the hold-out perturbation data of compound vorinostat. Vorinostat (suberoylanilide hydroxamic acid) is the first FDA-approved HDAC inhibitor for the treatment of cutaneous manifestations of cutaneous T-cell lymphoma (CTCL) and is also currently being studied as monotherapy and in combination therapy for other types of cancers36,37. Results demonstrated a comprehensive ability of PRnet to capture gene-level fold changes in all of the 71 cell lines (Fig. 3a). By comparing the transcriptional responses of cell lines from different organs, we observed that PRnet captured cell-type-specific responses. For instance, cell lines from muscle exhibited relatively weaker responses and smaller fold changes compared to those from the lung. Figure 3a shows that PRnet correctly captures both the right trend and the magnitude of perturbation of the top up- and down-regulation genes across 71 cell lines from 16 tissues/organs. Taking gene FAM57A and TP53 as examples, PRnet made accurate predictions in cases of both up- and down-regulation in all cell lines after perturbation. In addition, PRnet even correctly predicted the fold change values in the expression of FAM57A across all cell lines. Figure 3b shows a detailed comparison between the predicted and actual distributions of fold changes in gene expression for three representative cell lines from different organs (HT29: colorectal adenocarcinoma, A549: lung adenocarcinoma and MCF7: breast adenocarcinoma,) treated with vorinostat. It can be observed that PRnet aligns consistently with the distribution of predicted and true observations and accurately predicts the up- and down-regulation trends of the top 5 genes with high log(FC) values. We employed the KEGG pathway Gene Set Enrichment Analysis (GSEA) on the average predicted post-perturbations gene rank of Vorinostat across all cell lines. The GSEA results (Fig. 3c, d and Supplementary Fig. 3a) reveal that Vorinostat is enriched in pathways related to fundamental cellular processes related to tumor suppressor mechanisms. The GSEA results indicated that Vorinostat suppresses pathways such as Cell cycle, DNA replication, and Spliceosome, and activates pathways including Autophagy – animal, Lysosome, Phagosome, and so on, which are all associated with autophagy and apoptosis in tumor cells38.Fig. 3: PRnet captures gene programs in various perturbation scenarios.a The heatmap illustrates the average log(FC) of the gene expression profiles predicted by PRnet and the ground truth for 71 cell lines under the drug Vorinostat. The horizontal axis represents genes, while the vertical axis represents cell lines, with the color orange indicating upregulation and the color green indicating downregulation. b The violin plots illustrate the distribution of log(FC) for the top 5 up-and down-regulated genes in multiple perturbations of cell lines, including A549, HT29, and MCF7, from three cancer types treated with the drug Vorinostat. c The GSEA results of average post-perturbation changes in expression of 71 cell lines perturbed by Vorinostat. GSEA was performed for KEGG pathway enrichment. Enriched terms were identified by adjusted p-values (p.adjust) < 0.05 (Benjamini-Hochberg method). d The category net plot visualizes the functional enrichment result of Vorinostat with suppressed pathways colored in blue and activated pathways colored in red. e The box plots illustrate the log(FC) of the top 20 up-and down-regulated genes in cell lines HT29 with the drug bortezomib (n = 865 observations), MG-132 (n = 892 observations), and wortmannin (n = 207 observations), respectively. The middle line in the box plot, median; box boundary, IQR; whiskers, 1.5 × IQR; minimum and maximum, not indicated in the box plot; gray dots, points beyond the minimum or maximum whisker. f The box plots illustrate the expression of the 10 different expression genes of cell lines A549 (n = 450 cells), K562 (n = 430 cells), and MCF7 (n = 988 cells) perturbed by the drug GSK-LSD1, respectively. The middle line in the box plot, median; box boundary, IQR; whiskers, 1.5 × IQR; minimum and maximum, not indicated in the box plot; gray dots, points beyond the minimum or maximum whisker. Source data are provided as a Source Data file.To demonstrate the generalization of PRnet, we also analyzed perturbation observations of other hold-out compounds on a gene-by-gene basis. We collected and tested some case perturbation data of HT29 with the most observations. Figure 3e illustrates the log(FC) of the top 20 up- and down-regulated genes in multiple perturbations of HT29 cell lines treated with the hold-out compounds bortezomib, MG-132, and wortmannin, respectively. These results suggested that PRnet is able to capture regulated gene level information that is consistent with evidence from corresponding compounds in inferring different cancer transcriptional profile conditions that can be missed by perturbations analysis. The predicted distribution of fold changes in gene expression after perturbation closely aligned with the actual observed distribution, indicating the accuracy of PRnet in capturing the perturbation effects. More predicted gene-level perturbation responses of cell lines of breast cancer exhibited similar performance (Supplementary Fig. 2a). Besides, Fig. 3f demonstrates the ability of PRnet to predict cell-type-specific gene-level perturbation transitions in single-cell HTS observations. In the case of predicting the response of perturbing A549, K562, and MCF7 cell lines with GSK-LSD1, PRnet correctly captured both the right trend in gene expression and the magnitude of response across all 10 differentially expressed genes (Fig. 3f). Similar performance was observed for several other examples across perturbation conditions (Supplementary Fig. 4a–c). The ability to capture changes in gene-level programs under different compound conditions and resolutions indicates the robustness, generalization, and precise performance of PRnet in predicting perturbation responses.PRnet identified active compounds against small-cell lung cancerHaving been trained to simulate experimental measurements of high-throughput screening, PRnet was applied to identify potential novel compound candidates for the treatment of small cell lung cancer (SCLC). SCLC is an extremely aggressive lung cancer characterized by small cells with limited cytoplasm forming clusters or spheroids39. Despite an initial positive response to conventional chemotherapy and radiation, SCLC often recurs rapidly, with less than 5% of patients surviving five years. Currently, highly effective treatment options for this disease remain unresolved, making drug development efforts for this cancer a high priority.Given several novel cell lines of SCLC, namely NCI-H69, NCI-H526, NCI-H446, NCI-H209, and NCI-H196, and DMS114, we first employed PRnet to predict the transcriptional response of SCLC cell lines to sensitive compounds. Then, we in silico screened two user-defined compound libraries to identify potential compound candidates against SCLC (Fig. 4a), namely an active compound library (4158 compounds) from Selleckchem, and an in-house druglike compound library (29,670 compounds). Through in silico screening, PRnet predicted the transcriptional responses of each compound across eight concentration gradients perturbing 6 cell lines, with each scenario repeated three times for computational robustness and calculated gene rank of changes in the average post-perturbation expression of each compound. After that, the predicted up/downregulated genes of sensitive compounds on their cell lines were used as the GSEA gene signature input to calculate the enrichment scores. We then performed GSEA to calculate the enrichment scores of compounds in libraries and ranked them according to their scores (Supplementary Datas 1, 2). Ultimately, three compounds ((+)-Fangchinoline, (+)-JQ-1, and SEL120-34A HCl) at the top rank (rank ≤ 3 for enrichment score of sensitive compounds) were chosen as the candidate set (Fig. 4b). Among them, it has been proved that small cell lung cancer (SCLC) cells are exquisitely sensitive to growth inhibition by the BET inhibitor (+)-JQ-1 (CAS No. : 1268524-70-4)40, and (+)-Fangchinoline and SEL120-34A HCl were assessed experimentally. We also explored the suitable activity concentrations of candidate compounds by calculating the enrichment scores of perturbed transcriptional profiles at concentration gradients. As shown in Fig. 4c, d, concentrations in the range of 1–10 μmol/L might be the proper inhibitory concentration for these candidate compounds.Fig. 4: The PRnet-identified candidates against small cell lung cancer and colorectal cancer.a PRnet predicted the perturbed transcriptional profiles of 6 SCLC cell lines and 14 CRC cell lines. For in silico screening, PRnet first predicted the transcriptional profiles of SCLC cell lines perturbed by a multi-concentration gradient of 4158 active compounds, as well as 29,670 in-house druglike compounds. PRnet also predicted the transcriptional profiles of 14 CRC cell lines perturbed by 30,456 natural compounds. Then, post-perturbation fold-changes and average fold-changes of compounds across cell lines are computed. The gene ranking is performed based on the fold-change values. Given the predicted gene signature of sensitive compounds, the model computes the enrichment scores for up- and down-regulated gene sets of screening compounds. Finally, compounds are ranked based on the enrichment scores. For in vitro testing, MTT assays were performed for the evaluation of cell viability. b The heatmap illustrates the enrichment score of candidates for sensitive compounds. c, d The line charts plot the enrichment score of SCLC cells exposed to (+)-Fangchinoline and SEL120-34A HCl. e, f The cell survival curve of SCLC cells exposed to (+)-Fangchinoline and SEL120-34A HCl (n = 3 replicates). IC50 (half-maximal inhibitory concentration) are presented as mean values ± SD. g, h The cell survival curve of CRC cells exposed to 7-Methoxyrosmanol and Mulberrofuran Q (n = 3 replicates). IC50 are presented as mean values ± SD. Source data are provided as a Source Data file. Some icons were created in BioRender. Qi, X. (2024) biorender.com/c85y849.We utilized MTT assays to examine the activities of compound candidates against SCLC cells. Six human SCLC cell lines (NCI-H69, NCI-H526, NCI-H446, NCI-H209, and NCI-H196, and DMS114) were chosen for experiments. Results revealed the activity of SEL120-34A HCl (CAS No. : 1609452-30-3) and (+)-Fangchinoline (CAS No. : 436-77-1) against small cell lung cancer (SCLC) cell lines. SEL120-34A HCl and (+)-Fangchinoline showed significant inhibitory effects on the proliferation of SCLC cells, and exhibited an IC50 (half-maximal inhibitory concentration) of less than 10 μmol/L, indicating their inhibitory effect on SCLC cell viability (Fig. 4e, f). (+)-Fangchinoline and SEL120-34A HCl moderately inhibited the viability of SCLC cell lines. Detailed antiviability activities of two candidates are shown in Supplementary Table 3. These findings suggested the potential therapeutic efficacy of SEL120-34A HCl and (+)-Fangchinoline in the context of SCLC treatment, highlighting their promising role as active compounds against this aggressive lung cancer subtype.PRnet found natural compounds against colorectal cancerColorectal cancer (CRC) ranks as the third most common cancer and the second leading cause of cancer death worldwide41. Advances in molecularly targeted therapy and immunotherapy over the past decades have significantly improved patient survival41. However, some CRC patients, initially responsive to these treatments, quickly develop insensitivity. The emergence of drug resistance in cancer treatment significantly reduces patient outcomes. We aim to explore more potential novel treatment options for colorectal cancer to mitigate the impact of drug resistance through in silico screening of a large-scale library of natural compounds (30,456 compounds42).We extended the application of PRnet to screening novel natural compounds for the treatment of CRC. We first leveraged PRnet to predict the post-perturbation transcriptional profiles of sensitive compounds in 14 CRC cell lines (HT29, LOVO, MDST8, RKO, HT115, SW948, SNU1040, SNUC4, SNUC5, SW480, SW620, HCT116, CL34, and NCIH508). Then, we used the predicted up/downregulated genes of sensitive compounds on their cell lines as the GSEA gene signature. After that, we also in silico screened a large-scale natural compounds library containing 30,456 natural ingredients. PRnet predicted the transcriptional responses of each compound across eight concentration gradients perturbing 14 cell lines with three repeats and calculated the post-perturbation average gene rank of compounds. Two natural compounds (7-Methoxyrosmanol and Mulberrofuran Q) at the top rank (rank ≤ 5 for enrichment score of sensitive compounds) were chosen as the candidate set (see Supplementary Tables. 1, 2).We utilized MTT assays to examine the activities of candidate compounds against CRC cells. Six human CRC cell lines, namely HT29, SW480, Caco-2, SW620, HCT116, and Colo205, were chosen for the experiments. The results revealed that 7-Methoxyrosmanol (CAS No. : 113085-62-4) and Mulberrofuran Q (CAS No. : 101383-35-1) inhibited the viability of CRC cell lines (Fig. 4e, f). 7-Methoxyrosmanol and Mulberrofuran Q moderately inhibited the viability of CRC cell lines. Detailed antiviability activities of two candidates are shown in Supplementary Table 3. These findings showed the activity of 7-Methoxyrosmanol and Mulberrofuran Q against CRC cell lines, suggesting their potential efficacy in CRC treatment.PRnet generated a large-scale integration atlas of perturbation profilesWith the ability to characterize specific gene-level perturbation responses and identify anti-cancer compounds, PRnet was applied to in silico screen novel compound libraries and cell lines and generate a large-scale integration atlas of perturbation profiles across various scenarios (Fig. 5a). PRnet was trained with two datasets: (1) the L1000 dataset, a bulk high-throughput screening library consisting of 883,269 transcriptional profiles from 82 cell lines perturbed by 175,549 biologically active compounds, and (2) the Sci-plex3 dataset, a single-cell high-throughput screening library consisting of 290,888 transcriptional profiles from 3 cell lines perturbed by 188 active compounds. The L1000 dataset1 screened cell lines derived from over 20 diverse tissues and exposed to compounds targeting multiple genes and pathways (Supplementary Fig. 5a–c). The Sci-plex3 dataset screened three cancer cell lines treated by 188 compounds targeting a diverse range of targets and molecular pathways, covering various mechanisms of action (Supplementary Fig. 5d–f). After training, PRnet was applied to screen various perturbation scenarios to generate a large-scale integration atlas of perturbation profiles. Through virtual screening, PRnet has predicted over 25 million post-perturbation expression profile atlas of perturbation profiles which consists of five parts: (1) the FDA-approved drugs dataset: a bulk virtual high-throughput screening library containing 1,891,330 transcriptional profiles from 82 cell lines perturbed by 935 FDA-approved drugs, (2) the anti-cancer compounds dataset: a bulk virtual high-throughput screening library containing 8,781,784 transcriptional profiles from 88 cell lines perturbed by 4158 active compounds, (3) the natural compounds dataset: a bulk virtual high-throughput screening library containing 10,233,230 transcriptional profiles from 14 colorectal cancer cell lines perturbed by 30,456 natural compounds, (4) the bioactive compounds dataset: a bulk virtual high-throughput screening library containing 4,272,486 transcriptional profiles from 6 small cell lung cancer cell lines perturbed by 29,670 druglike compounds, and (5) the Gtex dataset: a bulk virtual high-throughput screening library containing 1,245,510 transcriptional profiles from 54 tissues perturbed by 935 FDA-approved drugs. Details of all cell lines are provided in Supplementary Data 3. PRnet offered a broad perspective by providing insights into how perturbations impact the transcriptional landscape on a large scale and extended its utility to diverse screening contexts. The large-scale integrated atlas of perturbation profiles can be applied to a variety of downstream application scenarios. For example, the FDA-approved drug dataset can be used for drug repositioning to recommend drugs for specific diseases based on gene signatures (see PRnet provided a robust and scalable drug 568 recommendation workflow based on the profiles atlas). The anti-cancer compounds dataset, the natural compounds dataset, and the bioactive compounds dataset are valuable for screening new anti-cancer compounds (see PRnet identified active compounds against small cell 430 lung cancer and PRnet found natural compounds against colorectal 482 cancer). In addition, the Gtex dataset can be useful to analyze the toxicity of compounds in different tissues. PRnet imports gene-level functionality in perturbations of different compounds and empowers flexibility by utilizing user-defined compounds’ structures and transcription profiles to estimate the gene expression matrix. These profiles can be compared against various perturbation conditions(dosages, structures of compounds) using PRnet, to evaluate the impact of using different compounds on specific gene-expression profiles from single cell and bulk data. These diverse profiles provided potential solutions for drug discovery, disease treatment, and toxicity analysis.Fig. 5: PRnet provided a robust and scalable recommend workflow on a large integration atlas of perturbation profile.a The large-scale integration atlas of perturbation profiles. b PRnet provided a recommended workflow for specific diseases. In step 1, given the structures of the screening library, PRnet predicts the post-perturbed transcriptional profiles of all compounds across multiple concentration gradients in 82 cell lines. In step 2, the transcriptional profiles of 978 landmark genes are transformed into 12,328 genes through linear transformation. Subsequently, the perturbation fold-change and the average fold-change across cell lines for the transformed expression profile are computed. Gene ranking is then performed based on the fold-change values. In step 3, given the gene signature for a particular disease, the model computes the enrichment scores for up- and down-regulated gene sets of screening compounds. Finally, compounds are ranked based on the enrichment scores. c–e Scatter plots of the enrichment score of compounds against NASH, Crohn’s disease, and PCOS. The computationally predicted candidates are highlighted with their names in the upper right corner. The color gradient shows the density of the dots. Source data are provided as a Source Data file. Some icons were created in BioRender. Qi, X. (2024) biorender.com/z56i777.PRnet provided a robust and scalable drug recommendation workflow based on the profiles atlasPRnet provides a comprehensive drug recommendation workflow based on the large-scale integration atlas of perturbation profiles (Fig. 5b). In step 1, given the compounds’ structure of the screening library, PRnet predicts the post-perturbed transcriptional profiles of all compounds across multiple concentration gradients across cell lines. In step 2, the transcriptional profiles of 978 landmark genes are transformed into 12,328 genes through linear transformation. Subsequently, the perturbation fold-change and the average fold-change across cell lines for the transformed expression profile are computed. The gene ranking is then performed based on the fold-change values. In step 3, given the gene signature for a specific disease, PRnet computes the enrichment scores for up- and down-regulated gene sets of screening compounds. Finally, compounds in the screening library are ranked based on these enrichment scores. These downstream applications demonstrated the versatility and utility of PRnet in addressing diverse challenges in the field of drug discovery and perturbation analysis.We leveraged the drug recommendation workflow to identify drug candidates for the treatment of 233 different diseases. For the user-defined compound library, we here chose the FDA-approved drug dataset, which comprises 935 FDA-approved drugs. All gene signatures of 233 diseases versus normal were collected from the CREEDS project (CRowd Extracted Expression of Differential Signatures43). The up/downregulated genes of a specific disease were used as the GSEA gene signature input to calculate enrichment scores (see “Methods”) of drugs. Finally, we obtained 577 candidate lists from 577 studies for 233 unique diseases (Supplementary Data 4). Taking three metabolic disorder diseases, including nonalcoholic steatohepatitis (NASH), inflammatory bowel disease (IBD), and polycystic ovary syndrome patients (PCOS) as examples, we provided enrichment scores of each drug for the three diseases and the recommended drug candidates for them. The enrichment scores are plotted in Fig. 5c–e, drugs positioned at the upper right corner (rank≤ 10 for enrichment score) were chosen as the candidate set for literature verification. Supplementary Table 4 lists the Top 10 enrichment scores for compounds against NASH, Crohn’s disease, and PCOS.Nonalcoholic steatohepatitis (NASH) is liver inflammation and damage caused by a buildup of fat in the liver without standard treatment and any well-established drug targets. After calculation, we ultimately selected a set of candidate drugs from the upper right corner (rank≤ 6 for enrichment score) as candidate treatments for NASH (Fig. 5c). Literature verification revealed that Mirabegron (rank 1), Vidofludimus (rank 5), and Rifaximin (rank 6) have literature support for use in the treatment of NASH. A study on high-fat diet rats44 suggested that mirabegron may have a protective effect against NASH as it improves liver enzymes, lipids, serum HbA1c, fasting insulin and glucose, insulin resistance index, and serum adiponectin levels, as well as ameliorates hepatic histopathologic changes in NASH-induced rats. A study on obesity mice45 suggested vidofludimus reduced hepatic steatosis and inflammation in obesity mice and has been repurposed to a therapeutic potential in the treatment of NASH by targeting FXR based on the newly established relationships among drugs, targets, and diseases. Rifaximin therapy appeared to be effective and safe in modifying NASH through reduction of serum endotoxin and improvement of insulin resistance, proinflammatory cytokines, CK-18, and NAFLD-liver fat score. Patients with biopsy-proven NASH with Rifaximin therapy showed that Rifaximin appeared to be effective and safe in modifying NASH46,47. We also performed KEGG pathway enrichment analysis of predicted up/down-regulated genes of Mirabegron, Vidofludimus, and Rifaximin (Supplementary Fig. 7). The KEGG pathway enrichment analysis of downregulated genes for both Mirabegron and Rifaximin suggested they may downregulate pathways associated with NASH.Crohn’s disease and ulcerative colitis are idiopathic inflammatory bowel disorders (IBD). Crohn’s disease is a relapsing systemic inflammatory disease that primarily affects the gastrointestinal tract with extraintestinal manifestations and associated immune disorders48. After ranking the drugs, two drugs positioned at the upper right corner (rank≤ 9 for enrichment score) were chosen as the candidate set (Fig. 5d). The literature verification revealed that Escin (rank 3) and Ozanimod (rank 9) have literature support for use in the treatment of Crohn’s disease. Escin is the main bioactive ingredient of Semen aesculi, which improves intestinal barrier dysfunction of IBD via Akt/NF-κB signaling pathway49. Ozanimod is a once-daily sphingosine 1-phosphate receptor modulator for the treatment of inflammatory bowel disease and was approved by the FDA. Phase 2 clinical trial had proved that Ozanimod can be a novel oral small molecule therapy for the treatment of Crohn’s disease50,51. The KEGG pathway enrichment analysis results of predicted up/down-regulated genes of Escin and Ozanimod are illustrated in Supplementary Fig. 7. The predicted upregulated pathways (Supplementary Fig. 7) suggested Escin may upregulate the PI3K/Akt signaling pathway which regulates a broad cascade of target proteins including nuclear factor kappa B (NF-κB) and glycogen synthase kinase-3β (GSK-3β)52.Polycystic ovary syndrome (PCOS) is a heterogeneous endocrine disorder in which the major endocrine disruption is excessive androgen secretion or activity, and a large proportion of women also have abnormal insulin activity. We also chose positively scored drugs as candidates for PCOS (rank≤ 9 for enrichment score, Fig. 5e). The literature verification revealed that Enzalutamide (rank 3), Linagliptin (rank 8), and Topiramate (rank 9) have literature support for use in the treatment of PCOS. Enzalutamide has been suggested for use as an antiandrogen to treat hirsutism and hyperandrogenism in women with polycystic ovary syndrome53,54 investigated the effect of linagliptin and/or I3C on experimentally-induced PCOS in female rats, and the results suggested that linagliptin/I3C combination might represent a beneficial therapeutic modality for amelioration of PCOS. PCOS has completed phase 3 trials for Topiramate, and phentermine-topiramate extended-release (PHEN/TPM) resulted in the most loss of weight and total body fat55. The predicted regulated pathways (Supplementary Fig. 7) of Enzalutamide showed that Enzalutamide may upregulate the PI3K/Akt signaling pathway and MAPK signaling pathway, which are related to Androgen signaling56. The predicted downregulated pathways suggested Topiramate may downregulate the Lipid and atherosclerosis and the Fat digestion and absorption pathway, thereby contributing to weight loss. All three literature verification results illustrated that PRnet recommended reliable and effective drugs for different diseases, and hence is a valuable tool for drug discovery.

Predicting transcriptional responses to novel chemical perturbations using deep generative model for drug discovery

Identification of diagnostic markers pyrodeath-related genes in non-alcoholic fatty liver disease based on machine learning and experiment validation

Differential amino acid usage leads to ubiquitous edge effect in proteomes across domains of life that can be explained by amino acid secondary structure...

Virtual resection evaluation based on sEEG propagation network for drug-resistant epilepsy

Comparison of whole-brain task-modulated functional connectivity methods for fMRI task connectomics

Bispecific antibody (ABL602 2 + 1) induced bistable acute myeloid leukemia kinetics

Hot Topics

Identification of diagnostic markers pyrodeath-related genes in non-alcoholic fatty liver disease based on machine learning and experiment validation

Differential amino acid usage leads to ubiquitous edge effect in proteomes across domains of life that can be explained by amino acid secondary structure...

Virtual resection evaluation based on sEEG propagation network for drug-resistant epilepsy

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Identification of diagnostic markers pyrodeath-related genes in non-alcoholic fatty liver disease based on machine learning and experiment validation

Differential amino acid usage leads to ubiquitous edge effect in proteomes across domains of life that can be explained by amino acid secondary structure...

Virtual resection evaluation based on sEEG propagation network for drug-resistant epilepsy

Comparison of whole-brain task-modulated functional connectivity methods for fMRI task connectomics

Popular Articles

Identification of diagnostic markers pyrodeath-related genes in non-alcoholic fatty liver disease based on machine learning and experiment validation

Differential amino acid usage leads to ubiquitous edge effect in proteomes across domains of life that can be explained by amino acid secondary structure...

Virtual resection evaluation based on sEEG propagation network for drug-resistant epilepsy