Computational approach for decoding malaria drug targets from single-cell transcriptomics and finding potential drug molecule

We have used three datasets of single-cell transcriptomics. First data set is a rich atlas of short and long-read single-cell transcriptomes of over 37,000 Plasmodium falciparum cells across intraerythrocytic asexual and sexual development18. Second data set is Single-cell transcriptomes collected from early (ring) and late asexual blood stages, as well as stage I and stage IV gametocytes (GC)19. Third dataset is Single-Cell transcriptomics collected from the zygote and the ookinete stages20. The first two are intra-erythrocyte datasets, and the third is non-intra-erythrocyte datasets. Each row of these datasets corresponds to a single cell, and each column corresponds to a gene.Data preprocessingThe raw data collected from single-cell experiments require preprocessing to address various challenges and enhance the quality of downstream analyses. Data preprocessing in single-cell transcriptomics has a few steps that involve cleaning, normalizing, and transforming the raw data into a format suitable for meaningful biological interpretation21. Three publicly available single-cell RNA-seq datasets were utilized. Two datasets were collected from the asexual reproduction phase18,19 and one from the sexual reproduction phase20. After collecting the raw datasets, they were preprocessed. The first dataset contains the expression of 5838 genes for 38,635 cells. The Second dataset contains mRNA count of 4722 genes for 3922 cells. Third dataset has 1024 cells and 1024 mRNAs. In the first dataset, each cell is assigned one label among the four blood cycle stages (ring, trophozoite, gametocyte, and schizont)18. In the second dataset, each cell is assigned one label among the four blood cycle stages (30hpi, 36hpi, 42hpi, and GCD)19. The time points in hours post-invasion (hpi) during the malaria parasite Plasmodium falciparum’s commitment cycle are denoted by the numbers 30, 36, and 42. GCD is the gametocyte stage. The synchronized parasites were cultured, and samples were collected at these precise intervals to examine the transcriptional modifications and alterations in gene expression induced by AP2-G during the commitment cycle19. Samples are taken 30, 36, and 42 hours after the invasion, and the time points are measured starting from the beginning of the invasion cycle19. In the third dataset, each cell is assigned one label among the four blood cycle stages (T0, T2, T4, T8, T12, T20)20. T0, T2, T4, T8, T12 and T20 are time points that represent intervals at which mosquito midguts were collected and dissected for analysis. After taking a blood meal, female Anopheles mosquitoes are tracked in time intervals to check the progression of the infection’s growth. The midguts are taken at the time intervals of 2, 4, 8, 12, and 20 hours after infection to analyze and track the development of the gametocytes or parasites within the mosquitoes20. After getting the datasets, Normalization was performed to address variability in sequencing depth and library sizes among cells, which can introduce biases in the study. We first applied total count normalization, a widely used method that scales each cell’s expression values based on its total number of reads, to resolve this challenge. Subsequently, RPKM normalization (reads per kilobase of exon per million reads mapped) was conducted to standardize mRNA counts22. This involved normalizing each gene’s expression by dividing the sum of its counts across each single cell and then further dividing by the length of the gene, measured in kilobases (KB).$$\begin{aligned} \text {RPKM} = \frac{\text {Number of reads mapped to a gene} \times 10^9}{\text {Total number of mapped reads} \times \text {Gene length in base pairs}} \end{aligned}$$
(1)
This normalization process promises robust and unbiased comparisons of gene expression levels across cells and samples.Feature selectionFigure 2Feature selection algorithm selects key proteins for the developmental cycle. Flowchart of the mutual information-based feature selection pipeline.Measurement errors, technical variations, and noise can all be present in single-cell data23. Irrelevant or noisy features in the analysis can lead to inaccuracies for further analysis. Selecting a subset of relevant features allows researchers to eliminate noisy features and focus on biologically meaningful genes or proteins24. We implemented a mutual information-based feature reduction technique with the help of supervised classification algorithms to select important features from our three datasets (a detailed pipeline for feature selection is given in Fig. 2a). In the first step of this feature selection algorithm (Fig. 2a), classification algorithms (LR, RF, XGBoost) were used to find the accuracies of each classification model on the three datasets. Next, mutual information between each column and the target column of these datasets was calculated. The target column has information about different cell types in these three datasets. Features with high mutual information with the target column were chosen, and classification algorithms were applied again to these reduced datasets. The goal was to observe whether better or the same classification accuracies were achieved compared to accuracies before feature selection. If better or the same classification accuracies were obtained, those features were selected; otherwise, the process was repeated with an increased number of features. Finally, we will select some n-number features that will give the same or better accuracies with respect to accuracies before feature selection. We can say that these selected features are crucial and contribute more to datasets. Next, Further analysis was conducted using these features. It was also demonstrated that randomly selecting the same amount of features from these datasets and applying classification algorithms resulted in decreased accuracy. We have shown the accuracies of these classification models before feature selection, after feature selection and randomly selected features in this bar diagram (Fig. 6a–c).There are other feature reduction techniques, but mutual information-based feature reduction is important because this algorithm quantifies the amount of information shared between different variables and is particularly useful for capturing non-linear relationships, which is essential in biological data where relationships may not be strictly linear. MI-based algorithms are less sensitive to noise than variance-based feature selection methods25. Single-cell transcriptomics datasets are often heterogeneous, containing a variety of cell kinds and states. This heterogeneity can be accommodated by mutual information-based feature reduction, which finds relevant features for differentiating between different cell types or circumstances.Construction and analysis of protein-protein interaction networkProtein-protein interactions (PPIs) are necessary for almost all cellular processes. Understanding PPIs is essential to understanding cell physiology in normal and pathological situations26. PPI network’s topological structure gives relevant connections and information regarding the associated biological functions27. In network theory, degree(D), Connectivity degree (k), betweenness centrality (BC), closeness centrality (CC), eigenvector centrality (EC), and eccentricity are the fundamental measures10. In our study, after getting selected features, we constructed protein-protein interaction network with these selected features from three datasets. We have used string software to construct these networks28. Then, from these networks, the degree and betweenness centrality of all the nodes of the networks were obtained. Betweenness centrality measures the extent to which a node lies on the shortest paths between other nodes, indicating its significant influence over information flow. Nodes with high BC, called bottlenecks, tend to indicate essential genes because they can be compared to heavily trafficked intersections on major highways or bridges. In a PPI network, nodes with large degrees, referred to as hub proteins, may correspond to the genes that cause the disease29,30. The hubs and bottlenecks at the heart of the PPI network were the primary focus of this investigation. We had three features set. The first two are from the asexual stage, and the third one is from the sexual stage. Proteins with high degree and betweenness centrality were identified from each feature set. The common hub and bottleneck proteins from the two asexual feature sets were selected, as these proteins exert more control in the network. Gene ontology enrichment analysis was performed with these selected features using g:Profiler31, a popular toolset for converting gene identity mappings to orthologs and discovering biological categories enriched in gene lists.It is found that all the selected hub and bottleneck proteins are members of some enriched biological functions. A close observation suggests that the apical complex and extracellular proteins are commonly found in both datasets, suggesting the critical roles of these complements during the intraerythrocyte developmental phases.Binding site prediction and drug molecule generationFigure 3Deep learning-based diffusion model predicts potential drug molecules and Binding site prediction. (a) The architecture of targetdiff diffusion model. \(M_{0}\) is the molecules in dataset. we are adding noise \(\epsilon\) to \(M_{0}\) and getting a diffused Gaussian noise \(M_{T}\). From \(M_{T}\) we are doing reverse diffusion and generating new samples. (b) A 3D representation of one strong binding site of PF3D7_1246200 protein as identified by the CAVITY software (c) A 3D representation of one strong binding site of PF3D7_0417200 protein as identified by the CAVITY software.After PPIN and protein enrichment analyses, a few proteins were selected for further analysis. based on their functional importance for plasmodium falciparum survival. These proteins are selected as drug targets for malaria.Cavity32 was used to find the strong binding sites of these proteins and after getting the binding sites and a generative deep learning model was used to generate suitable drugs that can bind with these binding sites (we have shown all selected drug molecules for asexual stage in Fig. 8a–f). We have used targetdiff, a target-based generative diffusion model, to generate drug molecules12. Many deep learning models can be used to generate drugs, but this is a target-based non-autoregressive diffusion model that can generate bonds and atoms of the molecules simultaneously33. This model has performed better than other models in target-based drug discovery. A total of 100 drug molecules were generated for each strong binding site.Architecture of TargetdiffTargetdiff is a target-based small molecule generation model based on the denoising diffusion process12. A diffusion process is a continuous-time stochastic process involving a series of diffusion steps to convert a given data distribution into a simpler distribution. In every step, the data is progressively transformed into a target distribution by adding noise. In reverse diffusion, we predict the added noise in each step to get the previous sample. Reverse diffusion allows to generation of new samples from the distribution33 (Fig. 3a). This model uses a SE(3)-equivariant network to train a joint generating process of both continuous features (atom coordinates) and categorical features (atom kinds). This SE(3)-equivariant network contains 9 equivariant layers, two graph attention layers for features, and coordinates with 16 attention heads and 128 hidden features. This is a structure-based model, so this model has been trained on 3D structure of molecules. During the generation time, small molecules have been generated based on the binding cavity of the target protein. For more details on the architecture of Targetdiff, please follow this ‘3D EQuivariant Diffusion For Target-Aware Molecule Generation And Affinity Prediction’12.Analysis of drug molecules to find lead compoundsA well-balanced combination of safety, pharmacokinetics, and biochemical behavior is what makes a drug successful. A drug candidate’s ability to achieve a desirable absorption, distribution, metabolism, excretion, and toxicity (ADMET) profile is as important to its success as having high potency and selectivity. More precisely, the perfect drugs should be ingested into the body, distributed to different tissues and organs reasonably, digested so as not to erase their action, and removed suitably instantly, and their non-toxicity should be established. These concerns, which span the entire process from administration to elimination, appear separate but are intimately related. We analyzed different properties of these drug molecules, such as physicochemical properties(molecular weight, volume, density, number of hydrogen atom acceptors(nHA), number of hydrogen atom donors(nHD), number of rotatable bonds in a molecule(nRot), Total Polar Surface Area(TPSA), Logarithm of Solubility (logS), Logarithm of the Partition Coefficient(logP), etc.), medicinal properties(Quantitative Estimate of Drug-likeness(QED), Surface Area to Volume Ratio score(SAS), Lipinski rule, Pfizer rule, GSK rule, Golden Triangle rule, etc), ADMET properties using ADEMETlab 2.034. Drug distribution, metabolism, excretion, and absorption are all impacted by molecular weight (ADME). We have added the description of all the physicochemical and medicinal properties in Github supplementary(description of physicological and medicinal properties)A few drug molecules that satisfy lead-likeness and drug-likeness properties and have no problematic structural fragments were chosen. After generating drugs we filter these drug molecules with the ADMET, druglikeliness, and leadlikeliness properties. Finally, after filtration, few drug molecules were selected for each protein (Table 3). Docking was then performed to determine each selected drug molecule’s binding energies and stable protein-ligand conformer with the respective target using AutoDock Vina35,36.

Hot Topics

Related Articles