scSemiProfiler: Advancing large-scale single-cell studies through semi-profiling with deep generative models and active learning

Method overviewThe scSemiProfiler approach represents a deep neural network framework for decomposing broad-scope bulk sequencing data into detailed single-cell cohorts. It achieves this by conducting single-cell profiling on only a select few representative samples and then computationally inferring single-cell data for the remainder. This approach substantially lowers the costs associated with large-scale single-cell studies. As depicted in Fig. 1, our method is designed to deliver cost-effective, semi-profiled single-cell sequencing data, enabling deep exploration of cellular dynamics in large cohorts. In this context, “semi-profiling” refers to the generation of single-cell data for an entire cohort, achieved either through direct single-cell sequencing of selected representative samples or via in silico inference using a deep generative model. This in silico inference process combines actual single-cell sequencing data from a representative sample with bulk sequencing data, encompassing both the target and the representative sample. Thus, a “semi-profiled cohort” includes real single-cell data for representative samples and inferred data for the non-representative ones. This deep generative approach facilitates a thorough examination of individual cellular profiles within a larger dataset, seamlessly linking the extensive scope of bulk sequencing with the granularity of single-cell analysis.Fig. 1: Overview of the scSemiProfiler method.a Initial Configuration: Bulk sequencing is first conducted on the entire cohort, followed by clustering analysis of this data. This analysis identifies representative samples, typically those closest to the cluster centroids. b Representative Profiling: The identified representatives are then subjected to single-cell sequencing. The data obtained from this sequencing is further processed to determine gene set scores and feature importance weights, enriching the subsequent analysis steps. c Deep Generative Inference: Utilizing a VAE-GAN-based model, the process integrates comprehensive bulk data from the cohort with the single-cell data derived from the representatives, and infers the single-cell data for the non-representative (target) samples. During the model’s 3-stage training, the generator aims to optimize losses ${L}_{{G}_{Pretrain1}}$, ${L}_{{G}_{Pretrain2}}$, and Linference, respectively, whereas the discriminator focuses on minimizing the discriminator loss LD. In LD, G and D are the generator and discriminator respectively. The term D((xi, si)) represents the discriminator’s predicted probability that a given input cell is real, under the condition that it is indeed a real cell. Conversely, D(G((xi, si))) denotes the discriminator’s predicted probability that the input cell is real, when in fact it is a cell reconstructed by the generator. d Representative Selection Decision: Decisions on further representative selection are made, taking into account budget constraints and the effectiveness of the current representatives. An active learning algorithm, which draws on insights from the bulk data and the generative models, is employed to pinpoint additional optimal representatives. These newly selected representatives then undergo further single-cell sequencing (b) and serve as new reference points for the ongoing in silico inference process (c). This active learning step is optional if the user prefers the all-in-one “global mode”. e Comprehensive Downstream Analyses: The final panel shows extensive downstream analyses enabled by the semi-profiled single-cell data. This is pivotal in demonstrating the model’s capacity to provide deep and wide-ranging insights, showcasing the full potential and applicability of the semi-profiled single-cell data.Initially, the semi-profiling pipeline commences with bulk sequencing of each cohort member (Fig. 1a), laying the foundational data layer for all subsequent analyses. Following this foundational step, the methodology employs a clustering analysis to form B (sample batch size) sample clusters utilizing the extensive data derived from the initial bulk sequencing and selecting a “representative” sample for each cluster (Fig. 1b). Single-cell profiling will be conducted on the selected representative samples in preparation for the following steps. The core of the scSemiProfiler involves a deep generative learning model (Fig. 1c). This model is engineered to intricately meld actual single-cell data profiles with the gathered bulk sequencing data, thereby capturing complex biological patterns and nuances. Specifically, it uses a VAE-GAN44 architecture initially pretrained on single-cell sequencing data of selected representatives for self-reconstruction. Subsequently, the VAE-GAN is further pretrained with a representative reconstruction bulk loss, aligning pseudobulk estimations from the reconstructed single-cell data with real pseudobulk. Finally, the model undergoes fine-tuning with another target bulk loss tied to the real bulk sequencing data of the target samples, facilitating precise in silico inference of the targets’ single-cell profiles. Once the in silico single-cell inference is finished for all non-representative samples in the cohort, an active learning module can be used for selecting the next batch of potentially most informative representatives for single-cell sequencing to further improve the semi-profiling performance (Fig. 1d). When studying a smaller dataset, or when more cells per sample are required, a smaller batch size, such as 2, may be preferred. However, a batch size of 4 is set as default to maximize the usage of a 10x genomics single-cell toolkit, which can typically capture up to 20,000 cells (4 samples if assuming 5000 cells each). This dynamic, iterative process is continuously augmented with newly acquired single-cell data, ensuring that the most informative samples are selected for real single-cell profiling, leading to more accurate in silico single-cell inference for non-representative samples. This iterative process concludes when the budgetary constraint is met or when a sufficient number of representatives have been chosen to ensure satisfactory semi-profiling performance.When any of the stop criteria are met, the semi-profiled single-cell data can be used for a broad spectrum of downstream single-cell analyses (Fig. 1e), such as cell feature visualization, biomarker and function enrichment analysis, tracking cell type compositions in various tissues/conditions, cell-cell interaction analysis, and pseudotime analysis. Ultimately, scSemiProfiler offers a holistic sequencing perspective, delivering nuanced single-cell insights from bulk sequencing data. The method exhibits acceptable performance in terms of runtime and memory usage (refer to Supplementary Fig. S1, making it suitable for scaling to large-scale datasets.While active learning allows for selecting a minimal number of representatives to conserve budget, the necessity for multiple sequencing rounds can sometimes challenge many laboratories and introduce batch effects. To counter this, scSemiProfiler features a “global mode” that enables all-in-one selection of representatives based on initial bulk data analysis. Furthermore, this “global mode” can be used in conjunction with active learning to enhance the selection process.The semi-profiled COVID-19 single-cell cohort exhibits significant similarity to its real counterpartTo test the performance of scSemiProfiler in generating semi-profiled single-cell data that resonates with the granularity and details of actual single-cell sequencing, we utilized a COVID-19 cohort single-cell sequencing dataset45. After quality controls (please refer to the section titled “Representative single-cell profiling and processing” in our Methods for more details.), this dataset includes 124 samples, including healthy controls and infected patients of different severity levels: asymptomatic, mild, moderate, severe, or critical. Here, we produced pseudobulk data by taking the average of the normalized count single-cell data. We then tested scSemiProfiler’s ability to regenerate the single-cell cohort from the pseudobulk data and real-profiled single-cell representatives using semi-profiling, as the actual bulk data for those samples in the COVID-19 cohort is absent. In generating our semi-profiled dataset, 28 representatives were selected in batches of 4 using our active learning algorithm for real single-cell profiling. Deep generative models then inferred the single-cell profiles for the remaining samples based on the representatives’ real-profiled single-cell data and the bulk data of all samples. The estimated total cost of these bulk and single-cell sequencing is $62,640. This is only 33.7% of the estimated price, $186,000, for actually conducting single-cell sequencing for the entire cohort. Additionally, our approach offers the advantage of generating extra bulk data for the cohort, a benefit not provided by single-cell sequencing of the entire cohort. The price of bulk sequencing is estimated based on the cost at McGill Genome Centre in the year 2023. This estimation assumes the use of one NovaSeq 6000 S2 system (capable of sequencing up to 4 billion reads per run) at an approximate cost of $7000, plus an additional $110 for library preparation per bulk sample. The cost for single-cell sequencing is based on the tool (costpercell). This tool provides a cost estimate for capturing 0.8 billion total reads from 20,000 cells across four samples in one 10x lane, equating to $0.3 per cell. Consequently, the estimated cost for each sample (5000 cells) is $1500. In the following sections, we provide justification for scSemiProfiler’s capacity to match the analytical results of a fully profiled single-cell cohort. We compare the outcomes from the semi-profiled cohort, which includes real single-cell data for representative samples and inferred single-cell data for target samples, to those from the real-profiled cohort, the ground truth with single-cell data for all samples.Using UMAP27 visualizations, we show a significant alignment between the semi-profiled and the real-profiled in Fig. 2a and b. We also annotated the semi-profiled COVID-19 cohort using an unbiased approach. We trained a Multi-layer Perceptron (MLP) classifier using the annotated representatives’ cells and used it to predict the cell types of the rest of the non-representative samples’ cells generated by the deep learning model. As shown in Fig. 2b, cell clustering remains intact in the semi-profiled cohort, from the distinctive clusters of B cells, plasmablasts, and platelets to the nuanced similarities of the CD14 and CD16 cells. The semi-profiled dataset illustrates its high fidelity to the real-profiled version. This fidelity extends to capturing even the subtle batch effects, as observed in the twin CD4 clusters, further accentuating scSemiProfiler’s robust in silico inference capabilities. The UMAP plots in Fig. 2c weave together directly sequenced samples with the semi-profiled ones, showcasing the tool’s finesse. The overlapping data points found in both sequencing techniques resonate with the transformative nature of scSemiProfiler—it harmonizes accuracy and cost-efficiency seamlessly. This comparison validates our tool’s ability to approximate a fully profiled dataset. This prompts an inquiry: Does this alignment owe its credit to the active learning mechanism that identifies the most informative representatives, or does it also hail from the deep generative model’s prowess in inferring target samples’ cells with finesse? To answer this, Fig. 2d uses distinct colors to delineate real-profiled cells from representatives and generated cells for target samples. It shows that our representative selection strategy is able to select representatives such that their cells have a relatively good coverage of the overall cell distribution. This emphasizes our method’s capacity to extend beyond the representatives, enriching the dataset. Meanwhile, the deep learning model managed to generate cells to complement the rest and make the overall semi-profiled cohort almost identical to the original real-profiled cohort. The effectiveness of the deep learning model in the semi-profiling process is further demonstrated in Fig. 2e and f and Supplementary Fig. S2. These figures illustrate the model’s capability in reconstructing data from single representative samples and its proficiency in inferring data for individual target samples. Supplementary Figs. S3–S5 provides further details regarding the cell type distribution in each sample/sample cluster and scSemiProfiler is capable of generating similar sample clusters in terms of UMAP visualization and cell type distribution.Fig. 2: Overall comparisons of the semi-profiled and real-profiled COVID-19 dataset.a UMAP visualization of the real-profiled data. Colors correspond to cell types and are consistent with (g). b UMAP visualization of the semi-profiled data. c UMAP visualization of semi-profiled data and real-profiled data together. The color differentiation signifies whether cells originate from the semi-profiled or the real-profiled dataset. Areas of overlap between the two indicate where the semi-profiled data closely resembles the real-profiled data. d UMAP visualization of the semi-profiled cohort, displaying different colors to distinguish cells produced by a deep generative model (labeled as “Generated”) from the representative cells obtained through real-profiling (labeled as “Representatives”). e, f An illustrative case of in silico single-cell data inference for a target sample (AP1) using a representative sample (AP6) from the COVID-19 cohort is presented. e UMAP visualization compares the reconstructed single-cell data of the representative (in gray) against the real-profiled single-cell data of the representative (in blue). f UMAP illustrates the cell distribution of the representative sample (blue), the inferred target sample cells (yellow), and the real-profiled single-cell data of the target sample (red). The inferred cells exhibit a much higher resemblance to the real-profiled target sample data than to the representative’s data. g Visualization illustrates the relative activation patterns of the interferon pathway. The comparison of these values between the semi-profiled and real-profiled matrices yields a Pearson correlation coefficient of 0.808 and a two-sided p-value of 5.74 × 10−22. h Graph depicting the normalized error in semi-profiled data with an increasing number of representatives. The terms “scSemiProfiler” and “Selection-only” represent our semi-profiling method and a method that only selects representatives using an active learning algorithm, respectively. It is important to note that actual costs may vary based on the sequencing technology and the specific number of cells sequenced. i Stacked bar plot illustrating the proportions of cell types across various disease conditions. The upper portion represents the real-profiled data, while the lower portion depicts the semi-profiled data. Pearson correlation coefficients comparing cell type proportions between the real-profiled and semi-profiled datasets are provided for different conditions: Healthy (0.987), Asymptomatic (0.970), Mild (0.996), Moderate (0.992), Severe (0.978), and Critical (0.989), indicating a high degree of similarity between the two datasets across these conditions. j–l Cell type deconvolution benchmarking. Results for all N = 124 samples in this COVID-19 dataset are shown. The boxes represent the interquartile ranges (IQRs), and the solid lines indicate the means. The whiskers extend to points within 1.5 IQRs of the lower and upper quartiles. P-values are from one-sided Wilcoxon tests without adjustment. j Figure displaying Lin’s Concordance Correlation Coefficient (CCC) between actual (ground truth) cell type proportions and those estimated by various deconvolution methods. Except for the first two columns, all other columns’ results are based on 4 representatives’ single-cell data as reference. k Comparison of Pearson correlation across various deconvolution methods. l Comparison of Root Mean Square Error (RMSE) across various deconvolution methods. Source data are provided as a Source Data file.To test the fidelity of single-cell gene expression semi-profiling, we tested the interferon (IFN) pathway gene set—crucial for the innate immune response against COVID-1946,47,48,49. Through the prism of our semi-profiled dataset, Fig. 2g reveals IFN activation patterns that harmonize with the real-profiled dataset. The uniformity in IFN activation patterns across various key cell types and severity levels, as highlighted by similar heatmaps, confirms the effectiveness of the semi-profiling technique. This uniformity indicates that the critical disease-related pathways were effectively captured and maintained in the semi-profiled data.Further, we explored the quantitative metrics of efficacy and cost-effectiveness in semi-profiling, Fig. 2h. Our analysis centered on understanding the relationship between the number of representative samples used for single-cell sequencing and the associated semi-profiling error. While an increase in representative samples intuitively raises the costs of single-cell sequencing, scSemiProfiler effectively leverages the single-cell data of these representatives for accurate inference of target samples. As more representatives are selected, our semi-profiling method effectively reduces the normalized error (see the Methods section for detailed error computation). We also compared our method to a selection-only method, which uses the same representatives chosen by scSemiProfiler using the active learning algorithm. For each target sample, it performs the single-cell data inference in a naive manner: merely copying the corresponding representative’s data. The dashed line shows that our semi-profiling method has a huge lead over the selection-only method. The star symbol denotes the number of representatives selected for the specific semi-profiled cohort that was utilized for subsequent analyses, along with the associated error. The vertical dashed line underscores that by using the same representatives, our method achieves substantially lower errors compared to the selection-only method. Moreover, the horizontal dashed line demonstrates that the selection-only method requires a considerably larger pool of representatives to attain the same level of semi-profiling accuracy as our method. The comparative analysis between scSemiProfiler and the selection-only method highlights the deep learning model’s efficacy in reducing costs and minimizing errors.In further evaluating the effectiveness of semi-profiling, defined as a single-cell granularity bulk decomposition, we turned to the conventional realm of cell type proportion metrics, as anchored by Fig. 2i. Existing research45 elucidates that PBMC cell type proportions undergo dynamic shifts with evolving disease conditions. True to this, our real-profiled dataset indicates a pronounced expansion of B cells and CD14 cells under aggravated conditions—a pattern mirrored in the semi-profiled dataset. The Pearson correlation coefficient50 of cell type composition associated with different disease conditions between the semi-profiled and the real single-cell datasets consistently surpasses 0.9.In our comprehensive analysis of cell deconvolution methods, we meticulously compared scSemiProfiler with several leading-edge techniques, including CIBERSORTx12,13, Bisque14, Scaden18, TAPE19, EPIC17, NNLS and MuSiC16, as depicted in Fig. 2j–l. Each method was tested under identical conditions, using the same bulk data and single-cell reference data. A key challenge in this analysis was the memory constraints encountered by most methods, elaborated in Supplementary Fig. S1b, which hindered their ability to process the full set of 28 representative single-cell data. To ensure a fair comparison across all methods, we limited the single-cell reference to the same initial batch of four representative samples. Additionally, we demonstrated the performance of our method using a set of 28 representatives that can not be fully exploited by other benchmarked tools. Our results unequivocally demonstrate the superior deconvolution performance of scSemiProfiler over all benchmarked methods. Its effectiveness is not only apparent when utilizing a smaller reference set of 4 samples but also becomes increasingly pronounced with a larger set of 28 samples. This distinct superiority of scSemiProfiler is a testament to its efficiency and versatility. Capable of excelling with both compact and extensive single-cell reference datasets, scSemiProfiler stands out as the most adept tool in cell type deconvolution, surpassing its contemporaries in handling diverse data scales with unparalleled precision and reliability. In Supplementary Fig. S6, we also show the high deconvolution accuracy using a side-by-side comparison between each sample’s predicted cell type proportion using 28 representatives with the ground truth.Different from existing deconvolution methods, since our method is capable of “deconvoluting” the bulk samples into single-cell resolution, we can perform de novo cell type annotation for the generated single-cell data. To further substantiate this unique functionality, we applied de novo cell type identification to the semi-profiled COVID-19 dataset. We then compared the de novo cell type annotation results, derived from the single-cell data profiles generated by our model, with the ground truth. Briefly, de novo labeling of cell types can be achieved from our semi-profiled single-cell data using a biomarker-based strategy similar to those employed in other relevant studies51,52 Specifically, we identified the top biomarkers associated with each cell cluster identified from the semi-profiled single-cell data. These biomarkers were then compared with known cell type markers from databases such as CellMarker53,54 or Panglao DB55 to annotate the cell types de novo. We compared the de novo cell type identification results with supervised cell type annotations and demonstrated the effectiveness of scSemiprofiler in de novo cell type identification (Supplementary Fig. S7).Results presented above are based on selecting representatives using our active learning algorithm. To examine the effectiveness of scSemiProfiler’s global mode and compare it with results obtained under active learning for selecting representative samples, we employed the global mode to select the same number of representatives (28). We then used the deep generative model to infer the single-cell data for the remaining samples, thereby creating a semi-profiled dataset. This dataset underwent the same analytical procedures as those applied to the dataset generated under active learning. The results, which are detailed in Supplementary Fig. S8, indicate that although the global mode’s performance is slightly inferior to that of the active learning mode, it remains quite comparable. However, it is important to note that, unlike active learning, the global mode cannot terminate before the predefined number of representatives has been profiled—even if fewer representatives could achieve similar performance. Therefore, this one-round global mode does not offer the same budget-minimizing flexibility as the multiple-round active learning mode.The semi-profiled COVID-19 single-cell cohort proves reliable for single-cell downstream analysesWe have previously demonstrated the capability of scSemiProfiler in accurately generating semi-profiled single-cell data that closely aligns with its real-profiled counterpart. Moving beyond basic cell type proportion predictions, which is the primary focus of other methods, scSemiProfiler excels in predicting gene expression for each cell within a population, thereby more authentically mimicking true single-cell data. This advancement is crucial for more complex downstream single-cell analysis tasks.To illustrate the effectiveness of semi-profiled data in standard downstream single-cell analyses, we conducted a series of evaluations. A key task in these analyses is the identification of biomarkers within distinct cell clusters, highlighting genes with distinct expression patterns. Utilizing the semi-profiled data generated by scSemiProfiler, we performed various single-cell level downstream analyses. The results from these analyses, as depicted in Fig. 3, demonstrate a significant consistency with outcomes derived from real-profiled data. For instance, we identified top cell type signature genes using the real-profiled cohort. When comparing their expression patterns in both real-profiled and semi-profiled datasets, the similarities were striking. The dot plots in Fig. 3a display these patterns, showcasing an almost indiscernible difference between the datasets. The semi-profiling at the single-cell level provides high-resolution expression data for marker genes within each cell population (cell type). Our approach reveals the distribution of marker gene expression across all cells within a specific type. While existing bulk deconvolution methods can offer average gene expression data for cell type markers, they fall short in depicting detailed gene expression distribution and variations among individual cells within the same population. This high level of cell type biomarker concordance underscores the scSemiProfiler’s robustness not only in replicating single-cell data but also in ensuring the fidelity of downstream analytical processes.Fig. 3: Comparative analyses of single-cell level downstream analysis tasks using real-profiled and semi-profiled COVID-19 datasets.a Dot plots elucidating the expression proportion and intensity of discerned cell type signature genes. The top half showcases the real-profiled dataset, while the bottom delineates the semi-profiled version. P-values are from two-sided Pearson correlation tests. b RRHO plot emphasizing the congruence between the CD4 positive and negative markers in both datasets. One-sided hypergeometric test p-values without adjustment are presented. c Visualization of the GO term enrichment outcomes rooted in CD4 signature genes from both dataset versions. The plot accentuates the union of the top 10 enriched terms, with the Pearson correlation coefficients of between the bar lengths, which is based on the corresponding p-value and therefore represents the significant level. Benjamini-Hochberg (BH) adjusted one-sided hypergeometric test p-values are plotted. The p-value on the top is from two-sided Pearson correlation test. d A juxtaposition of cell-cell interaction analyses stemming from real-profiled and semi-profiled cells from moderate COVID-19 patients, underscoring the similarity in interaction types and counts. A two-sided Pearson correlation p-value is presented to show the similarity between the two interaction number matrices. e Comparative depiction of pseudotime trajectories for CD4 cells across both datasets, highlighting their striking similarity in reconstructing dynamic cellular processes. The p-value is from a two-sided Pearson correlation test. Source data are provided as a Source Data file.We further explored the similarities between biomarkers identified using semi-profiled data and those from real-profiled data. Biomarker discovery is possible with lower-resolution data, such as the average cell type gene expression data provided by current bulk deconvolution tools. However, these methods are less reliable compared to single-cell data. The limitation of average cell type gene expression data is its lack of replicates for each cell type, leading to reduced statistical power. The absence of replicates makes it challenging to estimate variance within a cell type, which is essential for standard differential expression tests. In contrast, our semi-profiled single-cell data supports robust biomarker discovery through rigorous statistical testing, as it includes multiple samples (i.e. cells) for each cell type. We demonstrate the similarity of biomarkers identified using real-profiled and semi-profiled datasets through a rank-rank hypergeometric overlap (RRHO) plot. An RRHO plot56 visualizes the overlap between two ranked gene lists, highlighting the degree of similarity and the significance of the overlap between them (see the Methods section for more details). Leveraging the RRHO plots, we compared the top 50 positive and top 50 negative gene lists associated with different cell types (Fig. 3b for CD4 cells and Supplementary Fig. S9 for all other cell types. The plots show positive marker and negative marker lists from both datasets are highly similar. A marked dissimilarity was evident between the positive and negative marker lists, which is intuitively anticipated. By definition, positive markers are genes that are higher expressed in the corresponding cell types, and negative markers are the opposite—lower expressed. Therefore, they should have no overlap. The compelling concordance demonstrated in Fig. 3a, b bolsters our claim that the semi-profiled data from scSemiProfiler can viably supplant real-profiled data for the pivotal task of biomarker discovery.Next, we used the biomarkers derived from the semi-profiled dataset and those from the real-profiled dataset for gene functional enrichment analysis, assessing whether the two versions of the analysis yield consistent results. Fig. 3c compares the Gene Ontology (GO)57,58 enrichment59,60 outcomes derived from real-profiled and semi-profiled datasets. The top 100 signature genes from both datasets are used for the enrichment analysis. We observed an overlap of 95 genes between the two lists, yielding a highly significant hypergeometric test p-value of 9.10 × 10−196 (the population size of the hypergeometric test is the number of highly variable genes used for this dataset, 6030). A comparison of the top 10 overlapping terms from both versions reveals nearly identical significance (Pearson correlation coefficient of 0.998 with a p-value of 4.13 × 10−12 for comparing the significant levels). The results for other cell types are in Supplementary Fig. S10. Reactome pathway61 enrichment analysis results are in Supplementary Fig. S11. This further corroborates the reliability of semi-profiled data in downstream analyses.Progressing to yet another pivotal single-cell level downstream analysis task, we evaluated the congruence in cell-cell interaction analyses derived from real-profiled and semi-profiled datasets. Given the paramount role of cell-cell interactions in orchestrating a myriad of multicellular processes, their analysis often unveils pivotal biological insights62,63. Fig. 3d juxtaposes cell-cell interaction analyses rooted in real-profiled and semi-profiled cells from moderate COVID-19 patients (see results for other severity levels in Supplementary Fig. S12). The evident concordance in types and counts of interactions in both renditions reinforces the reliability of our semi-profiled data (R = 0.996, P < 2.23 × 10−308). We also show a comparison of partition-based graph abstraction (PAGA) plots generated using real-profiled cohort and semi-profiled cohort in Supplementary Fig. S13, which demonstrates that the semi-profiled data can accurately capture the cellular trajectories and relationships between cell types. Given that such analyses intrinsically require single-cell data, scSemiProfiler emerges as the sole contender capable of producing data apt for this task from bulk sources.Delving further into the capacity of semi-profiled data for other downstream single-cell level analysis tasks, we turned our attention to pseudotime analysis29,30. Pseudotime is a pivotal tool in reconstructing dynamic cellular processes, ranging from differentiation pathways to developmental timelines or disease trajectories. As depicted in Fig. 3e, the pseudotime trajectories derived from real-profiled and semi-profiled CD4 cells are strikingly similar (Consistent results for the pseudotime analysis of other cell types can be found in Supplementary Fig. S14). This similarity is supported by a high Pearson correlation of 0.809 (see the Methods section for details regarding pseudotime analysis and Pearson correlation test). Such compelling evidence underscores that the semi-profiled data retains its reliability even for intricate biological explorations like cell trajectory and differentiation analyses. The same downstream analysis has also been performed for the “global selection mode” of scSemiProfiler and is presented in Supplementary Fig. S15, which turns out to be only slightly worse than selecting representatives using active learning.In silico generated single-cell data offer additional insights beyond representative samples for understanding the studied cohortGiven the high similarity between the single-cell analysis results from the real-profiled and semi-profiled COVID-19 cohorts, a critical question arises: Does this similarity stem more from the accurate in silico inference of the deep generative learning model, or is it primarily due to effective representative selection through our active learning algorithm? To address this question, we divided the semi-profiled cohort into two components: the representative cells and the inferred cells, and conducted further comparative analyses. Our findings are presented from three perspectives: Analysis of the entire semi-profiled cohort is more informative than analyzing only the representative cells; the high similarity of the semi-profiled cohort with the real-profiled cohort, especially considering that most cells in the semi-profiled cohort are generated by the deep generative model, underscores the effectiveness of the in silico inference; comparisons between the analyses based on the real-profiled cohort and those using only inferred cells demonstrate the reliability and value of the inferred cells.At the cohort level: The inclusion of generated cells significantly improves the overall data similarity to the real cohort and enhances analysis results compared to using only representative cells. The UMAP results, as indicated in Supplementary Fig. S16a–c, show that relying solely on cells from representatives fails to encompass many areas of the original cohort’s UMAP. This leads to the omission of certain cell subtypes and a lack of intra-cluster variability. This observation is supported by a comparison between the semi-profiling and the “Representative-only” methods shown in Supplementary Fig. S16. These plots demonstrate that semi-profiling more accurately captures the real-profiled cohort, especially in terms of cell type proportions and overall gene expression, compared to using only representatives. While a sufficient number of representatives can achieve comparable results to the real cohort for straightforward tasks such as cell type marker identification (Supplementary Figs. S17 and S18), the semi-profiled cohort still presents advantages when conducting detailed single-cell analyses, such as pseudotime (Pearson correlation: 0.809 for semi-profiled vs. 0.545 for “representative-only”, Supplementary Fig. S18d). It is also crucial to note that the effectiveness of representative data is enhanced by our selection strategies, which include bulk clustering and active learning.At the disease condition level: Using only representative cells often fails to cover all disease conditions, which either precludes the study of specific conditions or leads to dramatically worse results compared to using the semi-profiled cohort. In the case of the COVID-19 cohort, we utilized 28 representatives out of 124 total samples for most analyses. These 28 representatives do not cover condition “Asymptomatic” condition, making it challenging to investigate this specific disease condition using single-cell data. For instance, the estimation of cell type proportions for the “Asymptomatic” condition cannot be achieved using only representatives, as there is no cells from this condition were profiled at the single-cell level. shown in Supplementary Fig. S16d. Moreover, even for conditions that are covered by the representatives, the analysis results are often less reliable due to a lack of statistical power or failure to capture the internal variety within each condition. For example, Supplementary Fig. S16d, g show that the semi-profiled cohort is significantly more similar to the real-profiled cohort than the “representative-only” version in terms of cell type proportion and pathway activation patterns. Notably, when investigating biomarkers for disease conditions, the semi-profiled dataset demonstrates a stronger similarity to the real-profiled dataset than the “representative-only” version, as illustrated in Supplementary Fig. S19. For instance, in studying the “Critical” condition, the top 100 markers identified by the semi-profiled dataset have 68 overlaps (P = 5.43 × 10−109) with the real-profiled ground truth, whereas the “representative-only” dataset only has 30 overlaps(P = 4.31 × 10−31). This disparity becomes even more pronounced when the number of representatives is reduced, often due to limited budget constraints in cohort studies. To illustrate this scenario with fewer representatives, we examined the COVID study using just 12 representatives and compared the findings with those from the semi-profiled results (utilizing the same 12 representatives). The results, presented in Supplementary Fig. S20 for the 12 representatives version, demonstrate that even with only 12 representatives, the semi-profiled dataset continues to maintain high similarity in terms of disease marker analysis, while the similarity of the “representative-only” results declines further.At the individual sample level: The reliance on single-cell data from representatives alone for studying non-representative samples is not feasible without in silico inference, as these samples are not directly profiled. This limitation highlights two major concerns: Equity, Diversity, and Inclusion (EDI) and the potential for missed scientific discoveries. Excluding non-representative samples from analysis introduces biases in our understanding of the disease, as these samples are not represented, potentially leading to unfair representation and oversight of specific groups that may not be well represented by the selected representatives. Moreover, excluding non-representatives risks missing unique biological traits that representatives may not exhibit. However, scSemiProfiler enables the inclusion of these non-representative samples in single-cell analyses without additional costs. The generated single-cell data for individual samples closely resemble their corresponding target samples, as evidenced by UMAP visualizations in Supplementary Fig. S2.Furthermore, representing non-representative samples directly using single-cell data from their corresponding representatives (as described in the “selection-only” method in our manuscript) can yield data too divergent from the ground truth to be meaningful. For example, Supplementary Fig. S21 shows a non-representative sample with a “Critical” severity level represented by its corresponding representative. This figure illustrates that the inferred single-cell data has cell type proportions dramatically more similar to the target sample than to the representative sample. In Supplementary Fig. S21b, we present 40 target sample biomarkers, differentially expressed genes (DEGs) of this sample against a healthy control. The left 20 DEGs show gene expression more similar to the target sample than the representatives, and the right 20 show where the representatives have more similar patterns than our inferred sample. The visualizations clearly indicate that the inferred sample is significantly more similar to the target than the representative. Even among the least similar 20 genes on the right-hand side, more genes show similar average expressions in the three samples. The Pearson correlations between the inferred sample and the target sample (0.904) are significantly higher than those between the representative and the target sample (0.568). Importantly, enrichment analysis in Supplementary Fig. S21c, d reveals that these DEGs are crucially relevant to the immune response, such as “regulation of immune response”, “Immune System”, “Innate Immune System”, “Cytokine Signaling in Immune System”, and several terms relevant to MHC64, interferon pathway49,65,66, indicating that using only the representative for analysis could likely lead to incorrect conclusions about the studied disease, since many of those key disease-associated immune terms would be missed. Additionally, extensive analyses have been conducted to further validate the effectiveness of the in silico generated cells. Supplementary Fig. S22a demonstrates that the deep generative learning model successfully generates the majority of cells, specifically those inferred for non-representative samples, across all datasets examined. The generally positive outcomes highlight the accuracy of these generated cells; any significant deviation from real-profiled data could compromise the overall results of single-cell analyses. Moreover, the cell type proportions obtained using only the generated non-representative cells closely align with those from the real-profiled datasets, as shown in Supplementary Fig. S22b, further affirming the reliability of the inferred cells. Our model’s detailed generative process is designed to maintain the integrity and variety of cell types, addressing concerns that replicating representatives might not accurately represent their associated target samples. Another comparative analysis between these semi-profiled non-representative cells and the original real-profiled cells within the COVID-19 dataset serves as a robust demonstration of our model’s effectiveness. This comparison focuses on cell type distributions and pathway activations, revealing high concordance in cell type percentages and the preservation of key functional attributes, such as interferon activation pathways crucial for the immune response to COVID-19. The similarity in these critical biological aspects is illustrated in Supplementary Fig. S23. Comprehensive downstream analyses—including biomarker identification, pathway enrichment, cell-cell interaction studies, and pseudotime trajectory analyses-performed using the inferred cells further validate the integrity and biological relevance of our generated data. The results of these analyses, detailed in Supplementary Fig. S24, demonstrate the model’s ability to produce data that is not only quantitatively robust but also qualitatively insightful, thus enhancing the dataset with a more representative spectrum of cellular states for advanced biological exploration.Semi-profiling maintains accuracy on a heterogeneous colorectal cancer datasetWe further tested the effectiveness of scSemiProfiler by validating it against a notably heterogeneous colorectal cancer dataset67, which encompassed 112 single-cell sequencing samples that passed the quality control. This collection comprised 19 normal tissues, 86 tumor tissues (including colorectal cancer subtype iCMS2 and iCMS3), and 7 lymph node tissues. Considering the inherent diversity of this dataset, achieving accurate semi-profiling could ostensibly test the limits of scSemiProfiler. Again, this study does not include paired bulk sequencing data. Therefore, we have utilized the pseudobulk data derived from single-cell analysis as a surrogate for actual bulk sequencing. Nevertheless, following a consistent data processing and semi-profiling protocol, and selecting 36 representatives in batches of 4, the semi-profiled dataset mirrored its real-profiled counterpart. This congruence manifested not only in visual similarity but also in cell type distributions and subsequent analyses outcomes. Using the same estimation method as applied to the COVID-19 cohort, the total cost for both bulk and single-cell sequencing to obtain this highly similar semi-profiled single-cell cohort is approximately $73,320. This price also includes the cost of bulk data for the cohort and represents only 43.6% of the $168,000 estimated for conducting single-cell sequencing on the entire cohort.Figure 4a, b graphically highlight the similarity between the semi-profiled and real-profiled data using UMAP visualizations. These visualizations, color-coded according to cell types, show a substantial alignment between the datasets. The semi-profiled data mirrors the real-profiled data in terms of the location and shape of each cell type cluster. Notably, both datasets effectively segregate cell types such as plasma B, enteric glial, Mast, and epithelial. Additionally, a nuanced connection between fibroblast and endothelial cells is evident in both versions. Immune-centric cells like McDC, T_NK, and B cells are also accurately positioned in close proximity in both datasets, underscoring the precision of the semi-profiled data. This similarity is further emphasized in Fig. 4c, which showcases the significant overlap between the two datasets. Moreover, Fig. 4d employs distinct color schemes to differentiate between cells generated by the deep generative learning model and those from real-profiled representative data. The cells from the representatives cover a substantial portion, indicating their well-chosen representative selection. However, numerous cells that fall outside the representatives’ distribution are accurately generated by the deep learning model, highlighting the critical role of both active learning and the deep generative model in achieving effective semi-profiling. The accuracy of the deep learning model’s generation can be further shown in Fig. 4e, f where we present the model’s single-cell inference for an individual sample. More examples can be found in Supplementary Fig. S25. Supplementary Figs. S26–S28 provide further details regarding the cell type distribution in each of the 36 sample clusters and individual samples and how scSemiProfiler performs in generating each of these sample clusters.Fig. 4: Detailed comparisons between the semi-profiled and real-profiled data in the heterogeneous colorectal cancer dataset.a UMAP visualization of the real-profiled data, with colors denoting distinct cell types. Colors are consistent with (g). b UMAP visualization of the semi-profiled data, with colors denoting distinct cell types. c Joint UMAP visualization highlighting the close resemblance between the semi-profiled and real-profiled data. d UMAP plot of the semi-profiled dataset, with color-coding distinguishing cells from the actual sequenced representatives and the ones generated through semi-profiling. e, f UMAP visualization of the in silico inference of the target sample (MUX9009) using representative (EXT095). e UMAP showing the similarity of the reconstructed and the original representative. f UMAP showing the inferred target sample is more similar to the target ground truth than to the representative’s cells. g “Activation of immune response” gene set relative activation pattern calculated for different tumor tissue types as compared to the “Normal” type in the real-profiled and semi-profiled datasets. Entries with fewer than 500 cells are left blank. The p-value is from a two-sided Pearson correlation test. h, Performance trajectory of the scSemiProfiler on the colorectal cancer dataset, showcasing its superiority over the selection-only approach, with costs computed similarly to Fig. 2d. i Stacked bar plots comparing cell type compositions between the semi-profiled and real-profiled datasets across different tissues. The Pearson correlation coefficients between the real-profiled and semi-profiled tissues are LymphNode: 0.995, Normal: 0.993, iCMS2: 0.994, iCMS3: 0.988. j–l Cell type deconvolution benchmarking. Results for all N = 112 samples in the colorectal cancer dataset are shown. The interquartile ranges (IQRs) are depicted by the boxes, and the solid lines denote the mean values. Whiskers extend to points within 1.5 times the IQR from the first and third quartiles. P-values are derived from one-sided Wilcoxon tests without adjustment. j CCC between the actual cell type proportions and those estimated by various deconvolution methodologies. k The Same comparison using Pearson correlation coefficients. l Root Mean Square Error (RMSE) comparisons among different deconvolution techniques, highlighting the computational efficiency and accuracy of the scSemiProfiler, especially with an extended set of representatives. Source data are provided as a Source Data file.To further justify the semi-profiled gene expression values are accurate and can be used for biological analysis, we computed the gene set activation pattern of the GO term “activation of immune response” (GO:0002253) for the two tumor tissue types the same way as we did for the COVID-19 cohort (Fig. 4g). We chose this term because the immune response plays a significant role in the body’s defense against cancer, and its activation or suppression can influence cancer progression and patient outcomes68,69. The gene set activation scores are calculated and then adjusted by subtracting the score of the “Normal” tissue. The activation patterns in the real-profiled and semi-profiled datasets are highly similar, leading to a high Pearson correlation coefficient of 0.909 between them.We also quantitatively examined the overall performance of scSemiProfiler as different numbers of representatives are selected. As shown in Fig. 4h, while our approach trumps the selection-only method, the gap narrows in comparison to results on the COVID-19 dataset-owing largely to the colorectal cancer dataset’s inherent heterogeneity. Despite this, the deep generative model’s efficacy remains conspicuous, ensuring cost-effective error reduction. Also, if we aim to achieve an error as low as the previous COVID-19 cohort, which leads to almost identical analysis results as the real data, only half of the samples need to be selected as representatives. This still reduces the cost significantly.Diving deeper into the cell type proportions within the colorectal cancer cohort, one discerns variations across different tissue types-“Lymph Node”, “Normal”, “iCMS2”, and “iCMS3”. Fig. 4i illustrates these differences. For example, “Lymph Node” contains an expanded population of B cells compared with other tissue types, “Normal” is enriched with PlasmaB cells, and the two tumor subtypes have a pronounced epithelial presence. Remarkably, the semi-profiled dataset captures these nuances with precision, underlining its capability to replicate intricate analyses with fidelity.Lastly, the benchmarking of deconvolution results presented in Fig. 4j–l positions scSemiProfiler at the forefront, significantly outperforming most existing methods such as Bisque, TAPE, and Scaden. While its performance with four representatives is on par with CIBERSORTx, scSemiProfiler excels in computational memory efficiency. Further extending the representatives to 36 dramatically boosts the deconvolution accuracy. In contrast, existing methods such as CIBERSORT falter when tasked with handling a large reference set like 36 representatives, mainly due to their computational inefficiencies. This distinction underscores the scSemiProfiler’s distinct advantage unshared by its peers. To provide a more detailed perspective on our deconvolution outcomes, Supplementary Fig. S29 showcases a comparative analysis. It displays our predicted cell type proportions for each individual sample alongside the ground truth, enabling a side-by-side evaluation.
scSemiProfiler ensures consistent downstream analyses between semi-profiled and real single-cell data in heterogeneous colorectal cancer cohortsIn the context of a heterogeneous dataset like this colorectal cancer one, the semi-profiled dataset stands robust, offering downstream analysis results that mirror the real-profiled data. This close resemblance is consolidated in Fig. 5.Fig. 5: Downstream analysis results comparisons for the colorectal cancer dataset.a Dot plots visualizing the cell type signature genes. The p-values are from two-sided Pearson correlation tests. b RRHO plot visualizing the comparison between semi-profiled and real-profiled markers of epithelial cells. One-sided hypergeometric p-values are plotted. c Epithelial cell type signature genes GO enrichment analysis results comparison. Benjamini-Hochberg corrected one-sided hypergeometric p-values are plotted. The p-value on the top is from the two-sided Pearson correlation test. d Cell-cell interaction results comparison between the real-profiled tumor tissue cells and semi-profiled tumor tissue cells. A two-sided Pearson correlation p-value is plotted. e Pseudotime results comparison using the epithelial cells. A two-sided Pearson correlation p-value is plotted. Source data are provided as a Source Data file.A notable observation is the accuracy in analyzing biomarker expression pattern and their intra-cluster variation using semi-profiled data. Fig. 5a showcases dot plots for top cell type signature genes derived from the real dataset. These plots reflect an identical pattern in both the real-profiled and semi-profiled datasets. Further affirmation comes from the strong Pearson correlation coefficient between the colors (0.994) and sizes (0.996) of the dots. Notably, these correlation coefficients even surpass those observed in the more homogeneous COVID-19 dataset. The semi-profiled dataset also reproduces biomarker discovery results, establishing its credibility as a suitable stand-in for the real-profiled data in such analyses. This consistency is exemplified by genes like KRT18 and KRT8, exclusive to epithelial cells, corroborated by existing literature70,71. Another illustration is the unique expression of CPA3 and TPSB2 in Mast across both datasets. Beyond these top cell type signature genes, a granular examination of epithelial cells—encompassing the top 50 positive and negative markers—reinforces the congruence. All markers were identified using data at single-cell resolution through thorough statistical testing, a process unachievable with decomposed cell type-level data from standard bulk decomposition methods. As depicted in Fig. 5b, the preponderance of highly significant entries, with many p-values lower than 10−50, strongly indicates a high degree of similarity in marker lists between the two datasets. RRHO plots for additional cell types can be found in Supplementary Fig. S30.Diving deeper into the gene functional enrichment analysis, Fig. 5c offers further validation. Analyzing the top 100 signature genes across both datasets reveals a staggering 96 common markers, yielding a hypergeometric test p-value of 2.13 × 10−183 (population size = 4053). GO terms from both datasets, along with their respective p-values, showcase pronounced similarity. Although the heterogeneity of the dataset leads to a relatively lower Pearson correlation coefficient of 0.593, the overall patterns in the two plots remain statistically similar (P = 0.042), leading to the same scientific conclusions. Moreover, when considering the union of the top 10 Gene Ontology (GO) terms from both the semi-profiled and actual datasets (comprising a total of 12 terms), there is a significant similarity in terms of enrichment p-values. Notably, the two versions of the top 10 GO terms have 9 overlap terms. More comprehensive GO term and pathway enrichment analysis results for other cell types’ signature genes are also consistent for the real-profiled and semi-profiled versions (Supplementary Figs. S31 and S32).Despite the increased heterogeneity of the dataset, the analysis of cell-cell interactions with the colorectal cancer semi-profiled cohort remains promising. As illustrated in Fig. 5d, the cell-cell interaction analysis, when executed on the real-profiled tumor tissue cells and the semi-profiled counterpart, reveals substantial consistency. Navigating the intricate interaction patterns characteristic of tumor tissues, the semi-profiled data astonishingly replicates the intricate layout with a robust Pearson correlation coefficient of 0.933. Both versions highlight enteric glial and fibroblast as the primary senders, while neutrophils emerge as the predominant receivers. Significantly, the most intense interactions identified across both sets involve the enteric glial with itself, the enteric glial with neutrophils, and the fibroblast with neutrophils. The cell-cell interaction results for other tissues also exhibit a high degree of similarity between the real-profiled and semi-profiled versions, as shown in Supplementary Fig. S33. Additionally, the strikingly similar PAGA plots presented in Supplementary Fig. S34 further demonstrate the utility of semi-profiled data in studying cellular trajectories and relationships between different cell types.Shifting the focus to pseudotime analysis, we evaluated epithelial cells across all tissues as an example demonstration (Fig. 5e). Given the presence of cells from tumor tissues, this might introduce elevated heterogeneity within the cell type. Yet, the consistency between pseudotime analysis results from both versions is significant. Both versions discern lower pseudotime values concentrated at the base of the cluster, culminating in larger values towards the upper regions, with the pinnacle being the top-right quadrant. The statistical significance of the similarity is further validated by a Pearson correlation of 0.461 with a significant p-value of 1.87 × 10−22 (see the Methods section for details about the Pearson correlation test). The pseudotime analyses for other cell types also demonstrate a high degree of similarity, as evidenced in Supplementary Fig. S35. Such a finding underscores the capability of the semi-profiled data to adeptly capture intra-cluster nuances in detailed analyses.Semi-profiling with real bulk measurements yields a dataset nearly identical to the original single-cell dataTo further illustrate the adaptability of scSemiProfiler in real-world applications, we directed our analyses towards the iMGL dataset72, which uniquely profiles both single-cell and real bulk RNA-seq measurements for the human inducible pluripotent stem cell (iPSC)-derived microglia-like (iMGL) cells, differing from the pseudobulk datasets previously used. There are 25 samples having both single-cell data and bulk RNA sequencing data. Samples are of different conditions (grown in cell culture for 0-4 days and under various treatments). The availability of such datasets, which include both single-cell and bulk sequencing data on a large scale, remains very limited, partially due to the unnecessity of doing both sequencing for the same large-scale cohort and its prohibitive cost. Pseudobulk, created by averaging out the single-cell data, is often quite different from the bulk sequencing from the same sample73. This is because single-cell data is typically noisier, largely due to shallow sequencing and the presence of “dropouts” (genes not detected) in single-cell data74,75,76.To navigate this complexity, we devised a method to infer the target sample pseudobulk directly from its real bulk data, real bulk, and pseudobulk for the representative (refer to the Methods section “Fine-tune the deep generative learning model to infer the single-cell measurements for the target samples” for comprehensive details). This technique allows us to accurately estimate the pseudobulk data of the target samples. Supplementary Fig. S36 shows that the average Pearson correlation between the inferred pseudobulk and ground truth pseudo bulk data is 0.997, significantly higher than the Pearson correlation between the real bulk and ground truth pseudobulk, which is 0.880. This approach enables us to more effectively utilize our deep learning model in the pseudobulk data space, which often aligns more closely with single-cell data as it is computed by directly averaging the single-cell data without any extra bias. Through this method, despite the intricate challenges of the iMGL dataset, the results in Fig. 6 illustrate that our semi-profiled data parallels the real-profiled data quite closely. By selecting only eight representative samples out of a total of 25 samples, we could notably reduce the overall cost without compromising on accuracy. For this smaller dataset, we employed active learning to select 8 representative samples in batches of 2. Using the same estimation method as in the other two studies, the total cost for acquiring both bulk and semi-profiled single-cell data through our method is approximately $21,750. This amount is just 58% of the estimated $37,500 required for conducting single-cell sequencing across the entire cohort.Fig. 6: Comparative analyses between semi-profiled and real-profiled iMGL datasets.a UMAP visualization of the real-profiled iMGL cohort. Different colors represent different cell types. Colors are consistent with (g). b UMAP visualization of the semi-profiled iMGL cohort. c Combined UMAP visualization showcasing the consistent cell distribution across both data versions. d UMAP visualization highlighting the representatives’ cells alongside the semi-profiled cells within the semi-profiled dataset. e, f UMAP visualization of the process of generating target sample’s (iMGL_D4_rep3) single-cell data using a representative (iMGL_D2_rep3) based on deep generative learning model. e UMAP visualization of the representative cells reconstruction. f UMAP visualization of inference. Compared to representative cells, inferred cells’ distribution shows higher similarity with the target real-profiled cells (example region highlighted in red box). g Relative pathway activation pattern of the GO term “activation of immune response” calculated for cells of different treatments as compared to cell type “C1: Homeostatic non-proliferative” in the real-profiled and semi-profiled cohorts. Entries with fewer than 100 cells are left as blank. The p-value is from a two-sided Pearson correlation test. h Performance evaluation of the scSemiProfiler on the iMGL dataset, emphasizing its efficiency in error reduction. Source data are provided as a Source Data file. i A comparative illustration of cell type proportions under varying experimental conditions, accentuating the similarity in patterns between datasets. The Pearson correlation coefficients between the real-profiled and semi-profiled versions of cell type proportions under different conditions are: iMGL_D0: 0.999, iMGL_D1: 0.871, iMGL_D2: 0.993, iMGL_D3: 0.989, iMGL_D4: 0.992, iMGL_DMSO: 0.998, iMGL_GW_30: 0.960, iMGL_GW_300: 0.999, iMGL_T_30: 0.987, iMGL_T_300: 0.999. j–l Deconvolution performance benchmarking using CCC, Pearson correlation and RMSE. The analysis covers all N = 25 samples in this iMGL dataset. The boxes illustrate the interquartile ranges (IQRs), with solid lines indicating the means. Whiskers extend to data points within 1.5 times the IQR from the lower and upper quartiles. P-values are obtained from one-sided Wilcoxon tests without adjustment.The UMAP visualization, as presented in Fig. 6a, b, solidifies our findings. Here, the two versions – semi-profiled and real-profiled – show a high level of consistency. The cell distributions of various iMGL subtypes (as delineated by the dataset provider) in the UMAP follow nearly identical patterns. The detailed observation showcases that clusters like “C2: Activated, immediate-early”, “C4: Activated, non-immediate-early”, and “C5: Activated, immediate-early” are interconnected and primarily located on the right-hand side. Likewise, “C3: Homeostatic, proliferative” finds its position at the upper left, with “C1: Homeostatic, non-proliferative” and “C6: Freshly thawed” lying at the bottom left. This consistency transcends to Fig. 6c, emphasizing a consistent cell distribution across the two versions. Furthermore, Fig. 6d underlines the precision of scSemiProfiler, where a majority of cells in the semi-profiled version were accurately generated. The efficient coverage of representative cells in this figure also highlights that our active learning strategy remains robust even when navigating the challenges of real bulk measurements.UMAP visualizations in Fig. 6e, f and Supplementary Fig. S37 show scSemiProfiler’s capability of accurately reconstructing a representative sample and infer single-cell data of a target sample from its real bulk data. Despite the similarity, there are discernible differences in cell distributions between the target and representative samples, with the semi-profiled version showing more resemblance. We can still see the tendency of the generated cells (yellow), distributed closer to the target ground truth (red) than to the representatives’ cells (blue). This indicates that the semi-profiled samples are either denser or sparser in the same areas as their counterparts. Importantly, these findings underscore scSemiProfiler’s sensitivity in detecting subtle differences between cell subtypes—a task that poses a challenge for other deconvolution methods that rely on cell type signature genes. Supplementary Figs. S38–S40 shows that the cell distribution varies in each sample cluster and sample. However, the semi-profiled version of the data is able to capture these nuance variances.Figure 6g provides an in-depth look at the accuracy of semi-profiled gene expression values. Since microglia are the resident immune cells in the brain, we checked the activation pattern of GO term “activation of immune response” in each cell type under each treatment. The semi-profiled dataset presents a highly similar activation pattern as the real-profiled dataset.Figure 6h offers further evidence of the scSemiProfiler’s effectiveness. Here, the performance curve of the semi-profiled approach significantly undercuts the selection-only one, demonstrating its capability to limit semi-profiling errors while optimizing on costs.An intriguing observation from the iMGL dataset is the cell type proportion’s dynamic shifts under various experimental conditions (Fig. 6i). The transitions from iMGL_D0 to iMGL_D4, for instance, reveal a progressive increase in the proportions of C1 and C4 cells. Contrastingly, “C2: Activated, immediate-early” cells peak at iMGL_D1 and then decrease steadily. Although the intricate effects of drugs iMGL_GW and iMGL_T need further investigation, preliminary data suggests that elevated doses result in a surge of “C1: Homeostatic, non-proliferative” cells. Impressively, these intricate variations are mirrored in the semi-profiled dataset.A juxtaposition of our deconvolution method against others on the iMGL dataset offers illuminating insights. The performance of CIBERSORTx and MuSiC, which was previously high in other datasets, significantly declines, possibly due to the inherent challenges of real bulk data and the nuanced similarities among cell types. It has been previously reported that MuSiC fails to generate reasonable results on real bulk datasets, leading to negative CCC19. Other methods, such as Bisque, TAPE, and Scaden, are more robust to those challenges, showing decent deconvolution performance. Despite these challenges, scSemiProfiler showcases resilience and consistently outperforms all its peers except Bisque, a fact further corroborated by the Wilcoxon test77 (see p-values in Fig. 6j–l). The marginal difference between our scSemiProfiler and the selection-only method is a testament to both approaches nearing optimal performance in this specific context. Supplementary Fig. S41 further presents our accurate deconvolution for individual samples. The superior performance is attributed to our pseudobulk inference strategy described in the Methods Section, which was designed to address the systematic differences between pseudobulk (the average expression) of aggregated cells from the target sample. As shown in Supplementary Fig. S36, the real bulk expression differs significantly from the ground-truth pseudobulk of the target single-cell data, while our inferred pseudobulk closely resembles the true pseudobulk.To further demonstrate the effectiveness of scSemiProfiler’s ability to deal with real bulk data, we generated pseudobulk of the iMGL dataset and performed the semi-profiling using these pseudobulk data. We used the generated single-cell data to perform the same analysis and compared it with the real bulk version of semi-profiling. The pseudobulk version of semi-profiling analysis results were detailed in Supplementary Fig. S42. Overall the results are very similar to the real bulk version, demonstrating that scSemiProfiler is able to handle the challenge brought by the real bulk data and maintain its high performance on pseudobulk data.To further assess scSemiProfiler’s reliability with real bulk data and its adaptability to other species and tissues, we conducted an extensive search in the GEO database for datasets that include both bulk and single-cell data. We discovered a hamster lung dataset78, which represents a different species and tissue compared to the previously analyzed datasets. Our selection criteria focused on datasets with paired single-cell and bulk RNA-seq profiling of at least 15 samples, each averaging over 1000 cells. The average number of cells in each sample in the hamster dataset is 1,151.4, while in the colorectal cancer dataset is 1914.0, in the COVID-19 dataset is 5138.3, and in the iMGL dataset is 2974.9. Most samples in the hamster dataset could not pass our previous filtering criteria—samples with fewer than 1000 cells should be filtered. A low number of cells significantly decreases the number of training samples for the deep generative learning model, making it hard for the model to learn the underlying data distribution. Despite its relatively lower data quality, characterized by fewer cells per sample and a smaller overall sample size, we still selected this dataset for analysis due to its unique composition and limited availability. We implemented the same analytical approach as with previous datasets, with results detailed in Supplementary Fig. S43. The close resemblance between the semi-profiled and real-profiled datasets, combined with scSemiProfiler’s significant advantage over other deconvolution methods (as evident in Supplementary Fig. S43f, h, i), confirms its effectiveness across diverse species and tissue types. By selecting only 2 out of 16 samples as representative (Supplementary Fig. S43d indicates these samples contained the majority of cells), scSemiProfiler successfully semi-profiled a single-cell dataset that closely mirrors the real-profiled ground truth. Additional outputs, including UMAP visualizations (Supplementary Fig. S43a–c) and precise cell type proportion estimates (Supplementary Fig. S43f–i), highlight its accuracy.Semi-profiling using real bulk measurement also leads to reliable downstream resultsIn this more realistic setting, where scSemiProfiler is tasked with semi-profiling using real bulk data for downstream analyses, the semi-profiled data consistently mirrored results from the real-profiled version. This is particularly remarkable given the unique challenges presented by the real bulk data (see the “Fine-tune the deep generative learning model to infer the single-cell measurements for the target samples” section in Methods for a detailed description of the challenges brought by real bulk dataset and our strategies for resolving them). We present these downstream analysis results in Fig. 7.Fig. 7: Similarities in downstream single-cell analyses between real-profiled and semi-profiled data for the iMGL cohort.a Dot plots visualizing the nearly identical expression patterns of cell type signature genes across both datasets. P-values are from two-sided Pearson correlation tests. b RRHO plots highlighting the striking similarities between the top 50 positive and negative C3 markers from both real and semi-profiled datasets. P-values are from one-sided hypergeometric tests. c Overlapping GO enrichment analysis results for the top C3 signature genes, emphasizing consistent analytical outcomes between the datasets. P-values plotted are from one-sided hypergeometric tests and are Benjamini-Hochberg corrected. The p-value on the top is from a two-sided Pearson correlation test. d PAGA plots illustrating the consistent major cell type links observed in both datasets. The p-value is from a two-sided Pearson correlation test. e Pseudotime plots affirming the topographical alignment between the real-profiled and semi-profiled cohort. The two versions of datasets’ pseudotime distributions show high similarity (Pearson correlation 0.762, two-sided p-value 3.76 × 10−77). Source data are provided as a Source Data file.The markers identified using single-cell resolution data in the real-profiled dataset were almost identical in expression patterns to those in the semi-profiled dataset. Fig. 7a visually reinforces this, showing “C1: Homeostatic, non-proliferative” and “C10: Homeostatic, non-proliferative” with virtually indistinguishable expression patterns across both data types. Additionally, unique expressions in “C11: Myeloid, progenitors”, such as GAPA2 and HPGDS, were consistently observed. The overarching similarities were further quantified with impressive Pearson correlation coefficients for both dot sizes and colors, clocking in at 0.980 and 0.989, respectively.Further validating the congruency of our method, Fig.7b presented the RRHO plot of the top 50 positive and negative C3 markers from both single-cell datasets (see RRHO plots for other cell types in Supplementary Fig. S44). The degree of similarity is substantial, with most of the overlapping test entries showcasing p-values less than 10−50. Such findings strongly suggest that the scSemiProfiler is adept at producing reliable data for biomarker discovery.Proceeding to more in-depth downstream analysis using the top 100 signature genes for GO enrichment, we observed an overlap of 90 genes between the semi-profiled and the real single-cell datasets (Fig. 7c). The hypergeometric test revealed a significant p-value of 1.81 × 10−183 (population size of the hypergeometric test equals the total number of highly variable genes used, 6013). The enriched terms identified from both the semi-profiled and real datasets matched closely. The Pearson correlation coefficient between the two versions’ significance is 0.995, underscoring the consistency in their analytical outcomes. Extended GO and Reactome enrichment analysis in Supplementary Figs. S45 and S46 further confirm the accuracy of finding signature genes using semi-profiled data.In the case of cell-cell interactions, both real and semi-profiled single-cell data did not capture any significant interactions, probably due to the similarity between cell clusters in the iMGL data. Instead, the partition-based graph abstraction (PAGA) analysis79 showcased in Fig.7d highlighted that major cell type links were consistent across datasets. Further cementing this was the strong Pearson correlation coefficient of 0.865 between the adjacency matrices of the two networks. Pseudotime analysis in Fig. 7e also affirmed the alignment between the two datasets. The topographical pseudotime alignment between them is almost congruent, justified by a Pearson correlation of 0.762 with a p-value 3.76 × 10−77.Additionally, as shown in Supplementary Fig. S47, the pseudobulk semi-profiling’s downstream analysis results are similar to the real bulk version, further demonstrating scSemiProfiler’s ability to handle real bulk data. Furthermore, even in the challenging hamster dataset, all the downstream analysis as illustrated in Supplementary Fig. S48, demonstrates that the semi-profiled dataset supports robust downstream analytical tasks such as biomarker discovery, cell-cell interactions, and pseudotime analysis.In conclusion, our findings robustly demonstrate that scSemiProfiler seamlessly adapts to real-world scenarios employing real bulk data. The downstream analytical outcomes derived from the semi-profiled data are significantly consistent with those based on real data.Active learning demonstrates its prowess in selecting the most informative samples for enhanced single-cell profilingThe crux of scSemiProfiler’s strategy revolves around judiciously selecting representative samples. The rationale is straightforward: the more informative the chosen representatives, the better the semi-profiling performance. This not only enhances the fidelity of the generated profiles but also provides a cost-effective approach by minimizing the number of necessary representatives.Initially, our methodology bore similarities to uncertainty sampling38,80,81, a heuristic active learning technique. Here, the intuition is to query samples with the most uncertainty, thereby maximizing the incremental information acquired. In our algorithm, we employed bulk data to pick out batches of samples that exhibited the most variance from their designated representatives. Since this method does not use information from the base learner (the deep generative model), it is still a passive learning algorithm.To improve the representative selection, we then turn the algorithm into an active learning algorithm by incorporating the information from the deep generative models. The algorithm also utilizes the clustering information in the cohort and is thus a type II active learning algorithm82. Combining these two ideas, the algorithm aims to reduce the total heterogeneity of each sample cluster, ensuring that each target sample has a similar representative, thus optimizing semi-profiling performance.As depicted in Fig. 8, we juxtaposed our advanced active learning algorithm against the rudimentary passive learning approach. Each panel, from Fig. 8a–c, encapsulates the comparative analyses derived from distinct datasets. While the x-axis maps the representatives earmarked for single-cell sequencing, the y-axis portrays the single-cell in silico inference difficulty, which is quantified using the average single-cell-level difference (see Equation (17) in the Methods section) between target samples and their representatives. This premise holds that as the dissimilarity between the target and the representatives increases, the complexity of the in silico inference task also rises. This metric is crucial as it highlights the challenges encountered by the deep generative learning model in semi-profiling, thereby illuminating the effectiveness of the strategies used for selecting representatives.Fig. 8: Active learning demonstrates its prowess in selectively profiling the most informative samples at the single-cell level.The x-axis represents the number of samples selected for single-cell profiling (representatives). The y-axis shows the single-cell in silico inference difficulty of the dataset, which is quantified by the average single-cell difference from each sample to its representative, showcasing the efficiency of representative selection strategies. The marked stars signify the iterations chosen for our methodology, with the generated data underpinning the analyses detailed in previous sections. a Results from the COVID-19 dataset. Active learning shows significantly better performance, especially in the beginning when a few representatives are selected. b Observations derived from the colorectal cancer dataset. Active learning continues to show significantly better performance even when more representatives are selected. c Insights from the iMGL dataset with real bulk measurements. Active learning still manages to outperform passive learning significantly. Source data are provided as a Source Data file.The empirical evidence is resounding. Across all datasets, the active learning algorithm showcased its mettle by consistently pinpointing representatives that considerably reduced the total distance to other samples. This underscores the algorithm’s capability to foster superior representative selection for semi-profiling. In Fig. 8a, when applying our method to the COVID-19 dataset, active learning shows better performance than passive learning, especially in the beginning iterations of the semi-profiling. Active learning reduces the inference difficulty when the same number of representatives are selected. Also, consider the marked point where 28 representatives are selected, which is consistent with the representatives we selected for our analyses in previous sections. To reach the same level of inference difficulty, passive learning needs to select 4 more representatives, i.e. the cost of 4 samples’ single-cell sequencing experiment is saved. For the colorectal cancer dataset (Fig. 8b), active learning continues to perform better than passive learning, and at the point we selected, the cost for more than two batches (8) of representatives can be saved using active learning. Fig. 8c shows the results for the iMGL dataset, in which active learning is also significantly better than passive learning. In all experiments, active learning consistently outperforms passive learning in selecting representatives. With the same budget, active learning achieves lower inference difficulty. Furthermore, to reach a comparable level of inference difficulty, active learning requires a lower cost.

scSemiProfiler: Advancing large-scale single-cell studies through semi-profiling with deep generative models and active learning

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models