scBubbletree: computational approach for visualization of single cell RNA-seq data | BMC Bioinformatics

Study A: exploring sample of five cancer cell linesGenerating the bubbletreeWe know that dataset A contains a mixture of five cancer cell lines. Hence, we expect \(k=5\) cell clusters or communities. To verify this we computed the gap statistic for the Louvain method with resolution parameter \(r=10^R\), where R was initialized using a sequence of values from \(-4\) to 1 in increments of 0.1. At each resolution r we recorded the number \(k’\) of identified communities (Supplementary Fig. S2A). WGIs were computed based on the cluster assignments and the vector of predicted cell line labels.The gap curve had a distinct ‘elbow’ at \(k’=5\) (\(r\in [0.003, 0.1]\)) (Fig. 1B). In line with these observations, we saw a steep drop in the WGIs to a value close to 0 at \(k’=5\) (\(r=0.003\)) (Fig. 1C) These results suggested that there is little benefit in using clustering resolutions that yield more than five communities.In the next step of the scBubbletree workflow we applied clustering with the original version of Louvain [13] (function get_bubbletree_graph) and resolution parameter \(r=0.003\) (\(k’=5\)). For input, we constructed a SNN graph based on \(A^{3,918\times 15}\), where each vertex (cell) was connected to its 50 nearest neighbors. Clustering was performed with 20 random starts and up to 100 iterations in each run. The clusters were organized in a bubbletree by hierarchical clustering with average linkage (tree in Fig. 1D). We ran \(B=1000\) bootstrap iterations and drew samples with up to \(N_\textit{eff}=200\) cells from each cluster to estimate inter-cluster distances and their robustness.Fig. 1scBubbletree analysis of dataset A. A scBubbletree workflow with main and alternative inputs shown with red and green labels, respectively. B Gap statistic for clustering solutions generated by the Louvain method with varying clustering resolutions. Vertical error bars are 95% confidence intervals of the gap statistic. C WGI for clustering solutions in panel B and the predicted cell lines of dataset A. D bubbletree (tree structure) annotated with a (E) heatmap displaying as rows the within-bubble relative frequencies of different cell lines. The bubbletree has five bubbles (white points) shown as leaves. Bubble radii scale linearly with the number of cells in the bubbles. Bubble identities, absolute and relative cell frequencies are shown as labels between bubbletree and heatmap. Gray branch labels are branch bootstrap support values. Rows in the heatmap integrate to 100%. F Mean normalized expression of five marker genes (x-axis) in each bubble (y-axis). G Density distribution of five marker gene (column panels) with normalized expressions of the cells in each bubble shown as violins. Numbers in heatmap tiles are rounded to the nearest tenthExamining the bubblesBubble sizes range from the smallest (bubble 4, 437 cells, 11.2% of the sample cells) to the largest (bubble 0, 1,253 cells, 32% of the sample cells). By visualizing the cell lines as categorical attributes, we saw clear mapping between different cell lines and bubbles (heatmap in Fig. 1E), i.e. cells from a specific cell lines were enriched in specific bubbles. For instance, 100% of the cells in bubble 0 and 4 belonged to cell line A549 and H1975, respectively, and between 99.5% and 99.8% of the cells in bubbles 1, 2 and 3 belong to cell lines H838, H2228 and HC827, respectively.By visualizing in each bubble the mean (normalized) expression (Fig. 1F) and the expression distributions (Fig. 1G) of five marker genes (one per cell line), we were able to confirm that bubble 0 was not only enriched with cells from cell line A549, but also had high ALDH1A1 expression which is a known A549 marker [31]. Similarly, bubble 4 had high expression of CT45A2 which is a marker of H1975 [32]. Bubbles 1, 2, and 3 were associated with high expression of SLPI, S100A9, and PIP4K2C, respectively, and based on data from the Cancer Dependency Map Portal (https://depmap.org; accessed 23.06.2022) we were able to confirm that these genes are typically over-expressed in cell lines H838, H2228, and HCC827, respectively.Examining the treeThe distance between bubble 3 and 4 was smaller than that of any other pair of bubbles in the bubbletree (Fig. 1D). This hinted at relatively higher transcriptional similarity between the corresponding cell lines HCC827 and H1975. Furthermore, bubble 3 and 4 were in a robust sub-tree that was found in all 1,000 bootstrap dendrograms. The sub-tree of bubbles 0, 3, and 4 was less robust (932 of 1,000 bootstrap dendrograms).The robust sub-tree of bubbles 3 (cell line HCC827) and 4 (cell line H1975) with the relatively short distance between these bubbles indicates transcriptional similarity of these cell lines. We checked this aspect of the tree by comparison with a set of independently measured transcriptional profiles of 69 adenocarcinoma cell lines from the Cancer Cell Line Encyclopedia [33], including HCC827 and H1975 (Supplementary Section 4). Euclidean distances were computed between the vectors of normalized gene expressions of all pairs of cell lines. The distance between HCC827 and H1975 was the 6th lowest among 2,346 pairwise comparisons (Supplementary Fig. S3), whereas the distances between the remaining cell lines of dataset A fell well within the general distribution of distances between the different adenocarcinoma cell lines. This result is consistent with the structure of the bubbletree, especially the robust closeness of bubbles 3 and 4.Dataset A had been measured with an artificially mixture of five distinct cell line populations, which explains the simple structure of the resulting bubbletree. The recovery of the ground truth of five cell lines (Fig. 1), was a sanity check for scBubbletree. In this simple case, even UMAP and t-SNE were able recover aspects of the overall structure (Supplementary Fig. S4A,B). The advantage of bubbletrees over these standard methods was the quantitative visualization of tree structure and cluster sizes. However, scBubbletree has been developed for the analysis of more complex scRNA-seq data that are typically observed with real tissue samples, as in the following study B.Study B: exploring a sample of human PBMCsClustering resolutionTo determine the clustering resolution of dataset B we computed the gap statistic for the Louvain method with resolution parameter \(r=10^R\), where R was initialized using a sequence of values from \(-4\) to 1 in steps of 0.1 (Fig. 2A). At each resolution we recorded the number \(k’\) of identified communities (Supplementary Fig. S2B).The gap increased rapidly between \(k’=1\) and \(k’=11\), followed by a dip at \(k’=12\) and a further ascend to about 4 at \(k’=24\) (\(r \approx 0.79\)), and a much slower, approximately linear increase at higher \(k’\) (Fig. 2A). The increase in the gap between \(k’=22\) and \(k’=24\) was about two-fold larger than that in the interval between \(k’=22\) and \(k’=57\), even though only two communities were added in the first interval and 23 communities (about 10-fold more) were added in the second interval. This suggested that there is little benefit in using values of r larger than 0.794 (\(k’=24\)).The value of \(k’=24\) is close to the number of 18 canonical cell populations identified in PBMCs in the Human Protein Atlas (HPA) database (https://www.proteinatlas.org/humanproteome/immune+cell).Clustering resolution analysis based on WGITo compute WGI curves we used as inputs the PBMC subtype labels from annotation set l1 and l2 and the Louvain clustering assignments generated for the gap analysis. We saw that annotation set l1 (Fig. 2B) was associated with lower WGIs compared to l2. The relatively lower WGIs of l1 are likely due to the fact that l1 contains labels of 8 major PBMC subtypes that have distinct transcriptional profiles. Hence, in each bubble of dataset B we expect a low degree of mixing in terms of different l1 labels (Fig. 2D). In contrast to this, annotation set l2 contains labels of 31 PBMC subtypes including, for instance, 13 T cell subtypes with comparable transcriptional profiles. Hence, we expect relatively higher degree of mixing of different l2 labels in common clusters in comparison to the l1 labels.The WGI curve converged to low WGIs at \(k’\approx 20\) for both annotation sets. This is similar to the clustering resolution of \(k’=24\) identified with the gap curve.Fig. 2scBubbletree analysis of dataset B. A Gap statistic for clustering solutions generated by Louvain clustering algorithm with varying clustering resolutions. Vertical error bars are 95% confidence intervals of the gap statistic. B WGI index for clustering solutions by Louvain algorithm and the cell type annotation sets l1 (gray points and lines) and l2 (black points and lines). C bubbletree with 24 bubbles (white points) shown as leaves. Bubble radii scale linearly with the number of cells in the bubbles. Bubble identities, absolute (in thousands) and relative cell frequencies are shown as labels. Gray branch labels are branch bootstrap support values. Clades of lymphocytes and monocytes are labeled with symbols ‘a’ and ‘b’. D Within-bubble relative frequencies (%) of cell types (l1; x-axis) rounded to the nearest tenth. Rows integrate to 100% (up to rounding error)Clustering and generating the bubbletreeFor clustering of dataset B we chose the original Louvain method and a resolution parameter \(r=0.794\) (\(k’=24\)). First, we constructed a SNN graph based on \(A^{161,764\times 15}\), where each vertex (cell) in the SNN graph was connected to its 50 nearest neighbors. Clustering was performed with 20 random starts and up to 100 iterations in each run. The clusters were organized in a bubbletree using hierarchical clustering with average linkage (Fig. 2C). For this we used \(B=1000\) bootstrap iterations and drew samples with up to \(N_\textit{eff}=200\) cells from each cluster to compute inter-cluster distances.The resulting bubbletree (Fig. 2C) had 24 bubbles with sizes ranging from 19,600 cells (12%, cluster 0) down to 300 cells (0.2%, cluster 23). The bubbletree had two major clades and two small outgroup bubbles (Fig. 2C). Clade ‘a’ contained 15 bubbles that accounted for about 65% of the cells in the sample, whereas clade ‘b’ contained seven bubbles that accounted for about 32% of the cells in the sample. Nearly all branches in the bubbletree were completely robust. The branch between bubble 4 and 7 had lower branch support as it was found in only 909 out of 1000 bootstrap dendrograms. A biological interpretation of the different clades and their within-clade branching patterns are provided in the following.Bubbletree evaluation with annotation set l1 with 8 canonical PBMC subtypesTo determine which PBMC subtypes are enriched in each bubble we visualized the within-bubble relative frequencies of 8 major PBMC subtypes from annotation set l1 (heatmap in Fig. 2D). Note that Figs. 2C, D demonstrate that bubbletrees (Fig. 2C) can be easily and coherently stacked with additional quantitative information (Fig. 2D), thus facilitating biological interpretation (Supplementary Fig. S5 for more extensive stacking).Figure 2D shows that most bubbles are enriched with cells from either one PBMC subtype or a combination of related subtypes (e.g. CD4+ and CD8+ T cells). Furthermore, bubbles enriched with cells from related PBMC subtypes are close to each other in the bubbletree; for instance, bubbles 10, 11, and 23 are enriched with B cells and constitute a robust subclade of the bubbletree. Similarly, bubbles 5, 3, 1, 12, 19, 13 and 16 form clade ‘b’ and are enriched with monocytes and dendritic cells (DCs).The bubbles of clade ‘a’ are mostly enriched with lymphocytes, and robust subclades within clade ‘a’ are formed by sets of bubbles that are enriched with specific lymphocyte subtypes, including T cells, B cells, and NK cells. The small outgroup bubble 21 contains a mixture of different lymphocytes. Clade ‘b’ has seven bubbles that are enriched in monocytes and DCs. The branches in that clade form a ladder-like topology in contrast to the more complex nested topology of clade ‘a’. The simpler topology in ‘a’ may be due to the lower heterogeneity of circulating monocytes compared to lymphocytes, with classical monocytes constituting about 85% of all circulating monocyte pool in humans [34].Clearly separated from clades ‘a’ an ‘b’ is the small outgroup of bubbles 20 of DCs and 18 consisting of a mixture of cells (‘other’).Supplementary Section 5 contains a bubbletree analysis based on the more detailed annotation set l2, revealing enrichment of specific PMBC subtypes in the different bubbles.Comparison with known approachesWe compared scBubbletree with UMAP, t-SNE and other popular methods for scRNA-seq data visualization with respect to visualization clarity, conservation of local and global data structure.Conservation of local structure: qualitative analysisIn the following we compare the ability of the Louvain method to conserve local structure with those of UMAP and t-SNE. For this analysis we generated conventional UMAP and t-SNE figures for datasets A and B (Supplementary Fig. S4A–D).In each figure, the different bubbles in dataset A and B were represented by heavily overplotted groups of dots in 2D UMAP and t-SNE space so that group size cannot be discerned from the figure. To our surprise, some cell lines from dataset A were split by UMAP and t-SNE into multiple subclusters; for instance, the cell line H1975 (bubble 4) had three distinct subclusters. The apparent distances between these subclusters in the drawing plane was more pronounced in t-SNE space (Supplementary Fig. S4B). Similarly, cell lines A549 (bubble 0) and H838 (bubble 1) were split into at least two subclusters of cells (Supplementary Fig. S4A–B). We realized that the cells from the smaller subclusters of A549 and H838 had systematically lower numbers of detected RNA molecules compared to the cells from the larger subclusters (Supplementary Fig. S6C–F). In contrast to this, the three subclusters in H1975 had comparable numbers of detected RNAs (Supplementary Fig. S6A–B), i.e. the subcluster structure of cell line H1975 could not be explained by differences in the number of detected RNA molecules. By using higher clustering resolution as input of Louvain we recovered these subclusters (data not shown). However, this was associated with negligible improvement in the gap and WGI (Fig. 1B–C).Several conclusions can be drawn from this qualitative analysis. First, visualization of the local structure of scRNA-seq data with UMAP and t-SNE can be challenging even for relatively small samples. Second, UMAP and t-SNE are prone to generating clearly separated visual clusters despite few underlying biological differences [7]. Furthermore, UMAP and t-SNE use a number of hyperparameters, which can have severe impact on the 2D projection [2, 7] (Supplementary Fig. S7). Taken together, it is difficult to ascertain whether the subclusters in dataset A represent biologically distinct subpopulations of the main cell lines or technical artifacts resulting from biological/technical noise.Conservation of distance structureUsers of conventional methods such as UMAP or t-SNE are tempted to interpret distances between dots or apparent clusters as indicators of transcriptional similarity of the corresponding cells. Yet, several studies [2, 35] have warned that while both, UMAP and t-SNE, consistently conserve distances on a small-scale, accurate conservation of large-scale distances is not always guaranteed.To highlight the challenge in conserving large-scale distances with conventional methods, we analyzed UMAP and t-SNE plots of a dataset with over 1.3 million mouse brain cells (Supplementary Fig. S8A–B). The UMAP and t-SNE plots suffer from excessive overplotting, which hinders visual evaluation of the distances between cells and clusters. In addition to this, we see large spread (indicated by the wide and overlapping error bars in Supplementary Fig. S8C–D) in the distances between cells in 20D PCA space over the entire range of distances in 2D UMAP and t-SNE space, which indicates ambiguous mapping of distances in 2D UMAP and t-SNE space relative to distances in 20D PCA space. We saw similar patterns in the 2D UMAP and t-SNE plots of dataset B (Supplementary Fig. S4G–H), even though dataset B is about one order of magnitude smaller than the dataset of mouse brain cells. As currently available workflows for scRNA-seq are able to generate data at similar scale and complexity as in these examples, we conclude that inference of transcriptional relationships between cells and clusters from UMAP and t-SNE will in general not be reliable.Bubbletrees by construction should better reflect distances in PCA space in the dendrogram distances. In fact, Supplementary Figs. S4I–J and S9B demonstrate that bubbletree distances are consistent with the average distances in PCA space between cells from the corresponding bubbles. This was true for distances on all scales except for intra-bubble distances, which correspond to bubbletree distance = 0 (distances between bubbles to themselves). For bubbletree distances equal to 0 we saw consistently small average distances between cells in PCA space. It is noteworthy that the spread of distances between dots in PCA space was not uniform over all pairs of bubbles, presumably because the underlying clusters deviate more or less from homogeneous spherical shapes in PCA space, leading to the dispersion of inter-bubble cell distances.Visualization clarityWe assessed the visualization clarity of scBubbletree with respect to two aspects: (1) scalability: the ability to handle large RNA-seq datasets; (2) flexibility: the ease of incorporating diverse cell attributes as part of the main visuals.Scalability We investigated scalability in the context of the overplotting problem. A high degree of overplotting was observed in the 2D UMAP and t-SNE maps generated based on dataset A and B (Supplementary Fig. S4A–D). The overplotting was particularly excessive in the 2D UMAP and t-SNE maps of dataset B, where individual dots (cells) were barely visible and clusters of cells appeared as colored blobs. There are several consequences of such massive overplotting: 1. failure to grasp data’s local and global structure; 2. inadequate description of cluster homogeneity in terms of different cell attributes; 3. loss of information about the density of cells across the different clusters; and 4. visual biases, i.e. users can deliberately or inadvertently emphasize or de-emphasize specific cell attributes by assigning the corresponding cells to the top of the cell piles [36].Various techniques have been proposed to mitigate the overplotting problem, such as, reducing the size of the dots, adding an alpha channel for point transparency, data subsetting [37], hexagonal cell binning [36], applying density-conserving dimensionality reduction [38] or exploring the data with interactive graphical tools [39]. However, as the scale of scRNA-seq datasets continues to increase these countermeasures are of limited help, and most of them even have a negative impact on the clarity of the resulting figures.Instead of visualizing individual cells, scBubbletree uses a high-level abstraction (bubbletree) to summarize scRNA-seq datasets and thereby avoids the overplotting problem. To demonstrate the high degree of scalability of scBubbletree in comparison to UMAP and t-SNE, we visualized with each approach a dataset of over 1.3 million mouse brain cells (Supplementary Fig. S8A–B and S9A). While both visuals are complex, the bubbletree approach at least offers an opportunity to understand the dataset, whereas the UMAP and t-SNE visuals lump all cell clusters into densely packed areas, making it impossible to understand the structure of the data. scBubbletree is not the only approach that employs trees to visualize scRNA-seq data, i.e. approaches such as TooManyCells [25] and clustree [40] use similar strategies to mitigate the overplotting problem.Flexibility The flexibility of scBubbletree was compared with that of UMAP, t-SNE, TooManyCells and clustree. Specifically, we assessed the ease with which each tool is able to integrate and combine multiple cell attributes. It is important to note that while these tools are functionally related to scBubbletree, there are inherent differences in the tasks they aim to solve. Therefore, a direct comparison of their results was not possible.Cell attributes are visualized in UMAP and t-SNE plots by color-coding each point according the value of a particular attribute (e.g. expression level of a marker gene). For simultaneous visualization of two or more cell attributes, it is current practice to generate multiple and often small copies of the UMAP or t-SNE plots in which the individual attributes are visualized [41]. Although this method of integrating and presenting scRNA-seq data is widely used, it suffers from the same and even more shortcomings than individual UMAP and t-SNE plots.TooManyCells is more flexible as it uses multiple facets to visualize cell attributes alongside its dendrogram e.g. by adjusting the color-code or the widths of its branches to for visual summaries (e.g. averages) of continuous cell attributes. TooManyCells can also attach pie charts to each leaf-node to visualize the within-leaf composition of a categorical attribute. Plots that encode information in such a diverse set of visual elements make it difficult for the viewer to decode and discover relationships.The clustree algorithm visualizes cell attributes as part of its main dendrogram by adjusting the different visual elements such as colors, sizes, shapes and labels of the leaf-nodes. However, clustree is less flexible than TooManyCells as only few cell attributes can be integrated simultaneously alongside the main dendrogram. One of the key advantages of clustree is that it allows users to explore relationships between clusterings at multiple resolutions, which can be used determining the optimal number of clusters. Similar functionality is implemented in scBubbletree’s function compare_bubbletrees, which enables comparison of pairs of bubbletrees generated from the same scRNA-seq data but with different input parameters such as e.g. clustering resolutions.With scBubbletree we have therefore limited the number of types of visual elements (dendrogram, bubbles, and annotations) but made it easy to attach to the tree simple, matrix-like plots (heatmaps, violins) with one matrix row for each bubble (e.g. Figs. 1D or 2D). So the tree can be virtually read from left to right or top to bottom in a consistent way. Several of these matrix-like elements can be combined without hampering analysis by visual overloading (Supplementary Fig. S5).

Hot Topics

Related Articles