Unsupervised representation learning of chromatin images identifies changes in cell state and tissue organization in DCIS

A large-scale high-resolution chromatin imaging dataset of tissue microarrays enables the analysis of disease stages and phenotypic categories in DCISCellular chromatin organization is highly informative of a cell’s functional state within the tissue microenvironment, including its gene expression profile, cell type, and health30. Chromatin staining is routinely employed in imaging experiments. While pathologists have used chromatin staining to predict disease stage, it is more commonly used as a fiducial marker for nuclear segmentation and the identification of cell centroid29,38. In this paper, we used chromatin staining to generate a large-scale dataset to compare different phenotypic categories in non-tumor, DCIS, and IDC patients. For this, we imaged 560 tissue microarray (TMA) samples from 122 patients at 3 disease stages and 11 phenotypic categories (ranging from normal breast tissue to hyperplasia, DCIS, and IDC) as annotated by pathologists (Fig. 1a and Supplementary Data 1, “Methods”). In addition to chromatin staining using Hoechst, the tissue microarrays were co-stained with one or two protein markers (Fig. 1b, “Methods”). The protein stains include cytokeratin, α-smooth muscle actin (α-SMA), type 1 collagen (collagen1), ki67, and ɣh2ax. Furthermore, we obtained tissue masks of the breast ducts using manual thresholding based on the cytokeratin expression levels, and we segmented the nuclei using StarDist39 (“Methods”, Supplementary Fig. 1). The duct and nuclear segmentations were examined by a pathologist and considered accurate, i.e., equivalent to an accurate manual segmentation (Supplementary Figs. 2–9). In the following, we demonstrate that a machine-learning based framework can infer the cell state changes in DCIS based on simple chromatin staining without the use of highly multiplexed-staining or gene expression measurements. Importantly, the use of chromatin images allows for quantitative characterization of disease stages in terms of cell states and their relative spatial organization.Unsupervised learning on single-cell chromatin images identifies morphologically distinct cell clusters that correlate with disease stages and phenotypic categories in DCISTo learn a representation of cell state from chromatin images, we trained a convolutional variational autoencoder (VAE), a neural network architecture widely used for representation learning (“Methods”, Supplementary Fig. 10a,b)40. We used a similar setup of the VAE as in a previous study41, which demonstrated that the resulting VAE latent features of chromatin images are informative of cell state and can be used to predict RNA expression. By clustering the VAE’s latent representations of the input chromatin images (Fig. 2a), we identified eight cell states based on their distinct nuclear morphometrics and chromatin organization. We further divided these eight major cell states into subclusters, until further division into more subclusters would result in identical subclusters, in terms of the distribution of pathologies and protein expression (Supplementary Fig. 11). The clusters identified by our autoencoder exhibit distinct nuclear morphology and chromatin organization features across the different cell states and substates (Fig. 2b and Supplementary Fig. 11b). Importantly, cells that are clustered together are similar to each other across the different disease stages (Fig. 2b and Supplementary Fig. 11b). We observed that all cell states exist in each of the disease stages and phenotypic categories, as annotated by pathologists, but with different proportions (Fig. 2c and Supplementary Fig. 11a). The same trend of cell state distribution was observed when examining each TMA core individually: namely, all cell states exist in almost all cores but with different proportions (Fig. 2e). Clusters 0, 1, and 2 are enriched in phenotypic categories of the non-tumor stage, while clusters 5, 6, and 7 are enriched in DCIS or invasive stages (Fig. 2c, e).Fig. 2: Extracting and clustering single-cell chromatin image features through the use of an autoencoder framework results in the identification of morphologically distinct cell states in DCIS.a An example of an input and reconstructed single-cell chromatin image by our convolutional variational autoencoder (VAE) framework. The latent representation of the chromatin images was clustered into eight top-level clusters. The same number of cells were selected from each of the 11 phenotypic categories for clustering (24,224 cells per stage) so that the clustering was not dominated by the cells from one particular stage. b Randomly selected examples of nuclei in each of the eight clusters in four representative phenotypes. DCIS ductal carcinoma in situ; IDC invasive ductal carcinoma. c Heatmap showing the fraction of cells in each of the eight top-level clusters in each phenotypic category organized into the three disease stages, non-tumor, ductal carcinoma in situ (DCIS), and invasive ductal carcinoma (IDC), calculated based on the cells used for clustering in (a). Columns were normalized to sum to 1. Histograms show the total number of cells in each cluster and in each phenotypic category. All cells were included for computing the histograms except for the cells in the held-out samples (“Methods”, Supplementary Fig. 12a). d The expression of each protein marker in each of the eight clusters. Columns were normalized to sum to 1. α-SMA: α-smooth muscle actin; collagen1: type 1 collagen. e Fractions of cells in each of the eight top-level clusters within each sample. The color coding of the clusters is the same as in (a) and (c). DCIS ductal carcinoma in situ; IDC invasive ductal carcinoma.Our finding that clusters 0, 1, and 2 indicate healthier cell states and clusters 5, 6, and 7 indicate more malignant cell states is corroborated by the additional protein stains (Fig. 2d). For example, the average cytokeratin level is increased in cell states that are more enriched in the tumor stages and the ɣh2ax expression level is the lowest in clusters 0 and 1 (clusters enriched in the non-tumor stage), indicating less DNA double-strand breaks41. The protein stains were not used in training the VAE nor for clustering and thus provide an orthogonal measurement demonstrating the association between the inferred cell states. The subclusters identified by our model also exhibit differences in both the distribution of phenotypic categories and protein expression levels, indicating that the subclusters also identify biologically meaningful cell states (Supplementary Fig. 11). Applying our trained autoencoder and clustering models to the held-out samples provided additional validation for the identified clusters and subclusters and their association with DCIS (Supplementary Fig. 12a). We further validated our cell state assignment by comparing the cell states inferred by our model with nuclear grades assigned by a pathologist. While the pathologist was blinded to our cell state assignment, we observed a positive correlation between the pathologist-assigned severity in nuclear grade and the malignancy of the cell states assigned by our model (Supplementary Figs. 13, 14a, c, d). Furthermore, consistent with the findings of our model, nuclei of all pathologist-assigned grades exist in each of the disease stages (Supplementary Figs. 13, 14b, e). These observations demonstrate that an unsupervised machine learning framework applied to simple and cost-effective chromatin images is able to identify morphologically distinct and disease-relevant cell states.Pseudo-time ordering of the cells in the autoencoder latent space orders the cell states by their enrichment in different disease stagesTo further examine the validity of analyzing the different disease stages and phenotypic categories based on cell states, we confirmed that cells within the same cell state are indistinguishable from each other, even when the cells are from TMA cores from a different disease stage or phenotypic category. Toward this, we trained a neural network classifier that predicts the phenotypic category based on the latent representation of cells computed by the convolutional autoencoder (Fig. 3a, “Methods”). The classifier is unable to distinguish cells from different phenotypic categories within a particular subcluster (Fig. 3a and Supplementary Fig. 15), confirming that cells within a subcluster are indistinguishable from each other. This lends additional support to our observation that all cell states exist in all disease stages and phenotypic categories.Fig. 3: All cell states are present in all disease stages, and the cell state ordering obtained in the autoencoder latent space is aligned with the enrichment of each state as a function of disease stage.a A neural network classifier was trained to classify the phenotypic category of an input cell based on the variational autoencoder (VAE) latent representation of its chromatin image as input. A separate classifier was trained for each of the subclusters of the eight top-level clusters, with 5% of all cells held out for validation and 10% held out for testing. Confusion matrices were computed based on the cells in the test set and are shown for subcluster 0 of cluster 0 (left) and cluster 7 (right). b A network indicating the similarities between all subclusters based on the VAE latent representation was computed using the PAGA method43. Each node represents a subcluster, and its size is proportional to the number of cells in the subcluster. Subclusters within the same top-level cluster are shown in the same color. Each node is labeled by top-level cluster assignment followed by subcluster assignment. For example, 3-0 means subcluster 0 of top-level cluster 3. c Diffusion pseudotime44 as a measure of cell state similarities was computed using a randomly selected cell from subcluster 2 of cluster 7 as the root cell. The visualization was obtained using Uniform Manifold Approximation and Projection (UMAP) initialized by the subcluster positions in the PAGA graph shown in (b). d Visualization of the average cytokeratin expression on the PAGA graph for all cells stained for cytokeratin in each of the subclusters. Each dot represents a subcluster, and its size is proportional to the number of cells in the subcluster.In addition to our clustering-based analysis described in the previous section, we obtained a pseudo-ordering of the identified cell states by applying the PAGA42 and the diffusion pseudotime43 methods to the latent representation of cells learned by our autoencoder (Fig. 3b, c). The clusters enriched in the non-tumor stage (i.e., clusters 0, 1, and 2) and the clusters enriched in the tumor stages (i.e., clusters 5, 6, and 7) are at the two extreme ends in Uniform Manifold Approximation and Projection (UMAP) visualization of the VAE latent space (Fig. 2a), further corroborating the identified clusters and their association in DCIS. Consistent with this observation, the pseudo-ordering inferred by PAGA identifies that the cluster enriched in the non-tumor stage (cluster 0) and the cluster enriched in the DCIS or invasive stages (cluster 7) are the least similar to each other, with other clusters ordered in between the two clusters based on the proportions of healthy and diseased stages (Fig. 3b). This observation was also confirmed with an additional method, diffusion pseudotime43, for which we randomly chose a cell in cluster 7 as the root cell. Applied to the autoencoder latent representations, the diffusion pseudotime method also orders the clusters from 0 to 7 according to their enrichment in the non-tumor and tumor stages. While the disease stage annotations were not used in training the autoencoder, the learned UMAP representation, latent clustering, PAGA, and diffusion pseudotime all independently derived the same order of cell states, which reflects the change in the enrichment of the cell states in the non-tumor, DCIS, and IDC stages. This result is further corroborated by the change in cytokeratin expression along the PAGA graph (Fig. 3d). Importantly, this demonstrates that the latent representations identified by our autoencoder have automatically captured meaningful chromatin features that correspond to disease stages, without using any knowledge of the disease stage during the autoencoder training.Unsupervised features learned by the autoencoder from chromatin images identify interpretable nuclear and chromatin morphometric featuresWe assessed if the difference between the cell states computed by clustering the autoencoder latent space could be explained by interpretable morphological features. The values of a set of 201 manually curated features of nuclear morphology and chromatin organization (NMCO)36 were computed for each cell based on its chromatin image. The NMCO features include features related to the radius, curvature, and image moments (Fig. 4b). A neural network classifier was trained to predict the eight top-level clusters and the subclusters from the NMCO features (Fig. 4a). The confusion matrix of the classifier’s prediction of cells not used in training into the eight clusters shows negligible test error (Fig. 4a and Supplementary Fig. 16). This demonstrates that the disease-relevant morphological features learned by the autoencoder can be characterized by human-interpretable features.Fig. 4: Cell state differences can be characterized by interpretable morphometric features, indicating morphological changes that are aligned with or orthogonal to disease progression.a A neural network classifier was trained to predict the cluster label of each cell based on 201 hand-crafted nuclear morphology and chromatin organization (NMCO) features36. The same number of cells were randomly selected from each phenotypic category (“Methods”), and we used a training, validation, and testing split of 85%, 5%, and 10%. The resulting confusion matrix based on the cells in the test set shows that most cells were correctly classified to their true cluster assignment. b Representative examples of NMCO features in each group described in (c). The full list of NMCO features is provided in Supplementary Data 2. c NMCO features that are significantly different in at least one of the eight top-level clusters grouped by correlation: Each of the 201 NMCO features was tested for whether its mean in any of the eight clusters was different to the mean in cells outside of that cluster, which resulted in 117 significantly different NMCO features (“Methods”); highly correlated features were grouped together resulting in 9 groups; the remaining features not in the 9 groups are labeled as group 10 (“Methods”). The heatmap shows the mean of the 117 NMCO features (columns) in each of the eight top-level clusters (rows). d Mean of the NMCO features in group 1 averaged over all cells in each of the subclusters, visualized on the PAGA graph shown in Fig. 3b. Each node represents a subcluster, and its size is proportional to the number of cells in the subcluster. Each node is labeled by top-level cluster assignment followed by subcluster assignment. For example, 3-0 means subcluster 0 of top-level cluster 3. e Mean of NMCO features in group 2 computed for each cell, visualized on the UMAP plot initialized by the subcluster positions in the PAGA graph. f Mean of NMCO features in group 3 computed for each cell, visualized on the UMAP plot initialized by the subcluster positions in the PAGA graph.We further analyzed the NMCO features that were altered across the cell states both along and orthogonal to the different disease stages. A subset of 117 NMCO features with statistically significant differences in at least one of the eight top-level clusters was identified (FDR < 0.01, fold change with respect to all cells >1.2 or <0.8, z-score > 0.5). The selected features were divided into 9 groups by merging features with high correlations; this grouping is robust to the choice of correlation threshold (correlation >0.8 for features in the same group; “Methods”; Fig. 4b, c and Supplementary Fig. 17). As expected from the representative cell images in each of the eight top-level clusters (Fig. 2b), cell size related terms are in the first group of NMCO features that increase from the healthy to the malignant cell states (cluster 0 to cluster 7) (Fig. 4b–d and Supplementary Data 2). Interestingly, terms characterizing curvature of the cell nuclei are also strongly correlated with the size-related terms (Fig. 4b and Supplementary Data 2 Group 1). Other NMCO groups that change along the disease stages include terms related to average curvature, homogeneity, and central image moments (Fig. 4b and Supplementary Data 2 Group 4–6 and 9).Changes in NMCO features that are orthogonal to the disease stages from cluster 0 to 7 include changes in the nuclear aspect ratio, homogeneity, and smoothness of the nuclear periphery. For example, Group 2 NMCO features contain two aspect ratio terms that show a change from more elongated nuclei to more circular nuclei that are orthogonal to the disease stages (Fig. 4b, c, e). Also, Group 3 NMCO features change orthogonal to the disease stages and contain features that collectively describe the smoothness of the nuclear periphery, homogeneity of the nuclei, and circularity (Figs. 4b, 3c, f and Supplementary Data 2 Group 3). These features indicate that nuclei that are more circular tend to be less homogenous, which suggests that more circular nuclei have more heterochromatin content, leading to a decrease in homogeneity. It is also interesting to note that the most malignant cell state, cluster 7, seems to contain cells with circular nuclei and non-smooth nuclear periphery. This is evident from the low value of inverse circularity (shape factor) and high standard deviation of nuclear radius. In addition, we observed that many top-level clusters build subclusters along the orthogonal direction of the disease stages (Fig. 3b), which suggests that changes in these orthogonal NMCO features are associated with subcluster-level differences. These analyses demonstrate that combining an autoencoder framework with known manually curated features can provide morphological interpretation into how cell states change along and orthogonal to the disease stages from non-tumor to DCIS and invasive stages.The position of cells relative to breast ducts is dependent on both cell state and disease stageThe DCIS to IDC transition is characterized by the pathological proliferation of luminal cells inside the breast duct and the penetration of tumor cells through the ductal membrane to the surrounding stroma4. We hypothesized that such reorganization could be identified using the chromatin features and location of the cell states identified by the autoencoder framework relative to the breast ducts. We first compared cells of the same cell state that were inside versus outside of the ducts to examine if there were differences in the chromatin organization of the cells that were not captured by the clusters identified based on the latent representations. Toward this, we trained a neural network classifier to distinguish between cells inside and outside of the breast ducts. The classifier was unable to distinguish cells from the same subcluster that were inside versus outside the duct, using either the autoencoder latent representations or the NMCO features as input (Fig. 5a and Supplementary Fig. 18a–c). This confirmed again that cells within a subcluster are indistinguishable and it is thus meaningful to perform a spatial analysis of cells with respect to the breast ducts at the level of cell states identified by our autoencoder framework.Fig. 5: The position of a cell relative to the breast ducts is dependent on both cell state and disease stage.a A neural network classifier was trained to predict whether a cell is inside or outside of breast ducts, given the duct segmentation masks derived from cytokeratin expression (“Methods”). A separate classifier was trained for each of the subclusters of the eight top-level clusters, with 5% of all cells held out for validation and 10% held out for testing. Confusion matrices were computed based on the cells in the test set and are shown for all subclusters of cluster 0. b Distance of each cell to the closest breast duct. Cells inside ducts were assigned a distance of 0. For all other cells, the distance was measured from the centroid of a cell to the nearest cell inside any duct, measured in a number of pixels (#pixels), and log-transformed. A value of 1 corresponds to around 0.49 µm and 5 corresponds to around 26.71 µm. c The average distance of cells to the closest breast duct was computed for each subcluster and visualized on the PAGA graph. Each node represents a subcluster, and its size is proportional to the number of cells in the subcluster.Our analysis revealed that none of the cell states were exclusively inside breast ducts and almost all cell states had cells both inside and outside of ducts, regardless of disease stage or phenotypic category (Supplementary Fig. 18d). We further incorporated distances of cells to the nearest breast duct into the analysis, assigning a distance of zero to cells inside the ducts (Fig. 5b). In all disease stages and phenotypic categories, the cell states enriched in the non-tumor stage were found to be further away from breast ducts than the cell states enriched in the tumor stages (Fig. 5c). In addition to this difference in top-level clusters, subclusters also show difference in their distances to ducts, e.g. subcluster 1 of cluster 3 tends to be closer to ducts than the other subclusters of cluster 3 (Fig. 5c). Comparing samples annotated as healthy breast tissue to DCIS phenotypic categories, the healthy cell states were found to be relatively closer to the breast ducts in healthy breast tissue than in DCIS samples, while in the DCIS samples, e.g. in DCIS with early infiltration, the malignant cell states were relatively closer to breast ducts in comparison to the healthy cell states. Performing the same analysis within each individual TMA core revealed consistent findings, with some variation in the DCIS samples (Supplementary Fig. 19), which showed a range of cell state distributions including some that were similar to breast tissue samples with all cell states close to breast ducts, as well as samples that were more similar to later stages with increasingly more malignant states far from ducts.Cell state co-localization pattern is predictive of disease stage and phenotypic categoryIn addition to the different proportions of cell states in different disease stages, our analysis in the previous section suggests that the spatial organization of the different cell states in the breast tissue might also be informative of tumor stage and phenotypic category. This is consistent with a previous study using highly multiplexed imaging to show that proximity between cell types accounts for 18% of features that are different between normal, DCIS, and invasive samples29. Such studies require a large amount of manual labeling or highly multiplexed imaging to obtain sufficient samples with accurate cell type annotation. In the following, we demonstrate that a predictive model of disease stage and phenotypic category can be obtained by using only the spatial neighborhood of the cell states learned from standard chromatin staining.We first compared the proportions of cell states in the neighborhood of a target cell for each of the eight-cell states to a random distribution of cell states in space (Fig. 6a, b and Supplementary Fig. 20a). We used a neighborhood diameter of ~ 52 µm, which can result in visually distinct clusters of image patches that also correspond to the image patch sizes used to train the convolutional VAE (Supplementary Fig. 20a). The resulting cell state distribution was then averaged over cells in the same state for a given phenotypic category, and this was then compared to a random assignment of cell states in the same samples (“Methods”). We observed that cells tend to cluster by cell states, regardless of the disease stage of the sample; this can be seen by the positive diagonal values in the neighborhood co-localization plots (Fig. 6b). In addition, cell states enriched in the non-tumor stage (i.e., clusters 0, 1, and 2) are more likely to co-localize, especially in DCIS or the invasive stage (Fig. 6b). Similarly, cell states enriched in DCIS or the invasive stage (i.e., clusters 4, 5, 6, and 7) also tend to co-localize, and this can be observed in all tumor stages as well as in normal breast TMAs (Fig. 6b). On the other hand, the co-localization of healthy and malignant cell states occurs less than expected by chance.Fig. 6: Cell states co-localization pattern is predictive of disease stage and phenotypic category, and the predictiveness is dependent on the co-localization of all cell states collectively rather than a single cell state.a Within a 25.9 µm radius around each cell, we compute a vector representing the proportions of cells in the neighborhood in each of the top-level clusters. The neighborhood size corresponds to the image patch size used to train the convolutional VAE (Supplementary Fig. 20a) that results in visually distinct clusters of image patches. b Cell state co-localization compared to a random distribution of cell states is plotted for representative phenotypic categories. The neighborhood proportion vectors of all cells within each of the eight clusters were averaged, respectively, giving rise to an 8 × 8 co-localization matrix representing for each cluster the proportion of neighboring cells within each cluster. For comparison, we randomly shuffled the cluster assignment of all cells within each sample 40,000 times and computed the resulting co-localization matrices (“Methods”). The fold-change of each entry in the observed co-localization matrix was computed with respect to the averaged random co-localization matrix. c The per-sample co-localization matrix was computed. A neural network classifier was trained to predict the phenotypic category of a sample from its co-localization matrix and the total number of cells in the sample (“Methods”). The confusion matrix shows the result of leave-one-patient-out cross-validation. d An ablation study was performed by removing cells from one of the eight clusters in the calculation of the co-localization matrix. e A neural network classifier was trained to predict the phenotypic category of a sample from the 7 × 7 co-localization matrix (where one of the clusters was ablated) and the total number of cells in the sample. f Classification error of the ablation study is plotted using leave-one-patient-out cross-validation. None means that all clusters were used as in (c) and each number indicates the ablated cluster. The classification errors are divided into 6 types and labeled as the true phenotypic category of the sample -> predicted phenotypic category of the sample. Non-tumor consists of “P0. Breast tissue”, “P1. Cancer adjacent breast tissue”, and “P3. Hyperplasia”. DCIS consists of “P5 + P6. DCIS and breast tissue” and “P7 + P8. DCIS with early infiltration”. Invasive consists of “P9. IDC and breast tissue” and “P10. IDC”.Next, we analyzed whether the co-localization matrix of cell states could be used as a predictor of the tumor stage and phenotypic category of a given sample. Toward this, we trained a 3-layer neural network classifier that used only the 8-by-8 co-localization matrix computed from a single sample and the cell density of the sample as the input to predict the phenotypic category of the sample (Fig. 6c, “Methods”). The confusion matrix resulting from leave-one-patient-out cross-validation shows that the phenotypic category of an unseen patient can be predicted with high accuracy (Fig. 6c). In particular, if we group the phenotypic categories into the three disease stages, non-tumor (breast tissue, cancer adjacent breast tissue, hyperplasia), DCIS, and invasive, then the classification error is below 17% with less than 5% misclassification rate of the invasive stage (Fig. 6f). We also tested other neighborhood sizes: in addition to a size of 52 µm used above, we found the co-localization pattern to be robustly predictive for neighborhood sizes of 26 µm to 120 µm (Supplementary Fig. 20b). This shows that the predictiveness of the disease stage or phenotypic category is not sensitive to the exact choice of the neighborhood size. Importantly, our analysis of cell state co-localization takes into account all cells in the TMA cores, including both stromal and epithelial cells. Compared to a classifier trained using the co-localization of only ductal cells (Supplementary Fig. 25), our classifier that also incorporates stromal cells has higher classification accuracy, indicating that the microenvironment change in the different disease stages is not limited to the ductal regions.Finally, we analyzed whether the neighborhood of certain cell states contributes more to the prediction of the tumor stage. Toward this, we retrained our classifier model after removing cells from a particular cell state and re-calculating the resulting 7-by-7 cell state co-localization matrix (Fig. 6d, e). While removing cells in cluster 7, the cell state most enriched in the tumor stages, results in the worst classification performance, the classification errors of the cross-validation after removing each of the cell states are comparable to using all cell states for the classification (Fig. 6f). This indicates that the different disease stages and phenotypic categories show a systemic reorganization of all eight cell states and are not limited to a particular cell state alone.Cell state co-localization pattern is more informative than cell state abundance for accurately classifying hyperplasia, DCIS, and IDCNext, we trained a classifier with the same architecture as in the previous analysis but also incorporated the proportion of cell states in a given sample as the input. Interestingly, we found that the addition of the cell state proportion in a sample did not significantly improve the prediction of the disease stage or phenotypic category, suggesting that the spatial co-localization of cell states is generally more important than the presence and abundance of a particular cell state (Supplementary Fig. 21).To investigate for each disease stage and phenotypic category the most important classification features, we performed a careful analysis of the mis-classified samples. Consistent with the identified enrichment of clusters 5, 6, and 7 in IDC compared to normal samples (Fig. 2c), the non-tumor samples (“Breast tissue” and “Cancer adjacent breast tissue”) misclassified as DCIS or invasive stage have a higher proportion of cells in these three clusters than the correctly classified normal samples (Fig. 7a and Supplementary Fig. 22). Similarly, the IDC samples misclassified as normal samples (“Breast tissue” and “Cancer adjacent breast tissue”) have a lower proportion of cells in clusters 5, 6, and 7 and a higher proportion of cells in clusters 0, 1, and 2 than the correctly classified IDC samples (Fig. 7a and Supplementary Fig. 22). This suggests that cell state abundance is important for distinguishing between highly invasive and normal samples.Fig. 7: Analysis of phenotypic classification performance provides insights into cell state abundance and co-localization differences in the phenotypic categories of DCIS.a The co-localization patterns of the misclassified samples are compared to the correctly classified samples and the log2 fold changes are plotted. The classification was performed using leave-one-patient-out cross-validation. Classification errors were categorized as the true phenotypic category of the sample -> predicted phenotypic category of the sample, e.g. Breast tissue -> IDC records the breast tissue samples that were misclassified as IDC. The proportion of cells in each of the eight clusters in the misclassified samples compared to the correctly classified samples was also plotted in terms of their log2 fold change (denoted by %cluster). b Classifiers were trained on hyperplasia and DCIS samples to predict their phenotypic category from the co-localization matrix and the total number of cells in a sample. The confusion matrix shows the result of leave-one-patient-out cross-validation. c Co-localization matrix and proportions of cells in each of the eight clusters (denoted by %cluster) of the misclassified samples compared to the correctly classified samples are plotted in terms of their log2 fold change. The classifiers were trained on hyperplasia and DCIS samples as in (b).On the other hand, we found the classification of DCIS (“DCIS and breast tissue” and “DCIS with early infiltration”) and hyperplasia to be highly dependent on the cell state co-localization pattern. Using cell state co-localization alone to predict phenotypic labels resulted in better performance in the classification of DCIS, hyperplasia, IDC and breast tissue, and IDC samples, as compared to classifiers that used cell state proportion alone (Supplementary Fig. 21). Analyzing the misclassified samples in these phenotypic categories further strengthened this observation (Fig. 7a and Supplementary Fig. 22). For example, cluster 7 in DCIS with early infiltration misclassified as IDC had higher abundance in the neighborhood of cluster 0, despite the overall decrease in cluster 7 abundance. Similarly, in IDC samples misclassified as DCIS with early infiltration compared to the correctly classified samples, there were more cluster 3 cells in the neighborhood of clusters 0 and 3, although the overall proportion of cluster 3 cells decreased. IDC and DCIS samples misclassified as hyperplasia also indicated the importance of the cell state co-localization pattern in the classification of the pathologies (Fig. 7a). Samples misclassified as hyperplasia generally showed a depletion of cluster 7 especially in the neighborhood of the more malignant cell states (Fig. 7a and Supplementary Fig. 22). This depletion was observed even when the overall proportion of cluster 7 cells was high (Fig. 7a “IDC and breast tissue -> Hyperplasia” and “DCIS and breast tissue -> Hyperplasia”). Consistent with this finding, normal breast tissue misclassified as hyperplasia showed an increase of cluster 7 cells (Supplementary Fig. 22). Our observations suggest that the cell state co-localization pattern is an important indicator of disease stage and phenotypic category and is especially important for accurate classification of hyperplasia, DCIS, and IDC.Atypical hyperplasia and low-grade DCIS are known to lack clinical consensus17,18. In fact, overlap in terms of both morphological features and genetic alterations has been reported between atypical hyperplasia and DCIS44. Given the accurate predictions of disease stages and phenotypic categories enabled by our use of cell state co-localization patterns described above, we examined the cell state composition and co-localization patterns associated with the samples labeled as DCIS or hyperplasia in more detail to test if they could be used as features to distinguish these phenotypic categories at a more fine-grained level. Toward this, we re-trained our neural network classifier to distinguish samples labeled as hyperplasia, atypical hyperplasia, DCIS and breast tissue, and DCIS with early infiltration (Fig. 7b). Our model predictions using cell state co-localization are highly consistent with the pathology annotations of the samples. Analyzing in detail the misclassified examples to understand the features used by our classifier, we found that hyperplasia samples misclassified as atypical hyperplasia have more cells in clusters 5, 6, and 7 and less cells in clusters 0, 1, and 2 (Fig. 7c), which suggests the importance of cell state proportions for distinguishing hyperplasia from atypical hyperplasia (Fig. 2c). On the other hand, the classification of atypical hyperplasia and DCIS depends more on the cell state co-localization pattern (Fig. 7c and Supplementary Fig. 23). Although the current clinical diagnosis of atypical hyperplasia does not explicitly use the spatial organization of cell states defined by their chromatin organization and nuclear morphology, our results suggest that cell state abundance and their co-location patterns are highly predictive of these phenotypic categories. While further research with more patients is needed to confirm the robustness of our model and its clinical utility in a larger patient cohort, the use of cell states defined by chromatin staining and their co-localization pattern could potentially help distinguish hyperplasia and low-grade DCIS.

Unsupervised representation learning of chromatin images identifies changes in cell state and tissue organization in DCIS

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis