SnapFISH-IMPUTE: an imputation method for multiplexed DNA FISH data

SnapFISH-IMPUTEMultiplexed DNA FISH data measure the three-dimensional Euclidean coordinates of consecutive loci on chromosomes. Depending on the ploidy and the cell cycle, it is possible to observe multiple signals of the same locus within each cell. Here we assume that the identity of each locus, that is, to which allele it belongs, is already resolved by clustering or more advanced methods tailored for spatial genome alignment8,19. Then the goal of imputation is to fill in the 3D coordinates of the missing loci on each chromosome.Although chromosome conformations often demonstrate large cell-to-cell variability, common structures such as TADs and enhancer-promoter loops are conserved in subpopulations of cells from the same cell type8,10,11. We, therefore, reason that for each missing locus, its position is not only related to loci that immediately upstream or downstream loci on the same chromosome but also can be determined by the relative positioning of the same locus on other cells. To find cells with similar structures, we first convert 3D coordinates to pairwise Euclidean distances between each locus pair. The pairwise distances are then normalized by their one-dimensional (1D) genomic distances, so the difference between the two cells is not dominated by pairwise distance entries with large variations. Specifically, for locus pairs separated by the same 1D genomic distance, we model the difference in each coordinate as a centered normal distribution with variance proportional to the 1D distance. The squared pairwise distance, which is the sum of squares of the difference in each axis, thus follows a chi-squared distribution with three degrees of freedom multiplied by the variance. In imaging experiments, however, factors such as the resolution of the microscope and the imaging procedure used might lead to violations of the assumptions. To account for potential deviations, an additional Box-Cox transformation is performed, ensuring that the underlying distributions are approximately normal (Fig. S1). Last but not least, we applied the z score normalization to correct for the difference in 1D genomic distances (Fig. 1).Fig. 1: Overview of the imputation algorithm.The input includes the 3D coordinates of available loci in the imaging region and a 1D genomic location annotation file. The Euclidean coordinates are converted to pairwise distances and then grouped by 1D genomic distances. Our method adopts a two-step normalization procedure to systematically remove 1D genomic distance effects. The normalized distances are used to calculate dissimilarities between cells, which then determines how missing entries are filled. Finally, the 3D coordinates are recovered by minimizing the difference between the imputed pairwise distances and the distances calculated from linear initialized coordinates.To find cells with similar conformations, we define a dissimilarity measure by computing the root mean square deviation between the normalized distances. Because of the presence of missing loci, when calculating the dissimilarity score between two cells, normalized distances that are unavailable in at least one of the two cells are skipped, and the final score is rescaled by the number of shared available entries, thus ensuring that all scores are within the same range and comparable with each other. We further filter the scores by removing the ones with shared entries <80% of the available entries in both cells. However, since the Euclidean distances are highly dynamic at the locus pair level, the dissimilarity defined in this way cannot capture larger structural features, such as TADs, a phenomenon also observed in Hi-C data20. A widely adopted solution for Hi-C data is to first smooth the contact matrix. Here, we resize the normalized distance matrices to 20 by 20 to achieve the same purpose. An additional advantage is that resizing will simultaneously reduce the missing ratio in the processed pairwise distances (Fig. S2), hence allowing more scores to pass the 80% filter.The next step of the imputation workflow involves constructing target pairwise distances. Specifically, for each cell, all the other cells are ranked by their dissimilarity, and each missing pairwise distance of the cell is replaced by the first non-missing entry in the sorted cell list. This replacing process is repeated for all cells until no missing pairwise distances exist. To recover the underlying 3D coordinates from pairwise distances, we minimize the difference between the target pairwise distances and the pairwise distances calculated from 3D coordinates. Similarly, the difference is calculated after pairwise distances are normalized by 1D genomic distances, which ensures that each entry contributes equally. The difference and its derivatives are then passed to a minimizer and optimized with the limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS)21, a method designed for solving large-scale optimization problems.Imputation preserves the distribution of pairwise spatial distancesWe applied our imputation algorithm to a published multiplexed DNA FISH dataset from mouse embryonic stem cells (mESCs)10. The authors performed a whole-genome DNA seqFISH+ on 446 cells from two biological replicates. Two groups of probes were designed, where the first group consisted of 1200 probes separated by 25 kb, and the second group consisted of 2460 probes separated by about 1 Mb. We started with the 25 kb resolution subset, where one imaging region is selected from each chromosome, and 60 consecutive loci separated by 25 kb are imaged for each imaging region. Keeping only haploid chromatins with at least one locus observed, the average detection efficiency of the 25 kb subset is 67.9% (range: 58.7% [chr2] 73.0% [chr3]). To benchmark our method SnapFISH-IMPUTE, we included three alternative imputation strategies in the following analyses: linear imputation, cubic spline imputation, and mean imputation (see Methods). The first two methods can recover the 3D coordinates, while the mean imputation method can only fill in missing pairwise distances.As expected, the average pairwise distances from the raw data and the mean imputed data are identical since in mean imputation, missing values are replaced by the average (Fig. 2a). Linear imputation also gives an average distance matrix similar to the original one, though the distances between locus pairs with larger 1D genomic distance are more likely to be overestimated (Fig. 2a, S3a). Interestingly, cubic spline, a generalization of linear imputation that uses a cubic polynomial instead of a linear function, completely removes the original patterns in the data (Fig. 2a, S3b). This suggests that filling in missing 3D coordinates while preserving population-level features is challenging. The main reason that linear imputation works reasonably well is likely that, arithmetically, it is equivalent to averaging neighboring two loci, so it implicitly captures the stochasticity in chromatin conformation and replaces those missing coordinates by their expectations. If we try to capture the variations explicitly by increasing the order of the fitted function, the predicted value is no longer computed from the neighboring two loci only but will also be influenced by other loci further apart. In contrast, by adopting a two-stage imputation workflow, SnapFISH-IMPUTE preserves the original population-level patterns faithfully and yields almost identical distance matrices (Fig. 2a, S3c). The difference between the imputed matrix and the raw matrix is also substantially smaller than the difference computed from both the linear imputation result and the cubic spline imputation result (Fig. 2b).Fig. 2: Imputation results of the 25 kb subset of the DNA seqFISH+ mESCs dataset.a The average pairwise distance matrices of imaging region 1. The upper triangle is the raw average distance matrix, and the lower triangle is the imputed distance matrix. b The absolute value of the difference between the upper and the lower triangle in part a (n = 1770 possible pairs). c The Pearson correlations between the average distance matrices from the raw data and the imputed data across the 20 imaging regions. d The standard deviations of each entry in the pairwise distance matrix. The first five imaging regions are shown (see Fig. S2 for the other 15 regions). Box plot shows the first quartile, the median, the third quartile, and the min and max (excluding outliers) of the data (n = 1770 possible pairs). e Single-cell examples. f Binary classification accuracies across different imaging regions. Error bars are the minimum and the maximum of classification accuracies from five-fold cross validation (n = 5 random partitions). g The first two principal components of mean imputation result and our imputation method. The groups with the highest and the lowest missing ratios from imaging region 4 are shown.The Pearson correlations between the mean distance matrices from the imputed data and the raw data show that SnapFISH-IMPUTE consistently outperforms linear imputation and cubic spline across all imaging regions. Indeed, the correlations are around 0.99, almost the same as the ones from mean imputation (Fig. 2c). Such high correlations are achieved without gathering population-level information beforehand, which is completely different from mean imputation, where the population average is pre-computed and directly used to fill in missing pairwise distances. The results indicate that the information from other cells is referenced sufficiently in the first part of our method, and the 3D coordinates recovered resemble true distributions. In addition to the mean distance matrix, we calculated the standard deviation of the distance between each locus pair across the dataset. Both linear imputation and spline imputation lead to unwanted variations, which can be more than twice as high as the observed variations (Fig. 2d). The mean imputation, on the other hand, reduces the original standard deviations because all missing distances from the same locus pair are replaced by the same value. Among all methods tested, although our proposed imputation method also slightly decreases the variation, it keeps the overall distribution at a similar level (Figs. 2d, S3d). Taken together, our method preserves population-level features and surpasses other benchmark methods in multiple aspects.Imputed conformations resemble observed conformationsWe next evaluated the effect of imputation on single-cell level characteristics. A few cells without missing loci are selected, and we found that chromatin conformations demonstrate large cell-to-cell variability, coherent with previous studies (Fig. S4a)10. Nevertheless, there are some common patterns shared by most imaged regions. For example, the loci are often arranged sinuously instead of following a smooth curve. As a result, neighboring entries in the pairwise distance matrix can have distinctive values, despite their being close in 1D genomic distance (Fig. S4a). Such features are not always preserved by other imputation methods considered. When ~50% of loci are missing, both linear imputation and cubic spline imputation fail to recover the sharper changes observed in real imaging data (Fig. 2e). This pattern is more obvious as the detection efficiency drops below 50%, where loci start to pack together, and finer structural variations are entirely lost, as reflected by the distance matrices (Fig. S4e, f). SnapFISH-IMPUTE, on the opposite, is robust under various detection efficiencies, and the imputed loci distribute randomly as in real imaging data even when two-thirds of the loci are missing (Fig. S4b–f). Remarkably, although SnapFISH-IMPTUE does not model the conformation parametrically, it captures the overall trend of non-missing loci, and the imputed pairwise distances are indistinguishable from the real pairwise distances, in contrast to mean imputation (Fig. 2e).If the imputed loci emulate real imaging data at the single-cell level, it would be difficult to discriminate imputed cells with low detection efficiency from those with high detection efficiency. For each imaging region in the 25 kb subset, we binned the cells by their detection efficiencies into three groups, and we trained a soft-margin support vector machine on the two groups with the highest and the lowest detection efficiencies (see Methods). If the imputed result closely resembles true data patterns, the classifier would not be able to distinguish cells from these two groups easily. Indeed, the accuracy is ~50% for SnapFISH-IMPTUE across different imaging regions, which is about the same as a random classifier. In comparison, the classification accuracies are considerably higher for all three competing methods, with mean imputation having the highest score of around 90% (Fig. 2f). We performed a PCA using the mean imputed data and the data generated by SnapFISH-IMPUTE. The result shows that cells with low and high detection efficiencies from the mean imputed data occupy different sub-spaces in the low-dimensional space, consistent with the high classification accuracies. In contrast, cells with different detection efficiencies are mixed together in the PCA plot of SnapFISH-IMPUTE, confirming that the predicted conformations are similar to the observed ones (Fig. 2g).SnapFISH-IMPUTE is robust under different resolutions and imaging protocolsNext, we applied SnapFISH-IMPUTE to the 1 Mb subset of the mESCs data. We imputed missing data in output from Jie19, a spatial genome alignment method. The overall detection efficiency is only 36.3%, with a range of 30.2% to 46.9%. Notably, although nearly two-thirds of the 3D coordinates are missing, SnapFISH-IMPUTE yields average distance matrices almost identical to the original ones, and the Pearson correlations are close to one as before (Fig. S5a, b). Additionally, we have analyzed a DNA FISH dataset of mESCs generated by a different imaging protocol11. A total of 41 probes are designed to image two alleles at 5 kb resolution, and the average detection efficiency is ~70% for both alleles. Similar to previous results, SnapFISH-IMPUTE recovers missing 3D coordinates effectively, as reflected by both the mean distance matrices and the distance deviations (Fig. S5c–f). The correlations are 0.98 and 0.98 for the two alleles, in contrast to the 0.88 and 0.86 achieved by linear imputation and the 0.58 and 0.57 by cubic spline imputation. In summary, the performance of the SnapFISH-IMPUTE is robust under different imaging resolutions and protocols.Imputation improves downstream analysisEnhancer-promoter loop is a critical structural feature in chromatin conformation data and plays a key role in transcription regulation22. In our recent work, we have developed a loop caller, SnapFISH23, for multiplexed DNA FISH data. We applied SnapFISH to the 25 kb imputed DNA seqFISH+ data and benchmarked the loop-calling performance with both the loop set from the raw data and the HiCCUPS24 output. With the default threshold, SnapFISH identified 30 loops from the imputed data, of which 14 loops overlapped with the HiCCUPS output (Fig. 3a). A large number of false positives is not surprising since imputation increases the effective sample size and thus allows more loops to pass the default threshold. The t test results from SnapFISH show that false positive loops have smaller t-statistics than true positive loops, suggesting that the relative loop strength is not affected by imputation (Fig. 3c). This motivates us to optimize the default FDR cutoff in the SnapFISH algorithm to achieve a similar precision to the raw data. Indeed, we found that by setting the cutoff to 0.001, SnapFISH reports 4 false loops while all the true loops are not affected (Fig. 3b). It is worth noting that not all imputation strategies will enhance loop-calling. Both linear imputation and cubic spline imputation drastically lower the sensitivity, with only two loops called at the optimal threshold. We reason that enhancer-promoter loop is a more intricate feature in 3D chromatin conformation, thus requiring careful processing of the raw data. To assess whether and to what extent the performance depends on the loop caller used, we performed the same analysis using loops called FitHiC225 as the ground truth. We observed similar patterns (Fig. S6) when using HiCCUPS loops as the ground truth.Fig. 3: Imputation improves loop-calling and cell-type clustering.a Loops called from the mESCs 25 kb DNA seqFISH+ dataset using the default threshold. b Loops called from the 25 kb subset using the optimal threshold in SnapFISH. c The t-statistics of loops called by SnapFISH. Box plot shows the first quartile, the median, the third quartile, and the min and max of the data (n = 2, 14 loops for no imputation; n = 16, 14 loops for SnapFISH-IMPUTE with the default cutoff; n = 4, 14 loops for SnapFISH-IMPUTE with the optimal cutoff). d Identification of the Sox2 enhancer-promoter interaction on the 129 allele (n = 100 random samples for each number of chromosomes). e Identification of the Sox2 enhancer-promoter interaction on the CAST allele (n = 100 random samples for each number of chromosomes). For d and e, the error bar is the 95% confidence interval calculated from a 1000-time bootstrap (sampling with replacement on the 100 loop-calling results and taking the middle 95% interval). f, g The embedding of different mouse brain cell types.Next, we applied SnapFISH to the 5 kb chromatin tracing data and tested whether it could identify the Sox2 enhancer-promoter loop with different numbers of cells. Specifically, we generated 100 random samples for each number of cells and calculated the F1 score on these 100 samples. The result shows that imputation boosts loop-calling efficiencies and leads to a higher F1 score across both alleles (Fig. 3d, e).In addition to loop-calling, we test whether the imputed data can be used for cell-type clustering. We re-analyzed a previously published DNA seqFISH+ dataset of mouse brain cells8. The authors also conducted mRNA seqFISH+ on each cell in the dataset and identified nine major cell types. Here we asked whether cells have distance 3D conformations in different cell types. We applied SnapFISH-IMPUTE and linear imputation to the 1 Mb resolution subset of the data. After obtaining the complete 3D coordinates, we computed the pairwise distances of each cell and concatenated all pairwise distances from the 19 autosomes in each cell. Since not all cells have both alleles observed and all 19 autosomes recorded, we kept only cells with at least one haploid observable and randomly selected one allele for cells with more than one allele to perform clustering. The concatenated distances are normalized and then embedded with PCA followed by UMAP. Although not all cell types are distinguishable from each other in the embedding space, some of them, such as microglia (Micro) and neurons expressing vasoactive intestinal polypeptide (Vip) (Fig. 3f) or neurons expressing parvalbumin (Pvalb) and endothelial cells (Fig. 3g), occupy distinct regions. We also noticed that SnapFISH-IMPUTE often separates these classes more clearly than linear imputation and, thus, is more appropriate for downstream analyses.To quantitatively evaluate the clustering efficiency, we calculated the adjusted mutual information scores (AMI) between the imputed data and the ground truth. Specifically, we first embedded the distances with PCA and then applied hierarchical clustering to obtain the predicted cluster assignments. We then computed the AMI score using the predicted and the true cluster assignments. The AMI scores of SnapFISH-IMPUTE are almost always higher than the ones from linear imputation (Fig. S7). For example, for pairs Pvalb versus Endo, Micro versus Vip, and Sst versus Micro, linear imputation yields an AMI close to zero, indicating that the clusters found are almost random. In contrast, SnapFISH-IMPUTE gives an AMI score close to 0.5 in these cases, which is considerably higher and much closer to 1, showing that SnapFISH-IMPUTE preserves original chromatin conformation characteristics.

Hot Topics

Related Articles