Intracellular spatial transcriptomic analysis toolkit (InSTAnT)

Source dataSource data are provided with this paper.Statistics and reproducibilityMERFISH data were generated on one sample of U2OS cell line (described above), and the data included thousands of cells as samples for INSTANT analysis. Experiment for SRRM2-MALAT1 interaction (Fig. 4a) was performed once and the colocalization rate was aggregated over cells.InSTAnT user guideInSTAnT tools have tunable parameters that can be selected based on the user’s requirement. We selected the scale parameter d based on the average cell’s diameter and CPB False Positive Rate (2%) estimates. The user can also obtain region annotations of a gene pair’s colocalization if the data include masks for cell and nucleus boundaries. Similarly, they may run cell type/region specificity analysis if the data include cell type or region information. We advise caution when using InSTAnT with small distance thresholds, such as 1 µm or less, as the false positive rates in this regime can be high. This is due to the fact that colocalization with small distance is relatively rare and the estimate of null probability of a pair of transcripts being proximal, a key aspect of the PP test, is error-prone in such cases. We believe that higher number of transcripts and improved optical resolution17,e.g., through expansion microscopy, may alleviate this problem.Comparison with BentoWe used Bento34 from the official GitHub repository (version 90f4ab4). We used the function for colocation- “coloc_quotient()” at https://github.com/ckmah/bento-tools/blob/master/bento/tools/_colocation.py. The original function uses an AnnData object to load data. However, we wrote scripts to load input data in csv format. The rest of the function was used as it is. We used K=10 in our experiments (d~5.5 mu). We also explored setting distance d=4 mu but noted worse FPR in this setting.Comparison of Proximal pairs with gene pairs having similar localization featuresWe sought to compare the results of the PP test with the approach of Battich et al.44 that is based on localization features. However, since their code is not easy to use, we used a function from Bento34, that uses a very similar approach based on pre-defined localization features of transcripts. The features used were- nucleus_inner_proximity, nucleus_outer_proximity, l_half_radius, l_max, l_max_gradient, l_min_gradient, l_monotony, cell_inner_asymmetry, nucleus_inner_asymmetry, nucleus_outer_asymmetry, point_dispersion, and nucleus_dispersion. The authors of Bento34 had made these features available for a SeqFISH+ data set on a mouse fibroblast cell line, hence we performed our comparison on this data set. Note that this function and the method of Battich et al.44 do not aim to identify gene pairs whose transcripts tend to be near each other; rather, they separately characterize the location of (transcripts of) each gene and report gene pairs with similar localizations in a cell. We found that this approach reports very different gene pairs compared to Instant (PP test), with <2% or fewer of the top 500 pairs in any one cell being common. For illustration purposes, we present one example of a gene pair deemed significantly d-colocalized by Instant PP test but not identified by the localization feature-based method, and one converse example (Supplementary Fig. 7).Implementation of Chen et al. (Bin based method)We followed the method described in Chen et al.13. First, we divided each cell into 2 by 2 regions (bins). For each gene, fraction of occurrence in each bin was calculated. Enrichment of a gene in a bin was calculated as ratio of the observed fraction in a given region to average fraction of all genes in the same region. Next, Pearson correlation of region-to-region variation in enrichment of a gene pair was calculated for each cell. For each gene pair, we removed the cells that had one of the genes as constant across four bins (and therefore correlation is not defined). Finally, for each gene pair, we take median of correlation across all cells as a global measure of their colocalization. Note that (a) the spatial resolution of this approach is quite low (about a quarter of a cells area/volume), and (b) correlation coefficients are calculated from four samples at a time, leading to unreliable estimates, which are then averaged.False positive rate (FPR)We generate random baseline dataset established by permuting the gene labels of all transcripts within each cell, which recapitulates the spatial patterns of the original data but not the gene-gene relationships. The gene pairs obtained with InSTAnT on real data comprise true positives (TP) and false positives (FP). The gene pairs found under the random data are assumed to be false positives (FP). FPR is obtained by comparing the number of detected pairs obtained on randomized data with number of detected pairs on real data. Since thousands of cells are independently randomized (gene labels of transcripts in an individual cell are shuffled), this procedure makes use of an extensive level of randomization. Ten of the 140 genes probed in the U2OS MERFISH data set were “blanks”, meaning that they do not represent any particular RNA or other molecule. Any gene pair involving such blank “genes”, if found to d-colocalize, is clearly a false positive. This provided us another opportunity to assess the false positive errors in our global co-localization map. We recorded the fraction of such false positives among predicted pairs at varying levels of significance (Supplement Fig. 3).Hyperparameter selectionU2OS MERFISH dataset: d was chosen to be 4 microns, which corresponded to ~5% of average diameter of a cell. The p-value threshold was chosen to be 0.001 for PP test and 0.0001 for CPB test that resulted in CPB FPR < 2%.Hypothalamus brain MERFISH dataset: d was chosen to be 2 microns, which corresponded to ~5% of average diameter of a cell. The p-value threshold was chosen to be 0.001 for PP test. For differential colocalization, we use p-value threshold of 5e-6 (Bonferroni corrected hypergeometric p-value 0.05) for unconditional p-value obtained from Hypergeometric test. Same parameters were used for behavior analysis.NIH/3T3 SeqFISH+ dataset: d was chosen to be 2 microns, which corresponds to ~0.5% of average diameter of a cell. The p-value threshold was chosen to be 0.01 for PP test. CPB estimated FPR is ~0% for top 109 pairs (Fig. 3j).Hippocampus brain Xenium dataset: d was chosen to be 0.75 micron that corresponds to ~3.5% of average diameter. The p-value threshold was chosen to be 0.01 for PP test. For differential colocalization, we use p-value threshold of 5e-6 (Bonferroni corrected hypergeometric p-value 0.05) for unconditional p-value obtained from Hypergeometric test.Proximal pair (PP) testPP test reports proximal pairs of genes in a particular cell. A gene pair \({g}_{i},\,{g}_{j}\) is a proximal pair in a cell if their transcripts are proximally located (separated by distance \(d\) or less) significantly more often than expected by chance. The null probability \(p\) is estimated from the distances between all pairs of transcripts (regardless of gene identities) in the cell, by calculating the fraction of transcript pairs that are proximally located. Let \({t}_{i}\) and \({t}_{j}\) denote the transcript counts of genes \({g}_{i},\,{g}_{j}\) respectively in the cell, let \(T={t}_{i}{t}_{j}\) and let \(K\) be the number of proximally located transcript pairs of these genes. The PP test performs a Binomial test providing a p-value (one-sided) for \({g}_{i},\,{g}_{j}\) as$${{{\rm{p}}}}-{{{\rm{value}}}}\big({g}_{i},\,{g}_{j}\big)={Binomial}(T,p,K)$$
(1)
PP-3D testPP-3D is an extension of PP test to handle three-dimensional data in the form of 2D (x-y) locations of transcripts in each of multiple \(z\)-planes. We assume that data from different planes are independent and identically distributed. The new distribution is the sum of independent Binomial distributions (with the same parameter), which is also a Binomial distribution. The null probability of two transcripts being proximal is estimated as a weighted combination of estimated null probability for each of the z-planes,$$p\equiv \frac{{\sum }_{z}{l}_{z}{p}_{z}}{{\sum }_{z}{l}_{z}}$$
(2)
where, \({p}_{z}\) denotes the null probability for \(z\)-th plane, \({l}_{z}\) denotes the total number of transcripts in \(z\)-th slice. \(T\) and \(K\) are also aggregated across z-planes:$$T=\mathop{\sum}_{z}{T}_{z}$$$$K=\mathop{\sum}_{z}{K}_{z}$$where \({K}_{z}\) is total number of proximal transcript pairs and \({T}_{z}\) is total number of transcript pairs (of \({g}_{i},\,{g}_{j}\)) in \(z\)-th plane. PP-3D calculates a p-value for each gene pair as p-value\(({g}_{i},\,{g}_{j})={Binomial}(T,p,K)\).Conditional poisson binomial (CPB) testCPB test detects a \(d\)-colocalized gene pair, i.e., a gene pair that is a proximal pair in significantly many cells. It assigns a p-value (one-sided) to the number of cells in which a gene pair is found to be proximal pair detected using PP test. We first describe a simpler version of the test (“unconditional Poisson Binomial” or UPB) test that assumes that all gene pairs are equally likely to be proximal pair in a cell but allows for the fact that different cells may have different number of proximal pairs. Let \({X}_{{ij}}^{c}\) be a binary variable denoting if \({g}_{i},\,{g}_{j}\) are a proximal pair in \(c\)-th cell. \({X}_{{ij}}^{c}\) is assumed to follow a Bernoulli distribution with parameter \({p}_{0}^{c}\), which is estimated as the fraction of proximal gene pairs in the cell:$${p}_{0}^{c}=\equiv \frac{{\sum }_{k\le l}{X}_{k,l}^{c}}{{\sum }_{k\le l}1}=\frac{{\sum }_{k\le l}{X}_{k,l}^{c}}{\left(\begin{array}{c}n\\ 2\end{array}\right)}$$
(3)
where \(n\) denotes total number of genes. This estimate of \({p}_{0}^{c}\) assumes that all gene pairs can be a proximal pair. To incorporate the fact that a gene pair cannot be a proximal pair if either of the genes is not expressed in the cell, the above estimate is modified as,$${p}_{0}^{c}\equiv \frac{{\sum }_{k\le l}{X}_{k,l}^{c}}{\sum {I}_{k\le l}\left({g}_{k},\,{g}_{l}\right)}$$
(4)
where \(I({g}_{k},\,{g}_{l})\) is an indicator function that equals to \(1\) iff both \({g}_{k}\) and \({g}_{l}\) are expressed.CPB test is a modified version of the UPB test that accounts for the possibility that all gene pairs are not equally likely to be colocalized in a cell and sets the Bernoulli parameter (\({p}_{0}^{c}\) above) to be gene pair-dependent. Let \({z}_{i}\) denote total number of proximal pairs having gene \(i\) as one of the genes, aggregated across all cells, i.e.,$${z}_{i}=\mathop{\sum }_{j\le c}{X}_{{ij}}^{c}$$
(5)
We use these global summary statistics to model the prior probability \({\Pi }_{{ij}}\) that a proximal pair detected in a cell is the gene pair \({g}_{i},\,{g}_{j}\), as follows:$${\Pi }_{{ij}}\equiv \frac{{z}_{i}{z}_{j}}{{{\sum }_{i\le j}{z}_{i}z}_{j}}$$
(6)
This model de-emphasizes gene pairs comprising genes that are frequently found to be in proximal pairs across cells. Now, the Bernoulli parameter for variable \({X}_{{ij}}^{c}\) is estimated as$${p}_{{ij}}^{c}\equiv 1-{\left(1-{\Pi }_{{ij}}\right)}^{{\sum }_{i\le j}{X}_{{ij}}^{c}}$$
(7)
The total number of cells where \({g}_{i},\,{g}_{j}\) is a proximal pair follows a Poisson Binomial distribution$$\mathop{\sum }_{c=1}^{m}{X}_{{ij}}^{c}\sim {Poisson\; Binomial}\, ({p}_{{ij}}^{1},\ldots,\,{p}_{{ij}}^{m})$$
(8)
Subcellular annotation of a d-colocalized gene pairA d-colocalized pair is annotated by cellular region where the gene pair’s proximal pairs tend to be found. We define four categories – Nucleus (Nuc), Peri-Nucleus (PN), Cytosol (Cyto) and Cell Periphery (CP). Proximal pairs in each cell are annotated by cellular region and is aggregated across cells to yield primary and secondary category. Perinuclear (PN) region is defined as including x microns on either side of the nuclear membrane, while Cell Periphery (CP) is defined as regions in cytoplasm within y microns of the cell membrane. Remaining regions are designated as Cytosol (Cyt) or Nucleus (Nuc). We chose x = 2 micron which corresponded to ~35% of nucleus transcripts being annotated as perinuclear, and y = 4 micron which corresponds to ~35% cytosolic transcripts being annotated as cell periphery.RNA-RNA interaction (RRI)For RRI, we set distance d to be 300 nm (resolution of MERFISH data is 200 nm). The small distance was chosen to capture gene pairs whose d-colocalization may be explained due to the binding of their transcripts. To control FPR at small distance, we used stricter p-value threshold of 1e-5 that resulted in 60 d-colocalized pairs (FPR~0%). We used RNAplex47 to compute the RRI scores. For this, we retrieved the nucleotide sequences from the Ensembl database87 and got the specific transcript id to get the correct spliced form. RNAplex has been shown to be among the most accurate tools while being fast enough to compute the scores for gene pairs with their full transcripts. Finally, we performed a hypergeometric test on d-colocalized pairs and pairs with RRI score greater than a fixed threshold (RRI > 35). 8 out of 60 d-colocalized pairs had high RRI scores that led to significant overlap (p-value = 0.02).Enrichment analysisTo understand the biological mechanism or consequences of d-colocalization, we tested if the compendium of d-colocalized gene pairs has significant overlap with functionally related gene pairs. We define a gene pair to be functionally related if both genes are present in same KEGGpathway or are annotated with same GO terms more than K times. K was chosen such thatnumber of gene pairs is similar across d-colocalized and functionally related set. This partially offsets the confounding impact of set size variations when performing multiple gene set enrichment tests. In our analysis, K (MF) = 2, K (BP) =1, K (CC) = 3, K (pathway) = 1. We performed a hypergeometric test between d-colocalized pairs and functionally related set.Differential colocalization of a gene pairInSTAnT employs a series of statistical tests to categorize a pair based on its specificity to a cell type, region or phenotype. First, it tests the association between cells where a gene pair was deemed a significant proximal pair and cells of a particular type (e.g., inhibitory neurons), using a Hypergeometric test. (This process is repeated for every cell type). If such an association is found to be statistically significant, it is subjected to further tests to determine if the cell type specificity arises simply because one of the genes in the pair is expressed specifically in that cell type. For this, InSTAnT utilizes a version of the generalized Hypergeometric test that tests for an association between two sets conditional on a third set88, as described below. In this case, the third set comprises the cells with high expression of one of the genes in the pair.Let \(U\) be the set of all cells, \(M\) be the set of cells of a particular cell type, \(O\) be the set of cells where a gene pair is deemed a proximal pair and \(E\) be the set of cells with high expression of one of the genes in the pair. \(M\), \(O\) and \(E\) are subsets of \(U\). The threshold for high gene expression used in defining \(E\) is chosen such that size (\(E\)) = size (\(M\)). Let \(\left|M\cap E\right|=\gamma,\left|M\cap O\right|=\lambda,\left|E\cap O\right|=\alpha\) |. The Hypergeometric test p-value of association between \(M\) and \(O\) is given by the probability that a random set of size |\(O\)| has an overlap (intersection) of size greater than or equal to \(\lambda\) with \(M\). However, we wish to test if the overlap between \(M\) and \(O\) is significant beyond what is expected not from a random set of size |\(O\)| but a random set of this size that respects the known overlap between \(M\) and \(E\) and between \(E\) and \(O\). For this, we calculate probability of the overlap between \(M\) and a random set of |\({O|}\) being greater than or equal to \(\lambda\) conditional on the observed overlap between \(M\) and \(E\) and that between \(E\) and \(O\), as follows:$$\frac{{\sum }_{k=\lambda }^{\min \left(\left|M\right|,\left|O\right|\right)}{\sum }_{\beta=0}^{k}\left(\begin{array}{c}\gamma \\ \beta \end{array}\right)\left(\begin{array}{c}m-\gamma \\ k-\beta \end{array}\right)\left(\begin{array}{c}{n}_{1}-\gamma \\ \alpha -\beta \end{array}\right)\left(\begin{array}{c}\left|U\right|-\left|M\right|-\left|E\right|+\gamma \\ \left|O\right|-\alpha -k+\beta \end{array}\right)}{\left(\begin{array}{c}\left|E\right|\\ \alpha \end{array}\right)\left(\begin{array}{c}\left|U\right|-\left|E\right|\\ \left|O\right|-\alpha \end{array}\right)}$$
(9)
This is an example of multivariate hypergeometric distribution. We use scipy.stats.multivariate_hypergeom package for multivariate hypergeometric distribution.For each gene pair that is associated with a cell type, InSTAnT performs the above test twice, each time conditioning on a set \(E\) defined by the high expression cells for one of the genes of the pair. Significant p-values in both tests thus performed indicate that the cell type-specificity of the d-colocalized gene pair is significant beyond what is expected from the specificity of either gene’s expression. Furthermore, InSTAnT tests if either gene of the pair is a marker of the cell type, defined as any gene among the top \(10\) by association between their expression and the cell type. A marker gene is found by conducting Hypergeometric test of overlap between \(O\) and \(E\).Using the above tests, InSTAnT categorizes a gene pair vis-à-vis specificity as follows: If the gene pair is significantly associated with a cell type/region/phenotype (first test above), then it belongs to Category 2 if the association is significant by the Hypergeometric test conditional on high expression cells of both genes and neither gene is a marker of the cell type, otherwise it belongs to Category 1.Probabilistic graphical model for spatial modulationInSTAnT uses a likelihood ratio test to determine if sub-cellular colocalization of a d-colocalized gene pair is spatially modulated at the tissue level. Informally, this means that the cells in which the gene pair is deemed to be a proximal pair are non-randomly distributed in the physical space.The probabilistic model is formulated around a graph with a node for each cell and edges between neighboring cells. Two cells are neighboring cells if they are located within a configurable distance (set to \(100\) micron in our tests). Each node is associated with a binary variable \({s}_{c}\) that indicates whether the specific gene pair (say \({g}_{i},\,{g}_{j}\)) is a proximal pair in the corresponding cell c, as detected by the PP test. The variable \({s}_{c}\) is assumed to be a Bernoulli-distributed variable. The null hypothesis is that the Bernoulli parameter is a global constant \({p}^{{global}}\) shared across all cells, i.e., it does not depend on the cell \(c\) and thus on its spatial location:$${H}_{0}\!\!:{s}_{c}\sim {Ber}\left({p}^{{global}}\right)$$
(10)
\({p}_{{global}}\) is estimated as the fraction of cells where the gene pair \({g}_{i},\,{g}_{j}\) is a proximal pair, which is its maximum likelihood estimate. In the alternative hypothesis, the model assumes that the distribution of variable \({s}_{c}\) depends on the fraction of cells c’ in the neighborhood of c for which \({s}_{c}^{{\prime} }=1.\) Let \({p}^{{local}}\) be the fraction of cells \({c}^{{\prime} }\) in the neighborhood of \(c\) for which \({s}_{c}^{{\prime} }=1.\)$${H}_{1}\!\!:{s}_{c}\sim {Ber}\left({{{wp}}^{{local}}+\left(1-w\right)p}^{{global}}\right)$$
(11)
The parameters \({p}^{{global}},\,{p}^{{local}},w\) are learnt by maximizing likelihood. Weight \(w\) controls the contribution of local neighborhood. InSTAnT calculates the log likelihood ratio (LLR) for each gene pair in the d-colocalization map and pairs with LLR above a threshold are designated as spatially modulated. The threshold is obtained by random permutation of the of \({s}_{c}\) values of cells, repeating the above test and selecting the highest LLR score (over all gene pairs) seen on the randomized data. This allows us to detect spatially clustered distributions of cells supporting \({g}_{i},\,{g}_{j}\) colocalization.Module discovery: global colocalization clustering (GCC)GCC is a procedure to analyze a d-colocalization map to identify subsets of genes that exhibit a high frequency of pairwise d-colocalization relationships. To this end, it represents the d-colocalization map as an \(n\) x \(n\) matrix (\(n\) = number of genes) whose entries are the negative logarithm of p-values of gene pairs from the CPB test and performs a hierarchical clustering of rows and columns using Euclidean distance with Ward criterion. (The constant 1e-64 is added to all the p-values to handle zero p-values prior to taking logarithms).Module discovery: frequent subgraph mining (FSM)FSM seeks a network of genes that is “colocalized” in many cells, where colocalization of a network in an individual cell means that every gene pair connected by an edge in that network is a proximal pair in that cell. It constructs a colocalization graph for each cell with genes as nodes and edges representing proximal gene pairs from PP test. It then uses an efficient graph mining tool called gSPAN54 to detect subgraphs with a pre-specified minimum size (numbers of nodes and edges) that are supported by a pre-specified minimum number of cells.MERFISH imaging and analysisCell line Source and Authentication: U2OS Cell lines were purchased from ATCC, original donor white female. Cell lines were authenticated by Cancer center at Illinois using the following method: Amplified with AmpFISTR Identifiler Plus PCR Amplification Kit and analyzed on the Applied Biosystems 3730/ GeneMapper 6.General cell culture conditionsU2 OS cells were cultured in minimal essential medium (MEM) from ATCC with 1 mM sodium pyruvate, 10% fetal bovine serum (FBS), and 1% penicillin-streptomycin (Pen-Strep). The cells were obtained from ATCC and maintained using the recommended protocol.MERFISH sample preparationU2 OS MERFISH samples were prepared using a previously published method89. MERFISH encoding probe sequences were originally from the Zhuang lab(ref. 89), and can be found in the “Zhuang U2OS probes” Source Data file. In brief, U2 OS cells were plated on a salinized 40mm #1.5 coverslip (Fisher Scientific). Plated cells were transferred to a 37 °C and 5% CO2 incubator overnight to grow. Cells were then fixed with 4% paraformaldehyde (Electron Microscopy Sciences) and permeabilized with 0.5% (vol/vol) Triton X-100 (Sigma Aldrich). Samples were stained with encoding probes (10nM/probe) and anchor probes (1µM) for 36 hours in a humidified incubator at 37 °C. To stabilize the cells during clearing, the stained cells were embedded in a thin, 4% polyacrylamide (PA) gel. Fiducial beads (Spherotech, FP-0245-2) were also included in the gel to align rounds of MERFISH images.Commonly used imaging solutionsThe following solutions were used during imaging experiments described in this work. Readout wash buffer was adapted from Moffit et al.89 and contained 10% (v/v) ethylene carbonate (Sigma Aldrich), 0.1% Triton X-100 in 2x SSC. Imaging buffer adapted from Moffit et al.89 and contained 5mM 3,4-dihydroxybenzoic acid (PCA; Sigma Aldrich), 2 mM trolox (Sigma Aldrich), 50 µM trolox quinone, 1:500 of recombinant protocatechuate 3,4-dioxygenase (rPCO; OYC Americas), adjusted to a pH of 7-7.2 using 1 N NaOH (VWR International) in 2x SSC. Cleavage buffer was adapted from89 and contained 0.05 M TCEP HCl, adjusted to a pH of 7-7.2 using 1 N NaOH, in 2x SSC. Stripping buffer was adapted from Eng. et al.14 and contained 55% formamide, and 0.1% Triton X-100 in 2x SSC.MERFISH imagingAll images were acquired using a Zeiss Axiovert-200m widefield microscope (Carl Zeiss AG) located in the IGB core imaging facility. The sample was placed into a flow cell (Bioptechs, FCS2), filled with RNAse free 2x SSC, and connected to a lab built automated flow system. Briefly, computer-controlled valves (Hamilton, MVP/4, 8-5 valve) are used to select which solution was pulled across the sample by a computer controlled pump (Gilson, Minipuls 3). All systems are controlled by a custom designed Python script that can communicate with the microscope to start imaging or start flowing after an imaging round is done. In brief, a single round of imaging involves staining with fluorescently labeled readout probes (0.4 mL/min for 6 minutes, and 0.34 mL/min for 6 minutes), washing with readout wash buffer (0.23 mL/minute for 9 minutes) to remove unbound probes, and imaging buffer was flowed into the flow cell prior to imaging (0.34 mL/minutes for 6 minutes) to reduce photobleaching. MERFISH readout probe sequences were originally from the Zhuang lab (ref. 89), and can be found in the “16 bit U2OS RO probes” Source Data file. A single quad band excitation filter (Chroma, ZET402/468/555/638x) and dichroic (Chroma, ZT405/470/555/640rpc-UF1) were used to image all samples. Excitation was provided by a 7 laser system (LDI WF, 89 North). Alexa Fluor 647 (Fisher scientific) labeled probes were excited using a 647 nm laser (0.5 W) with a ET700/75m (Chroma) emission filter, and 1.5 second exposure time. Atto 565 (Atto tec) labeled probes were excited using a 555 nm laser (1 W) with a ET610/75 m (Chroma) emission filter, and a 0.75 second exposure time. Fiducial beads were imaged with a 405 nm laser (0.3 W) with a ET440/40 m emission filter, and a 1-second exposure time. Samples were imaged with a 63x oil immersion objective (Carl Zeiss AG, 420782-9900-000), and focus was maintained between imaging rounds using Definite Focus (Carl Zeiss AG). 9 z planes with 0.7 µm steps were taken for each FOV, and a total of 100 FOVs were acquired. After imaging is complete, a cleavage buffer (0.2 mL/minute for 15 minutes) was flowed across the sample to remove the fluorophores from the probes. The cleavage buffer was washed away using RNAse free 2x SSC (0.5 mL/minute for 10 minutes). This process was repeated for a total of 8 rounds of imaging. PolyA probes were stained after the final imaging round using the same method as described above.MERFISH data processingIndividual FOVs were exported from czi format into 16 bit tiff format using Zen (Carl Zeiss AG) using the image export method. Images then were reformatted into image stacks by FOV and round. A modified copy of MERLIN90 was used to decode MERFISH spots. In brief, for each FOV, images from different rounds are aligned using fiducial beads that were imaged in each round. Aligned images are then normalized, decoded, and identified spots filtered using previously published methods31. Cell segmentation was done separately from MERLIN using Cellpose91 on PolyA and DAPI images for each FOV. To improve FOV alignment to neighboring FOVs, the DAPI channel was used with the restitching function found in Zen (Edge detection: on, minimal overlap: 5%, maximal shift: 15%, comparer: best, Global optimizer: best). Using the aligned images, segmented cells that cross FOV boundaries were merged into single cells, and global positions were generated for each spot. Spots are then assigned to cells based on their spatial coordinates. Spots were then filtered to remove any spot smaller than 3 pixels in size.smFISH probe designAll smFISH probes were designed using the Stellaris probe designer (Biosearch technologies). Probes were designed using the following settings: Masking level: 5, max number of probes: 48, oligo length: 20, minimum spacing length: 2. SRRM2 exon probes were designed against SRRM2 isoform ENST00000301740 (GRCh38.p13). SRRM2 intron probes were randomly selected from probes designed for three different introns defined by ensemble (SRRM2-230 intron 1, SRRM2-230 intron 2, and SRRM2-230 intron 10) (GRCh38.p13). MALAT1 probes were designed against MALAT1 isoform ENST00000534336 (GRCh38.p13). All probes were purchased from Biosearch modified with mdC (TEG-Amino) at the 3’ terminus. All probe sequences corresponding to MALAT1, SRRM2 exon and SRRM2 intron can be found in “smFISH probes” Source Data file. The probes were dissolved in TE buffer and labeled using AF488/Cy3/Cy5 NHS esters for MALAT1, SRRM2 intron, and SRRM2 exon, respectively. The labeled probes were purified using the Bio-Rad Bio-Spin P-6 purification columns (Cat # 732-6221).smFISH sample preparationApproximately 1.5-1.8 million U2OS cells were plated on a #1.5, 40 mm coverslip (Fisher Scientific) that has been UV treated before plating. The cells were then transferred to an incubator at 37 °C and 5% CO2, overnight for 12-16 hours.Modified from Fei et al.92, the sample was rinsed with 1x PBS (Corning), followed by fixation using 4% paraformaldehyde (PFA; Electron Microscopy Sciences) in 1x PBS for 10 minutes at room temperature (RT). The sample was then washed three times with 1x PBS and permeabilized with 0.5% Triton X-100 (Sigma Aldrich), 2 mM vanadyl ribonucleoside complexes (VRC; Sigma Aldrich) in 1x PBS for 10 minutes on ice, followed by three quick washes with 1x PBS. At this point, the sample can be stored in 70% Ethanol at 4 °C if the experiment needs to be paused temporarily.To prepare for smFISH hybridization, sample was rinsed with 10% formamide (Sigma Aldrich) in 2x saline sodium citrate (SSC; Fisher Scientific). smFISH probe hybridization buffer was prepared with 0.2 mg/mL of bovine serum albumin (BSA; Fisher Scientific), 2 mM VRC, 10% dextran sulfate (Sigma Aldrich), 1 mg/mL yeast tRNA (Fisher Scientific), 10% formamide, 1% murine RNase inhibitor (New England BioLabs) in 2x SSC. Avoid light exposure from this point forward. smFISH probes were then added to the FISH hybridization buffer at a final concentration of 14 nM for each targeted RNA (MALAT1, SRRM2 intron, and SRRM2 exon).A humidified chamber was made using an empty pipette box filled halfway with nuclease-free water (Corning) at the base and a UV-treated glass slide covered with a parafilm layer on top. A 100 μl drop of the FISH probe hybridization buffer was then added on top of the parafilm layer and the sample was casted over the drop with the cell side facing down. The chamber was then placed in an incubator in dark and wrapped entirely with aluminum foil overnight at 37 °C for at least 16 hours. The sample was quickly rinsed two times with 10% formamide in 2x SSC then stained with 4’,6-diamidino-2-phenylindole (DAPI; Invitrogen by Fisher Scientific) 1:1000 of 1 mg/mL stock solution and 1:5000 of Fluoro-Max Blue Aqueous Fluorescent Particles (fluorescent beads; Fisher Scientific) in 2x SSC. The sample was incubated with the DAPI and fluorescent beads solution for 5 minutes while rocking at RT, followed by a quick wash with 2x SSC, then stored in 2x SSC at 4 °C until ready for imaging.Protein stainingAfter smFISH imaging, the sample can be stored in 1x PBS at 4 °C for up to a week before protein staining. Samples were fixed a second time with 4% PFA in 1x PBS for 5 minutes at RT, then rinsed three times with 1x PBS. This was followed by incubation with a blocking solution of 1% BSA in 1x PBS for three consecutive times with 10 minutes each time at RT.The SON primary antibody (Anti-SON, Sigma Aldrich, HPA023535) was kept at −20 °C until ready for use. The primary antibody stock solution of 1:1000 was prepared with 1x PBS and kept on ice. A 1:5000 primary antibody dilution was prepared in blocking solution and the sample was incubated with 200 µl of the primary antibody solution for approximately 1 hour at RT in the dark.The sample was washed with blocking solution three consecutive times with a 10-minute incubation each time at RT, followed by three washes with 1x PBS, for 10 minutes each time at RT.The secondary antibody was conjugated to Alexa Fluor 647 (Goat anti-rabbit, Invitrogen, A21245). The concentrated secondary antibody was kept at 4 °C until ready for use. Sample staining was accomplished by 1:1000 dilution of the secondary antibody in blocking solution and casting of the sample on a 200 µl drop of the secondary antibody solution, with the cell side facing down. The sample was then incubated for 1 hour in the dark at RT. The sample was re-stained with DAPI in 1x PBS with the same concentration and incubation time described in smFISH staining section. This was followed by a quick rinse with 1x PBS and the sample was stored in 1x PBS at 4 °C until ready for imaging.smFISH image acquisitionsmFISH and protein imaging were done on the same MERFISH imaging and fluidic system described above (MERFISH imaging). After placing the sample into the flow cell, imaging buffer was flowed through the system (0.34 mL/minute for 5 minutes). Excitation and dichroic filters were the same as used above. Table 2 contains the dyes, lasers, and emission filters were used for smFISH imaging.Table 2 Imaging parameters used for smFISH experimentSamples were imaged with the same 63x oil immersion objective as above, and focus was maintained between imaging rounds using Definite Focus. 9 z planes were imaged with a step size of 0.7 µm. After imaging, smFISH probes were removed using a stripping buffer that was flowed through the system (0.34 mL/minutes for 5 minutes) without removing the sample from the microscope. After stripping the sample was washed with 2x SSC (0.5 mL/minutes for 5 minutes). The sample was imaged a second time using the same settings as above. After imaging the sample was removed from the flow cell and placed into 1x PBS prior to protein staining (Protein staining).After protein staining was complete, the sample was placed into the flow cell and filled with imaging buffer. The same region imaged during the smFISH experiment was found and reimaged using the same objective and z-stack settings as above. Table 3 contains the dyes, lasers, and emission filters were used for protein imaging.Table 3 Imaging parameters used for SON protein labeling experimentSRRM2 image registration and alignmentIndividual FOVs were exported from czi format into 16 bit tiff format using Zen’s (Carl Zeiss AG) image export method. To align images from the same FOV across multiple rounds of imaging or experiment, blue fluorescent beads imaged in the DAPI channel were used as fiducial markers. We found that aligning images from the same experiment required a simple translation. To align protein images with mRNA images, an iterative rotation and translation process was developed. For each iterative round of alignment, the protein DAPI channel was rotated, then translated to best align with the mRNA image, this warped image was then used as the starting protein DAPI image for the next round of alignment. We found that it took between 2 and 5 rounds of alignment to align protein images to mRNA images. Chromatic aberration was corrected by aligning all channels to the Cy5 channel. Multicolor beads (Multi-speck bead slide, Carl Zeiss AG, 1783-455) that included dyes in the Alexa Fluor 488, Cy3, and Cy5 channels were used to correct Alexa Fluor 488 and Cy3 channels. The DAPI channel was corrected to the Cy5 channel using the fiducial bead cross talk between the DAPI and Alexa Fluor 488 channels. This was done by calculating the shift between non-nuclear regions of the DAPI and Alexa Fluor 488 channels, then adding the Alexa Fluor 488 to Cy5 shift to the DAPI to Alexa Fluor 488 shift.SRRM2 image preprocessingTo remove cross talk in DAPI and Alexa Fluor 488 channels caused by the fiducial beads, stripped Alexa Fluor 488 mRNA channel was subtracted from the stained Alexa Fluor 488 channel. As fiducial beads are not affected by the mRNA stripping conditions, any spots that remain in the stripped Alexa Fluor 488 channel would be from the beads, not from MALAT1 mRNA. In order to reduce background in other images, round subtraction was also done on the other channels of the mRNA FOV.SRRM2 co-localization analysisCo-localization analysis was done on a single z plane from each experiment stack. Images were then filtered using a high pass filter (5 pixel sigma) and Lucy–Richardson deconvolution (10 iterations, 9 pixel filter size, 1.4 pixel sigma). Filtered images are then converted to binary masks with manually defined thresholds. To remove false positives in the MALAT1 channel, the MALAT1 mask was multiplied with the inverse of the stripped MALAT1 mask. Cell nuclei were identified using the DAPI channel and segmented using a manually defined threshold.The co-localization rate was calculated for each nucleus defined from the DAPI channel. To calculate the co-localization rate between two channels, each channel is multiplied against the nuclei mask. For each spot in the first mask, the spot was dilated by 2 µm and then compared against the second mask. If the dilated spot overlaps any spot in the second mask, it is considered to be colocalized. The colocalization rate was then calculated to be the following:$${colocalization\; percent}=\frac{{Co}-{localized\; spots\; ct}}{{Total\; spots\; ct}} * 100\%$$
(12)
The colocalization percent was averaged across 13 cells.Software usedWe used Merlin software for our U2OS data(v0.1.6,Zenodo, https://doi.org/10.5281/zenodo.3758540).SRRM2 figure generation (Fig. 4)SRRM2 exon and intron images were filtered using a high pass filter with 2 pixel sigma, while MALAT1 was filtered using high pass filter with 5 pixel sigma. Raw SON images were used in panels Fig. 4a–d.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles