Integrative spatial and genomic analysis of tumor heterogeneity with Tumoroscope

Breast cancer samplesThe breast tumor samples originating from an untreated invasive ductal carcinoma patient, specifically identified as HER2 positive, were obtained from the Department of Clinical Pathology and Cancer Diagnostics at Karolinska University Hospital, Stockholm, Sweden. Experimental procedures and protocols were approved by the regional ethics review board (Etikprövningsmyndigheten) in Stockholm (2016/957-31, amendment 2017/742-32, 2021-00795, and 2022-05245-02). Before surgery, informed consent was given to the patient for signature. The extracted tissue was promptly embedded in OCT for preservation, facilitating subsequent gene expression analysis through spatial transcriptomics. The residual material from each tumor section was allocated for comprehensive whole exome sequencing and single-cell RNA sequencing analysis42.ST experiments for breast cancer samplesFor the three breast cancer sections, H&E images and ST datasets were generated as detailed by Engblom et al.43. Specifically, each section was extracted from a different region of the tumor, and for each section, two nearby samples were collected. Single samples from sections SB2 and SB3 were already analyzed by Engblom et al.43, and the corresponding gene expression measurements were published there. Data for all six samples analyzed here are referenced in the Data and Code Availability section.Whole-exome sequencing for breast cancer samplesConcurrently, bulk DNA-seq procedures, as elucidated by Jun et al.42, were conducted on closely adjacent samples for each section of the breast samples. Specifically, the collected tissues were manually homogenized and genomic DNA samples were isolated by using the QIAamp DNA mini kit (QIAGEN). The library was prepared by using Twist Bioscience Human Core Exome kit (Twist Bioscience) according to the manufacturer’s protocol. The bulk DNA samples were then sequenced at 300x coverage depth in an S4 flow cell lane by the NovaSeq 6000 platform (Illumina) at the National Genomics Infrastructure, Science for Life Laboratory, Uppsala. Data for the WES samples is referenced in the Data and Code Availability section.ScRNA-seq experiments for breast cancer samplesThe collected breast cancer samples were prepared for single-cell RNA sequencing as described by Jun et al.42. Next, for scRNA-seq libraries, the Smart-Seq3 method was used according to the published protocol (PMID: 32518404). The list of used oligonucleotides is provided as supplementary data of Jun et al.42. Data for the scRNA-seq is referenced in the Data and Code Availability section.We employed a 384-well plate for single-cell RNA sequencing. Out of the total 384 cells, 224 were collected from the ST sections utilized in this study. We filtered the data based on the following criteria: mitochondrial expression fraction < 0.1, gene expression sum > 10,000, and the number of captured genes > 3,000, resulting in a final dataset of 120 high-quality cells for subsequent analysis.Prostate cancer samplesThe prostate cancer dataset was generated and published by Berglund et al.26. This dataset consists of twelve sections, with H&E images, bulk DNA-seq, and spatial transcriptomics provided for each section. The data were generated and processed using protocols as described in26.Identifying the spots that contain tumor cellsTo select the spots that contain tumor cells, we took advantage of H&E staining images of the analyzed tissues. For both breast and prostate cancer, regions containing cancer cells were annotated by an expert pathologist, Dr. Łukasz Koperski, using QuPath27. We further selected spots whose area overlapped with the pathologist’s annotated regions, using a custom script in QuPath27.Counting cells in spotsWe developed a custom script in QuPath27 to count cells in each ST spot visible in the H&E images27. The script takes as input coordinates and diameters of spots to define target areas. Then, we employ QuPath’s built-in cell counting algorithm to detect and count nuclei. In order to adjust parameters of the algorithm, we examined random spots by manually counting cells to verify the accuracy of the results.Spatial transcriptomics data preprocessingFor prostate cancer samples, the ST data bam files were provided by Berglund et al.26. For breast cancer samples, to create the genome index, we used the STAR program44 with the GRCh38 reference genome as input. Next, we applied the ST Pipeline with default setting45, providing the genome index, FASTQ files, barcodes and array coordinates as input. This pipeline ensures data integrity through quality checks, artifact removal, and genome alignment. We obtained the gene expression matrix as counts of reads for each gene, which the ST Pipeline produces by default. In addition, we modified the default settings to obtain bam files with the mapped reads.Bulk DNA-seq and somatic mutation callingWe identified somatic mutations that appeared in at least one of the bulk DNA-seq samples, by calling the mutations using Vardict28 for each sample with a p-value threshold equal to 0.1. Then, we used their union over sections as the set of mutations called in bulk DNA-seq data. This procedure was performed in the same way for the prostate and the breast dataset.Selection of somatic mutations that are detected both in bulk DNA-seq and ST dataNext, we identified the bulk DNA-seq mutations that were also present in ST data with the same alternated nucleotide. To calculate the total and alternated reads over the mutations in ST data, we located the selected bulk DNA-seq mutations in the ST bam files and counted the corresponding mapped reads with our script. The reads with a different nucleotide as compared to the reference genome were called the alternated reads. For ST data, we use the terms spot coverage and mutation coverage to refer to the number of reads found in a spot or over a mutation, respectively.Finally, we selected the mutations for which there existed at least one alternated read in at least one section. The alternated and total read counts in bulk DNA-seq data for the selected mutations were given as input for phylogenetic inference, while the alternated and total read counts in ST data for the same mutations were given as input to Tumoroscope. The median mutation coverage for the selected variant sites in bulk DNA-seq data for breast and prostate cancer were 214.5 and 134.75, respectively.Phylogenetic tree analysisTo identify the phylogenetic tree and infer the genotype and prevalence of each clone in the tree, we used a statistical method called Canopy9. The input to Canopy are variant allele frequencies of somatic single nucleotide alterations (SNAs), along with allele-specific mutation coverage ratios between the tumor and matched normal sample for somatic copy number alterations (CNAs). We used FalconX for producing the allele-specific mutation coverage ratio between tumor and normal sample29. We used the multi-sample feature of Canopy to infer the clonal evolution across the sections for both prostate and breast datasets.Mapping fractions of cells in ST spots to cancer clones using TumoroscopeTumoroscope is a probabilistic graphical model for estimating proportions of cancer clones in ST spots given alternated and total read counts over the analysed somatic mutations, genotypes and frequencies of the clones, and estimated cell counts per each spot (Fig. 1f). Let i ∈ {1, …, M} index the selected mutation positions, identified both in bulk DNA-seq and ST data. We are given a set of K cancer clones, indexed by k ∈ {1, …, K} as input, which has been derived from bulk DNA-seq data. The genotypes of the input clones are represented as a matrix C with entries between 0 and 1 corresponding to the zygosity. Ci,k equals 0 if there is no mutation on position i in clone k, equals 1 in case all alleles of that position carry the mutation, and equals 0.5 when the half of the alleles of that position carry the mutation. Note that there can be multiple alleles for position i. In general, the zygosity is defined as the ratio of the number of mutated alleles to the total number of alleles and we estimate it by the ratio of the major allele frequency to the total read count. The prevalence of the clones in the bulk DNA-seq is represented by the vector F = (F1, . . . , FK), with values summing up to one. Let s ∈ {1, …, S} index the spots. We use a feature allocation model to account for the presence of clones in spots46. Specifically, we define Zs,k ∈ {0, 1} as an indicator of the presence of clone k in spot s. We assume a Bernoulli distribution over Zs,k and a Beta prior over its parameter Π with hyper-parameter ζs:$${\mathbb{P}}({{{{\bf{Z}}}}}_{s,k}| {{{{\boldsymbol{\Pi }}}}}_{s,k}) \sim {{{\rm{Bern}}}}({{{{\boldsymbol{\Pi }}}}}_{s,k}),$$
(1)
$${\mathbb{P}}({{{{\boldsymbol{\Pi }}}}}_{s,k}| {{{{\boldsymbol{\zeta }}}}}_{s},K) \sim {{{\rm{Beta}}}}\left(\frac{{\zeta }_{s}}{K},1\right).$$
(2)
Let 1 = {1}K denotes a K-dimensional vector with all elements equal to 1. Bearing in mind the assumption about Beta prior over Πs,k, we calculate the expected number of nonzero entries in each spot \({\mathbb{E}}[{{{{\bf{Z}}}}}_{.,k}^{T}{{{\bf{1}}}}]\) using the formula for the mean of the Beta distribution as47,48$${\mathbb{E}}[{{{{\bf{Z}}}}}_{.,k}^{T}{{{\bf{1}}}}]={\sum}_{k=1}^{K}{\mathbb{E}}[{{{{\bf{Z}}}}}_{s,k}]=K{\mathbb{E}}[{{{{\bf{Z}}}}}_{s,k}]=K\frac{\frac{{{{{\boldsymbol{\zeta }}}}}_{s}}{K}}{\frac{{{{{\boldsymbol{\zeta }}}}}_{s}}{K}+1}=\frac{K{{{{\boldsymbol{\zeta }}}}}_{s}}{{{{{\boldsymbol{\zeta }}}}}_{s}+K}.$$
(3)
Given this formula and the number of the clones K, we are able to control the expected number of clones in each spot by tuning shape parameter of the beta distribution, \(\frac{{{{{\boldsymbol{\zeta }}}}}_{s}}{K}\).Our main goal is to estimate the proportions of clones in the spots, which are represented by the variable H, a matrix with S rows and K columns. The value of an element Hs,k is the fraction of spot s coming from clone k. We consider a Dirichlet distribution over Hs,. = (Hs,1, …, Hs,K),$${\mathbb{P}}({{{{\bf{H}}}}}_{s,1},\ldots,{{{{\bf{H}}}}}_{s,K}| {{{{\bf{F}}}}}^{{\prime} },{F}_{0},{{{{\bf{Z}}}}}_{s,.}) \sim {{{\rm{Dirichlet}}}}({{{{{\bf{F}}}}}_{1}^{{\prime} }}^{{{{{\bf{Z}}}}}_{s,1}}{{F}_{0}}^{1-{{{{\bf{Z}}}}}_{s,1}},\ldots,{{{{{\bf{F}}}}}_{K}^{{\prime} }}^{{{{{\bf{Z}}}}}_{s,K}}{{F}_{0}}^{1-{{{{\bf{Z}}}}}_{s,K}}).$$
(4)
Here, F0 corresponds to a ”pseudo-frequency”, and results in non-zero proportions for all clones for each spot. We set F0 to a small number, effectively assigning small proportions to clones that are not present in the spot. The \({{{{\bf{F}}}}}^{{\prime} }=({{{{\bf{F}}}}}_{1}^{{\prime} },…,{{{{\bf{F}}}}}_{K}^{{\prime} })\) are obtained as discretized frequencies F. Specifically, we discretize the values of F by dividing the range from 0 to 1 into 20 equal-sized bins and then round up the values to the upper-bounds of the bins and scale them by multiplicative factor l$${{{{\bf{F}}}}}_{k}^{{\prime} }=l\times \frac{\left\lceil 20\times {{{{\bf{F}}}}}_{k}\right\rceil }{20},$$
(5)
where we used l = 100, but it can be specified by the user.To sample H, we take advantage of the relation between Dirichlet and Gamma distribution49 and draw K independent random samples (Gs,1, …, Gs,K) from K Gamma distributions,$${\mathbb{P}}({{{{\bf{G}}}}}_{s,k}| {{{{\bf{F}}}}}_{k}^{{\prime} },{F}_{0},{{{{\bf{Z}}}}}_{s,k}) \sim {{{\rm{Gamma}}}}({{{{{\bf{F}}}}}_{k}^{{\prime} }}^{{{{{\bf{Z}}}}}_{s,k}}{{F}_{0}}^{1-{{{{\bf{Z}}}}}_{s,k}},1),$$
(6)
and then we calculate the proportions H:$${{{{\bf{H}}}}}_{s,k}=\frac{{{{{\bf{G}}}}}_{s,k}}{{\sum }_{l=1}^{K}\,{{{{\bf{G}}}}}_{s,k}}.$$
(7)
The total read count at position i in spot s is represented by observed variable Di,s. We assume a Poisson distribution over Di,s,$${\mathbb{P}}({{{{\bf{D}}}}}_{i,s}| {{{{\bf{H}}}}}_{s,.},{{{{\mathbf{\Phi }}}}}_{i,.},{{{{\bf{N}}}}}_{s}) \sim {{{\rm{Pois}}}}\left({{{{\bf{N}}}}}_{s}{\sum}_{k}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}\right),$$
(8)
where Φi,k is the average mutation coverage for the position i across the cells from clone k, and Ns is the number of cells in spot s. The variables Ns can be fixed to a priori known values.However, in most practical applications, the number of cells per spot is not known. This gives a compelling reason to estimate them as a part of model inference. We assume a Poisson distribution over Ns,$${\mathbb{P}}({{{{\bf{N}}}}}_{s}| {{{{\boldsymbol{\Lambda }}}}}_{s}) \sim {{{\rm{Pois}}}}\left({{{{\boldsymbol{\Lambda }}}}}_{s}\right),$$
(9)
where Λs is the expected number of cells in spot s. Also, we assume a Gamma distribution over Φi,k,$${\mathbb{P}}({{{{\mathbf{\Phi }}}}}_{i,k}| r,p) \sim {{{\rm{Gamma}}}}(r,p),$$
(10)
where r and p are the shape and rate hyperparameters, respectively.Ai,s represents the number of alternated reads for position i in spot s. We assume a Binomial distribution over Ai,s,$${\mathbb{P}}({{{{\bf{A}}}}}_{i,s}| {{{{\bf{D}}}}}_{i,s},{{{{\bf{H}}}}}_{s,.}{{{{\mathbf{\Phi }}}}}_{i,.}{{{{\bf{C}}}}}_{i,.}) \sim {{{\rm{Binom}}}}\left({{{{\bf{D}}}}}_{i,s},\frac{{\sum }_{k=1}^{K}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}{{{{\bf{C}}}}}_{i,k}}{\mathop{\sum }_{k=1}^{K}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}}\right).$$
(11)
Where the success probability of Binomial distribution is the probability of observing Ai,s alternated reads out of Di,s reads in total. Given the variables Ns, Hs,. and Φi,., we calculate the expected number of alternated reads and the total reads in spot s using \({{{{\bf{N}}}}}_{s}\mathop{\sum }_{k=1}^{K}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}{{{{\bf{C}}}}}_{i,k}\) and \({{{{\bf{N}}}}}_{s}\mathop{\sum }_{k=1}^{K}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}\), respectively. Therefore, to calculate the success probability, we determine the fraction of expected alternative reads to the total reads, which remains constant despite fluctuations in gene expression levels.Metropolis-Hasting inside Gibbs SamplingIn the Gibbs sampling, we iteratively generate samples from each hidden variable’s conditional distribution, given the remaining variables, in order to estimate the posterior distribution of the hidden variables. Each hidden variable given the variables in its Markov Blanket is conditionally independent of all variables outside its Markov Blanket in the graphical model50. A variable’s Markov Blanket includes its parents, children, and children’s parents. If the conditional distribution does not have a closed analytical form, we use a Metropolis-Hasting step inside the Gibbs sampler. In the following, we describe the sampling steps for each hidden variable.The variables with the closed-form sampling distributionΠs,k and Zs,k are the only variables with analytical sampling distributions.Sampling Π
s,k
For sampling Πs,k, we take advantage of the conjugacy of Beta and Bernoulli distributions:$$\begin{array}{l}{\mathbb{P}}({{{{\boldsymbol{\Pi }}}}}_{s,k}| {{{{\boldsymbol{\zeta }}}}}_{s},K,{{{{\bf{Z}}}}}_{s,k})\propto {\mathbb{P}}({{{{\boldsymbol{\Pi }}}}}_{s,k}| {{{{\boldsymbol{\zeta }}}}}_{s},K){\mathbb{P}}({{{{\bf{Z}}}}}_{s,k}| {{{{\boldsymbol{\Pi }}}}}_{s,k})\\={{{\rm{Beta}}}}\left({{{{\boldsymbol{\Pi }}}}}_{s,k}| \frac{{{{{\boldsymbol{\zeta }}}}}_{s}}{K},1\right){{{\rm{Bern}}}}({{{{\bf{Z}}}}}_{s,k}| {{{{\boldsymbol{\Pi }}}}}_{s,k})={{{\rm{Beta}}}}({{{{\boldsymbol{\Pi }}}}}_{s,k}| \frac{{{{{\boldsymbol{\zeta }}}}}_{s}}{K}+{{{{\bf{Z}}}}}_{s,k},2-{{{{\bf{Z}}}}}_{s,k}).\end{array}$$
(12)
Sampling Z
s,k
For sampling Zs,k, we utilize the fact that this variable only accepts binary values. Therefore, we sample 0 or 1, proportional to their corresponding calculated probabilities.$$\begin{array}{l}{\mathbb{P}}({{{{\bf{Z}}}}}_{s,k}| {{{{\boldsymbol{\Pi }}}}}_{k},{{{{\bf{G}}}}}_{s,k},{{{{\bf{F}}}}}_{k},{F}_{0},l)\propto {\mathbb{P}}({{{{\bf{Z}}}}}_{s,k}| {{{{\boldsymbol{\Pi }}}}}_{k}){\mathbb{P}}({{{{\bf{G}}}}}_{s,k}| {{{{\bf{F}}}}}_{k},{F}_{0},l)\\={{{\rm{Bern}}}}({{{{\bf{Z}}}}}_{s,k}| {{{{\boldsymbol{\Pi }}}}}_{k}){{{\rm{Gamma}}}}({{{{\bf{G}}}}}_{s,k}| {{{{\bf{F}}}}}_{k}^{{{{{\bf{Z}}}}}_{s,k}}{F}_{0}^{1-{{{{\bf{Z}}}}}_{s,k}},1).\end{array}$$
(13)
Metropolis-Hasting adaptive steps inside Gibbs samplerIn our model, there is no closed analytical form of conditional distribution for variables Φi,k, Gs,k and Ns. Therefore, we take advantage of Metropolis-Hasting inside Gibbs sampler. We compute the acceptance ratio A as the following$${{{\bf{A}}}}=\frac{f({x}_{c})Q({x}_{n}| {x}_{c})}{f({x}_{n})Q({x}_{c}| {x}_{n})}.$$
(14)
Where f(x) is a function that is proportional to the desired density function P(x) and Q is the proposal distribution. Bearing in mind the non-negativity of the variables of our interest, we choose a Truncated Normal distribution for Q with the mean value of the current sample xc and variances σΦ, σG and σN corresponding to each variable. The variance of the Truncated Normal distribution determines the proximity of the new sample from the current one, which is interpreted as the step size. The choice of the step size has a major impact on the acceptance rate of the Metropolis Hasting. We tune the σΦ, σG and σN every b steps starting with an arbitrary value based on the feedback from the acceptance rate. Firstly, we choose an optimal acceptance rate Ro for each variable. Secondly, we modify the variance by δ percent of the current variance, and δ is calculated by the difference between the optimal and current acceptance rate Rc. Ultimately, during the sampling steps, we learn the optimal variance value for each variable.$${\delta }_{t}={R}_{o}-{R}_{c}$$
(15)
$${\sigma }_{t+1}={\sigma }_{t}(1+\delta+t)$$
(16)
In the following, we describe the conditional distribution for each variable.Conditional distribution for Φ
i,k
$$\begin{array}{l}{\mathbb{P}}({{{{\mathbf{\Phi }}}}}_{ik}| r,p,{{{{\bf{A}}}}}_{i,.},{{{{\bf{H}}}}}_{.,k},{{{{\bf{C}}}}}_{i,.},{{{{\bf{D}}}}}_{i,.},{{{{\bf{N}}}}}_{.})\\ \propto {\mathbb{P}}({{{{\mathbf{\Phi }}}}}_{ik}| r,p) \mathop{\prod}_{s}{\mathbb{P}}({{{{\bf{A}}}}}_{i,s}| {{{{\bf{H}}}}}_{s,k},{{{{\mathbf{\Phi }}}}}_{i,k},{{{{\bf{C}}}}}_{i,k}) {\prod}_{s}{\mathbb{P}}({{{{\bf{D}}}}}_{i,s}| {{{{\mathbf{\Phi }}}}}_{i,k},{{{{\bf{H}}}}}_{s,k},{{{{\bf{N}}}}}_{s})\\={{{\rm{Gamma}}}}(r,p) {\prod} _{s}{{{\rm{Binom}}}}\left({{{{\bf{A}}}}}_{i,s}| {{{{\bf{D}}}}}_{i,s},\frac{{\sum }_{k=1}^{K}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}{{{{\bf{C}}}}}_{i,k}}{{\sum }_{k=0}^{K}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}}\right) \mathop{\prod}_{s}{{{\rm{Pois}}}}\left({{{{\bf{D}}}}}_{i,s}| {{{{\bf{N}}}}}_{s} {\sum}_{k}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}\right).\end{array}$$
(17)
Conditional distribution for G
s,k
$$\begin{array}{l}{\mathbb{P}}({{{{\bf{G}}}}}_{s,k}| {{{{\bf{F}}}}}_{k}^{{\prime} },{F}_{0},{{{{\bf{Z}}}}}_{s,k},{{{{\bf{A}}}}}_{.,s},{{{{\bf{D}}}}}_{.,s},{{{{\mathbf{\Phi }}}}}_{.,k},{{{{\bf{C}}}}}_{.,k},{{{{\bf{N}}}}}_{s})\\ \propto {\mathbb{P}}({{{{\bf{G}}}}}_{s,k}| {{{{\bf{F}}}}}_{k},{F}_{0},{{{{\bf{Z}}}}}_{s,k}) {\prod} _{i}{\mathbb{P}}({{{{\bf{A}}}}}_{i,s}| {{{{\bf{H}}}}}_{s,k},{{{{\mathbf{\Phi }}}}}_{i,k},{{{{\bf{C}}}}}_{i,k}) {\prod} _{i}{\mathbb{P}}({{{{\bf{D}}}}}_{i,s}| {{{{\mathbf{\Phi }}}}}_{i,k},{{{{\bf{H}}}}}_{s,k},{{{{\bf{N}}}}}_{s})\\={{{\rm{Gamma}}}}({{{{{\bf{F}}}}}_{k}^{{\prime} }}^{{{{{\bf{Z}}}}}_{s,k}}{{F}_{0}}^{1-{{{{\bf{Z}}}}}_{s,k}},1) {\prod} _{i}{{{\rm{Binom}}}}\left({{{{\bf{A}}}}}_{i,s}| {{{{\bf{D}}}}}_{i,s},\frac{{\sum }_{k=1}^{K}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}{{{{\bf{C}}}}}_{i,k}}{{\sum }_{k=0}^{K}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}}\right) \mathop{\prod} _{i}{{{\rm{Pois}}}}\left({{{{\bf{D}}}}}_{i,s}| {{{{\bf{N}}}}}_{s} {\sum}_{k}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}\right).\end{array}$$
(18)
Sampling N
s
$$\begin{array}{l}{\mathbb{P}}({{{{\bf{N}}}}}_{s}| {{{{\boldsymbol{\Lambda }}}}}_{s},{{{{\bf{D}}}}}_{.,s},{{{\mathbf{\Phi }}}},{{{{\bf{H}}}}}_{s,.})\propto {\mathbb{P}}({{{{\bf{N}}}}}_{s}| {{{{\boldsymbol{\Lambda }}}}}_{s}){\prod}_{i}{\mathbb{P}}({{{{\bf{D}}}}}_{i,s}| {{{{\mathbf{\Phi }}}}}_{i,.},{{{{\bf{H}}}}}_{s,.},{{{{\bf{N}}}}}_{s})\\={{{\rm{Pois}}}}\left({{{{\bf{N}}}}}_{s}| {{{{\boldsymbol{\Lambda }}}}}_{s}\right) {\prod} _{i}{{{\rm{Pois}}}}\left({{{{\bf{D}}}}}_{i,s}| {{{{\bf{N}}}}}_{s} {\sum}_{k}{{{{\bf{H}}}}}_{s,k}{{{{\mathbf{\Phi }}}}}_{i,k}\right).\end{array}$$
(19)
Parameter setting for different simulation setupsFirst, we calculate the parameter of the Beta distribution over variable Πs,k based on the assumed expected value of the number of clones:$$\frac{{{{{\boldsymbol{\zeta }}}}}_{s}}{k}=\frac{{\mathbb{E}}\left[{{{{\bf{Z}}}}}_{.,k}^{T}{{{\bf{1}}}}\right]}{K-{\mathbb{E}}\left[{{{{\bf{Z}}}}}_{.,k}^{T}{{{\bf{1}}}}\right]}.$$
(20)
Considering expected values of 1, 2.5, and 4.5 for the number of clones found in each spot, we obtain 0.25, 1, and 9 and use these values for the Beta distribution parameter.Second, we exploit Φi,k that represents the expected number of reads for mutation i in each cell for generating different read coverage for total and alternated read counts. We set p = 1. With this, we control the expected value of Φi,k using parameter r.$${\mathbb{P}}({{{{\mathbf{\Phi }}}}}_{i,k}| r,p) \sim {{{\rm{Gamma}}}}(r,p),$$
(21)
$${\mathbb{E}}[{{{{\mathbf{\Phi }}}}}_{i,k}]=\frac{r}{{p}^{2}}.$$
(22)
For the very low, low, medium and high number of reads, we consider r = 0.02, r = 0.07, r = 0.09 and r = 0.19, respectively, leading to the 18, 50, 80, and 110 average total reads for each spot.Last, we generate three datasets for the number of cells with different levels of noise to compare our two models, which have the number of cells as observed and hidden variables. We add the noise value ϵ to the true values.$${{{{\bf{N}}}}}_{s}={{{{\bf{N}}}}}_{s}+\epsilon .$$
(23)
We consider ϵ = 0, ϵ ~ Pois(1) and ϵ ~ Pois(10) for generating without noise, noisy and highly noisy number of cells.Parameter estimation obtained for the real dataFor the higher accuracy of the graphical model reflecting the real data, we estimate the input parameters of the model based on the characteristics of the data. The first parameter is λs, the expected number of the cells in spot s, which affects the estimation of the number of cells and, ultimately, the number of reads we are expecting, which is a crucial element for estimating the fraction of the clones. Therefore, we estimate the number of cells using the H&E images and a customized script in QuPath and use them as the mean parameter for the Poisson distribution over N(described above)27. Next parameters are r and p, the shape and rate in the Gamma distribution over variable Φ. We use mixed type log-moment estimators for calculating r and p51.$$\hat{r}=\frac{I{\sum }_{i=1}^{I}{{{{\bf{x}}}}}_{i}}{I{\sum }_{i=1}^{I}{{{{\bf{x}}}}}_{i}{{\ln }}({{{{\bf{x}}}}}_{i})-{\sum }_{i=1}^{I}{{\ln }}({{{{\bf{x}}}}}_{i}){\sum }_{i=1}^{I}{{{{\bf{x}}}}}_{i}}.$$
(24)
$$\hat{p}=\frac{{I}^{2}}{I{\sum }_{i=1}^{I}{{{{\bf{x}}}}}_{i}\ln ({{{{\bf{x}}}}}_{i})-\mathop{\sum }_{i=1}^{I}{{\ln }}({{{{\bf{x}}}}}_{i})\mathop{\sum }_{i=1}^{I}{{{{\bf{x}}}}}_{i}}.$$
(25)
Where xi with i ∈ {1, …, I} are the sample from Gamma distribution. We generate these samples using the total number of reads D. We calculate the average number of reads from every cell, dividing the reads from the spots by the number of estimated cells as input, which gives us I samples, equal to the number of mutations.$${{{{\bf{x}}}}}_{i}=\frac{1}{S}\mathop{\sum}_{s}\frac{{{{{\bf{D}}}}}_{i,s}}{{{{{\bf{n}}}}}_{s}}.$$
(26)
Clonal composition resemblance in adjacent spotsThe evolutionary process imposes the similarity of the clonal composition in the adjacent spots. Therefore, we expect to have a higher correlation between the clonal composition of the adjacent spots as compared to distant spots. To make this comparison, we randomly generate N pair of adjacent spots \(\left([({{{{\bf{X}}}}}_{1},{{{{\bf{Y}}}}}_{1}),({{{{\bf{X}}}}}_{1}^{{\prime} },{{{{\bf{Y}}}}}_{1}^{{\prime} })]\,\ldots [({{{{\bf{X}}}}}_{N},{{{{\bf{Y}}}}}_{N}),({{{{\bf{X}}}}}_{N}^{{\prime} },{{{{\bf{Y}}}}}_{N}^{{\prime} })]\right)\) with X and Y corresponding to their coordinates. These adjacent pairs satisfy two constraints of \({{{{\bf{X}}}}}_{j}-{{{{\bf{X}}}}}_{j}^{{\prime} }\le 1\) and \({{{{\bf{Y}}}}}_{j}-{{{{\bf{Y}}}}}_{j}^{{\prime} }\le 1\) indexed by j ∈ {1, …, N}. We also generate N pair of distant spots with the two constraints of \({{{{\bf{X}}}}}_{j}-{{{{\bf{X}}}}}_{j}^{{\prime} } > 1\) and \({{{{\bf{Y}}}}}_{j}-{{{{\bf{Y}}}}}_{j}^{{\prime} } > 1\). We define \([{{{{\bf{V}}}}}_{k,j},{{{{\bf{V}}}}}_{k,j}^{{\prime} }]\) as the fraction of clone k in spots corresponding to the jth pair in the adjacent spots. Then we calculate the Pearson correlation for the vector \([({{{{\bf{V}}}}}_{k,1},{{{{\bf{V}}}}}_{k,1}^{{\prime} }),\ldots,({{{{\bf{V}}}}}_{k,N},{{{{\bf{V}}}}}_{k,N}^{{\prime} })]\). The procedure is repeated for all the clones and distant spots for the sake of comparison.Clonal assignment of the spots using cardelinoCardelino14 is a statistical method originally developed for inferring the clone of origin of individual cells using scRNA-seq. It integrates information from imperfect clonal trees inferred from WES data and sparse variant alleles expressed in scRNA-seq data. However, here, we applied it to ST instead of scRNA-seq to validate the assumption of a mixture of clones in each ST spot instead of assuming homogenous spots containing only one clone. We used clone_id function with “sampling” inference mode, minimum iteration of 100000 and maximum iteration of 250000. We used 3 parallel chains for prostate cancer data and 1 chain for breast cancer data due to the high RAM demand of cardelino. The RAM demand of cardelino grows with the number of spots. With 294 spots in the prostate datasets and 11,461 spots in the breast cancer dataset, it is nearly 39 times larger and running 3 chains becomes computationally infeasible.Assigning spots to copy number clones using STARCHTo perform the ST data analysis using STARCH, we have used the default code configuration, as recommended by the authors. Specifically, we used the run command

python run_STARCH.py -i gene_expression_matrix.csv (or 10X_directory/)

––output name ––n_clusters K ––outdir output/directory/where K is the number of clusters (corresponding to the number of identified copy number clones). As suggested by the Authors, we selected the number K of clones by computing the average silhouette score for a range of K and selecting the value of K with the highest average silhouette score.Estimating gene expression of the clonesHaving the proportions of the clones in each spot inferred using Tumoroscope and gene expression data from ST, we estimate average clonal gene expression using a regression model. Let g ∈ {1, …, G} index genes and Y be a matrix with S rows and G columns, where Ys,g is the measured gene expression of gene g in spot s. We are interested in estimating Bk,g – average gene expression of gene g in one cell of clone k. We use H and N variables inferred by Tumoroscope, and we rewrite N as an S × S diagonal matrix \({{{{\bf{N}}}}}^{{\prime} }\), where \({{{{\bf{N}}}}}_{s,s}^{{\prime} }\) is the number of cells in spot s and other elements of the matrix are equal to zero. We describe the relationship between the variables with an overdetermined system of equations \({{{{\bf{N}}}}}^{{\prime} }{{{\bf{HB}}}}={{{\bf{Y}}}}\). Then we try to find the optimal solution of this equation using linear regression with a lower bound of Bk,g ≥ 0 and no intercept. For this purpose, we apply a Python function scipy. optimize. lsq_linear to the data.Validation of gene expression profiles via independent single-cell RNA-seq dataThe scRNA-seq dataset comprised 302 cells and captured the expression of 24,300 genes. We selected the top 100 genes with the highest variance across the clone-specific gene expression profiles inferred by Tumoroscope.To mitigate potential biases arising from the use of two distinct technologies for gene expression measurement, we harmonized the data by aligning the centers of both the scRNA-seq data and our inferred profiles. This alignment involved adding the distance between these two centers to the Tumoroscope-inferred profiles, followed by Z-score normalization on both cells and genes.Subsequently, we employed PCA to project the data into a space of reduced dimension. Next, we performed K-means clustering on the reduced gene expression profiles of individual cells. To determine the optimal number of clusters for K-means, we tested a range from 5 to 25 clusters and evaluated each clustering solution using the Dunn Index measure.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles