Inferring replication timing and proliferation dynamics from single-cell DNA sequencing data

PERT modelThe input for PERT is scWGS read depth (Z) and called CN states for all bins (N cells  ×  M loci) (Fig. S1a). Input CN states are obtained through single-cell CN callers such as HMMcopy49,84 or 10X CellRanger-DNA51,52. PERT first identifies a set of high-confidence G1/2-phase cells where the input CN states reflect accurate somatic CN. All remaining cells have their input CN states dropped as they are initially considered to have unknown CN states and cell cycle phase. Most S-phase cells should be present in the unknown initial set. High-confidence G1/2-phase cells are phylogenetically clustered into clones based on CN using methods such as sitka85 or MEDICC286. Optionally, users can provide the set of high-quality G1/2-phase cells and their mapping to CN clones as input.The initialized sets of cells are passed into a probabilistic model which infers somatic CN (X ) and replication states (Y ) through three distinct learning steps (Fig. S1c–e). In Step 1, PERT learns parameters associated with library-level GC bias (βμ, βσ) and sequencing overdispersion (λ) by training on high-confidence G1/2 cells (Fig. S1c). Step 1 conditions on CN (X ), replication (Y ), and coverage/ploidy scaling terms (μ) because input CN states are assumed to accurately reflect somatic CN states and all bins are unreplicated (Y = 0) in high-confidence G1/2 cells. Once βμ, βσ and μ have been learned in Step 1, we can condition on them in Step 2 (Fig. S1d). Step 2 learns latent parameters representing each cell’s time in S-phase (τn), each locus’s replication timing (ρm), and global replication stochasticity (α) to compute the probability that a given bin is replicated (Yn,m = 1) or unreplicated (Yn,m = 0). Only unknown cells are included in Step 2. Prior belief on each unknown cell’s CN state is encapsulated using a prior distribution (π) which has concentration parameters (η) conditioned on the input CN of the most similar high-confidence G1/2 cells. CN prior concentrations are set for each cell by using the consensus CN profile of the most similar G1/2 clone or a composite scoring of the most similar G1/2 clone and cell CN profiles (Fig. S1f) A full list of model parameters, domains, and distributions can be found in Fig. S1b. Step 3 is an optional final step which learns CN and replication states for high-confidence G1/2 cells (Fig. S1e). This step is necessary to determine if any S-phase cells are present in the initial set of high-confidence G1/2 cells. Step 3 conditions on replication timing (ρ) and stochasticity (α) values learned in Step 2 to ensure that such properties are conserved between both sets of cells.PERT is designed for scWGS data with coverage depths on the order of 0.01-0.1x and thus 500kb bin sizes are used by default in this manuscript; however, the model can be run on count data of any bin size as long as sufficient memory and runtime are allocated. We demonstrate PERT’s ability to run on 10X scWGS data at 20kb resolution in Additional File 2.Equations for step 1Given that we have accurate CN caller results for high-confidence G1/2 cells, we can solve for each cell’s coverage/ploidy scaling term μn and condition on it,$${\mu }_{n}=\frac{{\sum }_{m=0}^{M}{Z}_{n , m}}{\mathop{\sum }_{m=0}^{M}{X}_{n , m}}.$$
(1)
The latent variables are arranged together in function block f through the following equations to produce the bin-specific negative binomial event counts δn,m. The GC bias rate of each individual bin (ωn,m) depends on the GC content of the locus (γm) and the GC bias coefficients (βn,k) for the cell,$${\omega }_{n,m}={e}^{\mathop{\sum }_{k=0}^{K}{\beta }_{n,k} * {\gamma }_{m}^{k}}.$$
(2)
The expected read count per bin is computed as follows:$${\theta }_{n,m}={X}_{n,m} * {\omega }_{n,m} * {\mu }_{n}.$$
(3)
The expected read count per bin is then used in conjunction with the negative binomial event success probability term (λ) to produce a number of negative binomial event count for each bin,$${\delta }_{n,m}=f({X}_{n,m},{\gamma }_{m},\lambda,{\mu }_{n},{\beta }_{n,k})=\frac{{\theta }_{n,m} * (1-\lambda )}{\lambda },$$
(4)
where we place the constraint δn,m ≥ 1 to avoid sampling errors in bins with θn,m ≈ 0. Finally, the read count at a bin is sampled from an overdispersed negative binomial distribution Zn,m ~ NB(δn,m, λ) where the expected read count for Zn,m is θn,m and the variance is \(\frac{{\theta }_{n,m}}{(1-\lambda )}\).Equations for steps 2–3Steps 2-3 have equations which differ from Step 1 since it must account for replicated bins and cannot solve for μn analytically. The probability of each bin being replicated (ϕn,m) is a function of the cell’s time in S-phase (τn), the locus’s replication timing (ρm), and the replication stochasticity term (α). Replication stochasticity (α) controls how closely cells follow the global RT profile by adjusting the temperature of a sigmoid function. The following equation corresponds to function block g:$${\phi }_{n,m}=g(\alpha, {\tau }_{n},{\rho }_{m})=\frac{1}{1+{e}^{-\alpha ({\tau }_{n}-{\rho }_{m})}}.$$
(5)
Equations corresponding to function block f differ from those in Step 1. The total CN (χn,m) is double the somatic CN (Xn,m) when a bin is replicated (Yn,m = 1),$${\chi }_{n,m}={X}_{n,m} * (1+{Y}_{n,m}).$$
(6)
The GC rates (ωn,m) and negative binomial event counts (δn,m) are computed the same as in Step 1 (Eq (2), Eq (4)). However, the expected read count uses total instead of somatic CN,$${\theta }_{n,m}={\chi }_{n,m} * {\omega }_{n,m} * {\mu }_{n}.$$
(7)
Since CN is learned in Steps 2-3, the coverage/ploidy scaling term (μn) must also be learned. We use a normal prior \({\mu }_{n} \sim {{\rm{N}}}({\mu }_{\mu }^{n},{\mu }_{\sigma }^{n})\) where the approximate total ploidy and total read counts are used to estimate the mean hyperparameters (\({\mu }_{\mu }^{n}\)). Total ploidies for each cell are approximated using the CN prior concentrations (η) and times within S-phase (τ) to account for both somatic and replicated copies of DNA that are present. We fixed the standard deviation hyperparameters (\({\mu }_{\sigma }^{n}\)) to always be 10x smaller than the means to ensure that μn≥0 despite use of a normal distribution (used for computational expediency),$${\mu }_{\mu }^{n}=\frac{{\sum }_{m=0}^{M}{Z}_{n,m}}{(1+{\tau }_{n})\mathop{\sum }_{m=0}^{M}{{{\rm{argmax}}}}_{p}({\eta }_{n,m,p})},$$
(8)
$${\mu }_{\sigma }^{n}=\frac{{\mu }_{\mu }^{n}}{10}.$$
(9)
Constructing the CN prior concentrationsThere are two ways to construct the CN prior concentrations within PERT. The first is to use the most similar high-confidence G1/2 clone to define the concentrations for each unknown cell (clone method). We assign each unknown cell its clone (cn) via Pearson correlation between the cell read depth profile (Zn) and the clone pseudobulk read depth profile (Zc),$${c}_{n}={{{\rm{argmax}}}}_{c}({{\rm{corr}}}({Z}_{n},{Z}_{c})).$$
(10)
Clone pseudobulk CN and read depth profiles represent the median profile across all high-confidence G1/2 cells in a given clone c. Once we have clone assignments for each unknown cell, the CN concentration of all possible states P at each genomic bin (ηn,m,p) is constructed to be w times larger for the state p that matches the clone pseudobulk CN state (\({X}_{{c}_{n},m}\)) for that same bin compared to all other states. The default setting is w = 106:$${\eta }_{n,m,p}=\left\{\begin{array}{ll}w\quad &\,{{\rm{if}}}\,\,p={X}_{{c}_{n},m}\\ 1\quad &\,{\mbox{else}}\,.\hfill\end{array}\right.$$
(11)
The second way to construct the prior is to leverage additional information from the most similar high-confidence G1/2 cells when constructing ηn,m,p (composite method). The rationale for the composite method is that there might be rare CNAs within a clone which only appear in a handful of cells but do not appear in the clone pseudobulk CN profile Xc. To find the most similar high-confidence G1/2 cells, we compute the read depth correlation between the unknown cell (\({Z}_{{n}_{s}}\)) and the high-confidence G1/2 cells from the best matching clone (\({Z}_{{n}_{g}}\)),$$\psi={{\rm{corr}}}({Z}_{{n}_{s}\!},\,\, {Z}_{{n}_{g}}\!).$$
(12)
The consensus clone CN profile and top J matches for each unknown cell are then used to construct the CN prior (ηn,m,p). Each row of ψ is sorted to obtain the top J high-confidence G1/2 matches ng(0), …, ng( J−1). All entries are initialized to 1 (ηn,m,k = 1) before adding varying levels of weight (w) to states where the CN matches a G1/2-phase cell or clone pseudobulk CN profile. The default settings are w = 105 and J = 5:$${\eta }_{n,m,p}=\left\{\begin{array}{ll}+1\quad \hfill&\,\,{\mbox{everywhere}}\,\\+w * 2 * J\quad \hfill&{{\rm{if}}}\,\,p={X}_{{c}_{n},m}\\+w * (\,J-0)\quad \hfill&\,\,{{\rm{if}}}\,\,p={X}_{{n}_{g0},m}\\+w * (\,J-1)\quad \hfill&\,\,{{\rm{if}}}\,\,p={X}_{{n}_{g1},m}\\ \ldots \quad \hfill \\+w\quad \hfill&\,\,\,\,\,\,\,\,\,{{\rm{if}}}\,\,p={X}_{{n}_{g(J-1)},m}.\end{array}\right.$$
(13)
By default, the composite method is used during Step 2 and the clone method is used during Step 3; however, the user may select between both methods during Step 2. Using the clone method during Step 2 should be seen as a ‘vanilla’ version of PERT which should be used when very few cell-specific CNAs are present. The clone method is used for Step 3 since the composite method would produce many self-matching cells. A comparison of the two methods can be seen when benchmarking PERT on simulated data (Supplementary Notes 1-2).Model initialization and hyperparametersSplitting cells into initial sets of high-confidence G1/2-phase and unknown cells is performed by thresholding heuristic per-cell features known to correlate with cell cycle phase. PERT uses clone-normalized number of input CN breakpoints between neighboring genomic bins (BKnorm) and clone-normalized median absolute deviation in read depth between neighboring genomic bins (MADNnorm). These features are referred to as ‘HMMcopy breakpoints’ and ‘MADN RPM’, respectively, in the main text and figures. Note that breakpoints between the start and end of adjacently numbered chromosomes are not counted.$${{\mbox{BK}}}_{n}=\mathop{\sum }_{m=0}^{M-1}\left\{\begin{array}{ll}1\quad &\,{{\rm{if}}}\,\,{X}_{n,m}\ne {X}_{n,m+1}\\ 0\quad &\,{\mbox{else}}\,\hfill\end{array}\right.$$
(14)
$${{\mbox{BKnorm}}}_{n}={{\mbox{BK}}}_{n}-\frac{1}{C}{\sum }_{c=0}^{C}{{\mbox{BK}}}_{c}$$
(15)
$${{\mbox{MADN}}}_{n}={{\rm{Med}}}\,\left({\sum }_{m=0}^{{\rm{M}}-1}{Z}_{n,m}-{Z}_{n,m+1}\right)$$
(16)
$${{\mbox{MADNnorm}}}_{n}={{\mbox{MADN}}}_{n}-\frac{1}{C}{\sum }_{c=0}^{C}{{\mbox{MADN}}}_{c}$$
(17)
Under default settings, PERT initializes cells with MADNnorm < 0 and BKnorm < 0 as high-confidence G1/2-phase with all other cells as unknown phase. Initial cell phases can also be input by users based on experimental measurements or alternative metrics such as 10X CellRanger-DNA’s ‘dimapd’ score (used in27,28,51), the Laks et al. classifiers’ S-phase probability and quality scores50, or read depth correlation with a reference RT profile55.To improve convergence to global optima, each cell’s time in S-phase (τn) is initialized using scRT results from a clone-aware adaptation of Dileep et al.25 which thresholds the clone-normalized read depth profiles into replicated and unreplicated bins. Each unknown cell n is assigned to clone c with the highest correlation between cell and clone pseudobulk read depth profiles (Eq (10)). The read depth of each cell is then normalized by the CN state with highest probability within the CN prior (ηn,m,p),$${y}_{n,m}=\frac{{Z}_{n,m}}{{{{\rm{argmax}}}}_{p}({\eta }_{n,m,p})}.$$
(18)
The clone-normalized read depth profiles ( yn) are then binarized into replication state profiles (Yn) using a per-cell threshold (tn ∈ [0, 1]) that minimizes the Manhattan distance between the real data and its binarized counterpart.$${t}_{n}={{\mbox{argmin}}}_{t}\Bigg| \, {y}_{n,m}-\left\{\begin{array}{ll}1\quad &\,{{\rm{if}}}\,\,{y}_{n,m}\ge {t}_{n}\\ 0\quad &\,{\mbox{else}}\,\hfill\end{array}\right.\Bigg|$$
(19)
$${Y}_{n,m}=\left\{\begin{array}{ll}1\quad &\,{{\rm{if}}}\,\,{y}_{n,m}\ge {t}_{n}\\ 0\quad &\,{\mbox{else}}\,\hfill\end{array}\right.$$
(20)
The fraction of replicated bins per cell from the deterministic replication states Yn,m are then used to initialize the parameter representing each cell’s time in S-phase (τn) within PERT’s probabilistic model.$${\tau }_{n}=\frac{1}{M}{\sum }_{m=0}^{M}{Y}_{n,m}.$$
(21)
Initialization of τn is particularly important because the model might mistake an early S-phase cell (<20% replicated) for a late S-phase cell (>80% replicated), or vice versa, as both have relatively ‘flat’ read depth profiles compared to mid-S-phase cells. Thus τn will rarely traverse mid-S-phase values during inference when its initial and true values lie far apart. Additional parameter initializations include λ = 0.5 for negative binomial overdispersion and βσ,k = 10−k for the standard deviation of each GC bias polynomial coefficient k. Unlike τn, the model is unlikely to get stuck at local minima with these parameters so they are initialized to the same values globally.The latent variables βμ, ρ, and α are sampled from prior distributions with fixed hyperparameters. The mean of all GC bias polynomial coefficients (βμ) are drawn from the prior N(0, 1). Each locus’s replication timing (ρ) is drawn from the prior Beta(1, 1) to create a uniform distribution on the domain [0, 1]. The replication stochasticity parameter (α) is drawn from the prior distribution Γ(shape = 2, rate = 0.2) which has a mean of \(\frac{{{\rm{shape}}}}{{{\rm{rate}}}}=10\) and penalizes extreme values on a positive real domain.PERT phase predictionsWe used the PERT model output to predict ‘G1/2’, ‘S’, and ‘low quality’ (LQ) phases for each cell. G1/2-phase cells were defined by having  <5% or  >95% replicated bins. Of the remaining cells with 5-95% replicated bins, those with high read depth autocorrelation (>0.5), replication state autocorrelation (>0.2), or fraction of homozygous deletions (X = 0, >0.05) were deemed to be low quality. All other cells were deemed to be in S-phase. Using 500kb bins, autocorrelation scores were the average of all autocorrelations ranging from 10 to 50 bin lag size. Thresholds used for splitting S and LQ phases can be adjusted by users should the default settings produce unexpected output.Model construction and inferencePERT is written using Pyro which is a probabilistic programming language written in Python and supported by PyTorch backend58. PERT uses Pyro’s implementation of Black Box Variational Inference (BBVI) which enables the use of biologically-informed priors instead of being limited to conjugate priors87. Specifically, we use the AutoDelta function which uses a Taylor approximation around the maximum a posteriori (MAP) to approximate the posterior. Optimization is performed using the Adam optimizer. By default, we set a learning rate of 0.05 and convergence is determined when the relative change in the evidence lower bound (ELBO) is  <10−6 or the maximum number of iterations (2000 for step 2, 1000 for steps 1 and 3) is reached.Simulated datasetsTo benchmark PERT’s ability to accurately infer single-cell replication states, somatic CN states, and cell cycle phase against Kronos and the Laks et al. cell cycle classifier, we simulated datasets with varying clonal structures and cell-specific CNA rates. Somatic CN states are simulated by first drawing clone CN profiles and then drawing cell-specific CNAs that deviated from said clone CN profile. All CNAs are drawn at the chromosome-arm level. 400 S- and 400 G1/2-phase cells are simulated in each dataset.Once CN states have been simulated, we simulate the read depth using PERT as a generative model. We condition the model on the provided βμ, βσ, λ, α, ρ, γ, and X parameters when generating cell read depth profiles. All read depth values (Z) are in units of reads per million. RepliSeq data for various ENCODE cell lines are used to set ρ values for each clone23. G1/2-phase cells were conditioned to have all bins as unreplicated Y = 0. S-phase cells had their cell cycle times τ sampled from a Uniform(0, 1) distribution. A table of all the parameters used in each simulated dataset can be found in Table 1.We called CN on simulated binned read count data using HMMcopy. Given that Kronos was designed as an end-to-end pipeline that takes in raw BAM files, we forked off the Kronos repository and edited their ‘Kronos RT’ module to accept binned read count and CN states as input. Cells were split into S- and G1/2-phase Kronos input populations according to their true phase. Code to our forked repository can be found at https://github.com/adamcweiner/Kronos_scRT. Similarly, we removed features from the Laks et al. cell cycle classifier that used alignment information such as the percentage of overalapping reads per cell. The Laks classifier was retrained and benchmarked with said features removed prior to deployment on simulated data (Fig. S13).Experimental methods and participant detailsCell culture and PDXsCell lines and PDX samples were generated in Laks et al.50, Funnell et al.61, and Salehi et al.32 studies. Cell line samples included (1) an immortalized normal human female breast epithelial cell line 184-hTERT L9, (2) four sets of 184-hTERT cell lines with perturbations in TP53−/− passaged over multiple time points, (3) five 184-hTERT cell lines with a variety of genetic perturbations in the repair pathway, including TP53−/−, BRCA1+/−, BRCA1−/−, BRCA2+/− and BRCA2−/−, (4) lymphoblastoid cell line GM18507, (5) HER2+ breast cancer cell line T47D, and (6) ovarian cancer cell line OV2295. PDX samples included 6 different models of TNBC, three of which had cisplatin-treated replicates, and 15 different models of HGSOC. Serial passaging of cell lines was done by seeding approximately 1 million cells each time and profiling with scWGS (DLP+) at 4–11 different passage points with a mean of 6,070 cells at each time point. Xenograft-bearing mice were sacrificed when the tumor size approached 1000 mm3 in volume, according to the limits of the experimental protocol. For the serially passaged PDXs, the harvested tumor was minced finely with scalpels and then mechanically disaggregated for one minute using a Stomacher 80 Biomaster (Seward) in 1 ml to 2 ml cold DMEM-F12 medium with glucose, L-glutamine and HEPES. Aliquots from the resulting suspension of cells and fragments were used for xenotransplants in the next generation of mice and cryopreserved for sequencing. Serially transplanted aliquots represented approximately 0.1– –0.3% of the original tumor volume from the previous mouse.MSK SPECTRUM patient dataWe obtained matched scRNA (10x Genomics 3’-end) and scWGS (DLP+) from HGSOC patient OV-081 from the MSK SPECTRUM cohort. Single cell suspensions from surgically excised tissues were generated and flow sorted on CD45 to separate the immune component. CD45 negative fractions were then sequenced using the DLP+ platform. scRNA-seq was performed on both CD45 positive and negative fractions, with the malignant cells used in this study being the CD45 negative fraction. Detailed descriptions of data generation can be found in the two SPECTRUM studies41,78.Acquisition of samples from patientsPatient samples from University of British Columbia32,66 were acquired with informed consent, according to procedures approved by the Ethics Committees at the University of British Columbia. Patients with breast cancer undergoing diagnostic biopsy or surgery were recruited and samples were collected under protocols H06-00289 (BCCA-TTR-BREAST), H11-01887 (Neoadjuvant Xenograft Study), H18-01113 (Large-scale genomic analysis of human tumors) or H20-00170 (Linking clonal genomes to tumor evolution and therapeutics). HGSOC samples were obtained from women undergoing debulking surgery under protocols H18-01652 and H18-01090. Patient samples from Memorial Sloan Kettering Cancer Center41,61,78 were obtained following Institutional Review Board (IRB) approval and patient informed consent (protocols 15–200 and 06-107 for HGSOC; 18–376 for TNBC). HGSOC and TNBC clinical assignments were performed according to American Society of Clinical Oncology guidelines for ER, PR and HER2 positivity.Experimental cell cycle sortingGM18507 and T47D cell lines were experimentally sorted into G1, S, G2, and dead cells prior to scWGS (DLP+) sequencing50. 2 million cells fresh from culture suspended in 1 mL PBS were stained with Hoechst 33342 (Invitrogen), caspase 3/7 (Essen Biosciences), and propidium iodide (PI, Sigma Aldrich) for flow sorting separation of different cell phases. Flow sorting was carried out at the Terry Fox Laboratory, (BC Cancer Research Centre) using a BD FACSAria III cell sorter equipped with 375 nm, 405 nm, 488 nm, 561 nm and 640 nm laser. We excluded apoptotic cells on the live cell gate by gating out Caspase 3/7 high cells. On this live non-apoptotic cell gate, we gated for the cell cycle phases using DNA content of the cells measured by Hoechst 33342 staining to sort the G1, S, and G2 phases of the cell cycle individually. Cells from different cell cycle fractions were stained and dispensed into DLP+ chips for sequencing.scWGS data processingUnless otherwise noted, all scWGS data was generated via DLP+. All DLP+ data was passed through https://github.com/shahcompbio/single_cell_pipelineusing Isabl before downstream analysis88. This pipeline aligned reads to the hg19 reference genome using BWA-MEM. Each cell was then passed through HMMcopy using default arguments for single-cell sequencing. HMMcopy’s output provided read count and gc-corrected integer CN states for each 500kb genomic bin across all cells and loci. Loci with low mappability (<0.95) and cells with low read count (<500,000 reads) were removed. Cells were also filtered for contamination using the FastQ Screen which tags reads as matching human, mouse, or salmon reference genomes. If  >5% of reads in a cell are tagged as non-human the cell is flagged as contaminated and subsequently removed.Cells were only passed into phylogenetic trees if they were called as G1/2-phase and high quality by classifiers described in Laks et al.50. In certain cases, cells might be manually excluded from the phylogenetic tree if they pass the cell cycle and quality filters but have an abnormally high number of HMMcopy breakpoints. All cells included in the phylogenetic tree are initialized in PERT as the set of high-confidence G1/2 cells; all cells outside the tree are initialized as unknown cells.Phylogenetic clustering based on CN profilesWe used the clone IDs from Funnell et al. for high-confidence G1/2 cells61. These single-cell phylogenetic trees were generated using sitka85. Sitka uses CN breakpoints (also referred to as changepoints) across the genome as binary input characters to construct the evolutionary relationships between cells. Sitka was run for 3,000 chains and a consensus tree was computed for downstream analysis. The consensus tree was then cut at an optimized height to assign all cells into clones (clusters). For datasets with no sitka trees provided or select datasets, cells were clustered into clones using K-means where the number of clones was selected through Akaike information criterion. We performed a K-means reclustering for the Salehi et al. TNBC PDX data32 as sitka produced small clusters which inhibited robust tracking of S-phase clone fractions across multiple timepoints.Pseudobulk profilesMany times in the text we describe pseudobulk replication timing, copy number, or read depth profiles within a subset of interest (i.e. cells belonging to the same clone or sample). To compute pseudobulk profiles, we group all the cells of interest and then take the median values for all loci in the genome. When computing pseudobulk CN profiles, we only include the cells of the modal (most common) ploidy state before computing median values for all loci.S-phase timesWhen we refer to the time of individual S-phase cells, such a time is calculated as the fraction of replicated bins per cell. Thus, S-phase times near 1 are late S-phase cells and S-phase times near 0 are early S-phase cells.Comparison of PERT RT profiles to predicted RT profilesPredicted RT (predRT) profiles for ENCODE primary cell samples were obtained from Salvadores et al.18. The authors used the Replicon software with default settings59 to predict RT from DNase-seq chromatin accessibility data of each ENCODE sample. Human reference hg19 coordinates were used to create RT profiles with 1 MB bin size for this comparison.Ploidy-normalized CNWe sometimes opt to show ploidy-normalized CN profiles instead of their integer CN values. The ploidy pn of a given CN profile n is defined as the mode of all integer CN states Xn,m. We then compute a ploidy-normalized CN value χn,m as follows:$${\chi }_{n,m}=\frac{{X}_{n,m}-{p}_{n}}{{p}_{n}}.$$
(22)
Note that n can be a profile corresponding to a cell, clone, or sample.Identification of RT shiftsAll RT profiles were centered to have a mean of 0 and standard deviation of 1 across the entire genome prior to comparison to one another. This was essential as certain clones might have a skewed distribution of early vs late S-phase cells and thus their RT profiles would have later or earlier replication across all positions in the genome.Bespoke factor model which learns feature importance and RT profiles directly from clone RT profilesWe built a multivariate regression model which learned importance terms and RT profiles for each feature directly from the matrix of clone RT profiles. The model has the following terms and equations:RTc,m: the observed replication timing of clone c at locus m on the domain of [0,1]. This represents the fraction of replicated bins at locus m across all S-phase cells n in clone c.ρk,m: the latent replication timing of feature k at locus m.Ic,k: indicator mask representing which features k are present for clone c.βk: importance term for feature k.σ: standard deviation term when going from expected to observed replication timing; sampled from a uniform distribution on the domain (0,1).$${{\mbox{RT}}}_{c,m} \sim N(\frac{1}{1+{e}^{\mathop{\sum }_{k=0}^{K}({\beta }_{k} * {I}_{c,k} * {\rho }_{k,m})}},\sigma ).$$
(23)
All latent replication timing terms ρk are normalized to have mean of 0 and variance of 1 and there is only β value per class of features$${\beta }_{k}=\left\{\begin{array}{ll}{\beta }_{t}\quad &\,\,\,\,\, {{\rm{if}}}\,{{\rm{k}}}\,{{\rm{is}}} \, {{\rm{a}}} \, {{\rm{cell}}} \, {{\rm{type}}} \, {{\rm{feature}}}\,\\ {\beta }_{s}\quad &\,\,\,\,\,\,\,\,{{\rm{if}}}\,k\,{{\rm{is}}} \, {{\rm{a}}} \, {{\rm{signature}}} \, {{\rm{feature}}}\,\\ {\beta }_{p}\quad &{{\rm{if}}}\,{{\rm{k}}}\,{{\rm{is}}}\, {{\rm{a}}}\, {{\rm{ploidy}}} \, {{\rm{feature}}}\,\\ {\beta }_{d}\quad &\,\,\,{{\rm{if}}}\,k\,{{\rm{is}}} \, {{\rm{a}}} \, {{\rm{sample}}} \, {{\rm{feature}}}\,\\ {\beta }_{g}\quad &{{\rm{if}}}\,k\,{{\rm{is}}} \, {{\rm{a}}} \, {{\rm{global}}} \, {{\rm{feature}}}\,\end{array}\right.$$
(24)
This model is implemented in pyro and fit using BBVI58,87. We use the AutoNormal function which uses Normal distributions to approximate the posterior. Optimization is performed using the Adam optimizer with a learning rate of 0.02. Convergence is determined when the relative change in ELBO is  <10−3 of the total ELBO change between first and current iteration.Using SIGNALS to quantify allelic ratios from scDNA- and scRNA-seqIn brief, SIGNALS uses haplotype blocks genotyped in single cells and implements an hidden Markov model (HMM) based on a Beta-Binomial likelihood to infer the most probable haplotype-specific state. SHAPEIT was used to generate the haplotype blocks for SIGNALS input89. A full description of SIGNALS can be found in Funnell et al.61. Within each haplotype block for each sample, the major (most common) allele is labeled as the A-allele with the minor (less common) allele labeled as the B-allele. The B-allele frequency (BAF) is computed as the fraction of B-allele heterozygouos single nucleotide polymorphisms (SNPs) out of all heterozygous SNPs present in a given bin. SIGNALS is run on scDNA data by default but when scRNA data is also available, the haplotype blocks derived from the scDNA data can be used to extract A- and B-allele counts in the scRNA data too (albeit with much fewer counts as there are fewer SNPs sequenced in scRNA data). SIGNALS output for all scDNA samples is shown in Supplementary Data 1.Gastric cancer cell line data10X Chromium single-cell DNA (10X scWGS) data of gastric cancer cell lines NCI-N87, HGC-27, and SNU-668 were downloaded from SRA (PRJNA498809). CN calling was performed using the CellRanger-DNA pipeline using default parameters. Data was aggregated from 20kb to 500kb bins for analysis with PERT. Each cell line’s doubling time and fraction of G1-phase scRNA cells were extracted from Andor et al.51. PERT output for these sames is shown in Supplementary Data 2.Clone S-phase enrichment scoresTo test whether a clone (c) is significantly enriched or depleted for S-phase cells at a given timepoint (t), we must compare that clone’s fraction in both S- and G1/2-phases. We first define the following variables as such:Ns,c,t: Number of S-phase cells belonging to clone c at time tNg,c,t: Number of G1/2-phase cells belonging to clone c at time tNs,t: Total number of S-phase cells across all clones at time tNg,t: Total number of G1/2-phase cells across all clones at time tNt: Total number of cells in a population at time t (all clones, all phases)We can then define the fractions of S- and G1/2-phase cells assigned to clone c at time t (fs,c,t, fg,c,t):$${f}_{s,c,t}=\frac{{N}_{s,c,t}}{{N}_{s,t}},$$
(25)
$${f}_{g,c,t}=\frac{{N}_{g,c,t}}{{N}_{g,t}}.$$
(26)
Using the fraction of G1/2-cells belonging to clone c, we can compute the expected total number of cells in clone c and time t across all cell cycle phases,$$E({N}_{c,t})={f}_{g,c,t} * {N}_{t}.$$
(27)
We produce a p-value for enrichment of S-phase cells using a hypergeometric test scipy.stats.hypergeom(M = Nt, n = Ns,t, N = E(Nc,t)).sf(Ns,c,t). To produce a p-value for S-phase depletion we subtract this enrichment p-value from 1. All p-values are Bonferroni-corrected by dividing by the total number of statistical tests. p-adjusted thresholds of 10−2 are used for saying a clone is significantly enriched or depleted for S-phase cells within a given library. The enriched (p+,c,t) and depleted (p−,c,t) p-values from this hypergeometric test are then compared to one another to produce a single continuous SPE score (ξc,t). Positive values indicate enrichment for S-phase cells and negative values indicate depletion of S-phase cells,$${\xi }_{c,t}= {\log }_{10}({p}_{-,c,t})-{\log }_{10}({p}_{+,c,t}).$$
(28)
Clone expansion scoresFor time-series scWGS experiments, we computed clone expansion scores for each clone c at time t (Sc,t) by examining the fraction of G1/2-phase cells belonging to clone c at timepoint t (fg,c,t) and the subsequent timepoint (fg,c,t+1). Positive values indicate the clone expands by the next timepoint,$${S}_{c,t}={f}_{g,c,t+1}-{f}_{g,c,t}.$$
(29)
We use the same equation when computing the overall fitness of clones across multiple timepoints but instead treat the first timepoint as t and the final timepoint as t + 1.Comparing SPE to expansion in treated vs untreated dataTo test that treated clones had a significant difference in their relationship between SPE scores (ξc,t) and expansion scores (Sc,t) in treated (T) vs untreated (U) data, we first fit a linear regression curve to the untreated data,$$S_{c,t}^{U} \sim f({\xi }_{c,t}^{U},\hat{\beta}^{U})=\hat{\beta_{0}^{U}}+\hat{\beta_{1}^{U}} * \xi_{c,t}^{U}.$$
(30)
We then computed the residuals between the treated data and this line of best fit,$${S}_{c,t}^{U-T}=(\hat{{\beta }_{0}^{U}}+\hat{{\beta }_{1}^{U}} * {\xi }_{c,t}^{T})-{S}_{c,t}^{T}.$$
(31)
We then computed a second linear regression curve to the residuals \({S}_{c,t}^{U-T} \sim f({\xi }_{c,t}^{T},\hat{\beta}^{U-T})\) and computed the p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic. Having a p < 0.05 indicated that the slope of the treated and untreated lines are significantly different. All clone × timepoint combinations with <10 G1/2-phase cells were excluded from such analysis. For each point, both t and t + 1 are either untreated or on-treatment, thereby excluding points where t is untreated and t + 1 is treated.Cell cycle analysis of scRNA dataWhen available, we validated PERT cell cycle distributions using the cell cycle distributions estimated through scRNA sequencing. We determined the cell cycle phase of each scRNA cell using the Seurat CellCycleScoring() function73 which uses a set of S- and G2M-phase markers derived from Tirosh et al.74.Statistical testsWhen boxplots are presented in the figures, hinges represent the 25% and 75% quantiles and whiskers represent the ±1.5x interquartile range. Statistical significance is tested using two-sided independent t-tests from scipy.stats unless otherwise noted. Bonferroni correction is implemented for all statistical tests to limit false discovery. The number of stars is a shorthand for the adjusted p-value of a given statistical test (<10−4: ****,  <10−3: ***,  <10−2: **,  <0.05: *, ≥0.05: ns). Shaded areas surrounding linear regression lines of best fit represent 95% confidence intervals obtained via boostrapping (n=1000 boostrap resamples). Unless otherwise noted, linear regressions are annotated with Pearson correlation coefficients (r) and the p-value for a hypothesis test whose null hypothesis is that the slope is zero, using the two-sided Wald Test with t-distribution of the test statistic.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles