Multi-omic lineage tracing predicts the transcriptional, epigenetic and genetic determinants of cancer evolution

MiceAll animal studies were conducted with the approval of the Italian Ministry of Health and in compliance with the Italian law (D.lgs. 26/2014), which enforces Dir. 2010/63/EU (Directive 2010/63/EU of the European Parliament and of the Council of 22 September 2010 on the protection of animals used for scientific purposes) and EU 86/609 directive. Proper permit and consent were granted (Protocol No. 779/2020) by the institutional organism for ethics and animal welfare on experimental procedures (OPBA, Cogentech). Animals were bred and maintained under pathogen-free conditions in a controlled environment (18–23 °C, 40–60% humidity and with 12-h dark/12-h light cycles) at the certified Cogentech mouse facility located at IEO/IFOM campus (The FIRC Institute of Molecular Oncology, Milan, Italy).In vivo xenograftImmunodeficient NOD.Cg-PrkdcscidIL2rgtm1Wjl/SzJ (also known as NOD/SCID/IL2Rγc−/−) mice were anesthetised by intraperitoneal injections of 1.25% solution of tribromoethanol (0.02 ml/g of body weight). Barcode-bearing SUM159PT cells were resuspended in a mix of 14 μl PBS and 6 μl Matrigel and implanted in the fourth inguinal mammary gland of 10-week-old animals. Mice were monitored twice a week by an investigator. For the chemotherapy treatment studies, tumours were allowed to grow to palpable lesions (~20–30 mm3), then mice were randomised into groups and each group was treated intraperitoneally with either paclitaxel (PTX) (10 mg/kg in PBS) or vehicle (PBS) every 5 days for a total of three injections. Mice were euthanised according to our experimental protocol and institutional guidelines when tumours euqalled 1.2 cm in their largest diameter. Maximal tumour burden was not exceeded. Tumour growth dynamics were monitored every 3 days by calipers measurements. For in vivo limiting dilution assay transplantation experiments, decreasing dilutions (1:10,000; 1:1000; 1:100) of SUM159PT were resuspended in a mix of 14 μl PBS and 6 μl Matrigel and transplanted in the fourth inguinal mammary gland of 10-week-old animals. Animals were monitored as before and euthanized after 1.5–3 months (depending on tumour latency).Tissue harvest and processingThe primary tumours were removed when the tumour reached an approximate diameter of 1.2 cm. The animals were anesthetised with tribromoethanol, and the tumours were resected. The solid tissue was rinsed with PBS, minced with scalpels, followed by mechanical dissociation using gentleMACS (Miltenyi) in a digestion mix (collagenase, hyaluronidase, 5 μg ml−1 insulin, Hepes, hydrocortisone). The cell suspension was incubated for 25′ at 37 °C followed by an additional step of gentleMACS dissociation. Following a wash with base medium the cells were consecutively passed through 100, 70 and 40 μm filters. Primary tumour cells were treated with ACK lysis buffer (Lonza) followed by resuspension in 1% BSA/PBS and processed using the Mouse and Dead Cell depletion kits according to the manufacturer’s directions (Miltenyi).Cell culturesWe maintained HEK293T and MDA-MB-231 TGL cells in Dulbecco’s Modified Eagle’s Medium (DMEM) with 10% of TET-FREE foetal bovine serum (FBS) and 1% penicillin–streptomycin. Cells were grown in a humidified atmosphere at 5% CO2 at 37 °C. SUM159PT cell line and derivatives (Asterand) were cultured in Ham’s F12 medium with 5% TET-FREE FBS, human insulin (5 μg/ml), hydrocortisone (1 μg/ml), and Hepes (10 mM). Barcoded SUM159PT and CRISPRi cell line medium was supplemented with 2 μg/ml puromycin for selection. Cells were grown in a humidified atmosphere at 10% CO2 at 37 °C.Perturb_SUM159PT cell line generationPerturb-seq GBC library18,75 was a gift from Jonathan Weissman (Addgene ID #85968). The library contains a random 18-nt guide barcode (GBC) close to the polyadenylation signal of the blue fluorescent protein (BFP). The estimated complexity of the library is >5 million unique GBCs. We amplified the Perturb-Seq GBC library in Escherichia coli (E. coli) ElectroMAXTM DH5α-ETM electro-competent cells (Thermo Fisher Scientific), as indicated by the authors75. DNA extraction was performed with NucleoBond Xtra Maxi (Macherey-Nagel) kit according to the manufacturer’s instructions. Viruses were produced in HEK293T at 80% of confluency with the following transfection mix: 20 μg Transfer Vector, 13 μg of psPAX2 (gag&pol) (Addgene #12260), 7 μg of pMD2G (envelope) (Addgene #12259), 94 μL of RT CaCl2 and water up to 750 μL. Then, 750 μL of 2xHBS were added dropwise and 500 μL/10 cm plate of transfection mix was added to the cells. The medium was changed after 6 h and viruses harvested after 48 h, filtered (0.22 μm) and ultracentrifuged for 2 h at 50,000 × g (rotor SW32 Ti, Beckman Coulter), at 4 °C. The viral pellet was resuspended in 300 μL of PBS 1X and stored at −80 °C. To produce the Perturb SUM159PT cell line for lineage tracing experiments, 75,000 cells were seeded and infected with an estimated MOI of 0.1 in the presence of 8 μg/ml of polybrene (Sigma-Aldrich). After selection, transduction efficiency was measured by FACS analysis, which revealed that 8.6% of cells were successfully infected.SUM159PT_KRAB cell line generation and CRISPRi experimentsFor the generation of the PB-TRE-dCas9-KRAB plasmid, the DNA sequence of KRAB repressor domain was amplified by PCR from the pHAGE TRE dCas9-KRAB (Addgene plasmid #50917) and cloned in frame into the PB-TRE-dCas9-VPR backbone (Addgene plasmid #63800) within the AscI/AgeI sites. The cloning was sequence-verified by Sanger sequencing. SUM159PT cells were transfected in MW6 plates, following Lipo3000 transfection protocol (ThermoFisher Scientific) with 500 ng of transposon DNA (PB-TRE-dCas9-KRAB) and 200 ng of SuperPiggyBac transposase helper plasmid (SystemsBioscience). After at least 72 h from transfection, cells were selected with 200 μg/ml Hygromycin B. The PB-TRE-dCas9-KRAB SUM159PT cell line is referred in the text as SUM159PT_KRAB. We expressed sgRNAs upon cloning into lentiGuide-Puro sgRNA backbone (Addgene #52963) within BsmBI (Esp3I) sites. Lentiviruses were generated in six-well plates, following the Lipofectamine 3000 (Thermo Fisher) protocol. The transfection was performed by mixing the construct of interest, psPAX2 (gag&pol)(Addgene #12260) and pCMV-VSV-G (envelope) (Addgene #8454) plasmids at a ratio of 4:3:1. Viruses were collected after 24 h, filtered and frozen. We generated stable cell lines expressing single sgRNAs by lentiviral infection of 150,000/well SUM159PT_KRAB cells. Lentiviral supernatants (1:3 dilution) were added to cells, supplemented with 1 μg/ml of polybrene. After 24 h, cells were selected with 2 μg/ml puromycin (Gibco). For the CRISPRi experiment, we plated sgRNA-expressing stable cell lines with 100 ng/ml of doxycycline and harvested cells after 3 days.Paclitaxel treatmentA stock aliquot of paclitaxel (obtained from the IEO hospital) was prepared (70 μM) in PBS and used to treat cells. The dose of paclitaxel used for in vitro experiments was established by a dose–response curve of SUM159PT treated for 72 h. IC50 concentrations were estimated by parallel fit estimation (JMP software, n = 4). For short-term, single-treatment experiments, paclitaxel was used at the final concentration of 50 nM (~IC95). After 3 days of treatment, the medium was changed every 3 days without adding the drug. For the analysis of the susceptibility of pre-treated cells (shown in Fig. 6d), SUM159PT cells were treated with paclitaxel (50 nM) and surviving clones were allowed to recover for 21 days. Subsequently, cells were infected with the GBC library, sorted by BFP expression and subjected to a second round of treatment (50 nM). For the generation of the drug resistant model (shown in Fig. 6e), SUM159PT cells were treated multiple times with increasing doses of paclitaxel (10, 20, 50, and 100 nM). Drug-adapted cells were able to survive even when treated with 100 nM paclitaxel and were collected 90 days after treatment for WES analysis.TM4SF1high flow cytometrySUM159PT_KRAB cells were stained with anti-TM4SF1-APC (clone: REA851 Miltenyi Biotec) antibody for 10′ at 4°C, in the dark and sorted using Fusion Aria Sorter. The bulk population was FAC-sorted using FSC/SSC gate, while for the TM4SF1high subpopulation different gating strategies were tested (top 5% or top 10% APC fluorescence intensity as shown in Supplementary Fig. 5d); top 5% showed the best enrichment for S1 markers (shown in Supplementary Fig. 6b) and was used in subsequent experiments. After passive propagation in vitro, 9 days after sorting, both populations were transplanted into NOD/SCID/IL2Rγc, mice and tumours were collected as described above.RNA sequencingRNA was extracted from mouse-depleted and dead-cell-depleted samples. The bulk and TM4SF1high sorted cells were either immediately processed by RNA-seq or passively propagated in vitro for 43 days (corresponding to passage 19). Cells after 9 and 43 days from sorting were also collected and finally processed by RNA-seq. RNA was harvested using Maxwell RSC miRNA Tissue Kit according to the manufacturer’s protocol. Libraries for RNA-seq were prepared from 1 μg of total RNA using Truseq Total RNA Library Prep Kit (Illumina) following the manufacturer’s protocols. Samples were sequenced paired-end 50 bp on an Illumina Novaseq6000 instrument.Whole-exome sequencingSequencing was performed with the Agilent SureSelect All Exon v5 (experiments shown in Fig. 6c, d) or v7 (experiments shown in Fig. 6a, b), as per manufacturers’ instructions. Libraries were sequenced with coverage >200X on a Novaseq 6000, with a PE 100 reads mode.Multiseq—single-cell time-course for assaying drug toleranceThe single-cell time-course experiment for drug tolerance (shown in Fig. 7) was designed in order to process all time points at the same time. To do so, we performed an en-reverse experiment (i.e., starting from the last time point, d15). The same batch of cells was used for the entire experiment. Cells were passaged every 2 days and seeded either for passive culture or paclitaxel treatment (5 × 106 cells in 150 mm dishes). After 3 days of treatment (50 nM as described above), the medium was changed every 3 days without adding the drug. At day 15, all paclitaxel-treated time points were collected together, counted and processed with the Multiseq protocol76. Briefly, 500,000 cells for each sample (d5, d7, d9, d11, d13, d15) were resuspended in 180 μL of PBS and then labelled with 20 μL of a reaction mix composed of 2 μM of a sample-specific Barcode, 2 μM of LMO-Anchor (a kind gift of the Gartner Lab) and PBS. After 5′ of incubation in ice, we added 20 μl of reaction mix composed of 2 μM of co-anchor in PBS. After 5′ of incubation we stopped the reaction adding PBS/BSA 1%. We centrifuged and washed twice the cells before resuspending each sample in 125 μl PBS/BSA and pooling. After filtering, 25,000 cells were loaded on the Chromium Controller. Multi-seq library was isolated from the amplified cDNA and sequenced at 5000 barcode reads/cell depth.RT-qPCRTotal RNA was extracted using Maxwell RSC miRNA Tissue Kit according to the manufacturer’s protocol. One microgram of total RNAs was reverse-transcribed using ImProm-IITM Reverse Transcription System (Promega) and genes were analysed with Quantifast SYBR green master mix (Qiagen). RPLP0 was used as a housekeeping gene. The complete list of primers used in this study is reported in Supplementary Data 18.GBC library preparation from gDNAThe genomic DNA was extracted from 1 to 3 million cells (typically 3 million, coverage 300×) using the NucleoSpin Tissue kit (Macherey-Nagel). To enrich for GBCs, six parallel PCR reactions were performed in a final volume of 50 μl using 200 ng of genomic DNA, 0.2 μM of dNTPs mix, 0.5 μM of the following primers: F1: GGGTTTAAACGGGCCCTCTA and R4: GCCTGGAAGGCAGAAACGAC and amplified using PhusionTM High-Fidelity DNA Polymerase at the final concentration of 0.02 U/µl (Thermo-Scientific) (coverage 50×–500×), in its 5X Phusion HF buffer according to the following PCR protocol: (1) 98° for 30 s, (2) 98° for 7 s, then 60° for 25 s and 72° for 15 s (for 30 cycles), (3) 72° for 5 min.At the end of the PCR, reactions were pooled and purified using QIaquick PCR purification kit (QIAGEN). Then, the eluted DNA samples were run on a 2% agarose gel and the 280 bp band was purified using the QIAquick Gel extraction kit (QIAGEN). Illumina libraries were generated from 10 ng of DNA, which was blunt-ended, phosphorylated, and tailed with a single 3′ A. An adapter with a single-base “T” overhang was added and the ligation products were purified and amplified to enrich for fragments that have adapters on both ends.Libraries with distinct adapter indexes were multiplexed and sequenced (50 bp paired-end mode) on a Novaseq 6000 sequencer.Single-cell library and GBC sublibrary preparation (scRNA or snRNA assays)Single-cell suspensions (500–1000 cells/μl) were mixed with reverse transcription mix using the 10x Genomics Chromium Single-cell 3′ reagent kit protocol V2 (T0 MDA-MB-231 TGL and paclitaxel time-course exp2) or V3.1 (T1, paclitaxel time-course exp1) and loaded onto 10x Genomics Single-Cell 3′ Chips (www.10xgenomics.com). Libraries were generated as per manufacturers’ instructions and sequenced on Illumina Novaseq 6000 Sequencing System (with a single- or dual-indexing format according to the manufacturer’s protocol V2 or V3.1). We aimed at a coverage of 50 K reads/cell in each sequencing run. Multiome experiments were performed with the Chromium Single-Cell Multiome ATAC + Gene Expression Reagent Kits (V1). Nuclei suspensions (2000 nuclei/μL) were transposed and loaded onto Chromium Next GEM Chip J Single-Cell. Libraries were generated as per manufacturers’ instructions and sequenced on Illumina NOVAseq 6000, aiming at 50 K RNA and 50 K ATAC reads/cell. To enrich for GBC reads, in a final volume of 50 μL, we amplified by PCR the Perturb-library cassette from 5 ng of the amplified cDNA (as in ref. 18) using 0.3 μM of dual-indices primers (forward: 5′-AATGATACGGCGACCACCGAGATCTACACCTCCAAGTTCACACTC TTTCCCTACACGACGCTCTTCCGATCT-3′; reverse: 5′-CAAGCAGAAGACGGCATACGAG ATCGAAGTATACGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTAGCAAACTGGGGCACAAGC-3′) and amplified using Q5 2X master mix (M0541S, NEB) according to the following protocol: (1) 98° for 10 s, (2) 98° for 2 s, then 65° for 5 s and 72° for 10 s (25 cycles), (3) 72° for 1 min. The fragment band of the expected length (350–425 bp) was purified using EGEL 2% Power Snap Electrophoresis System (Thermo-Scientific) and checked at bioanalyzer before sequencing.sgRNAs listCRISPRi sgRNAs sequences targeting the two putative enhancers of COL6A1 and COL6A2 were designed using the web tool CRISPick (Broad Institute). The design region was defined by merging the ATAC module regions with the overlapping H3K27Ac signal from Encode (as shown in Fig. 5b) and exploiting the CRISPRi Range format for unstructured targeting provided by CRISPick. Inputs were: Enh_1 NC_000021.9:+:46048857-46050195 and Enh2 NC_000021.9:+:46052113-46053945. Selected sgRNAs were named according to the position relative to the start of the design window. We selected sgRNAs to cover the entire design window, choosing the sequences with higher on-target activity and filtering out those with potential off-target complementarity. As a negative control, we included three scrambled-sequence-sgRNAs for the CRISPRi experiments selected from the previous genome-wide CRISPRi screening library designed by the Weissman lab77. The sgRNAs chosen to target the promoters of COL6A1 and COL6A2 were also chosen from the same study. The list of sgRNAs sequence is reported in Supplementary Data 18.Genetic barcode analysisGenetic barcode callingA GBC library is built for each sample separately, starting from the FASTQ files, in two steps. First, the 18nt-long sequences located in the GBC locus are extracted using seqkit amplicon command from seqkit v2.1.078, with either 23–40nt- (DNA reads) or 12–29nt-long (cDNA reads) flanking regions and allowing one mismatch. A set S of sequences of length 18 is obtained and each s \(\in\) S is assigned a weight w(s), corresponding to its frequency in the FASTQ file. Note that, since the relative GBC abundance in a sample is unknown, not accounting for sequencing errors can result in an inflated estimate of sample clonality. Only sequencing errors in the form of mismatches are considered. The underlying assumption is that, if s,s′ ∈ S are sequenced from the same GBC species and carry d and d′ mismatches, respectively, and that d < d′, then w(s) ≥ w(s′). Thus, each GBC species i is associated with a subset Si ⊆ S, where the true GBC sequence is the s ∈ Si with maximal weight. The frequency of i is the cumulative weight of the sequences in Si. We infer the set of “true” GBC sequences and simultaneously correct for sequencing errors with the following procedure. Let G = (V, E, w) be an undirected graph, where V = S, E is the set of edges connecting sequences at Hamming distance ≤D, and w(s) is the frequency of s. We iteratively detect a collection of disjoint subgraphs Gi = (Si, Ei) induced by Si, called stars, where a node s ∈ Si is the hub and all the other nodes are neighbours of s. Stars are defined according to the following conditions: (i) the hub s has maximum weight w(s) in Si, (ii) the cumulative weight across stars is maximal, and (iii) the fraction of neighbours of s in G that do not belong to Si is < f, where 0 ≤ f < 1 is a fixed parameter. We compute a heuristic solution with a greedy approach. First, nodes are ordered by non-increasing weight. At each iteration, a new star is created, whose hub is the first node in the list and the other nodes are its neighbours, and the included nodes are removed from the list. The procedure ends when the first star that violates (iii) is found. We set D = 1 and f = 0.2. The final set of true GBC sequences is defined as the collection of detected hubs and their frequency is the cumulative weight across stars. We approximate the clone content of bulk DNA-Seq sequencing samples as the set of GBCs and their associated frequencies.Definition of tumour-initiating clones and drug-tolerant clonesIn bulk DNA-seq samples, we approximate clone content with GBC content and clone abundance with count per million reads (CPM). Clones whose frequency significantly differs between conditions are determined as follows. For each clone, we test if the CPM in the condition sample (treated or untreated) is significantly greater compared to the average across control samples (parental), using a Fisher’s exact test. The universe is defined as the union of all clones across control and condition samples. P values are adjusted using the Bonferroni correction method. Clones with adjusted p value < 0.05 in at least \(\lceil \frac{1}{2}n\rceil+1\) condition samples are labelled TICs (if condition is “untreated tumour”) or DTCs (if condition is “treated sample” or “treated tumour”), where n is the number of condition samples.Single-cell RNA-seq data analysis (SUM159PT)Cell barcode callingA GBC reference is defined as the union of the GBC species obtained from cDNA reads, as described in the section “Genetic barcode calling”, across all samples in the same experiment. Read alignment, UMI counting, cell-containing barcode (CB) calling, and GBC counting are performed using cellranger count from CellRanger v6.0 on the human reference genome GRCh38 v2020-A and the GBC reference. Parameter –expect-cells is set to 7500 (T0), 20,000 (T1 and exp1), and 3000 (exp2). Feature-CB count matrices are obtained, where features denote either genes or GBC species, and entries are UMI counts. Cellular barcodes with <5000 (T1 and exp1) or 10,000 (T0, exp2) gene UMIs are filtered out. Data post-processing is done with R79.MULTI-seq sample demultiplexingSamples belonging to T1 and exp1 are demultiplexed using MULTI-seq barcodes (MBCs), as follows. MBC-containing reads are aligned to the reference barcode sequences using the R package deMULTIplex v1.0.276. MBC-CB count matrices are obtained. Each MBC univocally identifies a sample, and a sample can be labelled with multiple MBCs. The automatic quantile-based thresholding procedure implemented in deMULTIplex fails to assign a unique MBC label to most of cellular barcodes in exp1, thus a custom demultiplexing procedure is used for both experiments. First, all MBCs with UMI count <2 are removed from every CB. For every cellular barcode i, an MBC j is marked as detected if cj ≥ 15, cj/cmax ≥ 0.5, and p < 1e − 5, where cj is the UMI count of j in i, cmax is the UMI count of the top abundant MBC in i, and p is the probability of observing a value equal or greater than cj given a Poisson distribution with λ = average UMI count of j across cellular barcodes in the sample. Only cellular barcodes with exactly one detected MBC are retained and assigned to the corresponding sample.Clone detectionExpressed GBCs in CBs are identified using the same procedure applied to MBCs and described in the section “MULTI-seq sample demultiplexing”, using cj ≥ 10 and cj/cmax ≥ 0.3. A p value threshold is set only for samples d11–13–15 for exp1 (p < 1e − 5) and sample d17 for exp2 (p < 1e − 10). Note that GBC frequency at earlier time points is very low, hence the p value criterion has no effect. CBs can be assigned 0, 1, or >1 GBCs, the latter being an effect of multiple cell encapsulation (doublets) or multiple GBCs infecting the same cell (co-infection). We extract clone information from multi-GBC CBs by distinguishing co-infection and doublet events. High UMI doublets are removed using a sample-specific cutoff. Assuming that GBC expression is approximately constant across GBC species and across cells, differences in GBC UMI count are essentially due to droplet-specific mRNA capture. Moreover, the transcriptome size of a cell line model should be constant as well. We deduce that a single cell infected with k GBCs should display a higher GBC UMI fraction compared to that of multiple cells encapsulated in the same droplet that jointly account for k infection events. For simplicity, GBC species for multiple infection events within the same droplet are assumed to be pairwise distinct. The procedure works as follows. First, each CB set with k expressed GBCs is sorted by non-increasing values of ci/ui, where ci and ui are the GBC UMI count and the gene UMI count for CB i, respectively. We obtain a list of sets S2, …, Sm, where m is the maximum number of distinct GBC species expressed in a CB. Then, we iteratively classify the CBs of each set Sk separately, starting from the smallest k. A CB i is labelled as co-infection in two cases: (a) all GBCs expressed by i are also expressed in at least five CBs, including i, or (b) the “doublet probability” of i (i.e., the cumulative sample frequency of each single GBCs expressed in i) is below a sample-defined threshold. If neither (a) nor (b) hold, i is labelled as doublet. The procedure continues until the fraction of co-infection events in multi-GBC CBs is ≥1-D, where D is the expected doublet fraction in multi-GBC CBs. D is a sample-specific doublet rate estimate based on 10× guidelines and the number of called CBs. The clone pool of a single-cell sample is defined as the collection of single GBCs that occur in single infection and doublet events with k = 2 GBCs, plus all multi-GBC sets from co-infection events.Clone comparison between single-cell and bulk samplesTo assess the accuracy of clonality estimates from bulk samples, we compare the GBC species content between bulk DNA and single-cell RNA GBC sequencing samples from the same condition. We define the frequency of a GBC species in a bulk sample by CPM. Then, for a given pair (Xb, Xs) of bulk and single-cell samples, we define the value y = f(x) as the fraction of GBC species with frequency ≥x in Xb found as expressed in at least one cell of Xs. In practice, we compute f(x) in steps of 20 CPM units. Consistent clone estimates in bulk and single-cell samples in a condition result in a monotonically non-decreasing trend of f(·). A cell c is classified as tumour-initiating or drug-tolerant (see section “Definition of tumour-initiating clones and drug-tolerant clones” above) if and only if all GBCs expressed in c are tumour-initiating or drug-tolerant, respectively. Conversely, the single-cell cluster labelling is transferred to a bulk sample as follows: the GBC abundance (CPM) in the bulk sample is first normalised by its average abundance at baseline; then, the abundance of cluster C in the bulk sample is calculated as the contribution of all GBCs expressed by any cell in C.Quality filtering and normalisationAll CBs with no expressed GBCs or classified as doublets are removed from subsequent analysis. Samples from technical replicates (same vial) are concatenated and processed using Seurat v4.0.580. UMI counts are added a pseudocount = 1, divided by library size, multiplied by 10,000, and log-transformed (natural logarithm), to obtain log-normalised counts.Dimensional reduction and clusteringHighly variable genes detection is performed on log-normalised counts using two different approaches: min.var.plot and vst. We set xmin = 0.1 and xmax = 10 for min.var.plot. The input parameters for each algorithm are let vary in a pre-defined set: dispersion.cutoff in {0.5, 1, 1, 5} for min.var.plot, and nfeatures in {1000, 2000, 5000} for vst. A cell-cycle score is computed using the Seurat function CellCycleScoring with default parameters. We observe a high cell-cycle effect in parental samples (T0 and T1); hence, the cell-cycle score computed as above is regressed out before scaling and centreing. PCA is performed on z-scores of the log-normalised counts on the reduced space of highly variable genes. A total of 50 PCs is computed. The optimal number n of PCs to retain for clustering is defined as the minimum n such that the standard deviation explained by the nth PC exceeds 50% of the average across the 40–50th PCs. Multiple Louvain clustering runs are performed on the selected PCs, for each highly variable gene set, by varying the number of neighbours k in the knn graph and the resolution parameter r in a pre-defined set: k ∈ {30, 40, 50} and r from 0.1 to 0.8 in steps of 0.1. A second and third round of highly variable gene selection, PCA, and clustering are possibly performed after the removal of small clusters with very low UMI count, found in specific solutions, until the average UMI count is homogeneous across clusters. We redo the whole clustering procedure instead of just removing the small, low-quality clusters, because they are usually outliers in the expression profile of the sample and can affect the definition of highly variable genes. We obtain 9395 and 13,562 cells for T0 and T1, and 7884 and 10,698 cells for exp1 and exp2, respectively. For T0 and T1, (i) a silhouette score is computed for each clustering solution on the Euclidean distances calculated on the same PCs used for clustering and (ii) a ROGUE score81 is computed for each clustering solution on the UMI counts, using default parameters, except for span =0.6. The clustering solution with the highest silhouette score is defined as optimal. To increase the number of detected clusters, we set a higher value for the resolution parameter, while maintaining the other parameters of the optimal solution unchanged (the highly variable gene set and the number of neighbours in the knn graph). The final clustering solutions for T0 and T1 are obtained with vst method by setting nfeatures = 1000 and clustering parameters k = 40 and r = 0.5, resulting in seven clusters for both experiments. Resolution 0.5 turned out to be a good compromise between silhouette width and ROGUE score on both T0 and T1. For exp1, we obtain three clusters using the following settings: vst method, nfeatures = 5000, k = 30, and r = 0.1. UMAP reduction82 is run with RunUMAP using default parameters on the same input used for clustering.Differential expression analysisA differential expression analysis is run between conditions via FindMarkers function from Seurat with MAST method83 accounting for sample ID as a covariate. We only test for genes detected (UMI count > 0) in at least 10% cells in either condition such that |log2(FC)| ≥ 0.25. A gene is defined as differentially expressed if its adjusted p value is < 0.05 (Bonferroni correction) between the two conditions.Computation of cell–cell distance distributionsTo evaluate the relationship between clonality and gene expression at basal state, we compare sister cells (same clone) and non-sister cells (different clones) at T0 and T1. We note that a sister cell similarity measure within the same experiment would be biassed towards clones with vial-specific frequency >1, thus we opt for a comparison between experiments (1886 common clones). Cells of T0 and T1 are projected on a common latent space using canonical correlation analysis (CCA)84, using the intersection of highly variable gene sets (defined as in the section “Dimensional reduction and clustering”) between experiments. By definition, this integration approach removes dataset-specific complexity by keeping only shared correlations, hence the contribution of the experiment covariate to cell–cell similarity should be minimised compared to more conservative approaches. Inter-sample cell–cell Euclidean distances are computed on the CCA latent space on sister cell pairs and a random subset of non-sister pairs of the same size. A Gaussian kernel density estimation is computed for sister and non-sister distances separately to obtain a global distance distribution. Then, to evaluate the relationship between and within the DTC and the non-DTC pool before and after treatment, we computed the Euclidean distances between cells at T1 (baseline) and in exp1 (after treatment); finally, we evaluated the relationship between clonality and gene expression on the treated condition by computing intra-sample Euclidean distances at day ≥13 on the PC space of exp1 and exp2, as defined in the section “Dimensional reduction and clustering”. We obtained Gaussian kernel density estimates for both cell–cell distance matrices as above.Computation of clone sharedness and detection of cell subpopulationsTo assess whether sister cell similarity is uniform across clones or is rather clone-specific, we introduce a clone sharedness score across clusters (see section “Dimensional reduction and clustering”). It is defined, for each pair of clusters i and j in T0 and T1, as cs(i,j) = (nij·n′ji)/(Ni·N′j), where nij (n′ji) is the number of cells in i (in j) with at least one sister in j (in i) and Ni (N′j) is the total number of cells in i (in j). Each cluster i in T0 is matched with the cluster jmax in T1 with maximum clone sharedness with i, namely, jmax = arg maxj cs(i,j). The three clusters belonging to maximal clone sharedness pairs (i,jmax) are defined as subpopulations and labelled by non-increasing value of cs(i,jmax) as S1, S2, and S3. Transcriptionally stable clones are defined as the ones only found in one of the three subpopulations in both T0 and T1, they are 437 in total. This approach is independent of single-cell data integration, whose accuracy is difficult to assess when reliable markers are unknown85,86. We define the gene signature of S ∈ {S1, S2, S3} as the set of genes that are differentially upregulated between S and its complement in both T0 and T1. We obtain 109, 29, and 27 genes for signatures S1, S2, and S3.Definition of clone–clone pair propensityWe define a pair propensity score to measure the pairwise gene expression similarity bias between clones. For each cell, we consider its directed neighbourhood of size k, i.e., each cell has exactly k nearest neighbours and the neighbour relationship is not symmetric. Given two clones labels i and j, possibly identical, the observed (i,j) frequency fijobs is the number of cell pairs (ci,cj) such that ci and cj belong to clones i and j, respectively, and cj is a neighbour of ci. The expected (i,j) frequency fijexp is calculated based on the frequency of clones i and j and the neighbourhood size k, as follows. Given the neighbourhood of ci, the probability that a given cell c ≠ ci in the neighbourhood is labelled with j is given by (nj − I(i,j))/(N − 1), where nj is the number of cells labelled with j, N is the total number of cells in the sample, and I(i,j) is the indicator function (I(i,j) = 1 if i = j and I(i,j) = 0 if i ≠ j). We obtain fijexp = ni·k·(nj − I(i,j))/(N − 1), where ni is the number of cells labelled with clone i. The pair propensity of i and j is defined as pij = fijobs/fijexp. By iterating across all clone pairs, we obtain a non-symmetric n × n matrix P, where n is the number of distinct clone labels. The value of pij indicates the propensity of clones i and j to be closer (pij > 1) or farther (pij < 1) from each other in the sample than expected by chance, where pij = 1 denotes random association. Finally, we obtain a symmetric matrix \({P}^{{\prime} }=({P}^{T}{P})^{\frac{1}{2}}\).Detection of clone lineagesWe use the above pair propensity definition to find groups of clones with mutually similar gene expression profiles on the treated condition at day ≥13. We first define the neighbour relationship, as follows. The top 1000 highly variable genes are computed with vst method on the union of the time points in exp1 and exp2, respectively. For each sample, a knn graph with k = 40 is built on the top n PCs (see section “Dimensional reduction and clustering”). Matrix P′ (see section “Definition of clone–clone pair propensity”) is computed on the top 20 highest frequency clones, clones i such that pii < 1 are discarded, and the matrix obtained is clustered using the hclust function with ward.D2 method on Euclidean distances. We obtain two clusters A and B that serve as “anchor” to add the remaining clones. Clones not yet included in A nor in B are sorted by non-increasing frequency and we iteratively add one clone at a time, as follows. The first clone i in the ordering is considered and a pair propensity piA (piB) is computed between i and the union of clones in A (in B). If piB < 1 and piA > piB + 1, then i is added to A (likewise for B), otherwise it is discarded. The procedure continues until the last clone in the ordering is considered. Starting from the sets A and B computed on each of the four samples, we define lineage L1 (lineage L2) as the set of clones always assigned to A (to B) and detected at least once in both exp1 and exp2. We detect 11 clones for L1 and 17 clones for L2. We define the gene signature of L1 (of L2) as the set of genes that are differentially upregulated (downregulated) between cells in L1 and L2 in both exp1 and exp2. We obtain 42 and 46 genes for L1 and L2.Functional annotationWe perform pathway enrichment analysis (REACTOME87) as follows. Genes with official gene symbols are first converted to ENTREZ identifiers, using limma v3.49.588 and clusterProfiler v4.1.489 packages, respectively. Enrichment significance is calculated with a Fisher’s exact test using the enrichPathway function from clusterProfiler. Benjamini–Hochberg method is used to correct for multiple testing. We perform Gene Set Enrichment Analysis (GSEA90) as follows. Given a query case-control, the list of genes (universe) is ordered by non-increasing log2(FC) values; GSEA is run with GSEA function from clusterProfiler, with nPerm = 1000 sample permutations. MYC activity was computed with ModuleScore function from Seurat on T0 and exp1 cells using HALLMARK_MYC_TARGETS_V1 and HALLMARK_MYC_TARGETS_V2 (www.gsea-msigdb.org) as MYC target gene lists.Single-cell RNA-seq data analysis (MDA-MB-231 TGL)Read alignment, UMI counting, and CB calling are performed as for SUM159PT cells, with parameter –expect-cells set to 3000. CBs with < 200 detected genes (UMI > 0) and genes detected in < 3 cells are filtered out. Normalisation, scaling, cell-cycle regression, dimensional reduction, clustering and differential expression analysis are performed as described for SUM159PT cells. The final clustering solution is obtained with vst method by setting nfeatures = 1000 and clustering parameters k = 40 and r = 0.5, resulting in seven clusters.Single-cell Multiome data analysisCell barcode callingA GBC reference is defined as the union of the GBC sets computed from the scRNA-seq reads, as described in the section “Genetic barcode calling”. Read alignment, UMI count, peak calling, and CB calling are performed using cellranger-arc count from CellRanger v6.0 on genome reference GRCh38 v2020-A-2.0.0. GBC counts are extracted using cellranger count, setting –expect-cells to 6200 (M0_1) and 8800 (M0_2) and –include-introns. CBs with < 10,000 UMIs are filtered out.Clone detection and quality filteringExpressed GBCs in CBs are identified using the same procedure applied to scRNA-seq samples and described in the section “Clone detection”, using cj ≥ 10 and cj/cmax ≥ 0.3, and no p value threshold. All CBs with no expressed GBCs or classified as doublets are removed from subsequent analysis. We obtain 2446 and 2377 nuclei for M0_1 and M0_2, respectively.Gene expression analysisCell clusters are defined using gene expression information only, on M0_1 and M0_2 separately. UMI count normalisation is calculated as for scRNA-seq datasets with Seurat v4.0.6. Highly variable genes are selected using FindVariableFeatures function from with method = vst and nfeatures = 5000. PCA is performed on the set of highly variable genes using RunPCA function with default parameters. The first 30 PCs are used to compute a Shared Nearest Neighbours graph via the FindNeighbors function, then FindClusters function is used to cluster the cells, using the SLM algorithm with resolution ranging between 0.2 and 1.2 in steps of 0.1. A silhouette width is calculated for each clustering solution on the Euclidean distance matrix computed on the same input used for clustering. Subpopulations S1, S2, and S3 in the two replicates are detected by matching the clusters to T0 and T1 using the sharedness score (see section “Computation of clone sharedness and detection of cell subpopulations”). The selected resolution value is the one that maximises the average silhouette width while keeping the three subpopulations distinct. We select resolution 0.4 and 0.6 for M0_1 and M0_2, respectively, and obtain 7 and 6 clusters. UMAP projection is calculated using RunUMAP with default parameters on the same input used for clustering. Cluster markers are extracted with the Wilcoxon rank-sum test implemented in FindAllMarkers function using default parameters. Differentially expresseSingle-cell RNA-seq data analysis (SUM159PT)”. Subpopulation gene signatures are defined as for the scRNA-seq samples.Chromatin accessibility analysis—data preprocessingThe region-cell count matrices, where regions are ATAC peaks and entries are fragment counts, are processed as follows. Raw fragment counts are normalised using term frequency-inverse document frequency (TF-IDF), which assigns higher importance to highly cell-specific regions. The active regions (fragment count ≥ 1) in at least ten cells are selected using the FindTopFeatures function from Signac v1.7.091. Latent semantic indexing (LSI) is applied to reduce the dimensionality of the dataset. We compute 50 LSI dimensions. We keep dimensions from 2 to 50, as dimension 1 shows a positive correlation with sequencing depth. UMAP projection is calculated on the same LSI space, using default parameters.We define pairs of regions associated with the same transposition event between replicates as those lying at a distance ≤ d on the genome. findOverlap function from GenomicRanges v1.46.192 is used to determine such pairs, by varying the gap size (i.e., the value of d) from −1000 (overlap) to 1000 (padding) in steps of 10. Depending on d, each region in M0_1 can have 0, 1 or multiple matching regions with M0_2, and vice versa. The idea is that a low d would fail to recognise regions stemming from the same transposition event, whereas a high d would result in many spurious matches. Low (high) values for d result in few (many) multiple matches. The value of d is chosen such that both ud and ud/(Nd −u d) are maximised, where ud is the number of unique matches between M0_1 and M0_2 and Nd the total number of matches. We select d = 0 and obtain 120,414 and 120,377 matched regions in M0_1 and M0_2, respectively.Chromatin accessibility analysis—topic modellingWe use a topic modelling approach to group regions that are consistently open in the same sets of cells while reducing data sparsity. Given a region-cell matrix, topic modelling outputs a set of topics, where each region has a probability of being assigned to a topic and each topic has a probability of being assigned to a cell. Modelling chromatin accessibility in this way has three advantages: first, the number of topics is typically orders of magnitude smaller compared to the number of regions; second, a cell’s epigenome is expressed as the contribution of multiple topics; third, the importance of a region in a cell is interpretable, namely, as the combination of the weight of the region in a topic and the contribution of that topic in the cell’s epigenome.Specifically, we use the Latent Dirichlet allocation (LDA) model implemented in cisTopic v0.3.056 on each replicate separately, on the set of matching regions between the two replicates (both unique and multiple matches, see section “Chromatin accessibility analysis—data preprocessing”). The input to cisTopic is a raw region-cell count matrix, binarised by setting a cutoff at one fragment per region per cell. cisTopic is run using a total number of 1000 iterations and a burn-in period of 500 iterations to the Collapsed Gibbs Sampler. The procedure is repeated by varying the number of output topics n between 10 and 50 in steps of 10. We select the maximum log-likelihood solution and obtain n = 40 for both M0_1 and M0_2. Topics with Pearson correlation coefficient >0.5 between topic-cell probability and fragment count are discarded. We select 31 topics for M0_1 and 34 topics for M0_2.Chromatin accessibility analysis—detection of ATAC modulesTo select the topics that represent robust biological signals, we compute a topic reproducibility score between replicates, as follows. To this aim, an IDR93 is calculated between every pair of topics in M0_1 and M0_2, respectively, on region-topic probabilities, using idr v2.0.3 tool with parameters –peak-merge-method max –idr-threshold 0.05 –max-iterations 100. Given topics t1 and t2 in M0_1 and M0_2 and a region r, the IDR statistic expresses the probability that the region-topic probabilities of r in t1 and t2 are different. We say that r is reproducible between t1 and t2 if IDR < 0.05. For each topic pair (t1,t2) in M0_1 and M0_2, we define the reproducibility score as the weighted mean between the number of reproducible regions and the 75th percentile of min {−125 log2(IDR), 1000} across those regions. We define the sets of regions with IDR < 0.05 in topic pairs with reproducibility score > 0 as ATAC modules. For each cell and for each module, we compute a module AUC (Area Under the Receiver Operating Characteristic curve) score, where the TF-IDF value is the predictor variable and the membership to the module is the response variable, using the auc function from pROC v1.18.094 and direction “<”. The set of regions resulting from the union of all ATAC modules were annotated using the ENCODE registry of Regulatory Elements v2 (cCRE)95 with a minimum overlap of 1 bp.Relationship between epigenetic modules and subpopulationsWe detect the ATAC modules explaining specific cell subpopulations as follows: for each ATAC module, we compute an AUC score where the predictor variable is the module AUC defined above and the response variable is either the subpopulation membership (S1, S2, S3) or the phenotype (TIC,DTC). Then, we compute Spearman’s ρ between the module AUC for each of the top 40 reproducible ATAC modules and each gene among those detected in at least 20 cells in the highly variable gene pool (vst method, nfeatures = 5000) across all nuclei. We annotate each gene according to whether they belong to S1, S2, and S3 signatures (obtained from both scRNA-seq and Multiome gene expression data) or are annotated as human TF by Uniprot96 (GO:0003700, taxon = human, gene product type = protein; 1435 TF in total). To identify putative cis-regulatory regions, we extract the genes in the transcriptional signatures of S1, S2, and S3 that are proximal to reproducible regions in epigenetic modules, with a neighbourhood size of ±50 kbp around the region, using findOverlaps function from GenomicRanges package.Identification of enriched TFsFor a given module, we extract two sets of genes according to the following criteria: (i) Spearman’s ρ ≥ 0.2 between the module AUC and gene expression across nuclei (see above), or (ii) gene locus lying ≤100 kb away from any region in the module, as by GREAT 4.0.497 analysis, using the basal plus extension gene regulatory domain definition. These two sets are separately used as input to ChEA v.398 and the output ranked list of enriched TFs is obtained.Whole-exome sequencing data analysisRead alignment, variant calling, and extraction of reproducible CNVsReads were aligned with BWA (-t 16) v0.7.1799 and CNVs were called with CNVkit v0.9.8100 with default parameters, using the parental SUM159PT cells as a normal reference sample and providing the appropriate Agilent bed file (v5 or v7) as target. The chromosomal CNVs detected with CNVkit were retrieved. CNVs were intersected across replicates using bedtools intersect from bedtools v2.30.0101, requiring ≥1 bp overlap. Only regions covered by all replicates are retained. A CNV consensus is then defined as the set of regions that span >80% of a CNV in each replicate.Comparison with single-cell sequencing dataRegions from scATAC-seq assays of M0_1 and M0_2 are intersected with CNV consensus regions using subsetByOverlap function from GenomicRanges v1.45.0 with default parameters. Genes from scRNA-seq assays are intersected with CNV consensus regions using the same approach but allowing ±100 kbp around the CNV consensus regions. Gene loci coordinates are extracted from the annotation used in the single-cell Multiome analysis.Bulk RNA-seq analysis of SUM159PT and tumoursReads were trimmed, filtered and aligned using STAR v2.7.3102. Read count extraction and TPM normalisation were performed using FeatureCounts. The TM4SF1high signature, consisting of 433 DEGs, was defined using edgeR (within Galaxy v3.36), considering the sorting batch as a factor, and selecting genes with log2(FC) > 1 and p value < 0.05 (N = 3 independent sorting batches). The functional analyses of TM4SF1high DEGs were generated through the use of IPA (QIAGEN Inc.). Differential gene expression analysis in tumours vs 2D samples was performed with the Bioconductor Deseq2 package v1.34103 using default parameters.Analysis of gene signatures (S1, S2, S3) in bulk TCGA and METABRIC datasetsRNA expression data from primary bulk breast cancer patients were retrieved from Cbio Cancer Genomics Portal104, selecting as studies METABRIC and TCGA and as Genomic Profiles “mRNA expression” (METABRIC: “mRNA expression z-scores relative to all samples (log microarray)”; TCGA: “mRNA expression z-scores relative to all samples (log RNA Seq V2 RSEM)”). Only complete samples were considered for this analysis. Molecular subtyping information was retrieved from the Cbio Portal. For each signature, the sum of the z-scores of all the expressed genes (genes expressed in <50% of the samples were excluded from the analysis) was calculated. Then, patients were stratified according to score quartiles (1st quartile, S1-high; 2nd and 3rd quartile, S1-mid; 4th quartile, S1-low; likewise for S2 and S3). The significance of the association with SUM159PT-derived signatures was calculated with a Chi-square test. Individual genes of the signatures were also analysed through multivariate linear correlation and visualised by a colour map of clustered correlations (K-means) using JMP17.2 (SAS).Comparison with single-cell sequencing datasets from the literatureTriple-negative breast cancer (TNBC) scRNA-seqConcatenated scRNA-seq data from primary TNBC samples were retrieved from GSE16152950. Filtered cell-count matrices were processed as in the section “Quality filtering and normalisation”. Epithelial cells were defined by normalised EPCAM expression greater than the local minimum between the top two local maxima in the distribution. Cells with UMI count >20,000 were removed, resulting in 9063 epithelial cells for downstream analysis. Highly variable gene detection, PCA, and clustering are performed with Seurat_preprocessing function from Scissor 2.0.0105. We obtained nine clusters. Pseudo-bulk positive and negative phenotypes are generated for each subpopulation (S1, S2, S3) and its complement, respectively, on each scRNA-seq sample at baseline separately (T0_1, T0_2, T1_1, T1_2), resulting in n = 8 samples/phenotype pair (4 replicates/condition). Scissor is run on the single-cell dataset for each of the three phenotype pairs. Parameter alpha for the logit regression is let vary in {0.7, 0.8, 0.9}, until ≤15% phenotype-associated cells (both positive and negative) are detected (cutoff = 0.15). To quantify the enrichment of S1, S2, and S3 signatures across clusters, we computed the odds ratio for each signature–cluster pair. UMAP is run with default parameters on the top 10 PCs.Curated Cancer Cell Atlas (3CA)We obtained the top 50 significant genes for each of the meta-programmes (MP) associated to malignant cells51. Gene symbols were first converted to synonyms to match the gene symbols in the gene-cell expression table, using UpdateSymbolList function from Seurat. We computed a joint gene expression for each gene list, using the ModuleScore function. Finally, we computed the area under the curve AUC(C,M), where the predictor variable is the ModuleScore associated to the meta-programme M and the response variable is true when a cell belongs to C, false otherwise.Pancreatic ductal adenocarcinoma (PDAC) scRNA-seqAssociation with subclonal dissemination for 2010 genes was retrieved from Simeonov et al.20. To map S1 and S3 signature genes, murine gene symbols were converted to human gene symbols with limma v3.49.5.Lung adenocarcinoma scATAC-seqPre-metastatic gene scores for 20564 genes for Module 9 (RUNX2) were retrieved from LaFave et al.52.Statistics & reproducibilityNo statistical method was used to predetermine the sample size, which was comparable to that reported in previous studies. Regarding single-cell experiments, the required number of individual single-cell profiles was determined to capture a sizable portion of the total number of GBCs, which was first assessed by bulk DNA sequencing. Statistical significance was measured as indicated in the figure legends. No data were excluded from the analysis performed in vivo. One single-cell time-course batch (MULTI-seq) was removed due to poor library quality. All experiments were replicated at least twice. The experiments were not randomised, as the study was performed on uniform biological material (i.e., a commercial cell line). For comparative in vivo experiments (e.g., TM4SF1 high vs bulk), animals were allocated randomly into the experimental groups. The investigators were not blinded to allocation during experiments and outcome assessment for the in vivo experiments (e.g., TM4SF1 high vs bulk).Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Multi-omic lineage tracing predicts the transcriptional, epigenetic and genetic determinants of cancer evolution

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models