Full-length target sequences of GeoMx digital spatial profiling probes reveal that gene-promiscuity predicts probe sensitivity to EDTA tissue decalcification

DSP probes are 35 to 50 nucleotides long, and are identified in the publicly available configuration file7 by a 35-nucleotide extract, the identifiers of mRNA transcripts they target, and the genomic coordinates that define the borders of the genomic sequences that give rise to the targeted transcript sequences. 98% of the 126,346 transcripts specified in the configuration file could be identified in a past RefSeq8 database version (GrCh38.p13 release 2020/05/22). However, because transcript databases are repeatedly updated with transcripts being added or removed, potential targets of DSP probes need to be repeatedly re-evaluated; this is critical because a more current version of RefSeq (GrCh38.p14, accessed 2023/04/08) allows identification of only 42% of the transcripts in the configuration file. Accordingly, we developed a strategy for inferring the undisclosed target sequences by combining the specifications provided by Nanostring with publicly available data (Fig. 2A).Fig. 2Analysis of full length target sequences of DSP probes. (A) The DSP configuration file is publicly available (see reference 6) and lists targeted transcripts, the genomic coordinates that delimit the genomic sequence that gives rise to the targeted sequence of the target mRNA transcript, and a 35-nucleotide long extract of the probe target sequence. However, the corresponding transcript coordinates first need to be identified using RefSeq and the GenomicFeatures R package. Transcript coordinates allow the retrieval of potential target sequences, which are then filtered by those matching the truncated sequence provided by Nanostring. Sequences are then aligned to known human transcripts using BlastN. Created with BioRender.com. (B) Overview of results of the probe identification process. “Probe-gene association” is an identified pair of probe and target gene. One probe may have multiple probe-gene associations. Each probe-gene association may involve multiple transcripts. Percentages are rounded to two decimal points and are relative to the number of the “parent” element (C) Probes ranked by adjusted p value of differential gene expression analysis between untreated tissues and those incubated with EDTA for one, three, or seven days (vertical axis). On the horizontal axis are shown the number of gene targets per probe (length of bar) and the direction of the log2FC of the differential expression analysis (color and direction of bar). Blue bars indicate negative log2FC, whereas red bars indicate a positive log2FC. Note that in non-normalized data, probes with many target genes (long bars) are located at the bottom end of the graph after 7 days (i.e., show only small change after EDTA treatment). After quantile normalization, these probes appear to show the highest change and are located at the top of the graph. The dotted line indicates the portion of probes with significant differential expression.For each probe, the configuration file specifies a variable number of target genes, transcripts and genomic coordinates, but not the correspondence between these data. For this reason, we first used RefSeq to determine matching pairs of target transcripts and their genomic coordinates. This allowed us to accurately convert the genomic coordinates to coordinates within targeted transcripts, which in turn enabled us to retrieve the precise transcript sequences that the probes are specified to hybridize with (summarized in Fig. 2B).The Nanostring human whole transcriptome atlas panel7 comprises 18,815 probes (18,676 target probes and 139 negative control probes). We determined the unequivocal full-length target sequences of 18,300 of the target probes (~ 98%), whilst 376 remained ambiguous: 213 with missing genomic coordinates (including two erroneous coordinates); 91 matching multiple potential target sequences; 56 probes with up to three nucleotides differing from the given extract; 12 probes whose given extract lay within a specified transcript but outside of the given genomic coordinates; and 4 probes where the coordinates indicated a sequence longer than the limit of 50 nucleotides specified by Nanostring.We further determined that the majority of unequivocally identified probes do not span introns (15,105, 83%), and that a minority of probes (26, 0.14%) span introns only in a subset of their target transcripts. The negative control probes (designed not to align against any human sequence) have no specified genomic coordinates or target transcripts, and therefore could not be deduced. We provide the entire dataset containing all probes, including unequivocal and ambiguous sequences in (Supplementary Table 2), but in the following summary only report results from unequivocally identified probes (e.g., excluding negative control probes), unless explicitly stated.We next aligned the target sequences with known human transcripts using BlastN9. We identified 182,161 transcripts with complete alignment identity and 11,568 transcripts with incomplete (81% to < 100%) alignment identity. Unexpectedly, even among “official” target transcripts previously specified by Nanostring, we found a minority (391) with incomplete (89%-98%) alignment. Grouping the transcripts by probe and gene, we identified a total of 21,867 probe-gene associations. Similarly, even among those previously specified by Nanostring, we found a minority (321) with incomplete alignment (93%-98%). This suggests that complete sequence identity was not an absolute requirement for Nanostring to designate target transcripts or genes. We therefore also refrained from further filtering our BlastN hits by alignment identity.We determined that the majority of the probe-gene associations (16,050) involved more than one transcript per gene, whereas a minority involved only a single one (5817; overall, with a median of 4 and mean of 8.8 transcript isoforms per probe). To identify novel transcript isoforms, we compared the base accession numbers of transcripts (without version suffixes) between those specified by Nanostring and those identified via BlastN, yielding 90,634 novel transcripts, and 30,641 that are either not targeted in their current version, or were entirely removed (and not updated) by RefSeq. Transcripts that were merely updated (i.e. showed an increment in the version suffix) were not considered novel and excluded from this summary.Grouping transcripts by probe, we found that the majority of unequivocally identified probes (16,707) targeted only one gene, whereas a minority (1559) targeted more than one gene, and 34 probes appeared not to target any transcripts currently listed in RefSeq. 1255 probes had at least one target gene that differed from the Nanostring specifications.Summarizing all aligned transcripts, we identified 19,937 target genes (19,277 with complete alignment), including 19,061 of the 19,505 genes specified by Nanostring. When including not only unequivocally identified probes, but also given extracts and ambiguous sequences, the number of target genes was 20,347 (19,707 with complete identity), including 19,403 of those specified by Nanostring. This indicates that some probes were designed to bind targets no longer thought to be expressed as mRNA, and the corresponding gene can no longer be considered to be assessed by the panel.The given extracts of two negative control probes showed complete alignment with transcripts listed in RefSeq, with the caveat that their full-length sequence could not be identified.In conclusion, this approach identified the full-length sequences and corresponding targets in the human whole transcriptome atlas panel, and enabled search for properties that contributed to specific EDTA sensitivity.

Hot Topics

Related Articles