MultiMatch: geometry-informed colocalization in multi-color super-resolution microscopy

Optimal chain-matchingIn the following we will denote the sets of two-dimensional particle coordinates in the image domain for each of the k color channels as$${X}^{(1)}:= {\left\{{{{{\boldsymbol{x}}}}}_{l}^{(1)}\right\}}_{l = 1}^{{n}_{1}},\ldots ,{X}^{(k)}:= {\left\{{{{{\boldsymbol{x}}}}}_{l}^{(k)}\right\}}_{l = 1}^{{n}_{k}}\subseteq {{\mathbb{R}}}^{2},$$
(1)
where number of particles \({n}_{j}\in {{\mathbb{N}}}_{\ge 0}\) for j ∈ {1, …, k}. For simplicity and related to the considered data in this article, we will only consider the cases k = 2, 3 in the following. Generalization to larger k is straight-forward. In a chain-like particle arrangement of the form \(\left({{{{\boldsymbol{x}}}}}^{(1)},\ldots ,{{{{\boldsymbol{x}}}}}^{(k)}\right)\) with x(j) ∈ X(j), all neighbors x(j), x(j+1) have to be closer than the colocalization threshold t and we will denote according tuples as \({{{{\bf{d}}}}}_{t}^{k}\)-chains:Definition 1(\({{{{\bf{d}}}}}_{t}^{k}\)-chain). Fix k ≥ 2. For sets X(1), …, X(k), a distance \({{{\bf{d}}}}:{{\mathbb{R}}}^{2}\times {{\mathbb{R}}}^{2}\to {{\mathbb{R}}}_{\ge 0}\) and a predefined maximal threshold t ≥ 0 a tuple of k points$$\left({{{{\boldsymbol{x}}}}}^{(1)},\ldots,{{{{\boldsymbol{x}}}}}^{(k)}\right)\in {{\mathbb{R}}}^{2\times k}\quad \,{{{\rm{with}}}}\,\quad {{{{\boldsymbol{x}}}}}^{(j)}\in {X}^{(j)}\ \,{{{\rm{for}}}}\,\ j\in \{1,\ldots ,k\}$$
(2)
is a \({{{{\bf{d}}}}}_{t}^{k}\)-chain, if pairwise point distances along the fixed tuple point order are smaller or equal than t, i.e.,$${{{\bf{d}}}}\left({{{{\boldsymbol{x}}}}}^{(j)},{{{{\boldsymbol{x}}}}}^{(j+1)}\right)\le t\quad \,{{\mbox{for}}}\,\quad j\in \{1,\ldots ,k-1\}.$$
(3)
In the context of our colocalization problem, d is the Euclidean distance (this can easily be generalized), a \({{{{\bf{d}}}}}_{t}^{3}\)-chain is a triplet and a \({{{{\bf{d}}}}}_{t}^{2}\)-chain a pair. For given t, we now aim to detect as many \({{{{\bf{d}}}}}_{t}^{k}\)-chains as possible:Definition 2(Optimal \({{{{\bf{d}}}}}_{t}^{k}\)-matching). A collection of pairwise disjoint \({{{{\bf{d}}}}}_{t}^{k}\)-chains is called \({{{{\bf{d}}}}}_{t}^{k}\)-matching. It is called optimal if its number of chains is maximal among all matchings.Such an optimal \({{{{\bf{d}}}}}_{t}^{k}\)-matching can be found by utilizing a multi-marginal and unbalanced formulation of OT. For example, if k = 3, for each channel i = 1, 2, 3, we interpret coordinates of detected particles as support points with mass 1 of a respective discrete distribution. Due to this discrete structure, the resulting optimization problem will be finite-dimensional. Since in our measurements the number of detected particles per channel might differ, we require an unbalanced formulation to compare distributions with different total masses. A wide variety of penalty terms for mass discrepancies has been studied in the literature, see for instance64. Our problem formulation is closely related to an ℓ1-penalty for unmatched particles, see also52. We first consider the problem of finding optimal \({d}_{t}^{2}\)-matchings between two point clouds, i.e. k = 2. This can be solved via the following optimization problem:Definition 3(Optimal \({{{{\bf{d}}}}}_{t}^{2}\)-matchings via unbalanced optimal transport). Let \(\lambda \in {{\mathbb{R}}}_{\ge 0}\), set the cost function$$\begin{array}{l}c:{{\mathbb{R}}}^{2}\times {{\mathbb{R}}}^{2}\to {{\mathbb{R}}}_{\ge 0}\cup \{\infty \},\\ ({x}_{1},{x}_{2})\mapsto \left\{\begin{array}{ll}{{{\bf{d}}}}({x}_{1},{x}_{2})-\lambda \qquad\,{{\mbox{if}}}\,{{{\bf{d}}}}({x}_{1},{x}_{2})\le t,\\ +\infty \qquad\qquad\qquad\,{{\mbox{otherwise}}}\,,\end{array}\right.\end{array}$$
(4)
and \({{{\boldsymbol{c}}}}\in {{\mathbb{R}}}^{{n}_{1}\times {n}_{2}}\) the pairwise cost between all points in X(1) and X(2), defined by \({c}_{{i}_{1},{i}_{2}}=c({{{{\boldsymbol{x}}}}}_{{i}_{1}}^{(1)},{{{{\boldsymbol{x}}}}}_{{i}_{2}}^{(2)})\). The optimal unbalanced transport problem of interest can now be stated as the following linear program$${{{{\rm{arg}}}}\,{{{\rm{min}}}}}_{{{\pi}} \in {\mathbb{R}}^{n_1\times n_2 \times n_3}} {\sum}_{i_1=1}^{n_1}{\sum}_{i_2=1}^{n_2}c_{i_1 i_2} \pi_{i_1 i_2}\\ s.\,t. {\sum}_{i_2=1}^{n_2} \pi_{i_1 i_2}\le 1\,\, for\,\, all\, i_1 = 1, \ldots, n_1\\ \quad {\sum}_{i_1=1}^{n_1} {\pi}_{i_1 i_2}\leq 1\,\, for\,\, all\, i_2 = 1, \ldots, n_2 \\ \quad {\pi}_{i_1 i_2} \geq 0 \,\,for\,\, all\, (i_1, i_2) \in \{1, \ldots, n_1\}\times \{1, \ldots, n_2\}.$$
(5)
Entries of an optimal π indicate which particles have been matched. The constraints enforces that each particle can at most be part of one matching, but it may also be discarded. By the definition of the cost vector c, the solution of Equation (5) does not match points x(1) and x(2) as soon as they are farther apart than t, but for each matching below distance t there is an incentive by the parameter λ. For λ sufficiently large in comparison to t one can show that the solution yields an optimal \({{{{\bf{d}}}}}_{t}^{2}\)-matching. Among all optimal matchings the above problem prefers one with the lowest sum of pairwise particle distances among matched particles.We now generalize this to k = 3 via a multi-marginal transport problem.Definition 4(Optimal \({{{{\bf{d}}}}}_{t}^{3}\)-matchings via unbalanced multi-marginal optimal transport). Let \(\lambda \in {{\mathbb{R}}}_{\ge 0}\), set the cost function$$\begin{array}{ll}c: {\mathbb{R}}^2 \times {\mathbb{R}}^2 \times {\mathbb{R}}^2 \to {\mathbb{R}}_{\geq 0} \cup \{\infty\}, \\ (x_1,x_2,x_3)\mapsto \left\{\begin{array}{ll}{{{\mathbf{d}}}}(x_1,x_2)+{{{\mathbf{d}}}}(x_2,x_3)-\lambda &{if}\, {{{\mathbf{d}}}}(x_1,x_2) \leq t \wedge {{{\mathbf{d}}}}(x_2,x_3) \leq t, \\ +\infty \hfill & {{otherwise}},\hfill\end{array}\right.\end{array}$$
(6)
and let \({{{\boldsymbol{c}}}}\in {{\mathbb{R}}}^{{n}_{1}\times {n}_{2}\times {n}_{3}}\), be the cost tensor between all triplets in (X(1), X(2), X(3)), defined by \({c}_{{i}_{1}{i}_{2}{i}_{3}}=c({{{{\boldsymbol{x}}}}}_{{i}_{1}}^{(1)},{{{{\boldsymbol{x}}}}}_{{i}_{2}}^{(2)},{{{{\boldsymbol{x}}}}}_{{i}_{3}}^{(3)})\). Then the unbalanced multi-marginal OT problem can be stated as the following linear program:$$ {{{{\rm{arg}}}}\,{{{\rm{min}}}}}_{{{\pi}} \in {\mathbb{R}}^{n_1\times n_2 \times n_3}} {\sum}_{i_1=1}^{n_1}{\sum}_{i_2=1}^{n_2}{\sum}_{i_3=1}^{n_3}c_{i_1 i_2 i_3} \pi_{i_1 i_2 i_3} \\ \quad s.\, t. {\sum}_{i_2=1}^{n_2}{\sum}_{i_3=1}^{n_3} \pi_{i_1 i_2 i_3}\leq 1\,\, \,\,for\,\, all\, i_1\in [n_1]\\ \qquad {\sum}_{i_1=1}^{n_1}{\sum}_{i_3=1}^{n_3} \pi_{i_1 i_2 i_3}\leq 1\,\, \,\,for\,\, all\, i_2\in [n_2] \\ \qquad {\sum}_{i_1=1}^{n_1}{\sum}_{i_3=1}^{n_3} \pi_{i_1 i_2 i_3}\leq 1\,\, \,\,for\,\, all\, i_3\in [n_1]\\ \qquad \pi_{i_1 i_2 i_3} \geq 0 \,\, \,\,for\,\, all\, (i_1, i_2, i_3) \in [n_1]\times [n_2]\times [n_3],$$
(7)
where we used the notation [n] = {1, …, n}. As mentioned above, note that per the marginal constraints, particles may be matched at most once and can also be discarded. Likewise, by definition of the cost vector c only allows matchings between points that are valid \({{{{\bf{d}}}}}_{t}^{3}\)-chains. Analogously there is a matching incentive via the parameter λ and for sufficiently high values (relative to t) one can show that the above problem provides an optimal \({{{{\bf{d}}}}}_{t}^{3}\)-matching. Among all these matchings, one with minimal sum of pairwise distances is selected by the problem.Generalization of Definition 4 to arbitrary k is now obvious, leading to a multi-marginal problem with k marginals. In general, multi-marginal problems quickly become numerically impractical due to the large number of variables. The cost function c in (6) has a chain structure, i.e. it can be written as a sum of functions only depending on (x1, x2) and (x2, x3). This chain structure allows the reformulation of the problem as a much more compact network flow problem (see Section below), and it implies the existence of optimal binary matchings. Problems where the cost exhibits a tree-structure can still be solved efficiently, see ref. 51 and references therein, but they cannot be formulated as network flow problems and do not exhibit binary minimizers in general.Network flow formulationIn this section, we show that the multi-marginal optimal unbalanced transport problem corresponds to a min cost flow problem if the cost function has a chain structure as in (6). This has two relevant consequences:

1.

It guarantees that (7) has integer solutions and thus indeed corresponds to a matching problem, which in general does not hold true for discrete OT problems;

2.

It allows us to solve the multi-marginal optimal unbalanced transport problem efficiently.

Definition 5Let (V, E) be a directed graph with a source node S ∈ V, a target node T ∈ V, an edge capacity function \({l}_{E}:E\to {\mathbb{R}}\cup \infty\) and an edge cost function \({c}_{E}:E\to {\mathbb{R}}\cup \infty\). Then we call (V, E, cE, lE) a flow network. Given an amount of flow, \(m\in {\mathbb{{R}_{+}}}\) the min cost flow problem consists in finding a function \(f:E\to {\mathbb{R}}\) that solves the following optimization problem:$$ {{{\rm{min}}}}_{f} {\sum}_{(u, v) \in E} f(u, v) c_E(u, v) \\ s.t. \; 0\le f(u,v) \le l_{E}(u,v)\,for\, all\, (u, v) \in E \quad (capacity\, constraints) \\ \quad {\sum}_{\{u: (u, v) \in E\}} f(u,v) – {\sum}_{\{w : (v, w)\in E\}}f(v, w)=0\;\;\; for\, all\, v \neq S, T \quad (flow\, conservation)\\ \quad {\sum}_{\{u: (S, u) \in E\}} f(S, u) – {\sum}_{\{v : (v, S)\in E\}}f(v, S)=m \quad (flow\, source)\\ \quad {\sum}_{\{u: (u, T) \in E\}} f(u,T) – {\sum}_{\{v : (T, v)\in E\}}f(T, v)=m \quad (flow\, sink).$$Notably, due to the total unimodularity of the constraint matrix, the min cost flow problem with integer total flow m and integer capacity function lE has an integer solution (Theorem 13.11 in65). In the following, we recast (7) to a min cost flow problem (see sketch in Supplementary Fig. 1):

Node set V: Define source node S ∈ V and target node T ∈ V and add two nodes \({v}_{l}^{(j)}\) and \({\hat{v}}_{l}^{(j)}\) for each detected particle position \({{{{\bf{x}}}}}_{l}^{(j)}\) in Equation (1).

Edge set E:

Connect nodes referring to the same detected point and set edge costs \({c}_{E}({v}_{l}^{(j)},{\hat{v}}_{l}^{(j)})=-\frac{\lambda }{k}\) where k is the number of point clouds as in (1).

Add all possible edges of form \(({\hat{v}}^{(j)},{v}^{(j+1)})\in E\) for j = 1, …, k − 1. Set edge costs$${c}_{E}({\hat{v}}^{(j)},{v}^{(j+1)})=\left\{\begin{array}{ll}\infty ,\hfill &\,{{\mbox{if}}}\,\,{{{\boldsymbol{d}}}}({{{{\boldsymbol{x}}}}}^{(j)},{{{{\boldsymbol{x}}}}}^{(j+1)}) > t\\ {{{\boldsymbol{d}}}}({{{{\boldsymbol{x}}}}}^{(j)},{{{{\boldsymbol{x}}}}}^{(j+1)}), & \,{{\mbox{otherwise}}}.\,\hfill\end{array}\right.$$

Include source and target nodes via edges of form \((S,{v}^{(1)}),({\hat{v}}^{(k)},T),(S,T)\in E\), and set its costs to 0.

Define edge capacities$${l}_{E}({v}_{i},{v}_{j})=\left\{\begin{array}{ll}\infty ,\quad\qquad\,{{\mbox{if}}}\,\,{v}_{i}=S\,\,{{\mbox{and}}}\,\,{v}_{j}=T\\ 1,\quad\qquad\,{{\mbox{otherwise}}}.\,\hfill\end{array}\right.$$

Proposition 1Let \(f:E\to {\mathbb{R}}\) be an integer solution of the min cost flow problem for the flow network (V, E, cE, lE) defined above with transported mass \(m=\min ({n}_{1},{n}_{2},{n}_{3})\). Then one of the optimal solutions π* of the multi-marginal optimal unbalanced transport problem (7) is given by,$${\pi }_{{i}_{1}{i}_{2}{i}_{3}}^{* }=f({\hat{v}}_{{i}_{1}}^{(1)},{v}_{{i}_{2}}^{(2)})f({\hat{v}}_{{i}_{2}}^{(2)},{v}_{{i}_{3}}^{(3)}),$$
(8)
for i1 ∈ [n1], i2 ∈ [n2] and i3 ∈ [n3] with notation [n] = {1, …, n}.ProofFirst we show that π* as defined in Eq. (8) is in fact a valid transport plan for Eq. (7). For any i3 ∈ [n3] we have that, using the conservation constraint,$${\sum}_{{i}_{1}=1}^{{n}_{1}}{\sum}_{{i}_{2}=1}^{{n}_{2}}{\pi }_{{i}_{1}{i}_{2}{i}_{3}}^{* } = {\sum}_{{i}_{2}=1}^{{n}_{2}}f({\hat{v}}_{{i}_{2}}^{(2)},{v}_{{i}_{3}}^{(3)})\left({\sum}_{{i}_{1}=1}^{{n}_{2}}f({\hat{v}}_{{i}_{1}}^{(1)},{v}_{{i}_{2}}^{(2)})\right)\\ \qquad = {\sum}_{{i}_{2}=1}^{{n}_{2}}f({\hat{v}}_{{i}_{2}}^{(2)},{v}_{{i}_{3}}^{(3)})f({\hat{v}}_{{i}_{2}}^{(2)},{v}_{{i}_{2}}^{(2)})\\ \qquad \le {\sum}_{{i}_{2}=1}^{{n}_{2}}f({\hat{v}}_{{i}_{2}}^{(2)},{v}_{{i}_{3}}^{(3)})= f({\hat{v}}_{{i}_{3}}^{(3)},{v}_{{i}_{3}}^{(3)})\le 1$$Analogously it is easy to verify that π* satisfies$$\begin{array}{rcl}{\sum}_{{i}_{1}=1}^{{n}_{1}}{\sum }_{{i}_{3}=1}^{{n}_{3}}{\pi}_{{i}_{1}{i}_{2}{i}_{3}}^{* }\le 1&& {{\mbox{for}}} \, {{\mbox{all}}}\,{i}_{2}\in [{n}_{2}]\\ {\sum}_{{i}_{2}=1}^{{n}_{2}}{\sum }_{{i}_{3}=1}^{{n}_{3}}{\pi }_{{i}_{1}{i}_{2}{i}_{3}}^{* }\le 1&&{{\mbox{for}}} \, {{\mbox{all}}} \,{i}_{1}\in [{n}_{1}].\end{array}$$Hence, π* is a feasible solution of (7). Further, since the source node S is directly connected to the target node T with an edge of infinite capacity and finite cost, the total flow cost must be finite. This implies that for any i1 ∈ [n1], i2 ∈ [n2] and i3 ∈ [n3], we have that \(f({\hat{v}}_{{i}_{1}}^{(1)},{v}_{{i}_{2}}^{(2)})=0\) if \({{{\boldsymbol{d}}}}({{{{\boldsymbol{x}}}}}_{{i}_{1}}^{(1)},{{{{\boldsymbol{x}}}}}_{{i}_{2}}^{(2)}) \, > \, t\) and \(f({\hat{v}}_{{i}_{2}}^{(2)},{v}_{{i}_{3}}^{(3)})=0\) if \({{{\boldsymbol{d}}}}({{{{\boldsymbol{x}}}}}_{{i}_{2}}^{(2)},{{{{\boldsymbol{x}}}}}_{{i}_{3}}^{(3)}) \, > \, t\). Hence, using the shorthand notation$$\langle {{{\boldsymbol{c}}}},{{\pi }}\rangle ={\sum}_{{i}_{1}=1}^{{n}_{1}}{\sum}_{{i}_{2}=1}^{{n}_{2}}{\sum}_{{i}_{3}=1}^{{n}_{3}}{c}_{{i}_{1}{i}_{2}{i}_{3}}{\pi }_{{i}_{1}{i}_{2}{i}_{3}},$$we can rewrite the total cost of the transport problem as$$\langle {{{\boldsymbol{c}}}},{{{{\boldsymbol{\pi}}}}}^{* }\rangle = {\sum}_{{i}_{1}=1}^{{n}_{1}}{\sum}_{{i}_{2}=1}^{{n}_{2}}{\sum}_{{i}_{3}=1}^{{n}_{3}}\left({{{\boldsymbol{d}}}}({{{{\boldsymbol{x}}}}}_{{i}_{1}}^{(1)},{{{{\boldsymbol{x}}}}}_{{i}_{2}}^{(2)})+{{{\boldsymbol{d}}}}({{{{\boldsymbol{x}}}}}_{{i}_{2}}^{(2)},{{{{\boldsymbol{x}}}}}_{{i}_{3}}^{(3)})-\lambda \right)\\ \cdot f({\hat{v}}_{{i}_{1}}^{(1)},{v}_{{i}_{2}}^{(2)})f({\hat{v}}_{{i}_{2}}^{(2)},{v}_{{i}_{3}}^{(3)}).$$By the flow conservation constraints and the fact that f is an integer solution, we can simply reformulate the sum above in terms of the network flow cost function to obtain$$\begin{array}{r}\langle {{{\boldsymbol{c}}}},{{{\pi }}}^{* }\rangle ={\sum}_{(u,v)\in E}{c}_{E}(u,v)f(u,v).\end{array}$$Let us now assume that there exists a feasible solution of (7), \(\tilde{{{{\boldsymbol{\pi }}}}}\), such that$$\langle {{{\boldsymbol{c}}}},\tilde{{{{\boldsymbol{\pi }}}}}\rangle \, < \, \langle {{{\boldsymbol{c}}}},{{{{\boldsymbol{\pi }}}}}^{* }\rangle .$$Then we can define the flow \(\tilde{f}:E\to {\mathbb{R}}\) by setting:$$\tilde{f}({\hat{v}}_{{i}_{1}}^{(1)},{v}_{{i}_{2}}^{(2)})={\sum}_{{i}_{3}=1}^{{n}_{3}}{\tilde{\pi }}_{{i}_{1}{i}_{2}{i}_{3}}\,{{\mbox{and}}}\,\tilde{f}({\hat{v}}_{{i}_{2}}^{(2)},{v}_{{i}_{3}}^{(3)})={\sum}_{{i}_{1}=1}^{{n}_{1}}{\tilde{\pi }}_{{i}_{1}{i}_{2}{i}_{3}},$$for i1 ∈ [n1], i2 ∈ [n2] and i3 ∈ [n3]. The value of the flow on the remaining nodes of E can then be determined by the conservation constraints. In particular, we have \(\tilde{f}(S,T)={{{\rm{min}}}} \{{n}_{1},{n}_{2},{n}_{3}\}-{\sum }_{{i}_{1} = 1}^{{n}_{1}}{\sum }_{{i}_{2} = 1}^{{n}_{2}}\mathop{\sum }_{{i}_{3} = 1}^{{n}_{3}}{\pi }_{{i}_{1}{i}_{2}{i}_{3}}\). This flow is a feasible solution of the given min cost flow problem and hence, by the definition of the cost function for the edges we can derive a contradiction:$$\mathop{\sum}_{(u,v)\in E}{c}_{E}(u,v)\tilde{f}(u,v)=\langle {{{\boldsymbol{c}}}},\tilde{{{\pi }}}\rangle\, < \,{\sum}_{(u,v)\in E}{c}_{E}(u,v)f(u,v). \quad\quad\quad \Box$$As a result of Proposition 1, we immediately obtain that the multi-marginal optimal unbalanced transport problem Eq. (7) has an integer solution and hence provides one-to-one point matchings.Another significant consequence of Proposition 1 is that we can solve the unbalanced optimal transport problem given in Eq. (7) efficiently. While it is often unfeasible to compute directly the solution of the n1 ⋅ n2 ⋯ ⋅ nk-dimensional linear programming problem in Eq. (7), the min cost flow problem can be solved by the Scaling Minimum-Cost Flow Algorithm in ref. 66 in \(O(| V{| }^{2}| E| {{{\rm{log}}}} (| V| ))\) elementary operations, where ∣V∣ is the number of nodes, ∣E∣ is the number of edges. In our case the number of nodes is of the order O(n1 ⋅ ⋯ ⋅ nk) and the number of edges can be upper bounded by an expression of the order \(O({n}_{1}\cdot {n}_{2}^{2}\cdot \cdots \cdot {n}_{k-1}^{2}\cdot {n}_{k})\). In practice, it is further possible to omit all edges with infinite cost, since the source S and the sink T are connected through an edge of cost 0 and with infinite capacity. This implies that for small t much fewer edges to the network are added which results in better computational performance.For an image containing around 1,000 points in each color channel, a solution of the min cost flow problem can be computed for about 10 different values of t in ~1 s on a standard laptop.Estimating the true chain-like particle abundancesThe quality of fluorescence microscopy suffers from non-optimal labeling efficiencies and point detection errors. This will be addressed by a statistical framework to infer on how many of the detected structures in the image actually concur with the ground truth biological structure and how many detections represent only incomplete parts of the underlying particle assembly. For color channels i ∈ {1, …, k} let$${\left\{{\xi }_{j}^{(i)}\right\}}_{j = 1}^{{n}_{i}}\subset {{\mathbb{R}}}^{2}$$
(9)
be the pairs of coordinates of all particles that lie within the scope of the microscope. Note that these point clouds do not necessarily equal those defined in Eq. (1) describing the coordinates of detected particles, since we might not be able to measure all of the existing particles to do unsuccessful labeling or point detection errors.Definition 6(Labeling Efficiency). For each color channel i ∈ {1, …, k} we assume that there is a specific probability si ∈ (0, 1] quantifying whether a particle of this channel is successfully imaged and detected. For simplicity in the following we will always call probabilities silabeling efficiencies.We further assume that the random event of successful detection is statistically independent for each point. Accordingly, the detection success can be described by independent Bernoulli variables$${\left\{{Z}_{j}^{(i)}\right\}}_{j = 1}^{{n}_{i}} \sim \,\,{\mbox{Ber}}\,({s}_{i}),$$
(10)
where si ∈ (0, 1] and \({\xi }_{j}^{(i)}\) is detectable, if and only if \({Z}_{j}^{(i)}=1\).If there exists a true \({{{{\bf{d}}}}}_{t}^{k}\)-chain of form \(({\xi }^{(1)},\ldots ,{\xi }^{(k)})\), then this can only be correctly identified as such, if each of the included particles was detected, i.e., if and only if \({\prod }_{i = 1}^{k}{Z}^{(i)}=1\). From independence it follows that$${\prod}_{i=1}^{k}{Z}^{(i)} \sim \,{{\mbox{Ber}}}\,\left({\prod }_{i=1}^{k}{s}_{i}\right).$$
(11)
Detecting an ABC triplet correctly is Ber(sAsBsC) distributed. Therefore, all possible substructures that can be detected conditioned on the true underlying ABC triplet, i.e.,

1.

ABC triplet, if we see all particles

2.

AB pair, if we do not see C

3.

BC pair, if we do not see A

4.

AC substructure, if we do not see B – which is detected as A and C singlets

5.

A singlet, if we do not see B and C

6.

B singlet, if we do not see A and C

7.

C singlet, if we do not see A and B

8.

\({{\emptyset}}\), if we do not see A,B and C which can not be detected at all,

can accordingly be modeled as Multinomial random variable$${{{{\boldsymbol{W}}}}}_{\cdot | ABC}=\left[\begin{array}{c}{W}_{ABC| ABC} \\ {W}_{AB| ABC} \\ {W}_{BC| ABC} \\ {W}_{AC| ABC} \\ {W}_{A| ABC} \\ {W}_{B| ABC}\\ {W}_{C| ABC} \\ {W}_{{{\emptyset}}| ABC}\end{array}\right].$$
(12)
This can be done in the same manner for all other structures of interest, i.e., true AB and BC pairs and A, B, and C singlets (and their respective substructures) yielding random variables W⋅∣AB, W⋅∣BC, W⋅∣A, W⋅∣B, W⋅∣C. The actual detectable numbers of those structures are$${W}_{ABC} = \sum {W}_{ABC| \cdot },\\ {W}_{AB} = \sum {W}_{AB| \cdot },\\ {W}_{BC} = \sum {W}_{BC| \cdot },\\ {W}_{A} = \sum {W}_{A| \cdot }+ \sum {W}_{AC| \cdot }, \\ {W}_{B} = \sum {W}_{B| \cdot },\\ {W}_{C} = \sum {W}_{C| \cdot } + \sum {W}_{AC| \cdot },$$
(13)
which define a random variable \({{{\boldsymbol{W}}}}={({W}_{ABC},{W}_{AB},{W}_{BC},{W}_{A},{W}_{B},{W}_{C})}^{T}\). This leads to a statistical framework, that allows us to estimate the true underlying structures abundances from the detected number of structures.Theorem 2Let known, positive labeling efficiencies sA > 0, sB > 0 and sC > 0 and unknown structure abundances \({{{\boldsymbol{n}}}}={({n}_{ABC},{n}_{AB},{n}_{BC},{n}_{A},{n}_{B},{n}_{C})}^{T}\) and define N = ∑i∈{ABC, …, C}ni. Assume the multinomial model as described in Equation (12) and Equation (13).Part 1: An unbiased estimator \(\hat{{{{\boldsymbol{n}}}}}\) of true abundances n is given as$$\hat{{{{\boldsymbol{n}}}}}=\left[\begin{array}{cccccc}\frac{1}{{s}_{A}{s}_{B}{s}_{C}}&0&0&0&0&0\\ \frac{{s}_{C}-1}{{s}_{A}{s}_{B}{s}_{C}}&\frac{1}{{s}_{A}{s}_{B}}&0&0&0&0\\ \frac{{s}_{A}-1}{{s}_{A}{s}_{B}{s}_{C}}&0&\frac{1}{{s}_{B}{s}_{C}}&0&0&0\\ \frac{{s}_{B}-1}{{s}_{A}{s}_{B}}&\frac{{s}_{B}-1}{{s}_{A}{s}_{B}}&0&\frac{1}{{s}_{A}}&0&0\\ \frac{(1-{s}_{A})(1-{s}_{C})}{{s}_{A}{s}_{B}{s}_{C}}&\frac{{s}_{A}-1}{{s}_{A}{s}_{B}}&\frac{{s}_{C}-1}{{s}_{B}{s}_{C}}&0&\frac{1}{{s}_{B}}&0\\ \frac{{s}_{B}-1}{{s}_{B}{s}_{C}}&0&\frac{{s}_{B}-1}{{s}_{B}{s}_{C}}&0&0&\frac{1}{{s}_{C}}\end{array}\right]{{{\boldsymbol{W}}}}.$$
(14)
Part 2: For n → ∞ entrywise, nj/N → fj with ∞ > fj > 0 constant for each j ∈ {ABC, …, C}, and \(\Theta \Sigma (\hat{{{{\boldsymbol{n}}}}}){\Theta }^{T}\) invertible,$$P\left({{\Xi }}\le {\chi }_{6,\alpha }^{2}\right)\le 1-\alpha ,$$
(15)
where$${{\Xi }}={(\hat{{{{\boldsymbol{n}}}}}-{{{\boldsymbol{n}}}})}^{T}{(\Theta \mu )}^{T}{\left(\Theta \Sigma (\hat{{{{\boldsymbol{n}}}}}){\Theta }^{T}\right)}^{-1}(\Theta \mu )(\hat{{{{\boldsymbol{n}}}}}-{{{\boldsymbol{n}}}})$$
(16)
and \({\chi }_{6,\alpha }^{2}\) is the α-quantile of a chi-squared distribution with 6 degrees of freedom and with Θ, μ and \(\Sigma (\hat{{{{\boldsymbol{n}}}}})\) defined as in the following proof. If \({\left(\Theta \Sigma (\hat{{{{\boldsymbol{n}}}}}){\Theta }^{T}\right)}^{-1}\) does not exist, we get Equation (15) with \({\chi }_{r,\alpha }^{2}\) plugging its pseudoinverse \({\left(\Theta \Sigma (\hat{{{{\boldsymbol{n}}}}}){\Theta }^{T}\right)}^{+}\) in Equation (16), where \(r={{{\rm{rank}}}}\left(\Theta \Sigma (\hat{{{{\boldsymbol{n}}}}}){\Theta }^{T}\right)\).ProofPart 1: conditioned on a true ABC triplet, the number of (mis)specifications resulting from incomplete labeling efficiencies is multinomially distributed:$${{{{\boldsymbol{W}}}}}_{\cdot | ABC}=\left[\begin{array}{c}{W}_{ABC| ABC}\\ {W}_{AB| ABC}\\ {W}_{BC| ABC}\\ {W}_{AC| ABC}\\ {W}_{A| ABC}\\ {W}_{B| ABC}\\ {W}_{C| ABC}\\ {W}_{{{\emptyset}}| ABC}\end{array}\right] \sim \,{\mbox{Mnom}}\,({n}_{ABC},{{{{\boldsymbol{p}}}}}_{ABC})$$
(17)
with probability vector$${{{{\boldsymbol{p}}}}}_{ABC}=\left[\begin{array}{c}{s}_{A}{s}_{B}{s}_{C}\\ {s}_{A}{s}_{B}(1-{s}_{C})\\ (1-{s}_{A}){s}_{B}{s}_{C}\\ {s}_{A}(1-{s}_{B}){s}_{C}\\ {s}_{A}(1-{s}_{B})(1-{s}_{C})\\ (1-{s}_{A}){s}_{B}(1-{s}_{C})\\ (1-{s}_{A})(1-{s}_{B}){s}_{C}\\ (1-{s}_{A})(1-{s}_{B})(1-{s}_{C})\end{array}\right],$$
(18)
where \({\sum}_{j = 1}^{8}{{{{\boldsymbol{p}}}}}_{ABC}[j]=1\). Accordingly, the abundances of (mis)detections of a true AB pair are$$\left[\begin{array}{c}{W}_{ABC| AB}\\ {W}_{AB| AB}\\ {W}_{BC| AB}\\ {W}_{AC| AB}\\ {W}_{A| AB}\\ {W}_{B| AB}\\ {W}_{C| AB}\\ {W}_{{{\emptyset}}| AB}\end{array}\right] \sim \,{\mbox{Mnom}}\,({n}_{AB},{{{{\boldsymbol{p}}}}}_{AB})$$
(19)
with$${{{{\boldsymbol{p}}}}}_{AB}=\left[\begin{array}{c}0\\ {s}_{A}{s}_{B}\\ 0\\ 0\\ {s}_{A}(1-{s}_{B})\\ (1-{s}_{A}){s}_{B}\\ 0\\ (1-{s}_{A})(1-{s}_{B})\end{array}\right].$$
(20)
This can be done accordingly for all other structures of interest, i.e. BC pairs and A, B, and C singlets yielding$$\begin{array}{rcl}{{{{\boldsymbol{W}}}}}_{\cdot | ABC} &\sim& \,{\mbox{Mnom}}\,({n}_{ABC},{{{{\boldsymbol{p}}}}}_{ABC})\\ {{{{\boldsymbol{W}}}}}_{\cdot | AB} &\sim& \,{\mbox{Mnom}}\,({n}_{AB},{{{{\boldsymbol{p}}}}}_{AB})\\ {{{{\boldsymbol{W}}}}}_{\cdot | BC} &\sim& \,{\mbox{Mnom}}\,({n}_{BC},{{{{\boldsymbol{p}}}}}_{BC})\\ {{{{\boldsymbol{W}}}}}_{\cdot | A} &\sim& \,{\mbox{Mnom}}\,({n}_{A},{{{{\boldsymbol{p}}}}}_{A})\\ {{{{\boldsymbol{W}}}}}_{\cdot | B} &\sim& \,{\mbox{Mnom}}\,({n}_{B},{{{{\boldsymbol{p}}}}}_{B})\\ {{{{\boldsymbol{W}}}}}_{\cdot | C} &\sim&\,{\mbox{Mnom}}\,({n}_{C},{{{{\boldsymbol{p}}}}}_{C})\end{array}$$
(21)
with$${{{\boldsymbol{p}}}}_{BC} = \left[\begin{array}{c} 0 \\ 0 \\ s_{B} s_{C} \\ 0 \\ 0 \\ s_{B} (1-s_{C}) \\ (1-s_{B}) s_{C} \\ (1-s_{B}) (1-s_{C})\end{array}\right], \, {{{\boldsymbol{p}}}}_{A} = \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 0 \\ s_A \\ 0 \\ 0 \\ (1-s_{A}) \end{array}\right], {{{\boldsymbol{p}}}}_{B} = \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ s_B \\ 0 \\ (1-s_{B}) \end{array}\right],\, {{{\boldsymbol{p}}}}_{C} = \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ s_C \\ (1-s_{C})\end{array}\right].$$
(22)
Note, that \({{\emptyset}}\) can not be detected at all and substructure AC is counted as a separate A and C singlet Hence, the total numbers of detected triplets, pairs and singlets are defined as the following sums$${W}_{ABC} = {W}_{ABC| ABC}\\ {W}_{AB} = {W}_{AB| ABC}+{W}_{AB| AB}\\ {W}_{BC} = {W}_{BC| ABC}+{W}_{BC| BC}\\ {W}_{A} = {W}_{A| ABC}+{W}_{AC| ABC}+{W}_{A| AB}+{W}_{A| A}\\ {W}_{B} = {W}_{B| ABC}+{W}_{B| AB}+{W}_{B| BC}+{W}_{B| B}\\ {W}_{C} = {W}_{C| ABC}+{W}_{AC| ABC}+{W}_{C| BC}+{W}_{C| C}.$$
(23)
This can be rewritten as$${{{\boldsymbol{W}}}} = \left[\begin{array}{c}{W}_{ABC} \\ {W}_{AB}\\ {W}_{BC} \\ {W}_{A}\\ {W}_{B} \\ {W}_{C}\end{array}\right]=\Theta \left({{{{\boldsymbol{W}}}}}_{\cdot | ABC}+{{{{\boldsymbol{W}}}}}_{\cdot | AB}+{{{{\boldsymbol{W}}}}}_{\cdot | BC}+{{{{\boldsymbol{W}}}}}_{\cdot | A}+{{{{\boldsymbol{W}}}}}_{\cdot | B}+{{{{\boldsymbol{W}}}}}_{\cdot | C}\right),$$
(24)
using the transformation matrix$$\Theta =\left[\begin{array}{cccccccc}1&0&0&0&0&0&0&0\\ 0&1&0&0&0&0&0&0\\ 0&0&1&0&0&0&0&0\\ 0&0&0&1&1&0&0&0\\ 0&0&0&0&0&1&0&0\\ 0&0&0&1&0&0&1&0\\ \end{array}\right]\in {{\mathbb{R}}}^{6\times 8}.$$
(25)
With this definition of Θ we delete the last entry in each binomial distributed vector and add an AC substructure appearance to singlet detections A and B. By Eq. (24) we get that$${\mathbb{E}}[{{{\boldsymbol{W}}}}]=\Theta \mu {{{\boldsymbol{n}}}}$$
(26)
with$$\mu =\left[\begin{array}{cccccc}{{{{\boldsymbol{p}}}}}_{ABC}&{{{{\boldsymbol{p}}}}}_{AB}&{{{{\boldsymbol{p}}}}}_{BC}&{{{{\boldsymbol{p}}}}}_{A}&{{{{\boldsymbol{p}}}}}_{B}&{{{{\boldsymbol{p}}}}}_{C}\end{array}\right]\in {{\mathbb{R}}}^{8\times 6}.$$
(27)
Hence, with positive labeling efficiencies sA > 0, sB > 0 and sC > 0, multiplying$${(\Theta \mu )}^{-1}=\left[\begin{array}{cccccc}\frac{1}{{s}_{A}{s}_{B}{s}_{C}}&0&0&0&0&0\\ \frac{{s}_{C}-1}{{s}_{A}{s}_{B}{s}_{C}}&\frac{1}{{s}_{A}{s}_{B}}&0&0&0&0\\ \frac{{s}_{A}-1}{{s}_{A}{s}_{B}{s}_{C}}&0&\frac{1}{{s}_{B}{s}_{C}}&0&0&0\\ \frac{{s}_{B}-1}{{s}_{A}{s}_{B}}&\frac{{s}_{B}-1}{{s}_{A}{s}_{B}}&0&\frac{1}{{s}_{A}}&0&0\\ \frac{(1-{s}_{A})(1-{s}_{C})}{{s}_{A}{s}_{B}{s}_{C}}&\frac{{s}_{A}-1}{{s}_{A}{s}_{B}}&\frac{{s}_{C}-1}{{s}_{B}{s}_{C}}&0&\frac{1}{{s}_{B}}&0\\ \frac{{s}_{B}-1}{{s}_{B}{s}_{C}}&0&\frac{{s}_{B}-1}{{s}_{B}{s}_{C}}&0&0&\frac{1}{{s}_{C}}\end{array}\right]$$
(28)
with W introduces an unbiased estimator \(\hat{{{{\boldsymbol{n}}}}}\).Part 2: we utilize that by the central limit theorem for a multinomially distributed random variable M ~ Mnom(m, p) with probability vector \({{{\boldsymbol{p}}}}={({p}_{1},{p}_{2},\ldots ,{p}_{k})}^{T}\)$$\frac{1}{\sqrt{m}}\left({{{\boldsymbol{M}}}}-m{{{\boldsymbol{p}}}}\right)\to ^{{{\mathcal{D}}}}{{{{\mathcal{N}}}}}_{k}\left({{{{\boldsymbol{0}}}}}_{k},\,{{\mbox{diag}}}\,({{{\boldsymbol{p}}}})-{{{\boldsymbol{p}}}}{{{{\boldsymbol{p}}}}}^{T}\right)\quad \,{\mbox{for}}\,\quad m\to \infty ,$$
(29)
where$$\,{\mbox{diag}}\,({{{\boldsymbol{p}}}})=\left[\begin{array}{ccc}{p}_{1}&0&\cdots \\ 0&{p}_{2}&\cdots \\ \vdots &\vdots &{p}_{k}\end{array}\right]$$
(30)
and \({{{{\boldsymbol{0}}}}}_{k}={(0,…,0)}^{T}\in {{\mathbb{R}}}^{k}\) (see, e.g., ref. 67). Hence, for n entrywise large enough, we can approximate properly scaled independent, multinomial random vectors$${{{{\boldsymbol{W}}}}}_{\cdot | ABC},\,{{{{\boldsymbol{W}}}}}_{\cdot | AB},\,{{{{\boldsymbol{W}}}}}_{\cdot | BC},\,{{{{\boldsymbol{W}}}}}_{\cdot | A},\,{{{{\boldsymbol{W}}}}}_{\cdot | B},\,{{{{\boldsymbol{W}}}}}_{\cdot | C}$$
(31)
with multi-dimensional normal distributions, respectively. In the following assume n → ∞ entrywise and nj/N → fj with ∞ > fj > 0 constant for each j ∈ {ABC, …, C}, where N = ∑i∈{ABC, …, C}ni. Then, it holds that$$\begin{array}{ll}&\mathop{\sum}_{i\in \{ABC,\ldots ,C\}}\sqrt{\frac{{n}_{i}}{N}}\frac{1}{\sqrt{{n}_{i}}}\left({{{{\boldsymbol{W}}}}}_{\cdot | i}-{n}_{i}{{{{\boldsymbol{p}}}}}_{i}\right)=\sqrt{\frac{1}{N}}{\sum}_{i\in \{ABC,\ldots ,C\}}\left({{{{\boldsymbol{W}}}}}_{\cdot | i}-{n}_{i}{{{{\boldsymbol{p}}}}}_{i}\right)\\ &{\to} ^{{{\mathcal{D}}}}{{{{\mathcal{N}}}}}_{8}\left({{{{\boldsymbol{0}}}}}_{8},{\sum}_{i\in \{ABC,\ldots ,C\}}{f}_{i}\left(\,{{\mbox{diag}}}\,({{{{\boldsymbol{p}}}}}_{i})-{{{{\boldsymbol{p}}}}}_{i}{{{{\boldsymbol{p}}}}}_{i}^{T}\right)\right).\end{array}$$
(32)
For now, suppose \(\sum {n}_{i}\left(\,{\mbox{diag}}\,({{{{\boldsymbol{p}}}}}_{i})-{{{{\boldsymbol{p}}}}}_{i}{{{{\boldsymbol{p}}}}}_{i}^{T}\right)\) is invertible. Then in the limit$$ {\left(\sum {f}_{i}\left({\mbox{diag}}({{{{\boldsymbol{p}}}}}_{i})-{{{{\boldsymbol{p}}}}}_{i}{{{{\boldsymbol{p}}}}}_{i}^{T}\right)\right)}^{-1/2}\sqrt{\frac{1}{N}}\sum \left({{{{\boldsymbol{W}}}}}_{\cdot | i}-{n}_{i}{{{{\boldsymbol{p}}}}}_{i}\right)\\ = {\left(\sum N{f}_{i}\left({\mbox{diag}}({{{{\boldsymbol{p}}}}}_{i})-{{{{\boldsymbol{p}}}}}_{i}{{{{\boldsymbol{p}}}}}_{i}^{T}\right)\right)}^{-1/2}\sum \left({{{{\boldsymbol{W}}}}}_{\cdot | i}-{n}_{i}{{{{\boldsymbol{p}}}}}_{i}\right)\\ = {\left(\sum {n}_{i}\left({\mbox{diag}}({{{{\boldsymbol{p}}}}}_{i})-{{{{\boldsymbol{p}}}}}_{i}{{{{\boldsymbol{p}}}}}_{i}^{T}\right)\right)}^{-1/2}\sum \left({{{{\boldsymbol{W}}}}}_{\cdot | i}-{n}_{i}{{{{\boldsymbol{p}}}}}_{i}\right)$$
(33)
and hence$${\left(\sum {n}_{i}\left({\mbox{diag}}({{{{\boldsymbol{p}}}}}_{i})-{{{{\boldsymbol{p}}}}}_{i}{{{{\boldsymbol{p}}}}}_{i}^{T}\right)\right)}^{-1/2}\sum \left({{{{\boldsymbol{W}}}}}_{\cdot | i}-{n}_{i}{{{{\boldsymbol{p}}}}}_{i}\right) {\to} ^{{{\mathcal{D}}}} {{{{\mathcal{N}}}}}_{8}\left({{{{\boldsymbol{0}}}}}_{8},{I}_{8\times 8}\right),$$
(34)
where I8×8 is the 8-dimensional identity matrix. In the following we denote$$\Sigma ({{{\boldsymbol{n}}}})=\left(\sum {n}_{i}\left(\,{\mbox{diag}}\,({{{{\boldsymbol{p}}}}}_{i})-{{{{\boldsymbol{p}}}}}_{i}{{{{\boldsymbol{p}}}}}_{i}^{T}\right)\right).$$
(35)
Multiplying (Θμ)−1Θ with Eq. (32) consequently yields$$\begin{array}{l}{\left({(\Theta \mu )}^{-1}\Theta \Sigma ({{{\boldsymbol{n}}}}){\Theta }^{T}{\left({(\Theta \mu )}^{-1}\right)}^{T}\right)}^{-1/2}\left({(\Theta \mu )}^{-1}\Theta \sum {{{{\boldsymbol{W}}}}}_{\cdot | i}-{(\Theta \mu )}^{-1}\Theta \sum {n}_{i}{{{{\boldsymbol{p}}}}}_{i}\right)\\ ={\left({(\Theta \mu )}^{-1}\Theta \Sigma ({{{\boldsymbol{n}}}}){\Theta }^{T}{\left({(\Theta \mu )}^{-1}\right)}^{T}\right)}^{-1/2}\left(\hat{{{{\boldsymbol{n}}}}}-{{{\boldsymbol{n}}}}\right){\to} ^{{{\mathcal{D}}}}{{{{\mathcal{N}}}}}_{6}\left({{{{\boldsymbol{0}}}}}_{6},{I}_{6\times 6}\right)\end{array}$$
(36)
with \(\hat{{{{\boldsymbol{n}}}}}={(\Theta \mu )}^{-1}\Theta \sum {{{{\boldsymbol{W}}}}}_{\cdot | i}\) and n = (Θμ)−1Θμn = (Θμ)−1Θ∑nipi. By law of large numbers, it holds that$$\frac{1}{N}\left(\hat{{{{\boldsymbol{n}}}}}-{{{\boldsymbol{n}}}}\right)=\frac{\hat{{{{\boldsymbol{n}}}}}}{N}-\frac{{{{\boldsymbol{n}}}}}{N} {\to} ^{{{\mathcal{P}}}}{{{{\boldsymbol{0}}}}}_{6}.$$
(37)
and hence for all j ∈ {ABC, …, C}$$\frac{{\hat{n}}_{j}}{N}\to^ {{{\mathcal{P}}}}{f}_{j}.$$
(38)
By Slutsky’s Lemma we can use Eq. (38) to replace n in Σ(n) with \(\hat{{{{\boldsymbol{n}}}}}\). For n → ∞ entrywise this yields$${{\Xi }}={(\hat{{{{\boldsymbol{n}}}}}-{{{\boldsymbol{n}}}})}^{T}{(\Theta \mu )}^{T}{\left(\Theta \Sigma (\hat{{{{\boldsymbol{n}}}}}){\Theta }^{T}\right)}^{-1}(\Theta \mu )(\hat{{{{\boldsymbol{n}}}}}-{{{\boldsymbol{n}}}}) {\to} ^{{{\mathcal{D}}}}{\chi }_{6}^{2}.$$
(39)
In case \(\Theta \Sigma (\hat{{{{\boldsymbol{n}}}}}){\Theta }^{T}\) is not invertible, one can use its pseudoinverse yielding convergence to a chi-square distribution with r degrees of freedom, i.e., \({\chi }_{r}^{2}\) in Equation (39), where \(r={{{\rm{rank}}}}\left(\Theta \Sigma (\hat{{{{\boldsymbol{n}}}}}){\Theta }^{T}\right) \quad\quad\quad \Box\).With Part 2 of Theorem 2 we can construct a confidence ellipsoid around \({\hat{{{\boldsymbol{n}}}}}\) in a straight-forward manner. To show that Ξ in our setting is approximately chi-square distributed for finite sample sizes and to compare simulated and theoretical coverages of \(\hat{{{{\boldsymbol{n}}}}}\), we performed a simulation study as described in the following section.Simulation study setupIn the first simulation study a predefined number of triplets, pairs, and singlets are generated as follows:

Step 1: Draw the coordinate for channel B as \(b \sim {{{\mathcal{U}}}}({[0,400\cdot r]}^{2})\), where \(\,{{{\mathcal{U}}}}\) is the continuous uniform distribution.

Step 2a: Draw angle \(\alpha \sim {{{\mathcal{U}}}}[0,2\pi ]\) and normally distributed distance \({d}_{A} \sim {{{\mathcal{N}}}}(t,0.5)\). Set \(a=b\left(\cos (\alpha ){d}_{A}+\sin (\alpha ){d}_{A}\right)\).

Step 2b: Draw \(\epsilon \sim {{{\mathcal{N}}}}(0,0.2)\) and set angle β = α + π + ϵ. Draw \({d}_{C} \sim {{{\mathcal{N}}}}(t,0.5)\) and set \(c=b\left(\cos (\beta ){d}_{C}+\sin (\beta ){d}_{C}\right)\).

Step 3: Round a, b and c to match the pixel grid \({[0,400]}^{2}\subseteq {{\mathbb{N}}}_{\ge 0}^{2}\).

This design favors to simulate triplets of an approximately linear structure. Pairs are simulated by skipping either Step 2a or 2b. Singlets are drawn as in Step 1.In the second simulation study quadruples, triplets, pairs, and singlets n are generated similarly, but replacing and adding

Step 2b: Draw angle \(\beta \, \sim \, {{{\mathcal{U}}}}[0,2\pi ]\) and \({d}_{C} \, \sim \, {{{\mathcal{N}}}}(t,0.5)\) and set \(d=b\left(\cos (\beta ){d}_{C}+\sin (\beta ){d}_{C}\right)\).

Step 2c: Draw angle \(\gamma \, \sim \, {{{\mathcal{U}}}}[0,2\pi ]\) and \({d}_{D} \, \sim \, {{{\mathcal{N}}}}(t,0.5)\) and set \(d=c\left(\cos (\gamma ){d}_{D}+\sin (\gamma ){d}_{D}\right)\).

This simulation setup allows arbitrarily curved chain-structures. The distance threshold is always fixed to t = 70 nm.To obtain intensity images close to an experimental STED setup from the simulated point sets we followed the simulation setup introduced in Tameling et al.17, to mimic experimental STED images of 400  × 400 pixels with full-width at half-maximum (FWHM) value of 40 nm (approximately the resolution of the STED microscope) and pixel size 25 nm = 1 pixel). In the second simulation study (including quadruples) the Poisson noise level was on average increased by a factor of 10.Methods included in the simulation studyFor the Ripley’s K based Statistical Object Distance Analysis (SODA)21 we used the triplet colocalization protocol SODA 3 Colors in ICY68 (version 2.4.0.0). For the analysis we used default input parameters and set scale threshold per channel to be 100. The plugin BlobProb15 was called in ImageJ/Fiji69 (version 2.3.0/1.53q) and the number of colocalized blobs were considered. We set voxel size to 25 nm in every dimension and the threshold per channel to 100. The ConditionalColoc18 from GitHub (https://github.com/kjaqaman/ConditionalColoc) was executed on MATLAB (version R2023a). Particles were detected using the “point-source detection” algorithm provided via the integrated u-track package (https://github.com/DanuserLab/u-track).For all implementations but ConditionalColoc the detected chain-structure abundances were output as integers. Therefore, we scaled abundances, i.e., divided them by the total number of particles detected in channel B. ConditionColoc already aims to output probabilities that are scaled by detected particles per channel, hence no further transformation of the output was performed by us. Since for all simulated Scenarios the same number of particles was generated in every channel, we ensured that both scaling procedures are comparable. The maximal colocalization threshold is set to t = 5 pixels = 125 nm throughout all considered methods.Nanoruler samplesCustom-made DNA nanoruler samples featuring one, two, or three fluorophore spots, each consisting of 20 fluorophores (Alexa Fluor488, Alexa Fluor594, Star Red), with a distance between the spots of 70 nm, were purchased from Gattaquant – DNA Nanotechnologies (Gräfelfing, Germany). The biotinylated nanorulers were immobilized on a BSA-biotin-neutravidin surface according to the manufacturer’s specifications.Stimulated emission depletion super-resolution light microscopyImage acquisition was done using a quad scanning STED microscope (Abberior Instruments, Göttingen, Germany) equipped with a UPlanSApo 100x/1,40 Oil objective (Olympus, Tokyo, Japan). Excitation of Alexa Fluor 488, Alexa Fluor 594 and Star Red was achieved by laser beams featuring wave lengths of 485 nm, 561 nm, and 640 nm, respectively. For STED imaging, a laser beam with an emission wavelength of 775 nm was applied. For all experimental STED images, a pixel size of 25 nm was utilized. For visualization purposes, contrast stretching and increasement of image brightness was applied to exemplary STED images within the figures of this manuscript. No image processing was applied prior to the application of the MultiMatch analysis workflow.Statistics and reproducibilityThe statistical framework developed and applied in this manuscript and the settings of simulation studies performed are presented in the Method sections. All sample sizes and significance levels of the confidence bands are listed in the respective figure legends. Experimental and simulated data and analysis scripts to reproduce results and figures are provided on Zenodo (https://doi.org/10.5281/zenodo.7221879)70.

Hot Topics

Related Articles