Simulation of CRISPR-Cas9 editing on evolving barcode and accuracy of lineage tracing

Comparison of alignment methodsOur modified alignment algorithm is compared to the well-established Gap Penalty alignment. The gap-penalty method introduces a cost/penalty ($gapstart<0$) whenever the sequence starts a new gap in alignment. So the reward becomes $gapstart+mismat\cdot L$, where L is the gap length. This is equivalent to the affine gap penalty algorithm if one rewrites the reward as $(gapstart+mismat)+mismat\cdot (L-1)$. When $gapstart=0$, this method is identical to the regular alignment. Figure 1 illustrates a sample of alignment results using different algorithms. Figure 1A is obtained in a similar manner as in Fig. 20A-bottom, which contains the real structure of edited leaf barcodes. In actual experiment and DNA sequencing, this structural information is lost because only collapsed sequence of barcodes are obtained (see Fig. 20B), however, in simulation we may use this information as benchmark to test the efficacy of different alignment algorithms.Using the root barcode as the pivot, we align the collapsed leaf barcode using regular alignment, modified alignment, and gap penalty alignment, in an effort to rebuild the structural information of edited barcodes (such as Fig. 1A), where Fig. 1B–D show a sample for each method, respectively. We set the reward of a match at 1 and the penalty of a mismatch at -2. The reward for consecutive matches is $consmat=0.2$, and fraction of penalty on consecutive mismatches is $fr=0.5$. The root barcode has a length of 50, and the tree depth is 5. In the gap penalty alignment, the root barcode serves as the pivot, so we believe it is not viable to penalize a gap in the root barcode. Therefore, we only put a penalty on the leaf side. Figure 1B is comparable to Fig. 20C where many nucleotides are put in spaces which are supposed to be deleted large segments, resulting in many small gaps; 1(C) is comparable to Fig. 20D where many large gaps are well recovered. Figure 1D shows a result of the gap penalty method, and similar to that of the modified alignment, large gaps are also recovered.Figure 1Comparison of Alignment Algorithms. (A) Leaf barcodes that contain structural information. This information is available in simulations but is lost in actual DNA sequencing. (B) Aligned barcodes using regular alignment. (C) Aligned barcodes using modified alignment. (D) Aligned barcodes using gap penalty alignment. (E) Leaf-leaf comparison scores using the three alignment algorithms.To test the efficacy of each alignment algorithm, we compare the aligned leaf barcodes, as shown in Fig. 1B–D, to the actual leaf barcodes pairwisely, as shown in Fig. 1A. Because the structure of each leaf barcode is supposed to be recovered, in this leaf-leaf comparison we just apply the regular alignment. Since Fig. 1A contains the actual structure of each leaf barcode, a mismatch to an empty spot is considered the same as a mismatch to a nucleotide, hence has the same penalty. Let $Rew(s_1,s_2)$ be the reward score, using regular alignment, of comparing two strings $s_1,s_2$, whose leading and ending empty entries have been removed, then each pairwise comparison will yield a contribution score$$\begin{aligned} \frac{Rew(s_1,s_2)}{\sqrt{Rew(s_1,s_1)}\sqrt{Rew(s_2,s_2)}}, \end{aligned}$$which is similar to the correlation formula. The contribution score is less than or equal to 1, but could be less than -1 depending on the penalty of mismatches. Since the reward of a match is set at 1, the part $Rew(s_1,s_1)$ ($Rew(s_2,s_2)$) may be simplified to the length of $s_1$($s_2$). Figure 1E shows a sample of the pairwise leaf-leaf comparison, for the 32 leaves in Fig. 1A, using different alignment algorithms. Afterwards, the average contribution score is calculated over the many leaves for each algorithm. Then this simulation is run for 100 times and the average of each alignment algorithm is calculated and compared.For the gap penalty method, we first test the parameter gapstart from –1 to –5 (when $gapstart=0$ it gives the same result as regular alignment). It appears that when $gapstart=-2$ this algorithm performs the best and when gapstart tends to be large negative, its performance degrades. The reason is that, although large negative gapstart reduces the number of gaps by putting a large penalty to open a gap, it also tends to stack the nucleotides, which results in many mismatches. With gapstart being set at -2, a simulation of 100 run yields the following average scores – regular alignment = 0.488, modified alignment=0.698, gap penalty alignment = 0.543. We then increase barcode length to 100 and generation number to 10, and tested gap penalty method to find that $gapstart=-1$ yields the best result. With this setting, the average of 100 run yields the following scores – regular alignment = -0.0855, modified alignment=0.346, gap penalty alignment = 0.0176. Both results indicate that the modified alignment outperforms gap penalty method, and gap penalty method outperforms regular alignment. The question of what parameter settings in gap penalty algorithm yields the best result is beyond the scope of this paper. From the simulation results, we conclude that the modified alignment is at least comparable to gap penalty method. Therefore, we adopt the modified alignment algorithm in an effort to recover the structure of leaf barcodes.Evolution of barcode matching scoresTo see how the matching scores, when compared to the root, evolve as cells divide in the simulation, we run the program from generation 1 to generation 15, and for each generation the program is repeated 10 times. This part of simulation does not involve sequence alignment or lineage tree reconstruction, so the computational burden is not heavy even if we reach generation 15. The simulation parameters are tuned according to some experimental results. For instance, from Figs. 1C–E and 3C–E in25, Fig. 3F.I in23, and Fig. 3A in36 we conclude that large segment deletion is a major phenomenon in barcode editing; single nucleotide insertion is more frequent than two or more nucleotides insertion; single nucleotide deletion is more frequent than two or more nucleotides deletion; unedited target sites (perfect repairs in our program) account for considerable proportion of barcode editing events, etc. Therefore at a cutting site on the barcode, we set the probability of perfect repair at 0.7; the probabilities of inserting 1,2,3 nucleotides are 0.1, 0.03, 0.02, respectively; the probability of a substitution is 0.05; and the probability of a single nucleotide deletion is 0.1. The Cas9 mutation rate, defined as a cutting probability at each nucleotide, is set at $mupb=0.1$. Given two or more cuts occurring on the barcode, the probability of a large segment deletion is 0.15.The average matching scores as a function of generation number is shown in Fig. 2A, where n is the barcode length. The scoring scheme adopted to compute matching scores is given by the following.$$\begin{aligned} \begin{aligned} \ \ {}&\ \ A&C&\ \ \ \ G&T\\ A&\ \ \ 1&-2&\ -2&-2\\ C&-2&\ 1&\ -2&-2\\ G&-2&-2&\ \ \ \ \ 1&-2\\ T&-2&-2&\ -2&1 \end{aligned} \end{aligned}$$Figure 2Evolution of Matching Scores. (A) Average matching score decays geometrically as cell division/barcode mutation continues. (B) Standard deviation of matching scores also decreases as cell division/barcode mutation continues.It is seen from Fig. 2A that when the mutation rate (mupb) is small, the matching score decreases slowly as cells divide; and when the mutation rate is high the matching score decays fast, roughly at a geometric rate.The standard deviation of the matching scores for the 10 runs is shown in Fig. 2B. Mutations accumulate in each generation, which brings in variation in the matching scores, and in the meanwhile the length of the remaining barcode shortens gradually, which causes standard deviation to decrease. As cells divide, the number of descendants are doubled in each generation and this factor further reduces the standard deviation of the matching scores.Test on RMP and NBJ methodsBefore we apply the RMP and NBJ methods, with or without filtering, to reconstruct the lineage tree and test the effect of some parameter settings, we need to test the efficacy of these methods. The mechanism of barcode editing in our simulation is different from most of the existing literature in that each nucleotide, with 4 possible states (A,C,G,T), could be a potential cutting site, and the barcode length varies from generation to generation. Furthermore, the Cas9 editing continues along with cell division, that is, any nucleotide in the current barcode could be cut again. Therefore, many existing public datasets do not fit this framework. For example, in the simulated in silico datasets in33, each barcode consists of fixed number (in a few hundreds) of Cas9 targets, and each target has 30 possible states with respectively assigned switching probabilities. In addition, it is assumed that this switch/jump may occur at most once, which greatly simplifies the simulation scheme.In the in vitro dataset of Challenge 1 in33, however, each Cas9 target has 3 states (0,1,2), and it is possible to test our RMP and NBJ methods on this dataset. We built a simplified version of our program that works on the 76 in vitro training sets of Challenge 1 in33, where the barcode length is fixed, and each target has 3 states, and pairwise entry-entry comparison is used (in place of alignment) when comparing two barcodes. The tree depth is estimated based on the number of cells/barcodes. For RMP method, we set $propm=0.7$ and for NBJ method $propm=0.3$. When rebuilding the parent node in our methods, there are some randomness involved (see section “Reconstruct parent node”), so for each of the 76 training sets we run RMP and NBJ, with and without filtering, for 50 times, and the average (Avg) and maximum (Max) accuracy of dividing/paired nodes of these 50 runs are recorded (detailed data is available on github.com/xzhanglab/CRISPR-based-Lineage-Tracing-Simulation). Then the overall average of the 76 training sets are calculated, as shown in Table 1.
Table 1 Accuracy (%) of in vitro data test.Recall that in the case of full binomial tree, the accuracy of paired nodes in our simulation is the complement of the RF-distance. Thus, if NBJNF Avg has an accuracy of 42.038% in the case of full binomial tree, then it corresponds to a RF-distance of 0.58.The 76 trees are further classified as small (cell number<10 cells), medium (10$\le$cell number<20), and large (cell number$\ge$20), and the accuracy of each method on each class are also carried out. This result is provided in Figure S1, which is comparable to Figure 2-F in33 (except that we use accuracy rather than RF-distance when comparing two lineage trees). As an example, Fig. 3 shows the accuracy of NBJNF method. In view of the result in Box 1-Figure F in33, we then conclude that our RMP and NBJ methods are comparable to some benchmarked approaches for reconstruction of in vitro cell lineages, such as DCLEAR(WHD), DCLEAR(KRD), Liu method, and Guan method.Figure 3Accuracy of NBJNF method on in vitro dataset. (x ticks represent the upper-bound of the sub-interval. For example, 8.3 represents the interval [0, 8.3]; 16.6 represents the interval (8.3, 16.6], etc.).In the second test on our RMP and NBJ methods we use our simulation to generate lineage tree with barcode length of 50 and generation number of 5, and apply the RMP and NBJ methods to rebuild it. Other parameters are as follows: probability of perfect repair is 0.5; the probabilities of inserting 1,2,3 nucleotides are 0.15, 0.1, 0.05, respectively; probability of a substitution is 0.1; probability of a single nucleotide deletion is 0.1; probability of large segment deletion is 0.1; sample size is $ss=1$; Cas9 cutting rate is $mupb=0.1$; pairing threshold propm varies from 0.4 to 0.9. Comparing to the parameter settings in section “Evolution of barcode matching scores”, we see that the large segment deletion probability is slightly lower, and the probabilities of insertions are slightly higher. It is expected that the leaf barcodes generated under this setting possess more diversity/entropy. This modification is supported by a recent experimental result35 where Cas9-TdT results in fewer deletions but twice the insertion events per allele than Cas9 expression, with all four nucleotides well-represented in the inserted sequences.The rebuilt tree is then compared to the actual lineage tree and accuracy is calculated. For each propm value and each lineage rebuilding method we run the simulation 100 times and the average (Avg) and maximum (Max) accuracy are computed, as shown in Table 2.
Table 2 Lineage accuracy (%) of 100 runs.We see from Table 2 that our RMP and NBJ methods, with or without filtering, could well rebuild the lineage, and in particular, the lineage accuracy of RMPNF and NBJNF methods could reach 100%, which means the full binomial lineage tree is perfectly rebuilt. Figure 4 shows a scenario of barcode evolution where NBJNF method with $propm=0.8$ rebuilds lineage with 100% accuracy. Other parameter settings are the same as described for Table 2.Figure 4A barcode evolution scenario where lineage is perfectly rebuilt using NBJNF method. The rebuilt lineage is identical to this original tree, hence is omitted.Further examination on the barcode evolution in Fig. 4 reveals that along with each cell division, sibling nodes have similar mutations which builds up in a way that barcodes belonging to different branches are easily distinguishable, see the paired leaf nodes in Fig. 4.In the 100 runs using NBJNF method with $propm=0.8$, as seen in Table 2, we also show the barcode evolution scenario which has the lowest accuracy of rebuilt lineage in Fig. 5. In this case, we see that after the first division, the two children barcodes both experience large segment deletion, which occur, although independently, around the same sites. Further mutations do not introduce enough varieties to distinguish leaf nodes that belong to different branches. As a result, the leaf barcodes all look similar, which causes homoplasy effect37, and that is the reason why NBJNF method cannot pair them accurately. This finding suggests that in actual experiment, it is better control mutation rate to avoid large segment deletions in early generations, for example, reduce Dox induction in the first few divisions.Figure 5A barcode evolution scenario where the rebuilt lineage has lowest accuracy in 100 runs using NBJNF method. (A) Original lineage tree. (B) Rebuilt lineage tree. Green nodes represent matched nodes to the original counterparts. The accuracy is $7/30=23.3\%$.The effect of barcode length and sampling proportionDue to the large computational burden, we may not perform exhaustive tests on all combinations of methods and parameter settings. We assume that some attributes found in one method also apply to other methods unless there is legitimate reason against it. In this section we first test the effects of barcode length and sampling size on the accuracy of reconstructed lineage using RMP and NBJ methods.We pick different lengths in the barcode, $n=100,200,300,400$ (bp)— and different sampling proportions (ss-sampling size) to test their effects on the accuracy of the reconstructed lineage trees, which is explained in sections “Binomial tree comparison” and “Fractional sampling”. We simulate cell divisions for 10 generations with barcode mutations under certain mutation rate. The simulation is run 10 times for each setting, and the average accuracy is plotted in Fig. 6A for the RMP method with $propm=0.85$, and in Fig. 6B for the NBJ method with $propm=0.4$. The effect of propm on accuracy is investigated in section “The effect of matching proportion in pairing”.Figure 6Whole tree accuracy with different barcode lengths and sampling proportions. (A) Lineage accuracy using RMP method. (B) Lineage accuracy using NBJ method.It is seen from Fig. 6 that accuracy of both the internal nodes and the paired nodes generally increases for longer barcode, though the increase does not appear dramatic. In a few scenarios the lineage accuracy slightly decreases with longer barcode. This is a bit anti-intuitive because we were expecting that longer barcode will greatly increase the accuracy of reconstructed lineage. We discover that this is an inborn attribute with purely stochastic models, and will explain it in details in the next section together with the effect of mutation rate.Nevertheless, the sampling proportion plays a much more influential role in the accuracy of lineage. When the sampling proportion is high, most internal nodes are paired nodes, and as a result the accuracy of internal nodes and the accuracy of paired nodes are very close. As the sampling proportion decreases, many internal nodes become singleton nodes, and these two accuracy measurements differ greatly – a large proportion of internal nodes are matched in the rebuilt lineage tree, while a much smaller proportion of the paired nodes are correctly matched.The RMP method with high propm is stringent in pairing the nodes, while NBJ method with low propm encourages the paring of nodes. Therefore, when most nodes are singleton nodes, RMP method yields higher accuracy than NBJ method, and if most nodes are paired nodes, NBJ method generally outperforms RMP method.The effect of mutation rateThe control of the mutation rate (or Cas9 cutting rate) is one of the key factors that affect the accuracy in reconstructing cell lineage. We pick barcode length $n=300$, generation level at 10, and run the simulation for a variety of mutation rates (mupb) and sampling proportions. Both RMP and NBJ methods are performed. The accuracy of the rebuilt lineage tree, in terms of percentage of matched all internal nodes and the percentage of matched all paired nodes using RMP method and NBJ method are illustrated in Fig. 7.Figure 7Whole tree accuracy with different mutation rates. (A) Lineage accuracy using RMP method with $propm=0.85$; (B) Lineage accuracy using NBJ method with $propm=0.4$.For the result of RMP method in Fig. 7A, when sampling size ss is large, most sampled internal nodes are dividing/paired nodes, and the two percentages – all matched internal nodes vs matched paired nodes (see Fig. 23)—are very close. And in this case, as mupb increases, both percentages increase and then flatten out. When ss is small, most internal nodes are singleton nodes, and there appears to be a big difference between these two percentages – the percentage of all matched internal nodes is high while the percentage of matched paired nodes is low. It is also seen from Fig. 7A that for $ss=0.5$, the percentage of matched paired nodes achieves a local maximum at mupb around 0.06.The NBJ method with low propm encourages the pairing of two barcodes. Figure 7B shows that the accuracy of paired nodes stays stable as the mutation rate increases. The situation is different for all internal nodes. When the mutation rate is low and sampling size is small, the percentage of matched all nodes is low, and as mutation rate increases, this accuracy increases quickly, and then flattens out when mutation rate is very high. The reasons are as follows: Low mutation rate does not generate much variation in the barcodes, so it becomes harder to distinguish barcodes that should belong to different clades; small sampling size results in more singleton nodes in the lineage tree which is sparse; and NBJ method with low propm encourages the pairing of two barcodes. As a result, many singleton nodes are paired incorrectly, which lowers the percentage of matched nodes. As mutation rate increases, much variation is created in the barcodes which prevents them to be incorrectly paired, and the accuracy improves greatly. The trade-off between all matched nodes and paired nodes is observed in both Fig. 7A, B.Furthermore, when mutation rate is high, lineage accuracy tends to flatten out or slightly decrease. To see the combined effect of barcode length and mutation rate, we run RMP and NBJ methods 10 times, respectively, with various barcode lengths ($n=100, 200, 300, 400$) and various mutation rates ($mupb=0.04, 0.08, 0.12, 0.16, 0.2$). The average full tree ($ss=1$) accuracy is provided in Fig. 8. Other parameter settings remain the same as in this section.Figure 8Full tree accuracy ($ss=1$) with different barcode lengths and different mutation rates. (A) RMP method; (B) NBJ method.Some interesting findings from Fig. 8 are as follows: when mutation rate is low, longer barcode consistently yields higher lineage accuracy; while mutation rate is high, however, lineage accuracy tends to decrease, and longer barcode does not show advantage over shorter barcode, which is a little surprising. Obviously, the first reason is that higher mutation rate brings in higher chance of large-segment-deletions, which wipes out lineage information, and longer barcode tends to have longer segment deletions. The second reason, which appears to be deeper, is the following. In many widely adopted simulation models, each target in the barcode is assumed to mutate at most once. This is a strong and ideal assumption in that the lineage information, which is accurately contained in these mutated targets, will be retained throughout the whole editing process. Longer barcode, which contains more lineage information, will certainly yield higher accuracy in the rebuilt lineage. In the pure stochastic model, however, Cas9 editing activities proceed continuously and randomly, and newly mutated nucleotides may mutate again later on. As a result, lineage information, that is contained in mutated nucleotides, may be wiped out due to further mutations. Not only that, these further mutations could alter the previously established lineage information, and consequently promote mismatches in rebuilding lineage. To our best knowledge, this phenomenon has not been addressed in other related work.The effect of matching proportion in pairingThe parameter propm, as introduced in sections “Pairwise alignment and barcodes pairing—RMP method” and “Neighbor joining method (NBJ) on barcodes pairing”, controls the pairing of two barcodes. High value of propm sets a high bar to pair two barcodes because they must match a large proportion of nonzero nucleotide, while low value of propm allows two barcodes to be paired even if a few nonzero nucleotide are matched. The comparison of the lineage tree accuracy of high and low propm at mutation rate $mupb=0.1$ are show in Fig. 9A for RMP method, and in Fig. 9B for NBJ method.Figure 9Whole tree accuracy with different matching proportions. A: all internal nodes, P: paired nodes, L: low $propm=0.4$, H: high $propm=0.85$. (A) RMP Method; (B) NBJ Method.For RMP method, Fig. 9A shows that the setting $propm=0.85$ yields higher accuracy than that of $propm=0.4$ for all internal nodes. But when the sample size is small, lower propm yields a better result on paired nodes than that of higher propm. Recall that lower propm means the pairing of two barcodes is encouraged.For NBJ method, when sample size is high, high and low promp have similar matching accuracies. However, when ss is small, high promp yields higher accuracy of all internal nodes, but low promp yields higher accuracy of paired nodes. Therefore, there appears to be a trade-off between matched singleton nodes and paired nodes. If ss is small, the lineage tree is ‘sparse’ in that most nodes are singleton nodes, and a few nodes are paired nodes that are sparsely distributed in the tree, hence hard to pair correctly. An algorithm with certain parameter setting may match more paired nodes but mismatch many singleton nodes, and vice versa. Therefore, it is important to properly choose the algorithm and parameter settings in order to balance this trade-off.Comparison of RMP and NBJ methodsThe results of section “The effect of matching proportion in pairing” may be combined to compare the performances of RMP method to NBJ method. As an illustration, we first choose the same $propm=0.4$, $mupb=0.1$, for both RMP and NBJ methods, and the accuracy comparison is shown in Fig. 10A, B.Figure 10Comparison of RMP and NBJ methods. (A) Accuracy of all internal nodes, same $propm=0.4$ for both RMP and NBJ methods. (B) Accuracy of paired nodes, same $propm=0.4$ for both RMP and NBJ methods. (C) Accuracy of all internal nodes, $propm=0.85$ for RMP and $propm=0.4$ for NBJ. (D) Accuracy of paired nodes, $propm=0.85$ for RMP and $propm=0.4$ for NBJ.Recall that low propm indicates encouragement of barcode pairing, and from Fig. 10A, B it is seen that under this condition, NBJ method outperforms RMP method. From the result in section “The effect of matching proportion in pairing” it appears that RMP method performs better when $propm=0.85$, so we compare the result of RMP method with $propm=0.85$ and the result of NBJ method with $propm=0.4$, as shown in Fig. 10C, D.The sub-graph of Fig. 10C shows the accuracy of all internal nodes of RMP and NBJ methods, respectively, with different sampling sizes (ss). We see that when ss is high, there is no big difference between the accuracy of these two methods. When ss is low, RMP method has higher accuracy than NBJ method. However, we recall that when ss is low, most sampled barcodes are singleton. RMP tends to not pair these barcodes, so it successfully constructs lineage for most singleton barcodes, but misses those that should be paired.The sub-graph of Fig. 10D shows accuracy on paired/dividing internal nodes. Again when ss is large, RMP and NBJ do not show much difference in accuracy of paired/dividing nodes. However, when ss is small, we see that NBJ outperforms RMP, that is, when ss is small, NBJ correctly pairs more barcodes than RMP.The effect of pulse inductionIn the work of Bowling et al25 pulse induction of Doxycycline is implemented to control the barcode mutation. To test the effect of pulse induction on the accuracy of the rebuilt lineage tree, we assign mupb alternatively between a set rate and a base value (0.005) as barcode editing continues. The ’No-Dox’ line in Fig. 1G in25 indicates that even in the case of no Dox induction, small percentage of edited alleles is observed. Therefore we set mupb at a very low rate to indicate the scenario of no Dox application. In the case of Dox induction, we tested the set rate at 0.05 and 0.1. Other parameter settings are the same as in section “The effect of barcode length and sampling proportion” so that we may compare the results. The results of RMP method are shown in Fig. 11, while the results of NBJ method are shown in Fig. 12.Figure 11Effect of Pulse Induction, RMP Method, $propm =0.85$. (A) Lineage accuracy with $mupb=0.05$. (B) Lineage accuracy with $mupb=0.1$.Figure 12Effect of Pulse Induction, NBJ Method, $propm =0.4$. (A) Lineage accuracy with $mupb=0.05$. (B) Lineage accuracy with $mupb=0.1$.From the simulation results of RMP method in Fig. 11 it is seen that when sampling size is large, pulse Dox induction yields higher accuracy than that of constant induction. When the sampling proportion is low and mupb is low, as in Fig. 11A, constant Dox induction yields higher accuracy on internal nodes, though in this situation most of the internal nodes in the sample are singleton nodes. However, when mupb is high as in Fig. 11B, pulse Dox induction generally yields better lineage accuracy than constant induction, especially for paired nodes.Similar observations are seen for NBJ method in Fig. 12, i.e., when sampling size is big and mupb is high, pulse induction outperforms constant induction. One difference is that when sampling size is small, constant Dox induction yields better accuracy than pulse induction on both the internal nodes and paired nodes.Comparing the results in Figs. 11 and 12, it appears that RMP method with $propm=0.85$ and pulse induction with $mupb=0.1$ yields the best lineage accuracy, see Fig. 11B.Effect of Dox induction at specific timeFor successful lineage tracing with molecular barcodes, it is important to have sufficient diversity and randomness in the barcodes. However, the cellular DSB repair system has a bias towards large sequence deletions. To prevent the loss of lineage information, it is recommended to use an intermediate dosage of Dox (25ug/g mice) with the CARLIN model, as suggested by Bowling et al25. Despite this recommendation, there is a lack of justification for this experimental setting, and optimizing the usage conditions for CARLIN in mice remains a costly and variable process.We conducted a series of simulations to evaluate the efficacy of paused intermediate Dox inductions for the CARLIN system. The simulations were performed using a set of parameters including a barcode length of 276, propm of 0.95, a sample size of 0.1, large deletion probability of 0.3, divp of 0.99, and clive of 0.99, which accounts for the possibility of cell death and non-dividing. Mutations were introduced at generation numbers 4, 6, and/or 8, with varying rates. The simulation was run 10 times for each setting, and the average accuracy was calculated. To evaluate the accuracy of the rebuilt lineage tree, we calculated the percentages of paired nodes and internal nodes for three scenarios: one generation above the leaf level, two generations above the leaf level, and the entire tree. The focus of our analysis was on tracing back a limited number of generations, as real experiments often involve single time point sampling from animal tissues, allowing for the reconstruction of only the most recent one or a few generations. Both RMP and NBJ methods were employed, and accuracy was calculated using either all internal nodes or just paired nodes.Figure 13Effect of Dox induction at specific time. (A) and (B): accuracy is calculated by using RMP method under different induction patterns and mock Dox concentrations, the accuracy curves were plotted in Fig S2. Average accuracy for each curve is calculated and they are summarized here for comparisons. (C) and (D): accuray is calculated by NBJ method.Supplementary Figure S2 presents the results of simulations performed across multiple induction patterns, various Dox dosages (mupb), and lineage reconstruction methods. The x-axis highlights the induction time points, which are depicted in red, and the 7 Dox dosages are represented by distinct colored lines. In Fig. 13, we calculated and summarized the average accuracy for each mock Dox concentration (mupb) over the 10 cell divisions (d3-d12). By comparing the accuracy calculated in paired nodes versus all internal nodes, we observed that in the initial 3 cell divisions, the majority of the leaves are singletons. This is evidenced by the fact that the initial accuracy in all internal nodes is significantly high, while it is extremely low in paired nodes. These results suggest that Dox induction is unnecessary for lineage tracing at very early stages, irrespective of the induction patterns and mupb values, background Cas9 activities could generate moderate barcode diversity. We also found that when comparing the CARLIN performance in tracing back different generations, the accuracy is higher when tracing back fewer generations. Furthermore, testing the induction patterns revealed that the double paused pattern has the highest accuracy under the same conditions. In addition, the NBJ methods demonstrated advantages over the RMP, as the intermediate dosage of Dox induction showed much similar accuracy with high Dox dosage inductions, compared to the same groups in RMP methods.Effect of non-filtering by rootWhile reconstructing the parent node from children cells, a natural idea is to use available information – such as root barcode – as much as possible. This is how the filtering step was introduced, as illustrated in Fig. 21C. This filtering step will make internal barcodes gradually converge to root while tracing backward. To test whether filtering improves lineage accuracy, we compare the results of NBJ and NBJNF, with and without pulse Dox induction. Barcode length is set at $n=100$(bp) and mutation rate is set at $mupb=0.1$. Other parameter settings are the same as that of section “The effect of barcode length and sampling proportion”. propm increases from 0.4 to 0.9, and sampling size takes values $ss=1,0.8,0.5,0.1$. For each setting the program is run 10 times and the average is calculated. The result with pulse Dox induction is shown in Fig. 14, and the result with constant Dox induction is shown in Figure S3.Figure 14Comparison between NBJ and NBJNF with pulse Dox induction. (A–D): accuracy of all internal nodes with different sampling size; (E–H): accuracy of all paired nodes with different sampling size.From both Figs. 14 and S3, it is seen that NBJNF outperforms NBJ in almost all the parameter settings, which, in our opinion, is counter-intuitive. We notice from simulation results that the filtering mechanism blurs the difference among different clones, in other words, internal barcodes become more and more similar as the algorithm traces backward in the lineage tree. Consequently, many internal barcodes are paired incorrectly which reduces lineage accuracy. We also made a comparison between RMP and non-filtering RMP methods with and without pulse Dox induction, as in Figs. S4 and S5 where we see that in most cases non-filtering RMP method outperforms RMP method. As a conclusion, it is better not perform the filtering with root while rebuilding parent node, as shown in Fig. 21C, and the lineage accuracy becomes higher.Effect of adding a second barcodeAs introduced in section “Adding a second barcode”, we test the lineage accuracy with two barcodes for a grid of propm and propmi. Other parameter settings are similar to that of section “Effect of non-filtering by root”, except that there are two barcodes that are being edited independently as cells divide. Both the NBJ and NBJNF methods are applied, with and without pulse Dox induction. The simulation is run 10 times for each setting and the average accuracy is calculated. The results of NBJ method are shown in Figs. S6 and S7; the result of NBJNF method with constant Dox induction is shown in Figure S8. Figure 15 shows the result of NBJNF method with pulse Dox induction.Figure 15Lineage accuracy with 2 Independent Barcodes using NBJNF Method and Pulse Dox Induction. (A–D) accuracy of all internal nodes with different sampling size; (E–H) accuracy of all paired nodes with different sampling size.The accuracy of paired nodes tells when two children cells are generated from a common parent, so we may mostly focus on the accuracy of paired nodes. In comparison of Figures S6 to S7, and Figures S8 to 15, we see that pulse Dox induction results in higher accuracy of paired nodes. If we compare Figures S6 with S8, and Figures S7 with 15, we see that NBJNF method generally outperforms NBJ method, which agrees with the findings in section Effect of Non-Filtering by Root.Next we compare Figures S6 to S3, Figures S8 to S3, Figures S7 to 14, Figs. 15 to 14, we see that adding a second barcode improves the lineage accuracy. So we anticipate that adding more barcodes will improve lineage accuracy, if this additional information is used properly and a good algorithm on lineage reconstruction is applied.In our NBJ and NBJNF algorithms on double barcodes, it appears to be tricky to choose parameters propm and propmi. From the simulation result, it is observed that the setting of propm-propmi combination where the highest accuracy occurs also depends on other factors such as the way of Dox application (constant v.s. pulse), sampling size (ss), and the nodes of interest (all internal nodes or paired nodes only).Effect of changing indel probabilitiesAs introduced in section Test on RMP and NBJ Methods, the use of Cas9-TdT in35 results in fewer deletions but twice the insertion events per allele than Cas9 expression, with all four nucleotides well-represented in the inserted sequences. We therefore test the effect of changing indel probabilities on lineage tracing. These indel probabilities are chosen the same as in section Test on RMP and NBJ Methods: probability of perfect repair is 0.5; the probabilities of inserting 1,2,3 nucleotides are 0.15, 0.1, 0.05, respectively; probability of a substitution is 0.1; probability of a single nucleotide deletion is 0.1; probability of large segment deletion is 0.1. We call this change of indel probabilities the simulation with Cas9-TdT. Barcode length is set at $n=100$ and generation number is 10. Two independent barcodes are used, NBJ and NBJNF methods are performed with and without pulse Dox induction. Simulation is run 10 times for each parameter setting. The average lineage accuracy using NBJ method is shown in Figures S9 (constant Dox induction) and S10(pulse Dox induction); the result of NBJNF method is shown in Fig. 16 (constant Dox induction) and Fig. 17 (pulse Dox induction).Figure 16Lineage accuracy with 2 Independent Barcodes using NBJNF Method and Cas9-TdT. (A–D) accuracy of all internal nodes with different sampling size; (E–H) accuracy of all paired nodes with different sampling size.Figure 17Lineage accuracy with 2 Independent Barcodes using NBJNF Method and Cas9-TdT with Pulse Dox Induction. (A–D) accuracy of all internal nodes with different sampling size; (E–H) accuracy of all paired nodes with different sampling size.Comparing the results with new indel probabilities (which favors insertion) with results of the original settings (as in section Effect of Adding a Second Barcode), it is easily seen that the results with new indel probabilities substantially improved lineage accuracy. As a further examination, we plot the maximum accuracy of the 10 runs for NBJNF method with pulse Dox induction in Fig. 18, where we see that the maximum accuracy could exceed 75% for some parameter settings.Figure 18Maximum lineage accuracy of 10 runs with 2 Independent Barcodes using NBJNF Method and Cas9-TdT with Pulse Dox Induction. (A–D) accuracy of all internal nodes with different sampling size; (E–H): accuracy of all paired nodes with different sampling size.Furthermore, we observe once again that the results with pulse Dox induction generally yield better accuracy than that of constant induction, and the NBJNF method outperforms NBJ method in the accuracy of lineage reconstruction. Our understanding is that the new indel probabilities, that favor insertion and inhibit large deletion, promote the diversity/differentiation of alleles as the editing process proceeds, yet filtering by the root barcode acts against this differentiation. As a conclusion, the root barcode information should be discarded while rebuilding the lineage tree.

Simulation of CRISPR-Cas9 editing on evolving barcode and accuracy of lineage tracing

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models