Be-dataHIVE: a base editing database | BMC Bioinformatics

Base editing prediction tasksIn the base editing field, there exist two main computational tasks: the prediction of efficiency rates and bystander mutations (the latter entails the prediction of bystander edit rates or bystander outcome rates). In the following, we will define those tasks mathematically by using the denominations:

E: Total number of reads

\(E_{{\text {edited}}}(s, e)\): Number of reads with at least one edit within the editing window starting at position s and ending at position e (e.g., bases 3 to 10)

\(E_{{\text {pos}}}(i)\): Number of edits at a specific position i

\(E_{{\text {outcome}}}(i, x, y)\): Number of edits at a specific position i that changed the underlying base x to base y

\({\text {edit}}(i, k)\): Indicator function for an edit occurring at position i in read k, where \({\text {edit}}(i, k) = 1\) if an edit occurred, and \({\text {edit}}(i, k) = 0\) otherwise

\({\text {outcome}}(i, k, x, y)\): Indicator function for an outcome change at position i in read k from base x to base y, where \({\text {outcome}}(i, k, x, y) = 1\) if such an outcome occurred, and \({\text {outcome}}(i, k, x, y) = 0\) otherwise

k: Individual read k, where k ranges from 1 to E

Efficiency rates (\(R_{{\text {eff}}}(s, e)\)) are defined as the proportion of reads with edited outcomes within a certain editing window, such as between bases 3 (s) and 10 (e), of the target to total reads (see for example [7]). One can think of the editing window as a subsection of the target sequence where the editing activity is the strongest.$$\begin{aligned} R_{{\text {eff}}}(s, e) = \frac{E_{{\text {edited}}}(s, e)}{E} \end{aligned}$$
(1)
To account for the fact that a read is counted as edited if any position within the window is edited, we can define \(E_{{\text {edited}}}(s, e)\) as:$$\begin{aligned} E_{{\text {edited}}}(s, e) = \sum _{k=1}^{E} \left[ 1 – \prod _{i=s}^{e} (1 – {\text {edit}}(i, k)) \right] \end{aligned}$$
(2)
where \({\text {edit}}(i, k) = 1\) if there is an edit at position \(i\) in read \(k\), and 0 otherwise.The product term \(\prod _{i=s}^{e} (1 – {\text {edit}}(i, k))\) evaluates to 0 if there is any position \(i\) within the window \([s, e]\) where \({\text {edit}}(i, k) = 1\). If there are no edits at any position \(i\) in read \(k\), the product will be 1.Bystander edit rates (\(R_{{\text {bystander}}}(i)\)) are defined as the number of edits at a given position (i) divided by total reads E while bystander outcome rates (\(R_{{\text {outcome}}}(i, x, y)\)) are defined as the number of edits at a specific position (i) that changed the underlying base x to base y divided by E. Thus, for the bystander tasks, there exist two possible forecasting targets – edit rates and outcome rates and both are typically expressed as editing fractions.$$\begin{aligned} R_{{\text {bystander}}}(i) = \frac{E_{{\text {pos}}}(i)}{E} \end{aligned}$$
(3)
where \(E_{{\text {pos}}}(i) = \sum _{k=1}^{E} {\text {edit}}(i, k)\).$$\begin{aligned} R_{{\text {outcome}}}(i, x, y) = \frac{E_{{\text {outcome}}}(i, x, y)}{E} \end{aligned}$$
(4)
where \(E_{{\text {outcome}}}(i, x, y) = \sum _{k=1}^{E} {\text {outcome}}(i, k, x, y)\).Edit prediction solely provides information if a base change occurred at a certain position (e.g., an edit at position 3), while outcome forecasts are more granular and give insights into the resulting base change (e.g., an edit at position 3 where A \(\rightarrow \) T), taking into account unexpected nucleotide alterations [3, 6,7,8,9,10,11].Efficiency rates and bystander edit rates are closely related to each other. Efficiency rates are always smaller or equal to the sum of the bystander edit rates, which can be illustrated mathematically. Using Eq. 3, the sum of bystander edit rates over the editing positions from \(s\) to \(e\) is:$$\begin{aligned} \sum _{i=s}^{e} R_{{\text {bystander}}}(i) = \sum _{i=s}^{e} \frac{E_{{\text {pos}}}(i)}{E} \end{aligned}$$
(5)
Substituting \(E_{{\text {pos}}}(i) = \sum _{k=1}^{E} {\text {edit}}(i, k)\):$$\begin{aligned} \sum _{i=s}^{e} R_{{\text {bystander}}}(i) = \sum _{i=s}^{e} \frac{\sum _{k=1}^{E} {\text {edit}}(i, k)}{E} \end{aligned}$$
(6)
Rearranging the summation:$$\begin{aligned} \sum _{i=s}^{e} R_{{\text {bystander}}}(i) = \frac{1}{E} \sum _{k=1}^{E} \sum _{i=s}^{e} {\text {edit}}(i, k) \end{aligned}$$
(7)
By comparing Eqs. 1, 2, and 7, one can see that \(\sum _{i=s}^{e} {\text {edit}}(i, k)\) will always be greater or equal to \(\left[ 1 – \prod _{i=s}^{e} (1 – {\text {edit}}(i, k)) \right] \). Thus, the efficiency rate will also always be less than or equal to the sum of the bystander edit rates over the same positions:$$\begin{aligned} R_{{\text {eff}}}(s, e) \le \sum _{i=s}^{e} R_{{\text {bystander}}}(i) \end{aligned}$$
(8)
As a more intuitive way to understand this inequality, think about \(R_{{\text {eff}}}(s, e)\) as measuring the occurrence of at least one edit in the window, while \(\sum _{i=s}^{e} R_{{\text {bystander}}}(i)\) is the sum of individual probabilities, which can sum to more than 1 if multiple edits are possible in the same read. Example calculations of efficiency and bystander rates can be seen in Sect. “Machine learning use case”.Data acquisition and processingTo ensure the inclusion of as many studies as possible and an extensive database, we analysed 723 unique publications from the base editing field. All publications with “base editing”, “base editor” or “base editors” in the title were retrieved from Google Scholar. Following, the papers were analysed via Python for data sources that are reported in the individual studies and manually screened afterwards. Additional data points were requested for several studies to ensure a comprehensive dataset.Figure 1 reports the available bystander data points per study in descending order. After five publications, there is a noticeable drop in available data points. In addition, subsequent studies often lack critical data points, such as total read counts or efficiency rates, and present data in formats that are challenging to standardize and extract, as these data points are usually embedded in the underlying data for tables and charts in publications. Considering the data quality and the fact that the first five studies account for over 98% of available data points, we establish a cut-off after the fifth article.From a machine learning perspective, the small number of data points offered by the subsequent studies would not have a meaningful impact on model training. Machine learning models rely on large, high-quality datasets to generalize well. The robust dataset from the first five articles provides a strong foundation for model development. To further grow and diversify the database, we encourage researchers to submit their data via our homepage.The included studies, along with selected key metrics, are reported in Table 1. Furthermore, stratification Table 2 shows selected metrics by base editors. Table 3 details key metrics and statistics segmented by studies.Fig. 1Available bystander data points per study in descending orderFor the data processing, we follow three steps. First, the raw data files for all studies are downloaded from the corresponding journal and unpacked using Python. Second, the data files are mapped to a common format where the bystander data is compressed to a single row, with editing outcomes at certain positions being represented by individual columns. Third, additional factors such as energy terms, and melting temperatures are incorporated in the database. All data processing was done in Python. The data processing script can be found under https://github.com/Lucas749/be-datahive. More details about the database format are reported in Table 1 in the supplementary data.Table 1 Overview of studies included in the base editing database with selected metricsTable 2 Stratification of the database by base editors for selected metricsTable 3 Stratification of the database by studies for selected metricsData enrichmentPhysical energy termsIn line with Störtz and Minary [12], we add various interaction energies figures to the database to enrich the dataset. Based on Alkan et al.’s [13] approximate energy model for Cas9-gRNA-DNA binding, we compute multiple energy terms. In total, 24 energy figures are added to the database based on different parameter combinations (see supplementary Table 2 for an overview). Furthermore, we use RNAfold [14] to compute the minimum free energy (MFE) secondary structure of the gRNA sequence.Although these energy metrics were originally developed for traditional Cas9 systems, base editing leverages a modified, inactive Cas protein, known as dead Cas (dCas), which does not cleave DNA. Despite this modification, we hypothesize that these energy terms will still enhance the predictive power of machine learning models in the field of base editing. The incorporation of these energy terms provides several advantages:

Enhanced Predictive Accuracy: Energy terms offer quantitative insights into the stability and efficiency of gRNA-DNA binding interactions, potentially allowing models to better forecast the likelihood of successful base editing events.

Robustness Across gRNA Variations: Energy terms help models generalize across different gRNA sequences and target sites by providing a consistent measure of interaction strength and stability, thereby potentially enhancing model robustness.

Future research will further investigate the importance and impact of these energy features on the performance of machine learning models for base editing.Melting temperatureIn addition, we add the melting temperatures of the 20nt target sequence and gRNA to the database. The melting temperature is computed via the Biopython MeltingTemp module [15] using the \(Tm\_NN\) function which calculates the temperature based on nearest neighbour thermodynamics and corrects amongst others for mismatches, dangling ends, and salt concentration. Melting temperatures find usage in some base editing prediction models, such as Pallaseni et al. [3] and Arbab et al. [6].Data representationThere exist many methods to encode sequence data and process information efficiently. Besides traditional methods, such as one-hot encoding or k-mers, novel approaches such as BASiNET [16] and Hilbert curve encodings [17, 18] have been developed. Our database is designed to easily integrate any type of encoding method, independently of the underlying complexity, and can be extended with more encoding techniques.Currently, the database supports two main encoding methods: one-hot encoding and Hilbert curve encoding. We chose these two approaches to illustrate the flexibility and capability of our database and to offer users one standard encoding approach (one-hot encoding) and a more novel and complex encoding framework from the field of imaging (Hilbert curve encoding) that has produced promising results in machine learning models [17, 18].One-hot encodingOne-hot encoding is a commonly used preprocessing method to convert categorical data into a format that can be understood by machine learning algorithms. Each category value is converted into a new binary feature that takes a value of 1 for its respective category and 0 for others. In a DNA sequence context, nucleotides ’A’, ’T’, ’C’, ’G’ can be one-hot encoded as [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1] respectively. This method efficiently represents categorical data.Hilbert curve encodingSequences are represented as a Hilbert curve image [19]. This imaging approach is used in converting multi-dimensional data into one-dimensional data while preserving locality, meaning that points that are close in higher dimensions remain close when mapped to the Hilbert curve. Using Hilbert curves allows DNA sequences to be represented in a two-dimensional space while preserving the locality of the nucleotides. Each point on the Hilbert curve corresponds to a specific nucleotide in the sequence. The curve covers every point in a square grid with a size of any power of 2. A practical example of the generation of a Hilbert curve encoding can be found in the Appendix. For a detailed explanation on Hilbert curves and the exact construction methodology we refer to Anjum et al. [17].Technical implementationThe database consists of four main components, a mySQL database, a Node.js server for REST API queries, a Python wrapper for the API, and our website https://be-datahive.com/. The setup is illustrated in Fig. 2. The REST server, utilizing Node.js as runtime framework, provides data for the website and can also be directly accessed from users to serve individual queries. The website is written from scratch using CSS, HTML, and JavaScript. API calls to the database are done via JavaScript’s asynchronous fetch method.Our setup enables highly individualized and fast data queries and offers users two interfaces – our website and API. Furthermore, the framework is easily expandable to accommodate and incorporate various data views, especially for machine learning applications.Fig. 2Illustration of the technical implementation of BE-dataHIVEWebsite interfaceOur website https://be-datahive.com/ provides a convenient way for practitioners to look up guides, targets, and base editor efficiency rates as well as bystander outcomes (see Fig. 3). For example, if a lab would like to investigate the bystander activity of a certain guide RNA, they can search for the specific guide but also for similar sequences via the search feature on our web page.The website offers the following features:

Browsing and searching by gRNA, base editor, and cell lines

Access to statistics and analytics for bystander and efficiency data

Charting of data

Direct csv download

API to interact with the mySQL database for customised data requests

Fig. 3Web interface of BE-dataHIVE. Experiments can be filtered on the home screen (a) and bystander data can be examined (b)Application programming interfaceThe database can be accessed via a REST-API that enables easy access to the underlying data as well as a flexible way to interact with the database. The data can be directly filtered and modified via the API and accessed from any programming language that supports http requests. The API will be particularly relevant for data scientists and machine learning researchers as it provides the flexibility to filter and retrieve the desired data directly from the server without any intermediary steps. The API documentation can be found under https://be-datahive.com/documentation.html.Python API wrapperTo provide a simple way to obtain data directly in Python, a widely used programming language for machine learning, we wrote the Python library be_datahive that handles all API requests and data handling. Data can be retrieved directly via the package. Furthermore, the wrapper implements some basic machine learning data handling routines, such as the creation of a labelled dataset. be_datahive is available on GitHub and PyPi. Detailed usage examples are showcased on our GitHub.Machine learning use caseBase editing features two types of prediction tasks, namely efficiency rate and bystander predictions. The former predicts the overall editing efficiency while the latter aims at predicting bystander edit rates or the more informative bystander outcome rates (as defined above). Easy access to training data for both types of prediction tasks can be facilitated by BE-dataHIVE. The required machine learning ready data can simply be obtained and multiple feature encodings are available out-of-the-box. Python users can use our Python API wrapper that returns the requested data in a few lines of code. Practical coding examples for training machine learning models with our database are available on our Python wrapper GitHub. The following section illustrates the structure of one data point based on a specific example to make it easier for readers to grasp the data format. Assuming that one labelled data point has the format (X, y), where y is the prediction target one aims to predict based on X, which typically denotes the features. Data points for efficiency rate, bystander edit rate and bystander outcome rate differ in terms of y so here we give an example data point for each rate that shares the same X illustrated in Eq. 9. Using our database, we retrieve 33 features for X, such as melting temperatures, energy terms, and gRNA. One-hot and Hilbert curve encodings are available for all nucleic acid sequence fields.$$ X = \left[ {\begin{array}{*{20}l} {{\text{gRNA}}} \hfill \\ {{\text{Pam Sequence}}} \hfill \\ {{\text{Full Context Sequence Padded}}} \hfill \\ {{\text{gRNA Sequence Match}}} \hfill \\ {{\text{Cell}}} \hfill \\ {{\text{Base Editor}}} \hfill \\ {{\text{Melt Temperature gRNA}}} \hfill \\ {{\text{Melt Temperature Target}}} \hfill \\ {{\text{Energy 1}}} \hfill \\ \vdots \hfill \\ {{\text{Energy 24}}} \hfill \\ {{\text{Free Energy}}} \hfill \\ \end{array} } \right] = \left[ {\begin{array}{*{20}l} {{\text{GGACCGTCGAAAATGGGCCT}}} \hfill \\ {{\text{GGG}}} \hfill \\ {{\text{NTCCAATATC}}…} \hfill \\ {{\text{TRUE}}} \hfill \\ {{\text{K562}}} \hfill \\ {{\text{BE4 – 1}}} \hfill \\ {56.96} \hfill \\ {56.96} \hfill \\ 0 \hfill \\ \vdots \hfill \\ {27} \hfill \\ { – 4.6} \hfill \\ \end{array} } \right]{\text{ }} $$
(9)
Efficiency rate prediction taskThe efficiency data can be obtained by calling the endpoint “efficiency” in our API or Python wrapper. In the efficiency rate prediction task, we are forecasting a single number, which for our example (X) is$$\begin{aligned} y = \begin{bmatrix} \text {Efficiency Full gRNA Reported} \\ \end{bmatrix} = \begin{bmatrix} 0.9840 \\ \end{bmatrix} \end{aligned}$$
(10)
Bystander edit rate prediction taskEndpoint “bystander” yields bystander edit and outcome data (see below). For bystander edit rate prediction the target matrix is of size \(1 \times m\), containing editing fractions (number of edits divided by total reads) for every position. In our data point \(m = 42\) and positions are determined relative to the start of the gRNA, meaning that position \(-1\) would indicate the base immediately before the start of the gRNA sequence (see Eq. 11). In the case that only a single edit per base occurs per sample, we could simply sum up the vector to calculate the efficiency rate. However, multiple edits can occur at different positions within the same read, weaking this link between efficiency and bystander edit rate (see Eq. 8).
(11)
Bystander outcome rate prediction taskBystander outcome forecasts have as target a matrix of size \(n \times m\), containing editing fractions (number of edits divided by total reads) for position and outcome combinations. m refers to the editing positions around the target sequence while n represents the possible editing outcomes A, T, C, or G for outcome predictions. Therefore, the matrix has four rows (\(n=4\)). Equation 12 shows the example, y for editing outcomes, which is a 4 × 42 matrix. Based on the example data, we can see that in 0.18% of the total reads the base at position − 9 is edited to an A nucleotide.
(12)
To demonstrate the utility of our database and the significance of newly incorporated features, such as energy terms and melting temperatures, we train a machine learning model to predict efficiency rates using various feature combinations. We employ a Gradient Boosting Regression model with a learning rate of 0.1, maximum depth of 3, and 100 boosting stages. The model is trained on \(80\%\) of the dataset based on a five-fold cross-validation approach with Spearman correlation as the loss function. We use the same model and training approach while varying feature combinations. We create the following feature groups and test all four combinations:

1.

Baseline Includes only one-hot encoded gRNA and full context sequence.

2.

Energy Terms Includes all physical energy terms for the sequence and gRNA.

3.

Melting Temperature Includes all melting temperatures for the sequence and gRNA.

Figure 4 shows the improvement of different feature combinations over the baseline model using Spearman correlation to assess performance. Models with an enhanced feature set outperform the baseline across ABEs and CBEs. The greatest performance improvements—\(6.0\%\) higher Spearman correlation for ABEs and a \(4.2\%\) increase for CBEs compared to the baseline—are achieved using both energy terms and melting temperatures. Using energy terms alone improves performance for ABEs and CBEs by \(4.9\%\) and \(3.0\%\), respectively. Melting temperatures increase the correlation by \(0.3\%\) for ABEs and \(0.2\%\) for CBEs. Overall, incorporating energy terms and melting temperatures into the feature set results in a substantial increase in Spearman correlation across ABE and CBE base editors. This systematic case study highlights the potential of the newly added features in our database to enhance the predictive power of base editing models.Fig. 4Percentage improvement of the Spearman correlation over the baseline feature set with respect to various feature combinations

Hot Topics

Related Articles