TIANA: transcription factors cooperativity inference analysis with neural attention | BMC Bioinformatics

Overview of TIANAFigure 1A shows the overall model architecture and training strategy. TIANA consists of two primary steps: In the first step (denoted as Training Step), it employs a deep learning framework to classify between positive sequences (such as ATAC-seq and ChIP-seq peak sequences) and size-matched negative sequences (random genomic sequences). This classification step uses a neural network architecture that consists of one convolutional layer with preloaded known motif information and one self-attention layer. Attention heads allow the model to simultaneously focus on different aspects of the input sequence, enabling it to capture diverse patterns and relationships. In the second step (denoted as Interpretation Step), TIANA infers TF cooperativities using integrated gradients, which computes gradients on attention heads (denoted as attention attributes) of the fully trained model from the previous step. TIANA uses integrated gradients to quantify these cooperativities by first identifying active motifs from input sequences, next, TIANA links active motifs using the attention attribute. Lastly, TIANA compares attention attributes between positive and negative sequences to identify motif pairs that have significant attention levels to infer TF cooperativities. A schematic overview of required files to run TIANA can be found in Figure S1. Briefly, a peak file such as in bed format from a ChIP-seq/ATAC-seq experiment are required from the user. TIANA provides motif files as well as random genomic sequences used during training. The output of TIANA consists of a html report of significance levels of motif pairs, a corresponding csv file and a list containing all motif pairs identified and their corresponding attention attributes.Fig. 1Overview of the TIANA implementation. A TIANA consists of two main steps. In the first Training Step, TIANA learns to classify one-hot encoded positive sequences from randomly generated negative sequences. Next, in the Interpretation Step, TIANA utilizes integrated gradient to compute attention attribute matrices from the MHA block, enabling the inference of co-occupancy of TFs. B In the Interpretation Step, active motifs are identified using motif score thresholds. As shown in the cartoon example in B, for a positive sequence, the TF1 motif ends at 80 bp, while the TF4 motif ends at 150 bp, colored in red. The attention attribute, highlighted in blue, connecting 80 bp and 150 bp, is used to associate two active motifs from these respective locations (shown as TF1 and TF4 are linked with a blue edge). For each motif pair, TIANA compares the attention levels between all positive sequences and negative sequences to determine whether the motif pair exhibits significantly higher attention in positive sequences than in random negative sequencesArchitecture and trainingTIANA was implemented using tensorflow v2.9.1. The first component of TIANA is a binary classification deep learning framework. TIANA takes training data consisting of positive sequences and negative sequences. Positive sequences are enhancers, open chromatin, or TF binding peak sequences. Negative sequences are noncoding random genomic sequences. Positive and negative sequences were one-hot encoded to A: [1,0,0,0];C: [0,1,0,0]: G: [0,0,1,0], T: [0,0,0,1]. All input sequences are 200 bp in size unless otherwise specified, and TIANA can tolerate input sizes up to 1000 bp. For data used in this study, chr1, chr8, and chr18 were held out as the evaluation testing set; other chromosomes were used for training. Training performance of dataset used in this study are listed in Supplementary Table 1.The first layer of the TIANA model is a one-dimensional convolution layer that uses filters to identify short nucleotide patterns. The convolution layer is preloaded with known motif position-specific score matrices (PSSMs) and set to be fixed. We obtained known motifs from JASPAR motif library [13, 14]. To reduce the redundancy of motifs, we used a hierarchical clustering function from TBA to group highly similar motifs [15]. The Biopython motif module was used to convert Jaspar-format motifs to motif PSSMs, which were loaded into convolution layers and computed the motif score cutoff. Unless otherwise specified, the convolution layer used in this study consists of 448 filters, with 224 motifs and their reverse complement motifs.We used a vertical 1D max-pooling layer with a pooling size of 2, after the convolution layers to capture the highest activation score for each forward-reverse complement motif pair. By default, TIANA combines highest forward and reverse complement motif scores at each position. TIANA also provides an optional approach that computes forward and reverse complement motif separately through raising the –compute_rc flag when calling tiana.py. The vertical max-pooling layer was followed by a batch normalization layer and a 1D horizontal max-pooling layer, with pool size of 4 to reduce the computational overhead and improve model generalizability [16]. A detailed TIANA model summary can be found in Supplementary Table 2. For the examples used in this study, TIANA was trained for 50 epochs.Positional encodingWe used a positional encoding method based on sine and cosine functions to preserve the positional information [17]. Suppose that the input from a previous layer is \(X_{{{\text{input}}}} \in R^{{m \times f}}\) where \(m\) is the sequence size and \(f\) is the total number of motifs/filters. The positional encoding matrix \(P\in {R}^{m\times f}\) shares the same dimension as \({X}_ {{\text{input}}}\). For each element \({P}_{i,2j}\) or \({P}_{i,2j+1}\), the positional embedding values are computed as:$${P}_{i,2j}=sin\left(\frac{i}{{n}^{2j/f}}\right)$$$${P}_{i,2j+1}=cos\left(\frac{i}{{n}^{2j/f}}\right)$$where n is a scalar factor with a default value of 10,000. The output matrix can be written as \({X}_{out\text{put}}={X}_{in\text{put}}+P\).Multi-head attention unitThe output of positional encoding was then fed into the multi-head attention block, which consisted of Query (\(Q\)), Key (k) and Value (\(V\)) matrices. The QKV matrices provide the framework for the attention mechanism, with the Query matrix generating queries, the Key matrix providing contextual information, and the Value matrix holding the actual content, collectively facilitating effective attention computation and learning. TIANA uses one MHA layer with four attention heads. For each attention head \(h\), the score matrix \({S}_{h}\in {R}^{m\times m}\) is computed using:$$Attention\left({S}_{h}\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}^{k}}}\right)$$where: \({d}_{k}\) represents the size of Key (k).The output of the MHA unit is computed by taking the dot-product of attention score matrix \(S\) with value matrix \(V\):The output of the MHA unit was then fed into a flattened layer, as well as two fully connected layers with 256 and 64 units, respectively. The output of the fully connected layers was then fed into a single unit output layer with a sigmoid activation function for binary classification.Integrated gradientsWe used integrated gradients of the attention matrix \(S\) to measure the TF cooperativities. We denote \(G\in {R}^{m\times m}\) as the integrated gradients attention attribute matrix for each attention score matrix \(S\). Using the methodologies described previously [11, 18], \(G\) can be approximated as:$${\text{G}} = \frac{{\text{S}}}{{\text{a}}} \odot \sum _{{{\text{k}} = 1}}^{{\text{a}}} \frac{{\partial {\text{F}}\left( {\frac{{\text{k}}}{{\text{a}}}{\text{S}}} \right)}}{{\partial {\text{S}}}}$$where, \(a\): number of steps in the Riemann approximation with default value of 100. \(F\): TIANA’s classification function. \(\frac{\partial F}{\partial S}\): gradient of F relative to attention score matrix S. \(\odot\): elementwise multiplication.Quantifying transcription factor cooperativitiesFigure 1B provides a schematic of the process to quantify TF motif cooperativities after the Training Step. First, for each sequence, active motifs are identified by having activation values from convolution layers higher than motif score thresholds computed previously using Biopython motif module with a p-value threshold of 1e−4. The positional information of activated motifs was preserved in attention attribute matrices \(G\). Next, for each sequence, TIANA computes max attention attributes by calculating the position-wise maximum over the attention attribute matrices across all attention heads to produce \({G}^{max}\in {R}^{m\times m}\) matrix.The max attention attribute value \({G}_{i,j}^{max}\) was used to link two active motifs at two positions \(i\) and \(j\) for \({\forall }_{i}\in \left\{\text{1,2},\dots m\right\}\) and \({\forall }_{j}\in \left\{\text{1,2},\dots m\right\}\). In the cartoon example shown in Fig. 1B, TF1 and TF4 motifs are active in position 80 bp and 150 bp, thus TF1 and TF4 are linked using the attention attribute value \({G}_{\text{80,150}}^{max}\) (blue edge connecting TF1 and TF4 in Fig. 1B). For each sequence, if a TF motif is active at multiple locations, thus a motif pair has multiple attention attributes, the highest attribute value was used for the next significance testing step. Summary statistics, including number of peaks used in each dataset and number of motif pairs with attention attributes are listed in Supplementary Table 3.Identifying significant motif pairsFor each pair TF motif, we compare the attention attributes for all the peaks between positive and negative sequences. To eliminate noise from low-quality predictions, positive peaks with sigmoid activation less than 0.7 and negative peaks with activation higher than 0.3 from the Training Step were removed from the significance testing. We used Mann–Whitney U test to compare the significance of attention attributes differences of a TF pair in positive and negative sequences. All p-values were corrected using Bonferroni correction method. The output of TIANA reports all significant (p-values < 1e−4) TF pairs from the positive sequences, sorted by p-values.Sequencing data processingATAC-seq and H3K27ac ChIP-seq for KLA activated BMDM were downloaded from GEO (accession: GSE109965) [19]. LDTF ChIP-seq data for ATF3, SPI1, and CEBPB data were downloaded from GEO (accession: GSE111856 and GSE109965) [15, 19]. ETS1 and RUNX1 co-binding data were downloaded according to Martin et al. [20] from ENCODE (ID: ENCSR588AKU for RUNX1 and ID: ENCSR000BKQ for ETS1). Oct4/Sox2 ChIP-seq data were downloaded from Malik et al. [21] (accession: GSE103980). TF co-binding ChIP-seq data were obtained from the ENCODE data portal, as described in Shen et al. [7]. Sequencing reads were mapped using Bowtie2, using either human GRCh38/hg38 or mouse GRCm38/mm10 as a reference genome [22]. We used Homer findPeaks to identify peaks and used IDR to obtain reproducible peaks [23]. Peaks were resized to 200 bp and converted to a bed format before converting into one-hot encoding formats for training and testing with TIANA.Enhancers were called from ATAC-seq and H3K27ac ChIP-seq using a similar method described previously [24]. First, we identified reproducible open chromatin peaks using ATAC-seq data. Next, we used Homer annotatepeaks.pl function to identify peaks with > 16 H3K27ac tag counts. Finally, we selected regions with > 2.5 fold change in H327ac, marked as “KLA-activated” and “KLA-repressed” peaks.Comparative methodsWe compared TIANA to SATORI, another MHA-based model used to infer TF cooperativities. We followed the method described in SATORI to extract the attention score matrix, which we compared with the attention attribute matrix from TIANA. Since the attention score (SATORI) and attention attributes (TIANA) cannot be directly compared due to a different data distribution, we computed rank for the attention score and attention attributes for comparison.

Hot Topics

Related Articles