An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models | BMC Bioinformatics

The HAVAC hardware accelerator was implemented on a Xilinx Alveo U50 FPGA accelerator card. The Alveo U50 is a PCI-e card that is simple to install and is modestly priced ($2,965 at time of writing). The Alveo U50 contains a custom-built Ultrascale+ FPGA with approximately 874K lookup tables (LUTs) and supports two 4GiB banks of High Bandwidth Memory (HBM). The HAVAC hardware design was implemented using Vitis High-Level Synthesis (HLS), a tool for synthesizing designs from compliant C/C++ codebases along with FPGA-specific #pragma instructions. Vitis HLS can allow for significantly faster development compared to traditional hardware description languages (HDLs), at a small cost to implemented resource utilization and performance [51]. The HAVAC host (CPU) driver code was implemented in C++ using the Xilinx Runtime Library (XRT).Profile HMM scoresAs in the nhmmer SSV filter, HAVAC computes the scores of the Dynamic Programming (DP) matrix using 8-bit integer emission scores. For a given pHMM, a threshold score t is generated representing the minimum score required for a target database sequence to pass the requested P-value target (default: \(P\le 0.02\)) using the pHMM’s pre-computed gumbel distribution parameters [31] with adjustments to account for SSV’s removed state transitions. HAVAC uses the threshold to generate a scaling factor \(\tau\) on the pHMM’s emission scores. The purpose of \(\tau\) is to reproject the emission scores such that an accumulated score passes the query-specific threshold if and only if the score reaches 256.$$\begin{aligned} \tau = 256/t \end{aligned}$$The purpose of using \(\tau\) to re-project the emission scores is twofold. The primary reason is to simplify the hardware accelerator’s cell Processing Elements (PEs). By enforcing a threshold score of 256, the cell PEs can check their score against the threshold by using an 8-bit full adder’s carry bit, eschewing the need for a full 8-bit comparator to check for hits. As a result, cell PEs require fewer resources to implement, which allows more PEs to fit into the hardware to improve parallelism. As an added bonus, using \(\tau\) to reproject the emission scores allows for better use of the 8-bit integer space, allowing for slightly more precision in the emission scores when compressed down to 8-bit integers.The HMMER3 pHMM format stores emission scores as negative log-likelihood values, and these scores must be converted to bits before being projected using \(\tau\). To convert a negative log-likelihood score s to a \(\tau\)-reprojected score \(s’\) in bits, the score is represented as a single-precision floating-point value and extracted from negative log-likelihood space, divided by the background distribution of equal probability for each of 4 nucleotides, converted to bits, and then multiplied by \(\tau\).$$\begin{aligned} s^{\prime } = log2(e^{-s} / (1/4) ) * \tau \end{aligned}$$The above equation involves performing exp() and log() function calls on each emission score, which are slow on modern hardware. We eliminate the need to perform these expensive functions by simplifying as follows.$$\begin{aligned} s^{\prime }&= log2(4e^{-s} ) * \tau \\ s^{\prime }&= (log2(4) – s*log2(e)) * \tau \\ s^{\prime }&= 2\tau – s\tau {}log2(e) \end{aligned}$$In the resulting equation, the terms \(2\tau\) and \(\tau log2(e)\) can be computed as constants over all match states of the model; this means that all match states can be computed with a single multiplication and subtraction. Each re-projected emission score s\(\prime\) is then rounded to the nearest integer, and cast to signed 8-bit integers. Each individual pHMM in the input.hmm file is projected in this way using the same P-value and the data are appended together and written to the FPGA’s HBM memory banks.SSV processor designDefine M(i, j) as the score of the maximum scoring ungapped Viterbi alignment ending at the ith letter of the sequence and the jth position of the pHMM. For a sequence of length L and a pHMM of length K, this requires computation of LK cells. The HAVAC kernel on the FPGA is comprised mainly of a systolic array of n cell PEs that each individually compute a single cell of the DP matrix each cycle. If \(L>n\), then the sequence is procedurally broken into length-n segments. Each cell PE in the systolic array uses a different letter from the sequence to find the match score for the current pHMM position and adds it to the score computed by the previous cell PE on the previous cycle. Initially, the systolic array computes the cells M(1, 1) to M(n, 1) in the DP matrix. The model position is then incremented and the systolic array computes M(1, 2) to M(n, 2), and so on until a full pass through the pHMM has been completed. Figure 3 shows how the sequence is broken down into segments of length n, and how the SSV processor computes rows of length n down the DP matrix. If \(L>n\), then a new segment of the sequence is loaded into the cell PEs and another pass through the pHMM begins. These passes through the pHMM compute columns of width n through the DP matrix until the entire matrix has been calculated. Sequences that are not a multiple of N in length are padded with random data to reach that target; SSV matches to these random sequences will be rare, and are easily filtered out by host-side driver.Fig. 3SSV dynamic programming matrix. The Dynamic Programming matrix for the SSV model. Here, the sequence is comprised of 3 segments of length \(N=4\). The light colored row shows the cells being computed this cycle. The fuchsia arrows indicate the data from the cells \(M(i-1, j-1)\). The blue arrows indicate the cells that will use the data generated this cycle to compute \(M(i+1,j+1)\)When the processor is running, n-length sequence segments are read asynchronously from memory and inserted into a hardware FIFO until the entire sequence is consumed. Similarly, the pHMM is read asynchronously from memory in full for each sequence segment, and inserted into its own FIFO. When the processor begins to compute a pass through a sequence segment, a full segment is consumed from the sequence FIFO to set each cell’s sequence symbol. Once the sequence is loaded into the cell PEs, the full pHMM is consumed from the FIFO, one model position at a time. Figure 4 shows a simplified view into the overall HAVAC hardware accelerator. Figure 5 shows the systolic array of cell PEs that compute the cells of the DP matrix.Fig. 4HAVAC hardware design overview Diagram of the overall design of the HAVAC hardware accelerator. Sequence length and model length are provided as inputs. The hardware then asynchronously loads length n segments of the sequence from HBM. For each of those segments, the model is loaded from HBM. These data are fed to the SSV processor, which calculates the cells of the DP matrix. The score queue module facilitates the transfer of scores from the last column of one sequence segment pass to the first of the next. The vector of bits representing any threshold hits are passed to the hit reporter module, which checks for any hits, determines the model and sequence positions of the hits, and writes the hits to HBMFig. 5Systolic array of N cell PEs. Each cell PE receives the score that was computed in the previous PE on the previous cycle. All cells use the same pHMM vector on a given cycle, and advance to the next vector on the subsequent cycle. Each cell uses a different symbol from inside the contiguous sequence segment to select the correct emission score for the PE’s corresponding DP matrix cellFor any given sequence segment, the leftmost cell of the systolic array of cell PEs requires input scores from the previous sequence segment column (except the first segment, which requires inputs of 0 for each cell in the first column of the matrix). Likewise, the scores generated by the rightmost cells in the segment column must be used as inputs for the next segment column of the matrix. A score queue module that implements a hardware FIFO structure is used to enqueue the outputs of the rightmost cell in a segment column, and dequeue scores as needed by the next segment column. The score queue module uses a series of 36kb block RAM elements to implement the hardware FIFO. Since hardware block RAMs have a definitive capacity, the score queue module acts as a limiting factor on the total length of phmm vectors that HAVAC can support. HAVAC supports a maximum of 1,048,576 (1024*1024) model positions for a single SSV query. This is large enough to hold the entirety of the Rfam database of RNA families [52](4108 models) twice over.Hardware cell processing elementsThe HAVAC cell PE is designed to minimize the resources required to calculate the score at a given cell. The cell’s sequence symbol c is used as the select on the current pHMM vector P(j) to obtain the cell’s signed 8-bit pHMM emission score. This value is then summed with the score computed by the previous cell PE on the previous cycle, \(M(i-1, j-1)\), to obtain an unsigned 8-bit intermediate sum value T(i, j) and a carry bit from the 8-bit full adder. This carry bit is then used along with the sign bit from the emission score to determine if the threshold has been passed and if the final cell’s result, M(i, j), should be reset back to zero. There are two situations where the result of the summation should be discarded: if a negative match score was added to \(M(i-1, j-1)\) (the carry bit was clear), or if a positive emission score was added to \(M(i-1, j-1)\) (the carry bit was set). In the first case the result underflowed the unsigned 8-bit result, and in the latter case the result overflowed the maximum representable score of 256. In this second case, the given cell has passed the score threshold and should be reported. Figure 6 shows the logic that implements a cell PE.$$\begin{aligned} T(i,j)&= M(i-1,j-1)+P(j,c)\\ M(i,j)&= {\left\{ \begin{array}{ll} 0 ,&{}\text {if }T(i,j)<0 \text { or } T(i,j)>256\\ T(i,j),&{} \text {otherwise} \end{array}\right. } \end{aligned}$$Fig. 6Logic Diagram of the Cell PE. The pHMM vector’s four 8-bit scores corresponding to each nucleotide are de-multiplexed by the sequence symbol, and then added to the score from the previous cell PE. Simple bitwise operations then determine if an overflow or underflow occurred, and reset the score or report a hit accordinglyHit reportingHAVAC reports any cell that passes the 256 threshold as an (i, j) position pair in the sequence and pHMM. HAVAC finds these (i, j) pairs by taking the threshold bit from all N cell PEs and systematically partitioning the threshold hit bits to find and report any that were set for a given cycle. These bits represent the cells \(i`< i < i`+N\) where i‘ is the first position of the N-length current segment of sequence. First, all N threshold hit bits are bitwise-or reduced; if the result is 1, the threshold hit bits contain at least 1 cell that passed the threshold. In this case, the threshold hit bits are enqueued to a hardware FIFO along with the current pHMM position and the current sequence segment index. Then, in concurrently running processes, the threshold hit bits are partitioned into 16 contiguous bit ranges. HAVAC iteratively bitwise-or reduces each of these 16 bit ranges to determine if a hit was in any of the ranges. If a bit range contained a hit, the bit range is enqueued to a subsequent hardware FIFO along with the rest of the position metadata and the index of the bit range. This process continues, further dividing the threshold hit bits until it represents a single bit. At this point, the hit’s position in the sequence can be identified as the index of that bit plus i‘. The hits are then written to the FPGA HBM memory where they can be read by the host to extract the (i, j) pairs after the HAVAC hardware accelerator finishes.Because the hit report partitioning runs concurrently to the main systolic array, and the hardware does not know a priori how many hit reports will be written for an invocation of the hardware accelerator, an extra terminator bit t is included along with the threshold hit ranges and positional data and is normally cleared. On the final cycle of the matrix computation (the final pHMM position of the final sequence segment index), the full threshold hit bit range is enqueued to the partitioning system even if it does not contain any hits, along with a set terminator bit. Each subsequent partitioning tier then passes along a threshold bit range with a set terminator on the final (16th) section of the bit range, and then deactivates. Once all hit reports are written to memory and the final partitioning tier deactivates, the number of hits is written to memory and the hardware accelerator completes its operation.Accelerator synthesis and driverSynthesis of the HAVAC hardware accelerator design was performed inside the Xilinx Vitis HLS tool. Implementation was performed by the v++ compiler tool. The Alveo U50 card contains 8GiB of HBM, split between 32 256MiB banks. Of the 8GiB of HBM available, 4GiB (16 banks) is used to store sequence data. All sequence characters are represented with 2-bit encoding (ambiguity characters are replaced by random letters, which may result in rare cases of falsely passing the filter; these are quickly filtered out by the downstream driver). HAVAC supports target sequences up to an aggregated length of 16Gbp. 512MiB (2 banks) is allocated for the profile HMM match scores. 3.5GiB (14 banks) is allocated for reporting the 8 byte (i, j) pairs for hits. Therefore, a maximum of 469,762,048 hits can be reported for any query. HAVAC was implemented with 12,288 cell PEs at a clock speed of 144.5 MHz. The number of cell PEs was chosen as a multiple of 4096 to allow the sequence data to better align to the hardware’s 4KiB memory boundary, while maximizing the number of PEs that could be successfully be synthesized and implemented (by the HLS tool and v++ compiler). The clock speed was determined automatically by the v++ compiler as the maximum clock speed that could be successfully implemented.The HAVAC driver software library allows for simple control over the hardware accelerator. Input data is provided to the driver as a fasta file and a HMMER3 model file and is then preprocessed and loaded onto the board’s HBM memory banks. The SSV computation can then be performed either synchronously or asynchronously. Once the hardware accelerator has finished, hit data can be read from the device as global sequence and model positions. Using these global positions, the HAVAC driver references the fasta and model files to locate which sequence and model the hits resides in, and the local positions in the sequence or model the hit refers to. These sequence and model positions can then be used by other algorithms in a sequence alignment pipeline as positions of potential homology. Because of the nature of HAVAC being a dedicated co-processor, SSV calculation may be performed concurrently to these other algorithms, further reducing the processing time for large queries that can be broken down into smaller batches.With enough memory to store 16Gbp of sequence and 13 million positions of pHMM data, HAVAC has far more memory than would be required for any single sequence or pHMM. The HAVAC driver concatenates all sequences in a fasta file and all models in an hmm file before transferring the data to the accelerator for processing. Thus, HAVAC searches the sequences and models in an all-to-all manner. The only limit to the number of sequences or pHMMS that HAVAC can process in a single invocation is the size of the memory allocated to each on the device. Once the hits are reported to the host, a fast SSV validity check is performed with the CPU to eliminate any hits that may have been generated by crossing the boundary between any two sequences or models.

Hot Topics

Related Articles