Crossfeat: a transformer-based cross-feature learning model for predicting drug side effect frequency | BMC Bioinformatics

Benchmark datasetWe downloaded the drug side effect frequencies and unique names of drugs and side effects from Supplementary Data 1 in Galeano et al.’s study [15]. This dataset contains 37,441 frequency-class associations for 759 drugs and 994 side effects. The occurrence of side effects was quantified into side effect frequency classes coded with integers between 1 and 5 (very rare; frequency = 1, rare; frequency = 2, infrequent; frequency = 3, frequent; frequency = 4, very frequent; frequency = 5). A dataset of drug side effect frequencies was used as the target values in this study.Construction of input featuresCrossFeat utilizes two types of drug information and two types of side effect information to generate similarity-embedding matrices and embedding vectors for drugs and side effects. Drug information comprises mol2vec [22] and fingerprint vectors. Mol2vec is a 100- or 300-dimensional vector representing the molecular structure of a drug that is obtained by inputting the drug SMILES into Mol2vec. Drug SMILES sequences were collected from the STITCH [23] database. Meanwhile, the fingerprint is a 2048-dimensional vector obtained by inputting the drug SMILES into RDkit [24], providing descriptors of the compound. These mol2vec and fingerprint vectors were subsequently employed to create mol2vec and fingerprint similarity vectors representing the similarity between drugs based on cosine similarity [25] and Jaccard similarity [26], respectively. For side effect information, semantic similarities and side effect word vectors were employed. We calculated semantic similarity using the Adverse Drug Reaction Classification System IDs to draw Directed Acyclic Graphs (DAGs). These DAGs represent the hierarchical relationships between side effects [16]. In addition, GloVe [27] was used to generate 300-dimensional side effect word vectors with side effect names, and the word vector similarities between the side effects were calculated using cosine similarity. All drug and side effect information were regenerated using the methods proposed by Zhao et al. [18].In total, 36,850 frequencies for 736 drugs were obtained after removing 23 drugs with no matching information in the SDPred to ensure consistency in our dataset. This step was necessary to maintain the integrity of our comparisons and to avoid potential ambiguities in the data. Additionally, 994 side effects matched the benchmark dataset. Let n and m be the number of drugs and side effects, respectively. A dataset of drugs can be represented as \(D = \left\{ d_{1},d_{2},\cdots ,d_{n}\right\}\), where \(n=736\), and a dataset of side effects can be represented as \(S = \left\{ s_{1},s_{2},\cdots ,s_{m}\right\}\), where \(m=994\). All possible drug-side effect pairs can be \(D\times S\) and the number of pairs is \(n\times m=731,584\). As shown in Table 1, we partitioned the samples of drug-side effect pairs into three distinct subsets: \(\textrm{PS}_{1}\) containing 36,850 drug-side effect pairs with frequency information, \(\textrm{PS}_{2}\) with 36,850 pairs randomly selected from 694,734 pairs with unknown frequencies, and \(\textrm{PS}_{3}\) for the remaining 657,884 pairs with unknown frequencies. It was essential to include samples without drug side effect frequency information in the training set to calculate the probability of occurrence of drug side effects. Therefore, we randomly sampled a set of pairs equivalent to the size of \(\textrm{PS}_{1}\) to form \(\textrm{PS}_{2}\), assuming a frequency of zero for \(\textrm{PS}_{2}\). All samples from \(\textrm{PS}_{1}\) and \(\textrm{PS}_{2}\) were used for model training and testing in a five-fold cross-validation to predict the probability of drug side effect occurrence and frequency of side effects. \(\textrm{PS}_{3}\) was used for literature validation to assess the performance of the model.
Table 1 Three subsets of drug-side effect pair samplesA transformer-based cross-feature learning model for drug side effect frequency prediction (CrossFeat)This study introduces the CrossFeat, designed to predict the occurrence probability and frequency of side effects for new drugs based on the molecular structure and compound description information of drugs, along with word embedding and semantic similarity information of side effects. The architecture of the proposed model is illustrated in Fig. 1. Two critical challenges were addressed during the development of CrossFeat. The first involves creating representations that effectively represent each drug and its side effects, thereby ensuring accurate predictions for new drugs. The second challenge was to integrate the drug and side effect input features into a unified dimension. Samples comprised of pairs of drugs and their side effects; therefore, it was crucial to train features concurrently to enhance their interdependence as the model learned.Fig. 1Workflow of CrossFeat. Drug and side effect similarities are dimensionally reduced and multiplied through outer product operations to generate the drug and side effect embedding matrix (left side of the figure). Subsequently, the CNN architecture extracts features from the drug and side effect embedding matrix (center of the figure). The transformer module learns the representations from individual drug and side effect features and concurrently undergoes cross-learning to acquire the representations of each other (upper and lower right of the figure). Simultaneously, the Multi-Layer Perceptron (MLP) module projects the drug mol2vec vector and side effect word vector into a same-dimensional embedding (right middle of the figure). All output embeddings from MLPs and transformers are concatenated and inputted into two classifiers to predict the occurrence probabilities and frequencies of side effects for the drugsThe CrossFeat model was trained using the following workflow: (i) Drug and side effect similarities were dimensionally reduced to the same size of vectors. The reduced drug vectors were computed through outer product operations [28] to construct the drug embedding matrix, and the side effect vectors were similarly subjected to outer product operations to generate the side effect embedding matrix; (ii) these embedding matrices were subsequently input to the CNN for feature extraction; (iii) the transformer module was utilized to emphasize crucial information from the drug and side effect features themselves and simultaneously acquire information about each other by facilitating cross-feature learning; (iv) the multi-layer perceptron (MLP) projects the drug mol2vec vector and side effect word vector into the same size of embeddings; and finally, (v) the classifiers predict the occurrence probabilities and frequencies of side effects for the drugs by concatenating all the embeddings from the transformer and MLP. The detailed steps are elaborated in the subsequent subsections.Generation of the input embedding matrixThe drug similarity vectors of drug \(d_i\), \(V_{mol}^i \in {\mathbb {R}}^{n}\) and \(V_{fin}^i \in {\mathbb {R}}^{n}\), represent the cosine similarities of drug molecular substructures with mol2vec and Jaccard scores of chemical substructures by fingerprint, respectively. The side effect similarity vectors of side effect \(s_j\), \(V_{sem}^j \in {\mathbb {R}}^{m}\) and \(V_{word}^j \in {\mathbb {R}}^{m}\), represent the side effect semantic similarity and word vector cosine similarity, respectively. The dimensionality of all feature vectors \(n\) and \(m\) was reduced to \(l\) (in this study, \(l = 128\)); that is, \(\left\{ V’^i_{mol}, V’^i_{fin}, V’^j_{sem}, V’^j_{word} \right\} \in {\mathbb {R}}^{l}\). Each vector was subsequently multiplied by the others using the outer product operation, denoted by \(\bigotimes\). We used the outer product operation expecting that its use between similarity matrices would result in a synergistic effect on the similarity values. Thus, the drug embedding matrix \(M_{d_i} \in {\mathbb {R}}^{l\times l} = V’^i_{mol}\bigotimes V’^i_{fin} = V’^i_{mol} (V’^i_{fin})^\intercal\) and side effect embedding matrix \(M_{s_j} \in {\mathbb {R}}^{l\times l} = V’^j_{sem}\bigotimes V’^j_{word} = V’^j_{sem} (V’^j_{word})^\intercal\) with a size of \(l \times l\) were generated and subsequently used as the input to the CNNs. The similarity information for test drugs in \(V^i_{mol}, V^i_{fin}, V^j_{sem},\) and \(V^j_{word}\) was uniformly filled to zero to consider the drugs in the test set of each fold in the five-fold cross-validation experiment as new drugs without prior information.Feature extraction with CNNA CNN is a type of artificial neural network commonly applied to visual image analysis [29, 30]. It consists of multiple layers, each capable of detecting different features in an image. Our study used two separate CNNs to extract features from the drug and side effect embedding matrices. The structures of both CNNs were identical and comprised four convolutional layers, each consisting of a 2D convolution, batch normalization [31], and a ReLU [32] activation function (see Fig. 2A). Each layer had a channel size of 32, a stride of 2, and a kernel size of 2. In the CNN module, the input is a tensor of the following shape: batch \(\times 1 \times l\times l\). After passing through four convolutional layers, the input was abstracted into a feature map with a size of batch x \(32 \times 8 \times 8\). Subsequently, mean pooling was applied to each feature map.Fig. 2Schematic of the CNN and the cross-feature learning (feature-wise cross-attention) mechanism in the CrossFeat architecture. A An \(l\times l\) dimension embedding matrix is passed through four convolutional layers, each consisting of a Conv2D, batch normalization, and ReLU activation function, followed by mean pooling to extract feature matrices. These feature matrices are then input into the transformer encoder. B Queries (Q) from the drug encoder and keys (K) from the side effect encoder are used to form the attention scores. Specifically, the queries are derived from the previous sublayer of the drug encoder, while the keys and values (V) are obtained from the first sublayer of the side effect encoder. Attention scores are calculated as the dot product of the queries and keys, which are then passed through a softmax function to generate the attention weights. These weights are subsequently multiplied by the values to produce the output. This cross-attention process enables the effective fusion of features between the drug and side effects. It enhances the ability of the model to capture the complex relationships between drugs and their side effectsCross-feature learning with transformer encoder through cross-attentionThe original transformer [19] is a neural machine translation model consisting of encoder and decoder architectures. The encoder extracts features from an input sentence and the decoder utilizes these features to produce an output sentence for translation. The transformer module in CrossFeat is a variation of the original transformer encoder. The encoders for cross-feature learning are composed of a stack of two identical blocks (or layers), each containing three sub-layers (whereas the original encoder has two sub-layers): two multi-head attention mechanisms and a position-wise fully connected feedforward network. The output embeddings of a sublayer are carried forward to the subsequent layers through residual connections, and layer normalization is applied after each residual connection. The input of the attention function consists of queries and keys with dimensions \(D_k\), and values with dimensions \(D_v\), where the queries, keys, and values are packed together into matrices Q, K, and V, and the output matrix is calculated using the following equation:$$\begin{aligned} Attention(Q,K,V)=softmax\left( \frac{QK^\intercal }{\sqrt{D_{k}}}V \right) . \end{aligned}$$
(1)
Refer to Fig. 2b for a detailed illustration of the attention mechanism.In addition to the two sublayers in the original transformer encoder, CrossFeat’s encoder includes an additional multi-head attention layer inserted as a second sublayer, which performs cross-feature learning. Cross-feature learning (feature-wise cross-attention) is a module for semantic segmentation used in the CrossFeat architecture. It is employed to fuse features between the drug and side effect encoders. This module guides the filtration of transformer features and eliminates ambiguities in interactions between drugs and side effects. Let us denote the drug’s encoder as \(E_{d_i}\) and the side effect’s encoder as \(E_{s_j}\). The second sublayer of the \(E_{d_i}\) performs cross-attention over the output of the first sublayers of \(E_{d_i}\) and \(E_{s_j}\). Specifically, the queries are derived from the previous sublayer of \(E_{d_i}\), and the keys and values are obtained from the first sublayer of \(E_{s_j}\) as shown in Fig. 2B. Similarly, the second sublayer of \(E_{s_j}\) performs cross-attention over the output of the first sublayers of \(E_{s_j}\) and \(E_{d_i}\). Here, the queries come from the previous sublayer of \(E_{s_j}\) and the keys and values come from the first sublayer of \(E_{d_i}\). All sublayers of \(E_{d_i}\) and \(E_{s_j}\) produce outputs with dimensions \(D_{E_{d_i}}\) and \(D_{E_{s_j}}=p\).CrossFeat’s multi-layer perceptron (MLP)In the previous steps, we trained CNNs and transformers to generate embedding vectors that described each drug, denoted by \(d_i\), and each side effect, denoted by \(s_j\) based on their similarities to other drugs and side effects. In this step, our objective was to learn latent representations for each drug and side effect by directly capturing vectors representing \(d_i\) and \(s_j\) without relying on similarity information. Mol2vec vectors represent drugs and are projected onto q-dimensional representations for each drug using a two-layer MLP and batch normalization. Similarly, word vectors represent each side effect and are projected onto q-dimensional space using a two-layer MLP and batch normalization to create the corresponding latent representations.ClassifiersThe classifiers consist of a binary classifier to determine whether the side effect \(s_j\) occurs owing to the drug \(d_i\) and a regression classifier to predict the frequency of \(s_j\) occurring. The outputs from steps 3 and 4 were concatenated to create a classifier input vector with \(2p+2q\) dimensions. The binary classifier employs a sigmoid function. The output of the binary classifier was set to one if \(s_j\) occurs and zero otherwise. The binary occurrence was determined based on the following thresholds when a predicted score x was obtained:$$\begin{aligned} \textrm{Predicted binary occurrence}(x)= \left\{ \begin{matrix} 1 & \textrm{if}~x>0 \\ 0 & \textrm{if}~x=0 & . \end{matrix} \right. \end{aligned}$$
(2)
The output of the regression classifier is a continuous value between zero and five or higher if the output of the binary classifier is one.Experimental designWe employed a five-fold cross-validation procedure on 73,700 samples (drug-side effect pairs) comprising the \(\textrm{PS}_{1}\) and \(\textrm{PS}_{2}\) datasets. We divided the folds by drug rather than by sample to ensure that the drugs in the held-out test fold were not detected in the held-out train folds. Consequently, the folds did not contain the same number of samples. The average numbers of samples in the training and test folds were 58,960, and 14,740, representing approximately 80% and 20% of the total, respectively. The samples in the training fold were further split at a 4:1 ratio based on the samples. This division allocated 80% of the training fold samples (referred to as the training set) for model training and the remaining 20%, referred to as the validation set to set the model hyperparameters and choose the best model per fold.A grid search was performed to determine optimal hyperparameters for each fold of CrossFeat. The hyperparameter search space for CrossFeat is provided in Supplementary Table S1. We randomly selected 10 hyperparameter combinations and compared their performances on the validation set. During training, early stop endurance was counted if the performance on the validation set deteriorated compared to the previous state. The training process was concluded when early stop endurance reached 10. Subsequently, the performance of the test fold was evaluated using the best-performing hyperparameter combination determined in the validation set.We adopted the binary cross-entropy (BCE) loss function for the binary classification of side effects and we applied the \(L_2\) in Eq. 4 to our loss function for side effect frequency prediction. CrossFeat utilizes two Adam optimizers [33] to learn the predicted side effect occurrence probabilities \(\hat{y_{i1}}\) and predicted side effect frequency value \(\hat{y_{i2}}\) by minimizing the following two loss functions:$$\begin{aligned} & BCE = -\frac{1}{N}\sum _{i=1}^{N}y_{i1}\cdot log(\hat{y_{i1}})+(1-y_{i1})\cdot log(1-\hat{y_{i1}}) \end{aligned}$$
(3)
$$\begin{aligned} & \quad L_2 = \sum _{i=1}^{N’}(y_{i2}-\hat{y_{i2}})^2, \end{aligned}$$
(4)
where N and \(N’\) represent the number of training samples in the \(\textrm{PS}_{1}\) and \(\textrm{PS}_{2}\) datasets and training samples in the \(\textrm{PS}_{1}\) dataset, \(y_{ik}\) and \(\hat{y_{ik}}\) represent the true and predicted values of sample i, respectively. Four metrics were employed to evaluate the performance of the model: area under the receiver operating characteristic curve (AUROC), AUPRC for binary classification, root mean squared error (RMSE) and mean absolute error (MAE) for regression classification.Independent FAERS_SI datasetWe conducted additional experiments using the FAERS (FDA Adverse Event Reporting System) dataset, which includes reports of actual adverse events and medication errors submitted by patients to the Food and Drug Administration (FDA). The FAERS database relies on voluntary adverse event reports submitted by healthcare professionals, consumers, and manufacturers, including negative placebo effects. In contrast, the SIDER [20] database collects its information on drug side effects from the FAERS dataset; however, it utilizes natural language processing to extract drug-side effect pairs from the drug package insert. For this case study, we collected FAERS reports from the fourth quarter of 2012 to the second quarter of 2023. The original data can be downloaded from https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html. We included reports from healthcare professionals only, including physicians, pharmacists, nurses, dentists, and others. To enhance dataset reliability, we filtered the FAERS dataset to include only the drugs, side effects, and drug-side effect pairs that had frequency information available in the SIDER database. Additionally, we excluded cases involving the simultaneous use of several drugs to ensure clarity in determining the cause of the side effects attributable to a specific drug.The FAERS dataset we downloaded initially included 10,511,188 samples (drug-side effect pairs), 34,486 drugs, 17,550 side effects, and 1,341,486 distinct drug-side effect pairs. After filtering by 2,962 SIDER side effects, we reduced the sample size to 2,044,255. Further filtering by SIDER’s 932 drugs resulted in 808,783 samples. Finally, filtering by 59,333 SIDER drug-side effect pairs resulted in a dataset with 231,464 samples, encompassing 633 drugs, 1,395 side effects, and 19,319 distinct pairs. This dataset will be referred to as FAERS_SI. Galeano’s study [15] used the SIDER 4.1 database, which was released in October 2015, to create frequency classes, covering an earlier period. FAERS_SI includes data from a later period, with additional drugs and side effects absent in the Galeano dataset, making it largely independent, though not completely. Specifically, 61.3% of side effects (855/1,395), 77.4% of drugs (490/633), and 3.6% of distinct pairs (690/19319) in FAERS_SI overlap with those in the Galeano dataset. For a detailed illustration of the process of generating FAERS_SI, see Fig. 3. The frequency of drug side effects was calculated by dividing the number of samples in which a specific side effect occurred with the use of a particular drug by the total number of samples using that specific drug. Finally, we quantified the calculated frequency of drug side effects on a scale of 1 to 5. The frequency value was determined based on the following criteria:$$\begin{aligned} \textrm{Frequency}(x)= {\left\{ \begin{array}{ll} 1~\mathrm {(Very rare)} & \text {if } x< 0.0001 \\ 2~\mathrm {(Rare)} & \text {if } 0.0001 \le x< 0.001 \\ 3~\mathrm {(Infrequent)} & \text {if } 0.001 \le x< 0.01 \\ 4~\mathrm {(Frequent)} & \text {if } 0.01 \le x < 0.1 \\ 5~\mathrm {(Very common)} & \text {if } x \ge 0.1. \end{array}\right. } \end{aligned}$$
(5)
A frequency value of 0 was assigned to cases where there was no information or where no side effects were reported.Fig. 3FAERS_SI dataset creation process. The original FAERS dataset contains data from Q4 2012 to Q2 2023. First, the dataset was filtered to include only reports from healthcare professionals to enhance data reliability. Subsequently, it was further filtered to include only the drugs, side effects, and drug-side effect pairs present in the SIDER database. The SIDER database collects its information on drug side effects from FAERS up to 2015 using natural language processing to extract data from drug package inserts. This resulted in the final FAERS_SI dataset

Hot Topics

Related Articles