Deep active learning with high structural discriminability for molecular mutagenicity prediction

DatasetsTOXRIC dataset33
The raw data used in this study were the C. Xu’s Ames data collection45, which is one of the commonly used data sets for developing the prediction models. The entire database was prepared as follows46. Firstly, any inorganic molecules, that is, those without carbon atoms within the structure are removed. Secondly, the molecules with unspecified stereochemistry were removed. Thirdly, the molecules were standardized using the InChI key47. Finally, duplicates were identified and removed using the InChI key across the data collection. Ultimately, in total, 7485 compounds were used for the model building. The data sets contained 4196 mutagens and 3289 non-mutagens. The dataset is available in https://toxric.bioinforai.tech/home.Li’s dataset21
The dataset was constructed from data sourced from three distinguished databases: the Chemical Carcinogenesis Research Information System, the National Toxicology Program, and the Instituto Superiore di Sanita for Salmonella Typhimurium. Further refinement was achieved by removing samples that were duplicates of those found in the Ames dataset, thus establishing a new, independently verified external test set. Statistical analysis of the dataset was shown in Supplementary Fig. 2.Molecular descriptors and fingerprint featuresThe fingerprint features include three sets of topological path-based features (Extended Connectivity Fingerprints with a diameter of 2, 4, and 6, ECFP2, ECFP4, and ECFP6) and one set of substructure-key SMARTS-based features MACCS.ECFP fingerprints are generated based on the connectivity between atoms in a molecule, taking into account the bonds, hybridization states, and functional groups. ECFP fingerprints use circular fingerprints, where the radius of the circle defines the maximum distance between atoms that can be included in a particular substructure. According to different radius, ECFP is divided into ECFP2, ECFP4, and ECFP6.MACCS fingerprints are generated based on the structural features of a molecule, such as the presence of aromatic rings, functional groups, and atom types. Each structural key in the fingerprint is assigned a binary value, where 1 indicates the presence of the key and 0 indicates the absence of the key.RDKit2D descriptors are selected as input features to complement the fingerprint features. RDKit2D descriptors can provide information on a wide range of molecular properties, including size, shape, polarity, and flexibility. Some examples of RDKit2D descriptors include the number of atoms, the molecular weight, the number of rotatable bonds, and the number of hydrogen bond donors and acceptors.Deep active learning strategyActive learning aims at selecting the most informative samples from a pool of unlabeled samples in the entire sample space. Defining the amount of information in a sample is the biggest challenge in the active learning problem. To describe the deep active learning scenario proposed in this paper, an unlabeled sample pool consisting of \(N\) unlabeled molecules is assumed to be \({{{{\mathrm{U}}}}}_{N}=\{({x}_{1},{y}_{1}),{..}.,({x}_{N},{y}_{N})\}\), where \({x}_{i}\) is the feature of the molecule and \({y}_{i}\) is the toxicity label corresponding to \({x}_{i}\). First, we randomly select \(M\) samples from \({{{{\mathrm{U}}}}}_{N}\) and give them to the oracle for annotation, which generates an initial pool of annotated samples \({ {{\mathcal L}} }_{M} = \{({x}_{1},{y}_{1}),{..}.,({x}_{M},{y}_{M})\}\). Then the five features of all samples in the initial annotated sample pool \({ {\mathcal L} }_{M}\) are extracted and input to the backbone module \({f}_{{{{\rm{b}}}}}\). The output of the hidden layer of the network, which can be considered the embeddings of the input features, is fed into the uncertainty estimation module \({f}_{{{{\rm{u}}}}}\). The model’s parameters are updated by jointly optimizing \({f}_{{{{\rm{b}}}}}\) and \({f}_{{{{\rm{u}}}}}\) according to the defined total loss.Framework architectureFeature extraction moduleMolecular fingerprints and molecular descriptors are widely used in similarity searching and classification. Four molecular fingerprints and one molecular descriptor are used in this work. They are ECFP2, ECFP4, and ECFP6 (2048 bits), MACCS keys (MACCS, 166 bits) and RDKit2D. All the fingerprints and molecular descriptors were calculated by the RDKit python package.Backbone moduleAs shown in Fig. 2, considering the higher dimensionality of extended connectivity fingerprints compared to other features, we first stitch ECFP2, ECFP4, and ECFP6 in the channel dimension to form a three-channel fusion feature, which is then fed into two convolution blocks. In each convolutional block, a 1D convolutional layer and an average pooling layer are used first to extract features and remove redundant information, thus reducing the parameters of the network. Then the ReLU activation function follows, which introduces a non-linear element to enhance the representation ability of the network and mitigate the problems of gradient disappearance and gradient explosion. With the two convolution blocks mentioned above, it is possible to further extract features while reducing the dimensionality, which helps in the subsequent classification steps. The output of the convolution block is stitched with the lower dimensional MACCS fingerprints and RDKit2D descriptors to achieve feature fusion. The fused features are then fed into a linear block consisting of a linear layer, a ReLU layer and a Dropout layer, where the Dropout layer is used to prevent overfitting of the input. Finally, a linear classifier is used to classify the mutagenicity of the molecule.Uncertainty estimation moduleIn active learning, the key issues are the criteria for measuring the informativeness of the samples and the design of the query module. For the first problem, the most commonly used measure is uncertainty-based querying, i.e., querying the samples that are most difficult for the model to classify. Uncertainty-based querying has been shown to be more applicable in classification problems with small samples48, so we choose uncertainty as the measure of informativeness. In deep learning, the loss is often used as a measure of the difference between the predicted and true values of a model. The samples with the largest losses can usually be regarded as the samples which are the hardest for the model to distinguish. Therefore, the uncertainty estimation can be converted into the loss estimation. Since the model loss values cannot be computed for samples without true labels, a module needs to be designed to estimate the loss values for unlabeled samples. By training an uncertainty estimation module using labeled samples, we can predict the loss of unlabeled samples and thus estimate their uncertainty. The uncertainty estimation module designed in this paper is shown in Fig. 2. To make good use of features extracted by the hidden layer of the backbone module, we use it as input to the uncertainty estimation module. Inspired by Yoo et al.49, the module consists of a global average pooling layer, two linear layers and a ReLU layer, where the global average pooling layer aims to integrate feature information, and the introduction of linear and non-linear activation layers enables the network to learn better. A final linear layer maps the features into a scalar that outputs the uncertainty scores of unlabeled samples. We did not use more scaled hidden features, as this could have led to a more complex structure of the uncertainty estimation module, which would decrease the prediction performance. We confirmed this view in Results.Loss calculation moduleHaving defined the structure of the backbone module \({f}_{{{{\rm{b}}}}}\) and the uncertainty estimation module \({f}_{{{{\rm{u}}}}}\), we need to focus on how they are jointly optimized. The total loss of the modules \({L}_{{\rm{total}}}\) consists of two main components: the backbone module loss \({L}_{{{{\rm{b}}}}}\) and the uncertainty estimation module loss \({L}_{{{{\rm{u}}}}}\), which will be described separately below.The output of a labeled sample \(x\) after the backbone module is \(\hat{y}={f}_{{{{\rm{b}}}}}(x)\). In the binary classification task, we usually use binary cross-entropy loss. It is$${L}_{{{{\rm{b}}}}}(\hat{y},y)=-(y\cdot \,\log (\hat{y})+(1-y)\cdot \,\log (1-\hat{y}))$$
(1)
We want the output of the uncertainty estimation module to be as close as possible to the binary cross-entropy loss of the sample, so the uncertainty estimation task can be considered a regression task. In usual regression tasks, the most used metric is the mean squared error (MSE) \({L}_{{{{\rm{u}}}}}(\hat{y},y)=\frac{1}{n}{\sum }_{i=1}^{n}{(\hat{y}-y)}^{2}\), but the scale of loss changes as the training progresses, so using MSE as the loss is not a sensible choice. Here, we determine the trend of uncertainty score estimation by comparing the losses of a pair of samples within a mini-batch. Assuming that the \(k\) th pair of samples \(({x}_{i},{y}_{i})\) and \(({x}_{j},{y}_{j})\) in the same mini-batch in the sample pool, their outputs after the uncertainty estimation module are \({\hat{l}}_{i}\) and \({\hat{l}}_{j}\), and the actual cross-entropy losses are \({l}_{i}\) and \({l}_{j}\), we can define the loss for this pair of samples after the uncertainty estimation module as$${L}_{{{{\rm{u}}}}}\left({\hat{l}}_{{{{\rm{batch}}}}}^{k},{l}_{{{{\rm{batch}}}}}^{k}\right)=\,\max (0,-{{{\rm{sign}}}}({l}_{i}-{l}_{j})\cdot ({\hat{l}}_{i}-{\hat{l}}_{j})+\xi )$$
(2)
where \({{{\rm{sign}}}}(\cdot )\) is the sign function, margin \(\xi\) is a very small number. Equation(2) indicates that when \({l}_{i}-{l}_{j}\) and \({\hat{l}}_{i}-{\hat{l}}_{j}\) have the same sign, i.e., the loss of a pair of samples shows the same trend, the value of \({L}_{{{{\rm{u}}}}}\) is zero. Otherwise, the parameters of the uncertainty estimation module need to be updated by gradient descent.Thus, given the size \(B\) of the mini-batch, the total loss of the two modules can be defined as$${L}_{{{\rm{total}}}}=\frac{1}{B}\left(\mathop{\sum}_{(x,y)\in B}{L}_{{{{\rm{b}}}}}(\hat{y},y)+2\lambda \cdot \mathop{\sum}_{({x}^{k},{y}^{k})\in B}{L}_{{{{\rm{u}}}}}({\hat{l}}_{{{{\rm{batch}}}}}^{k},{l}_{{{{\rm{batch}}}}}^{k})\right)$$
(3)
By optimizing the total loss \({L}_{{\mbox{total}}}\), we can jointly optimize the parameters of the backbone module and the uncertainty estimation module during the training process, thus estimating the uncertainty of unlabeled samples during the active learning phase. Algorithm 1 is elaborated to the algorithm logic and conceptual modeling.
Algorithm 1
The muTOX-AL framework for molecular mutagenicity prediction
Input:
unlabeled pool \({{{\mathrm{U}}}}\)
The testing set \({{{\mathcal{T}}}}\)
The number of initialized label set \(M\)
The number of active learning cycles \(C\)
The number of samples labeled in each cycle \(K\)
The backbone module \({f}_{{{{\rm{b}}}}}\), The uncertainty estimation module \({f}_{{{{\rm{u}}}}}\)
1: Randomly select \(M\) samples from \({{{\mathrm{U}}}}\) to gain initialized labeled set \({\mathcal L}\)
2: For c in C:
3:    Train the backbone module \({f}_{{{{\rm{b}}}}}\) and the uncertainty estimation module \({f}_{{{{\rm{u}}}}}\) using \({\mathcal L}\)
4:     Evaluate the performance on \({f}_{{{{\rm{b}}}}}\) using the testing set \({{{\mathcal{T}}}}\)
5:      Estimate the uncertainty of the unlabeled samples \({{{\mathrm{U}}}}\) by \({f}_{{{{\rm{b}}}}}\) and \({f}_{{{{\rm{u}}}}}\)
6:     Select the top K samples with the highest uncertainty
7:     Query their labels from the oracle to obtain \({ {\mathcal L} }_{K}\)
8:     \({\mathcal L} \leftarrow {\mathcal L} \cup { {\mathcal L} }_{K}\)
9:     \(c\leftarrow c+1\)
10: End
Evaluation metricsTwo commonly evaluation metrics in classification tasks are used as evaluation criteria: accuracy and F1-score. First, we define four indicators: True Positive (TP) means that the positive sample has a positive predictive value and the prediction is correct; True Negative (TN) means that the negative sample has a negative predictive value and the prediction is correct; False Positive (FP) means that the positive sample has a positive predictive value and the prediction is wrong; False Negative means that the predicted value of the negative sample is negative and the prediction is correct.Accuracy represents the proportion of samples correctly predicted to all samples and is the most common evaluation metric in classification tasks and is defined as$${{\rm{Accuracy}}}=\frac{{{\rm{TP}}}+{{\rm{TN}}}}{{{\rm{TP}}}+{{\rm{TN}}}+{{\rm{FP}}}+{{\rm{FN}}}}$$
(4)
F1-score is defined in Eq. (5), which combines the Precision and Recall metrics.$${{\rm{F1}}}{\mbox{-}}{{\rm{score}}}=\frac{2 \times {{\rm{Precision}}} \times {{\rm{Recall}}}}{{{\rm{Precision}}}+{{\rm{Recall}}}}$$
(5)
where$${{\rm{Precision}}}=\frac{{{\rm{TP}}}}{{{\rm{TP}}}+{{\rm{FP}}}}$$
(6)
$${{\rm{Recall}}}={{\rm{Sensitivity}}}=\frac{{{\rm{TP}}}}{{{\rm{TP}}}+{{\rm{FN}}}}$$
(7)
Specificity refers to the proportion of TN samples that are correctly predicted as negative by the model, i.e., the probability of TN samples being correctly predicted as negative.$${{\rm{Specificity}}}=\frac{{{\rm{TN}}}}{{{\rm{TN}}}+{{\rm{FP}}}}$$
(8)
Experimental settingsAll our experiments are implemented in the PyTorch framework. We set the batch size as 128 and used a 5-fold cross-validation to increase the generalizability of the experimental results. We mainly followed the training strategy in the active learning setup, following a five-fold cross-validation strategy, where all the dataset was randomly divided into five subsets, and in each fold, four of them were selected as the total training set for the model, and the remaining one subset was used to test the model’s performance (the test set), which is not visible at all times. The whole active learning process is divided into nine cycles. At the beginning of the experiment (cycle = 0), we randomly select 200 samples from the unlabeled sample pool to train the initialized network and select 300 samples from the unlabeled pool in each active learning cycle. The backbone module is trained jointly with uncertainty estimation module. Separately, the backbone module is trained with 300 epochs, using an SGD optimizer with a learning rate of 5e-3, a momentum parameter of 0.9 and a weight decay parameter of 5e-4. The uncertainty estimation module is trained using an Adam optimizer with a learning rate of 8e-3. The margin in Eq. (2) is set to one. For each method, ten randomized replicate experiments are conducted using different initial labeled samples, and we report the mean of the ten experiments at the end. Detailed information on active learning training strategies can be found in the “Active learning training strategies in muTOX-AL” section of the Supplementary Information.Active learning methods for comparisonWe have compared muTOX-AL with the following five active learning methods.Random strategyThe random strategy is the most common active learning baseline. In each active learning cycle, the \(K\) samples are selected randomly from the unlabeled pool and given to the oracle for annotation.Margin-based active learning strategy35
The margin-based active learning strategy is a uncertainty-based method. It defines the uncertainty by measuring the difference between the prediction probabilities of different categories. The \(K\) samples with the lowest margin are added to the labeled pool. The margin is defined as$${{{\rm{X}}}}={{\mbox{arg}}\,\min }_{x\in {{{\mathrm{U}}}}}(P({\hat{y}}_{1}|x)-P({\hat{y}}_{2}|x))$$
(9)
Entropy-based active learning strategy36
The entropy-based active learning strategy is an uncertainty-based method. In information theory, the uncertainty of the data is higher if it has a higher entropy. Therefore, the entropy of the unlabeled samples is calculated and ranked. The \(K\) samples with the highest entropy are added to the labeled pool. The entropy is defined as$${{{\rm{X}}}} = {{\arg}\,\max }_{x\in {{{\mathrm{U}}}}} {E}_{x} = {\arg}\,\min \left\{\mathop{\sum }_{i=1}^{Y} {{{\mathrm{P}}}}({y}_{i}|x)\times \,\log {{{\mathrm{P}}}} ({y}_{i}|x)\right\}$$
(10)
TOD active learning strategy37
The temporal output discrepancy TOD active learning strategy based on temporal output discrepancy is an uncertainty-based approach. It defines the uncertainty by calculating the discrepancy of model output at different active learning cycles.Core-set active learning strategy38
The Core-set active learning strategy is a diversity-based approach which is also a common baseline in active learning trying to find a core set that makes the model’s performance on the core set and the whole dataset as close as possible.Machine learning-based mutagenicity prediction methods for comparisonMILFeeney et al.19 propose a machine learning approach based on multi-instance learning for molecular mutagenicity prediction, particularly for metabolically activated compounds like aromatic amines. By grouping metabolites and their parent compounds under a single mutagenicity label, MIL circumvents the need for individual labels, capturing the mutagenic potential through structural considerations. MIL achieved excellent performance on the mutagenicity molecular dataset, so we used it as one of the baselines for muTOX-AL.Enhanced_Representation_MutagenicityShinada et al.18 systematically considered and evaluated combinations of structures and molecular features that have the greatest impact on model accuracy, using various classification models (including classic machine learning and deep learning models) to assess these features. We selected Structural Representation, Molecular Descriptors, and Genotoxicity Descriptors features, with the Random Forest classifier as our evaluation baseline.Statistics and reproducibilityThe study employed five fold cross-validation with ten random repetitions, reporting the mean and standard deviation across these repetitions. The p-values reported in the study were calculated using independent t-tests.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles