Semi-supervised meta-learning elucidates understudied molecular interactions

Data setsExperiment 1: DTI prediction
Training/validation data
We used molecular interactions in ChEMBL25 and HMDB27 as training and validation data. It contained 298,736 total pairs with 230,811 unique chemicals and 3084 unique proteins.

OOD testing data
The annotated DTIs from DrugBank28 were used as testing data. To simulate an OOD scenario, we removed all chemicals that are structurally similar to drugs in the testing data (Tanimoto coefficient > 0.5), totaling 21,760 pairs including 8917 unique chemicals and 3266 proteins.

Unlabeled data
To focus on the unexplored domain of interest, a target domain sampling strategy was developed. Specifically, we selected unlabeled pairs of drug targets and drug-like chemicals but excluded already labeled pairs. For each chemical, we sampled six proteins, resulting in 53,502 total unlabeled pairs. The detailed data statistics can be found in Fig. 1.
Experiment 2: Human MPI prediction
Training/validation data
The training data for this experiment was sourced from ChEMBL (version 29)25. It consisted of 334,668 pairs with 252,712 unique compounds and 5204 unique proteins, where each pair represented an activity with a single protein as the target.

OOD testing data
For the testing, we utilized HMDB27, which provided interactions between metabolites and human enzymes. We randomly sampled 10,000 pairs as the testing data covering 8,921 unique compounds and 2611 unique proteins.

Unlabeled data
To create the unlabeled dataset, we considered all the unlabeled metabolite-enzyme pairwise combinations. From the total pairs, we included all unique metabolites and randomly selected two enzymes to associate with each chemical, this resulted in the creation of a sizeable unlabeled dataset, consisting of 44,644 unlabeled samples. The detailed data statistics can be found in Fig. 1.
Experiment 3: Microbiome-human MPI prediction
Training/validation data
For this experiment, the training data consisted of a combination of ChEMBL, HMDB, and NJS1630 datasets. After removing duplicates and unusable data, the dataset contained a total of 1,667,708 samples including 357,213 unique compounds and 168,517 unique proteins.

OOD testing dataset
The testing dataset was manually created based on two published works. The first work31 provided information on interactions between 241 GPCRs and metabolites from simplified human microbiomes (SIHUMIs) consisting of the seven most general bacteria species. The second work32 involved the screening of gut microbiota metabolomes to identify ligands for various GPCRs. Since this study focused on small molecule metabolites, lipids were excluded, resulting in a total of 162 MPIs, including 17 positive activities.

Unlabeled data
For the protein side, we included all GPCRs from UniProt53. Besides, an equal number of proteins were randomly selected from the Pfam dataset. Chemical samples were the 240 unique metabolites from the NJS16 dataset. Overall, the unlabeled data consisted of 73,238 pairs. The detailed data statistics can be found in Fig. 1.
MMAPLE base modelsTo enable a fair comparison with the baseline models, we currently focus on binary classification problems. Four state-of-the-art base models were employed to evaluate the performance MMAPLE:

DISAE26. Distilled Sequence Alignment Embedding (DISAE) is a method developed by us that includes three major modules: protein language model, chemical structure modeling, and the combination of the above two modules. The protein sequence module uses distilled sequence alignment embedding, leveraging a transformer-based architecture trained on nearly half a million protein domain sequences for generating meaningful protein embeddings. This is crucial for predicting protein-ligand interactions in out-of-distribution (OOD) scenarios. The chemical module is a graph isomorphism network (GIN) to obtain chemical features, which are numerical representations of small molecules and capture their chemical properties. Finally, DISAE includes an attentive pooling module that combines the protein and chemical embeddings obtained from the first two modules to produce the final output for predicting DTIs or MPIs as a binary classification task (i.e., active or inactive). The attentive pooling module uses a cross-attention mechanism to weigh the importance of each protein and chemical embedding, allowing it to focus on the most relevant information when making the prediction. Lbase denotes the loss function of the base model, which is a binary cross-entropy loss in this case.

TransformerCPI18 Adapted from the transformer architecture, TransformerCPI takes protein sequence as the input to the encoder, and atom sequence as the input to the decoder, and learns the interaction at the last layers. Specifically, the amino acid sequence is embedded with a Word2vec model pre-trained on all human protein sequences in UniProt, and the self-attention layers in the encoder are replaced with a gated convolutional network and output the final presentation of proteins. The atom features of chemicals are learned through graph convolutional network (GCN) by aggregating their neighbor atom features. The interaction features are further obtained by the decoder of transfer, which consists of self-attention layers and feed-forward layers.

DeepPurpose19 DeepPurpurpose provides a library for DTI prediction incorporating seven protein encoders and eight compound encoders to learn the protein and compound representations respectively, and eventually feeds the learned embeddings into an MLP decoder to generate predictions. We implemented the best-reported architecture, convolutional neural network (CNN) for both protein and compound feature representation learning, as another base model of MMAPLE.

BACPI20 The last base model included in this study is the Bi-directional attention neural network for compound-protein interaction (BACPI). Similarly, it consists of chemical representation learning, protein representation learning, and CPI prediction components to combine them. BACPI employs a graph attention network (GAT) for compounds to learn various information of the molecule graphs. For protein, it introduces a CNN module to take the amino acid sequence as input, to learn the local contextual features of protein by using a content window to split the sequences onto overlapping subsequences of amino acids. Finally, the atom structure graphs and residue sequence features are fed into the bi-directional attention neural network to integrate the representations and capture the important regions of compounds and proteins, and the integrated features are used to predict the CPI.

Semi-supervised meta-learningWe adopted a semi-supervised meta-learning paradigm for our model training. Similar to pseudo labels, there is a pair of teacher model and student model, the teacher model takes unlabeled data as input, and uses the predicted results as pseudo labels for the student model to learn with the combination of labeled and pseudo-labeled data. However, instead of learning from the fixed teacher model, the student constantly sends feedback to the teacher in the format of performance on labeled data, and the teacher keeps updating the pseudo labels on every mini-batch. This strategy could solve the problem of confirmation bias in pseudo-labeling54. The illustration of MMAPLE training is shown in Fig. 5. Let T and S denote the teacher model and the student model, θT and θS denote the corresponding parameters (\({\theta }_{T}^{{\prime} }\) and \({\theta }_{S}^{{\prime} }\) denote the updated parameters). We use \({{\mathcal{L}}}\) to represent the loss function, and T(xu; θT) to stand for the teacher predictions on unlabeled data xu, similar notations for S(xu; θS) and \(S({x}_{l};{\theta }_{S}^{{\prime} })\). CE denotes the cross-entropy loss.Fig. 5: Illustration of MMAPLE training schema.A teacher model generates pseudo labels by predicting a batch of unlabeled data. The pseudo label is further passed to a filter to control the balance ratio of the positive and negative samples (as a hyperparameter). A student model generates the predictions from the same unlabeled data as those used in the teacher model and is updated by minimizing loss function \({{\mathcal{L}}}_{u}({\theta }_{T},{\theta }_{S})\) as in equation (3). Then, the updated student model takes a batch of labeled data and generates new predictions that compare with the ground truth labels and minimize loss \({{{\mathcal{L}}}}_{l}({\theta}^{{\prime} }_{S})\) as equation (6).Model trainingMMAPLE does not work alone but is built on top of other models. The training process is repeated until optimization converges. The number of iterations depends on the base model and training data, so it varies accordingly. To ensure a fair comparison with the base models, both MMAPLE and base models were constructed using the same architecture. The detailed training procedure is shown in Algorithm 1.The update rule of studentOn a batch of unlabeled data xu, sample T(xu; θT) from the teacher’s prediction, and optimize the student model with the objective$${\min }_{{\theta }_{S}}{{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S})$$
(1)
where$${{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S}):={{\mathbb{E}}}_{{x}_{u}}[CE(T({x}_{u};{\theta }_{T}),S({x}_{u};{\theta }_{S}))]$$
(2)
The optimization of each mini-batch is performed as$${\theta }_{S}^{{\prime} }={\theta }_{S}-{\eta }_{S} \nabla {\theta }_{S}{{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S})$$
(3)
The update rule of teacherOn a batch of labeled data (xl, yl), and use the students’ update to optimize the teacher model with the objective$${\min }_{{\theta }_{T}}{{{\mathcal{L}}}}_{l}({\theta }_{S}-{\eta }_{S}\nabla{\theta }_{S}{{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S}))$$
(4)
where$${{{\mathcal{L}}}}_{l}({\theta }_{S}^{{\prime} }):={{\mathbb{E}}}_{{x}_{l},{y}_{l}}[CE({y}_{l},S({x}_{l};{\theta }_{S}^{{\prime} }))]$$
(5)
The optimization of each mini-batch is performed as$${\theta }_{T}^{{\prime} }={\theta }_{T}-{\eta }_{T}\nabla{\theta }_{T}{{{\mathcal{L}}}}_{l}({\theta }_{S}-{\eta }_{S}\nabla{\theta }_{S}{{{\mathcal{L}}}}_{u}({\theta }_{T},{\theta }_{S}))$$
(6)
We experimented with both hard labels and soft labels. Due to the superior performance of soft labels to hard labels, the final MMAPLE was trained using the soft label. The methods are described as follows:Using soft labelsBecause we always treat θS as fixed parameters when optimizing Equation (6) and ignore its higher-order dependence on θT, the objective is fully differentiable with respect to θT when soft pseudo labels are used, i.e., T(xu; θT) is the full distribution predicted by the teacher model. This allows us to perform standard back-propagation to obtain the gradient.Additionally, we incorporated the temperature scaling to soften the teacher model’s predictions55. T(xu; θT) is the teacher’s output distribution computed by applying softmax over the logits \(z\! :\,{\mbox{softmax}}\,(z)=\frac{\exp (z/T)}{\mathop{\sum }_{j = 1}^{n}\exp ({z}_{j}/T)}\), the temperature parameter T is used to control the ”softness” of the output probabilities. In the implementation, the temperature was tuned by hyperparameter searching.For the quality control of soft labels, we employed a balance sampler to control the ratio between positive and negative hard labels transferred from soft labels. This will provide a mechanism to dynamically adjust the ratio of positive and negative during training. This ratio served as a crucial parameter to govern the training process, enabling us to strike a balance between the two label categories. Through this approach, we aimed to alleviate bias and imbalance in the dataset.Using hard labelsWhen using hard pseudo labels, we followed the derivative rule proposed in the reference54, which was a slightly modified version of REINFORCE applied to obtain the approximated gradient of \({{{\mathcal{L}}}}_{l}\) in Equation(6) with respect to θT as follows:$$h={\eta }_{S}\cdot \left({\left({\nabla}_{{\theta }_{{S}^{{\prime} }}}CE\left.\right({y}_{l},S\left({x}_{l};{\theta }_{S}^{(t+1)}\right)\right)}^{T}\cdot \nabla{\theta }_{{S}^{{\prime} }}CE\left({\hat{y}}_{u},S\left({x}_{u};{\theta }_{S}^{t}\right)\right)\right)$$
(7)
The teacher’s gradient from the student’s feedback:$${g}_{T}^{t}=h\cdot \nabla{\theta }_{T}CE({\hat{y}}_{u},T({x}_{u};{\theta }_{T})){| }_{{\theta }_{T}}={\theta }_{T}^{(t)}$$
(8)

Algorithm 1
Training procedure
Require: N, the batch size nsup, number of epochs of supervised training nfreeze, number of epochs that teacher model is frozen n, number of training epochs
Input: Xun, Xl
Stage 1:
for epoch = 1 to nsup do
for t = 1 to \(\frac{{N}_{l}}{N}\) do
sample Xl of size N from the labeled data (without rep)
Update θT with Lbase
end for
end for
save the model with early stopping
Stage 2:
Initialize the teacher model with θT
Initialize student model with random parameters θS
for epoch = 1 to nfreeze do
for t = 1 to \(\frac{\min ({N}_{un},{N}_{l})}{N}\) do
sample Xun of size N from unlabeled data (without rep)
update θS with student update rule
end for
end for
for epoch = nfreeze + 1 to n do
sample Xun of size N from unlabeled data (without rep)
update θS with student update rule
update θT with teacher update rule
end for
Model evaluationThe model performance was measured using both Receiver Operating Characteristic (ROC) and Precision-Recall (PR) and their corresponding area under the curve (AUC). While ROC is a commonly used metric, it may give an optimistic impression of the model’s performance, particularly when datasets are imbalanced56. Therefore, PR is a better metric to evaluate the performance of MMAPLE than ROC. A three-fold cross-validation approach was utilized to ensure the robustness of the model’s performance evaluation. Consistency across evaluations was maintained by using the same folds for all base models.Statistical significance of predictionIn our study, we focused on predicting GPCR genes that interact with TMAO, employing a comprehensive analytical approach to evaluate the statistical significance of each prediction. The prediction scores generated through our model were subjected to Kernel Density Estimation (KDE) from the Python package scipy57. KDE is a non-parametric way to estimate the probability density function of our prediction scores. By applying KDE, we were able to calculate the tail probability for each predicted interaction score, which we interpreted as a p-value. This p-value serves as an indicator of the rarity or significance of the predictions within the overall distribution of scores, providing a statistical basis for identifying the most significant GPCR-TMAO interactions. The detailed results of our predictions can be found in Supplemental Table 4. The results of DISAE predictions can be found in Supplemental Table 5.Statistics and ReproducibilityStatistical analyses in general were conducted using paired t-tests. We employed three-fold cross-validation to ensure the robustness of our results. For each fold, we applied early stopping and tested the model on the hold-out testing set. The final reported mean performance is the average result from these testing sets.GPCR functional assayTrimethylamine N-oxide (TMAO) (purity: 95%, molecular weight: 76.12) was purchased from Sigma-Aldrich (MO, USA).GPCR functional assay was performed using the PathHunter® β-Arrestin assay by Eurofins (CA, USA). The PathHunter® β-Arrestin assay monitors the activation of a GPCR in a homogenous, non-imaging assay format using a technology called Enzyme Fragment Complementation (EFC) with β-galactosidase (β-Gal) as the functional reporter. The enzyme is split into two inactive complementary portions (EA for Enzyme Acceptor and PK for ProLink) expressed as fusion proteins in the cell. EA is fused to β-Arrestin and PK is fused to the GPCR of interest. When the GPCR is activated and β-Arrestin is recruited to the receptor, ED and EA complementation occurs, restoring β-Gal activity which is measured using chemiluminescent PathHunter® Detection Reagents.The compound activity was analyzed using the CBIS data analysis suite (ChemInnovation, CA).For agonist mode assays, percentage activity was calculated using the following formula:%Activity = 100% x (mean RLU of test sample – mean RLU of vehicle control) / (mean MAX control ligand – mean RLU of vehicle control)Where RLU is relative luminescence unit of the measurement.For antagonist mode assays, percentage inhibition was calculated using the following formula:%Inhibition = 100% x (1 – (mean RLU of test sample – mean RLU of vehicle control) / (mean RLU of EC80 control – mean RLU of vehicle control))Where EC80 is 80% maximal effective concentration of TMAO.Protein-ligand dockingAutoDock Vina38 was applied on TMAO to find the best conformation in the CXCR4 chemokine receptor (PDB ID: 3ODU). The center of the co-crystallized ligand (ligand ID: ITD) in 3ODU was used to define the center of the searching space and 12 Angstrom of extra space was added to the edge of ITD to set up the docking space for TMAO. The binding energies between TMAO and 3ODU were attained in terms of Kcal/mol.Ablation studyAll the ablation studies were applied to the experiment of Microbiome-human MPI prediction.Vanilla TSVanilla teacher-student model, where the teacher model is pre-trained, and kept frozen while training the student model, so the student will reply on the pseudo labels to learn, without sending feedback to the teacher model. Hard labels are used for pseudo-labeling.TS softSame as Vanilla TS, except for soft labels are used.OOC-ML(out-of-cluster meta-learning)As demonstrated in the published work12, we created five clusters based on the scaffold of the molecules, and we forced the model to see data from different clusters from every meta-update, therefore the model was pushed to generalize on the unseen data.

Hot Topics

Related Articles