Contextual AI models for single-cell protein biology

The Methods describe (1) the curation of datasets, (2) the construction and representation of multiscale single-cell networks, (3) PINNACLE multiscale graph neural network, (4) the fine-tuning of PINNACLE for target prioritization and (5) the metrics and statistical analyses used.DatasetsReference human physical protein interaction networkOur reference PPI network is the union of physical multivalidated interactions from BioGRID101,102, the Human Reference Interactome (HuRI)91 and Menche et al.103 with 15,461 nodes and 207,641 edges. Different sources of PPI have their own methods of curating and validating physical interactions between proteins. BioGRID, HuRI and Menche et al. are PPI networks from three well-cited publications and databases regarding human protein interactions. By joining the three networks, we construct a comprehensive global PPI network for our analysis.Multiorgan, single-cell transcriptomic atlas of humansWe leverage Tabula Sapiens20 data source as our multiorgan, single-cell transcriptomic atlas of humans. The data consists of 15 donors, with 59 specimens total. There are 483,152 cells after quality control, of which 264,824 are immune cells, 104,148 are epithelial cells, 31,691 are endothelial cells and 82,478 are stromal cells. The cells correspond to 177 unique cell ontology classes.Construction of multiscale networksOur multiscale networks comprises protein–protein physical interactions, cell type-to-cell type communication, cell type-to-tissue relationships and tissue–tissue hierarchy.Cell type-specific protein interaction networksFor each cell type, we create a cell type-specific network that represents the physical interactions between proteins (or genes) that are probably expressed in the cell type. Intuitively, our approach identifies genes significantly expressed in a given cell type with respect to the rest of the cells in the dataset. Concretely, we use the processed Tabula Sapiens count matrix to calculate the average expression of each gene in a cell type of interest and the average expression of the corresponding gene in all other cells. Then, we use the Wilcoxon rank-sum test on the two sets of average gene expression. From the resulting ranked list of genes based on activation, we filter for the top K most activated genes. We repeat these two steps N times and filter for genes that appear in at least 90% of iterations. Finally, we extract these genes’ corresponding proteins from the global protein interaction network and take only the largest connected component. To ensure high-quality representations of cell types in our networks, we keep networks with at least 1,000 proteins. We do not perform subsampling of cells (that is, sample the same number of cells per cell type) to minimize information loss for constructing protein interaction networks (Extended Data Fig. 2). Limitations are described in Discussion.Cell type and tissue relationships in the metagraphWe identify interactions between cell types based on LR expression using the CellPhoneDB104 tool and database. An edge between a pair of cell types indicates that CellphoneDB predicts at least one significantly expressed LR pair (with a P value less than 0.001) between them. As recommended by CellPhoneDB, cells are subsampled before running the algorithm, which uses geometric sketching105 to efficiently sample a small representative subset of cells from massive datasets while preserving biological complexity. We choose to subsample 25% of cells and run CellPhoneDB for 100 iterations. We determine cell type–tissue relationships and extract tissue–tissue relationships using Tabula Sapiens meta-data. For relationships between cell types and tissues, we draw edges between cell types and the tissue that the cells were taken from. For tissue–tissue relationships, we select the nodes corresponding to the tissues where samples were taken from and all parent nodes up to the root of the BRENDA tissue ontology106. We perform sensitivity and ablation analyses on different components of the metagraph (Supplementary Tables 3–5).Final datasetWe have 156 cell type-specific protein interaction networks, which have, on average, 2,530 ± 677 proteins per network. The number of unique proteins across all cell type-specific protein interaction networks is 13,643 of the 15,461 proteins in the global reference protein interaction network. In the metagraph, we have 62 tissues (nodes), and 24 are directly connected to cell types. There are 3,567 cell–cell interactions, 372 cell–tissue edges and 79 tissue–tissue edges.Multiscale graph neural networkOverviewPINNACLE performs biologically informed message passing through proteins, cell types and tissues to learn cell type-specific protein representations, cell type representations and tissue representations in a unified multiscale embedding space. Specifically, PINNACLE traverses through protein–protein physical interactions in each cell type-specific PPI network, cell type–cell type communication, cell type–tissue relationships and tissue–tissue hierarchy with an attention mechanism over individual nodes and edge types. Its objective function is designed and optimized for learning the topology across biological scales, from proteins to cell types to tissues. The resulting embeddings from PINNACLE can be visualized and manipulated for hypothesis-driven interrogation and fine-tuned for diverse downstream biomedical prediction tasks.Problem formulationLet \({{{\mathcal{G}}}}=\{{G}_{1},\ldots ,{G}_{| {{{\mathcal{C}}}}| }\}\) be a set of cell type-specific PPI networks, where \({{{\mathcal{C}}}}\) is a set of unique cell types. Each \({G}_{{c}_{i}}=({V}_{{c}_{i}},{E}_{{c}_{i}})\) consists of a set of nodes \({V}_{{c}_{i}}\) and edges \({E}_{{c}_{i}}\) for a given cell type \({c}_{i}\in {{{\mathcal{C}}}}\) specific PPI network. Their nodes \(u,v\in {V}_{{c}_{i}}\) are proteins, and edges \({e}_{u,v}^{{{{\rm{PP}}}}}\in {E}_{{c}_{i}}\) are physical PPIs (denoted with PP in superscript). Cell types and tissues form a network, referred to as a metagraph. The metagraph’s set of nodes comprises cell types \({c}_{i}\in {{{\mathcal{C}}}}\) and tissues \({t}_{i}\in {{{\mathcal{T}}}}\). The types of edges are cell type-cell type interactions (denoted with CC in superscript) \({e}_{{c}_{i},{c}_{j}}^{{{{\rm{CC}}}}}\) between any pair of cell types \({c}_{i},{c}_{j}\in {{{\mathcal{C}}}}\); cell type-tissue associations (denoted with CT in superscript) \({e}_{{c}_{i},{t}_{i}}^{{{{\rm{CT}}}}}\) between any pair of cell type \({c}_{i}\in {{{\mathcal{C}}}}\) and tissue \({t}_{i}\in {{{\mathcal{T}}}}\); and tissue–tissue relationships (denoted with TT in superscript) \({e}_{{t}_{i},{t}_{j}}^{{{{\rm{TT}}}}}\) between any pair of tissues \({t}_{i},{t}_{j}\in {{{\mathcal{T}}}}\).Protein-level attention with cell type specificityFor each cell type-specific PPI network \({{{{\mathcal{G}}}}}_{{c}_{i}}\), we leverage protein-level attention to learn cell type-specific embeddings of proteins. Intuitively, protein-level attention learns which neighboring nodes are probably most important for characterizing a particular cell type’s protein. As such, each cell type-specific protein interaction network has its own cell type-specific set of learnable parameters. Concretely, at each layer-wise update of layer l, the node-level attention learns the importance αu,v of protein u to its neighboring protein v in a given cell type \({c}_{i}\in {{{\mathcal{C}}}}\):$${{{{\bf{h}}}}}_{u}^{{{{\rm{PP}}}}}\leftarrow {{{\rm{AGG}}}}\left(\sigma \left(\sum_{v\in {{{{\mathcal{N}}}}}_{u}}{\alpha }_{u,v}{{{{W}}}}^{{{{\rm{PP}}}}}{{{{\bf{h}}}}}_{v}^{{{{\rm{PP}}}}}\right)\right)$$
(1)
where AGG is an aggregation function (that is, concatenation across K attention heads), σ is the nonlinear activation function (that is, ReLU), \({{{{\mathcal{N}}}}}_{u}\) is the set of neighbors for u (including itself via self-attention), αu,v is an attention mechanism defined as \({\alpha }_{u,v}=\frac{\exp (\sigma ({{{{\bf{a}}}}}^{T}\cdot [{{{{\bf{h}}}}}_{u}\parallel {{{{\bf{h}}}}}_{v}]))}{{\sum }_{v\in {{{{\mathcal{N}}}}}_{u}}\exp (\sigma ({{{{\bf{a}}}}}^{T}\cdot [{{{{\bf{h}}}}}_{u}\parallel {{{{\bf{h}}}}}_{v}]))}\) between a pair of interacting proteins from a specific cell type, WPP is a PP-specific transformation matrix to project the features of protein u in its cell type-specific protein interaction network, and \({{{{\bf{h}}}}}_{v}^{{{{\rm{PP}}}}}\) is the previous layer’s cell type-specific embedding for protein v. Practically, we leverage the attention function in graph attention neural networks (that is, GATv2)44. Proteins of the same identity are initialized with the same random Gaussian vector to maintain their identity during training.Metagraph-level attention on cellular interactions and tissue hierarchyFor the metagraph, we use node-level and edge-level attention to learn which neighboring nodes and edge types are probably most important for characterizing the target node (that is, the node of interest). Intuitively, to learn an embedding for a specific cell type or tissue, we evaluate the informativeness of each direct cell type or tissue neighbor, as well as the relationship type between the cell type or tissue and their neighbors (for example, parent–child tissue relationship, tissue from which a cell type is found, and cell type with which the cell type of interest communicates with).Concretely, at each layer l of PINNACLE, the embeddings of a cell type \({c}_{i}\in {{{\mathcal{C}}}}\) are the result of aggregating (via function AGG) the embeddings (\({{{{\bf{h}}}}}_{c}^{{{{\rm{CC}}}}}\) and \({{{{\bf{h}}}}}_{t}^{{{{\rm{CT}}}}}\)) of its direct cell type neighbor c and tissue neighbor t that are projected via edge-type-specific transformation matrices (WCC and WCT) and weighted by learned attention weights (\({\alpha }_{{c}_{i},c}\) and \({\alpha }_{{c}_{i},t}\) respectively):$${{{{\bf{h}}}}}_{{c}_{i}}^{{{{\rm{CC}}}}}\leftarrow {{{\rm{AGG}}}}\left(\sigma \left(\sum_{c\in {{{{\mathcal{N}}}}}_{{c}_{i}}}{\alpha }_{{c}_{i},c}{{{{W}}}}^{{{{\rm{CC}}}}}{{{{\bf{h}}}}}_{c}^{{{{\rm{CC}}}}}\right)\right)$$
(2)
$${{{{\bf{h}}}}}_{{c}_{i}}^{{{{\rm{CT}}}}}\leftarrow {{{\rm{AGG}}}}\left(\sigma \left(\sum_{t\in {{{{\mathcal{N}}}}}_{{c}_{i}}}{\alpha }_{{c}_{i},t}{{{{W}}}}^{{{{\rm{CT}}}}}{{{{\bf{h}}}}}_{t}^{{{{\rm{CT}}}}}\right)\right)$$
(3)
The embeddings generated from separately propagating messages through cell type–cell type edges or cell type–tissue edges are combined using learned attention weights βCC and βCT, respectively.$${{{{\bf{h}}}}}_{{c}_{i}}={\beta }^{{{{\rm{CC}}}}}{{{{\bf{h}}}}}_{{c}_{i}}^{{{{\rm{CC}}}}}+{\beta }^{{{{\rm{CT}}}}}{{{{\bf{h}}}}}_{{c}_{i}}^{{{{\rm{CT}}}}}$$
(4)
Similarly, the embeddings of a tissue \({t}_{i}\in {{{\mathcal{T}}}}\) are the result of aggregating (via function AGG) the embeddings (\({{{{\bf{h}}}}}_{t}^{{{{\rm{TT}}}}}\) and \({{{{\bf{h}}}}}_{c}^{{{{\rm{TC}}}}}\)) of its direct tissue neighbor t and cell type neighbor c that are projected via edge-type-specific transformation matrices (WTT and WTC) and weighted by learned attention weights (\({\alpha }_{{t}_{i},t}\) and \({\alpha }_{{t}_{i},c}\) respectively).$${{{{\bf{h}}}}}_{{t}_{i}}^{{{{\rm{TT}}}}}\leftarrow {{{\rm{AGG}}}}\left(\sigma \left(\sum_{t\in {{{{\mathcal{N}}}}}_{{t}_{i}}}{\alpha }_{{t}_{i},t}{{{{W}}}}^{{{{\rm{TT}}}}}{{{{\bf{h}}}}}_{t}^{{{{\rm{TT}}}}}\right)\right)$$
(5)
$${{{{\bf{h}}}}}_{{t}_{i}}^{{{{\rm{TC}}}}}\leftarrow {{{\rm{AGG}}}}\left(\sigma \left(\sum_{c\in {{{{\mathcal{N}}}}}_{{t}_{i}}}{\alpha }_{{t}_{i},c}{{{{W}}}}^{{{{\rm{TC}}}}}{{{{\bf{h}}}}}_{c}^{{{{\rm{TC}}}}}\right)\right)$$
(6)
The embeddings generated from separately propagating messages through tissue–tissue edges or tissue–cell type edges are combined using learned attention weights βTT and βTC, respectively.$${{{{\bf{h}}}}}_{{t}_{i}}={\beta }^{{{{\rm{TT}}}}}{{{{\bf{h}}}}}_{{t}_{i}}^{{{{\rm{TT}}}}}+{\beta }^{{{{\rm{TC}}}}}{{{{\bf{h}}}}}_{{t}_{i}}^{{{{\rm{TC}}}}}$$
(7)
For the node-level attention mechanisms (equations (2), (3), (5) and (6)), AGG is an aggregation function (that is, concatenation across K attention heads), σ is the nonlinear activation function (that is, ReLU), \({{{{\mathcal{N}}}}}_{{c}_{i}}\) and \({{{{\mathcal{N}}}}}_{{t}_{i}}\) are the sets of neighbors for ci and ti respectively (includes itself via self-attention), WCC, WCT, WTC and WTT are edge-type-specific transformation matrices to project the features of a given target node, \({{{{\bf{h}}}}}_{c}^{{{{\rm{CC}}}}}\), \({{{{\bf{h}}}}}_{t}^{{{{\rm{CT}}}}}\), \({{{{\bf{h}}}}}_{t}^{{{{\rm{TT}}}}}\) and \({{{{\bf{h}}}}}_{c}^{{{{\rm{TC}}}}}\) are the previous layer’s embedding for c given the edge type CC, t given the edge type CT, t given the edge type TT, and c given the edge type TC, respectively. Practically, we leverage the attention function in graph attention neural networks (that is, GATv2)44. Finally, the node-level attention mechanism for a given source node u and edge type r is \({\alpha }_{u,v}^{r}=\frac{\exp (\sigma ({{{{\bf{a}}}}}_{r}^{T}\cdot [{{{{\bf{h}}}}}_{u}\parallel {{{{\bf{h}}}}}_{v}]))}{{\sum }_{v\in {{{{\mathcal{N}}}}}_{u}}\exp (\sigma ({{{{\bf{a}}}}}_{r}^{T}\cdot [{{{{\bf{h}}}}}_{u}\parallel {{{{\bf{h}}}}}_{v}]))}\). For the attention mechanisms over edge types (equations (4) and (7)), \({\beta }^{r}=\frac{\exp ({m}_{r})}{{\sum }_{r\in R}\exp ({m}_{r})}\) such that \({m}_{r}={\sum }_{u\in {V}_{q}}{{{{\bf{s}}}}}^{T}\cdot \tanh ({{{M}}}\cdot {\bf{h}}_{u}^{r}+{{{\bf{b}}}})\) where Vq is the set of nodes in the metagraph, s is the attention vector, M is the weight matrix and b is the bias vector. These parameters are shared for all edge types in the metagraph.Bridge between protein and cell type embeddingsUsing a pooling mechanism, we bridge cell type-specific protein embeddings with their corresponding cell type embeddings. We initialize cell type embeddings by taking the average of their proteins’ embeddings: \({{{{\bf{h}}}}}_{{c}_{i}}=\frac{1}{| {V}_{{c}_{i}}| }{\sum }_{u\in {V}_{{c}_{i}}}{{{{\bf{h}}}}}_{u}\), where hu is the embedding of protein node \(u\in {V}_{{c}_{i}}\) in the PPI subnetwork for cell type ci. Similarly, we initialize tissue embeddings by taking the average of their neighbors: \({{{{\bf{h}}}}}_{{t}_{i}}=\)\(\frac{1}{| {{{{\mathcal{N}}}}}_{{t}_{i}}| }\left({\sum }_{t\in {{{{\mathcal{N}}}}}_{{t}_{i}}}{{{{\bf{h}}}}}_{t}+{\sum }_{c\in {{{{\mathcal{N}}}}}_{{t}_{i}}}{{{{\bf{h}}}}}_{c}\right)\), where ht and hc are the embeddings of tissue node t and cell type node c, respectively, in the immediate neighborhood of source tissue node ti. At each layer l > 0, we learn the importance \({\gamma }_{{c}_{i},u}\) of node \(u\in {V}_{{c}_{i}}\) to cell type ci such that$${{{{\bf{h}}}}}_{{c}_{i}}\leftarrow {{{{\bf{h}}}}}_{{c}_{i}}+{{{\rm{AGG}}}}\left(\sigma \left(\sum_{u\in {V}_{{c}_{i}}}{\gamma }_{{c}_{i},u}{{{{\bf{h}}}}}_{u}\right)\right).$$
(8)
After propagating cell type and tissue information in the metagraph (namely equations (2)–(6)), we apply \({\gamma }_{{c}_{i},u}\) to the cell type embedding of ci such that$${{{{\bf{h}}}}}_{u}\leftarrow {{{{\bf{h}}}}}_{u}+{\gamma }_{{c}_{i},u}{{{{\bf{h}}}}}_{{c}_{i}}.$$
(9)
Intuitively, we are imposing the structure of the metagraph onto the PPI subnetworks based on a protein’s importance to its corresponding cell type’s identity.Overall objective function of PINNACLEPINNACLE is optimized for three biological scales: protein, cell type and tissue level. Concretely, the loss function \({{{\mathcal{L}}}}\) has three components corresponding to each biological scale:$${{{\mathcal{L}}}}={{{{\mathcal{L}}}}}_{{\mathrm{protein}}}+(1-\theta )({{{{\mathcal{L}}}}}_{{\mathrm{celltype}}}+{{{{\mathcal{L}}}}}_{{\mathrm{tissue}}}),$$
(10)
where \({{{{\mathcal{L}}}}}_{{\mathrm{protein}}}\), \({{{{\mathcal{L}}}}}_{{\mathrm{celltype}}}\) and \({{{{\mathcal{L}}}}}_{{\mathrm{tissue}}}\) minimize the loss from protein-level predictions, cell type-level predictions and tissue-level predictions, respectively. θ is a tunable parameter with a range of 0 and 1 that scales the contribution of the link prediction loss of the metagraph relative to that of the PPIs. At the protein level, we consider two aspects: prediction of PPIs at each cell type-specific PPI network (\({{{{\mathcal{L}}}}}_{{\mathrm{ppi}}}\)) and prediction of cell type identity of each protein (\({{{{\mathcal{L}}}}}_{{\mathrm{celltypeid}}}\)). The contribution of the latter is scaled by λ, which is a tunable parameter with a range of 0 and 1.$${{{{\mathcal{L}}}}}_{{\mathrm{protein}}}=\theta {{{{\mathcal{L}}}}}_{{\mathrm{ppi}}}+\lambda\, {{{{\mathcal{L}}}}}_{{\mathrm{celltypeid}}}$$
(11)
Intuitively, we aim to simultaneously learn the topology of each cell type-specific PPI network (that is, \({{{{\mathcal{L}}}}}_{{\mathrm{ppi}}}\)) and the nuanced differences between proteins activated in different cell types. Specifically, we use binary cross-entropy to minimize the error of predicting positive and negative PPIs in each cell type-specific PPI network$${{{{\mathcal{L}}}}}_{{\mathrm{ppi}}}=\sum_{{c}_{i}\in {{{\mathcal{C}}}}}\sum_{u,v\in {V}_{{c}_{i}}}{y}_{u,v}\log (\;{\hat{y}}_{u,v})+(1-{y}_{u,v})\log (1-{\hat{y}}_{u,v})$$
(12)
and center loss107 for discriminating between protein embeddings zu from different cell types, represented by embeddings denoted as \({{{{\bf{z}}}}}_{{c}_{i}}\).$${{{{\mathcal{L}}}}}_{{\mathrm{celltypeid}}}=\sum_{{c}_{i}\in {{{\mathcal{C}}}}}\sum_{u\in {V}_{{c}_{i}}}| | {{{{\bf{z}}}}}_{u}-{{{{\bf{z}}}}}_{{c}_{i}}| {| }_{2}^{2}$$
(13)
At the cell type level, we use binary cross-entropy to minimize the error of predicting cell type–cell type interactions and cell type–tissue relationships:$${{{{\mathcal{L}}}}}_{{\mathrm{celltype}}}={{{{\mathcal{L}}}}}_{{\mathrm{celltype}}}^{{{{\rm{CC}}}}}+{{{{\mathcal{L}}}}}_{{\mathrm{celltype}}}^{{{{\rm{CT}}}}}$$
(14)
such that$${{{{\mathcal{L}}}}}_{{\mathrm{celltype}}}^{{{{\rm{CC}}}}}=\sum_{{c}_{i},{c}_{j}\in {{{\mathcal{C}}}}}\;{y}_{{c}_{i},{c}_{\!j}}\log (\;{\hat{y}}_{{c}_{i},{c}_{\!j}})+(1-{y}_{{c}_{i},{c}_{\!j}})\log (1-{\hat{y}}_{{c}_{i},{c}_{\!j}})$$
(15)
$${{{{\mathcal{L}}}}}_{{\mathrm{celltype}}}^{{{{\rm{CT}}}}}=\sum_{{c}_{i}\in {{{\mathcal{C}}}}}\sum_{{t}_{k}\in {{{\mathcal{T}}}}}\;{y}_{{c}_{i},{t}_{k}}\log (\;{\hat{y}}_{{c}_{i},{t}_{k}})+(1-{y}_{{c}_{i},{t}_{k}})\log (1-{\hat{y}}_{{c}_{i},{t}_{k}}).$$
(16)
Similarly, at the tissue level, we use binary cross-entropy to minimize the error of predicting tissue–tissue and tissue–cell type relationships:$${{{{\mathcal{L}}}}}_{{\mathrm{tissue}}}={{{{\mathcal{L}}}}}_{{\mathrm{tissue}}}^{{{{\rm{TT}}}}}+{{{{\mathcal{L}}}}}_{{\mathrm{tissue}}}^{{{{\rm{TC}}}}}$$
(17)
such that$${{{{\mathcal{L}}}}}_{{\mathrm{tissue}}}^{{{{\rm{TT}}}}}=\mathop{\sum}\limits_{{t}_{k},{t}_{q}\in {{{\mathcal{T}}}}}{\,y}_{{t}_{k},{t}_{q}}\log (\;{\hat{y}}_{{t}_{k},{t}_{q}})+(1-{y}_{{t}_{k},{t}_{q}})\log (1-{\hat{y}}_{{t}_{k},{t}_{q}})$$
(18)
$${{{{\mathcal{L}}}}}_{{\mathrm{tissue}}}^{{{{\rm{TC}}}}}=\mathop{\sum}\limits_{{t}_{k}\in {{{\mathcal{T}}}}}\mathop{\sum}\limits_{{c}_{i}\in {{{\mathcal{C}}}}}{\,y}_{{t}_{k},{c}_{i}}\log (\;{\hat{y}}_{{t}_{k},{c}_{i}})+(1-{y}_{{t}_{k},{c}_{i}})\log (1-{\hat{y}}_{{t}_{k},{c}_{i}}).$$
(19)
The probability of an edge of type i between nodes u and v is calculated using a bilinear decoder:$${y}_{u,v}={{{{\bf{z}}}}}_{u}\cdot {{{{\bf{r}}}}}_{i}\cdot {{{{\bf{z}}}}}_{v},$$
(20)
where zu and zv are embeddings of nodes u and v, and ri is the embedding for edge type i. Note that any decoder can be used for link prediction in PINNACLE.Training details for PINNACLE

Overview
PINNACLE is trained using the cell type identity of the protein interaction networks and the graph connectivity of the cell type-specific protein interaction networks and metagraph. To learn cell type identity, PINNACLE predicts the cell type(s) that the node(s) corresponding to each protein are activated in. For capturing graph connectivity, PINNACLE performs self-supervised link prediction; it predicts whether an edge (and its type) exists between a pair of nodes. For link prediction, a randomly selected subset of edges is masked (or hidden) from the model, and the model must be able to predict that such edges exist (and that the randomly generated false edges do not exist). Practically, this means that the graphs being fed as input into PINNACLE during train, validation, or test do not contain the masked edges.

Data split
Protein–protein edges are randomly split into train (80%), validation (10%) and test (10%) sets. The metagraph edges are not split into train, validation and test sets because there are relatively few of them, and they are all critical for injecting cell type and tissue organization to the model. The proteins involved in the train edges are considered in the cell type identification term of the loss function (\({{{{\mathcal{L}}}}}_{{\mathrm{celltypeid}}}\)).

Sampling negative edges
For link prediction, false (or negative) edges have the label of 0 and are randomly generated (via structured_negative_sampling function in Pytorch Geometric). The ratio of positive to negative edges is 1:1.

Hyperparameter tuning
We leverage Weights and Biases108 to select optimal hyperparameters via a random search over the hyperparameter space. The best-performing hyperparameters for PINNACLE are selected by optimizing the ROC and Calinski–Harabasz score109 on the validation set. The hyperparameter space on which we perform a random search to choose the optimal set of hyperparameters is: the dimension of the nodes’ feature matrix ∈ [1,024, 2,048], dimension of the output layer ∈ [4, 8, 16, 32], lambda ∈ [0.1, 0.01, 0.001], learning rate for link prediction task ∈ [0.01, 0.001], learning rate for protein’s cell type classification task ∈ [0.1, 0.01, 0.001], number of attention heads ∈ [4, 8], weight decay rate ∈ [0.0001, 0.00001], dropout rate ∈ [0.3, 0.4, 0.5, 0.6, 0.7] and normalization layer ∈ [layernorm, batchnorm, graphnorm, none]. The best hyperparameters are as follows: the dimension of the nodes’ feature matrix = 1,024, dimension of the output layer = 16, lambda = 0.1, learning rate for link prediction task = 0.01, learning rate for protein’s cell type classification task = 0.1, number of attention heads = 8, weight decay rate = 0.00001, dropout rate = 0.6, and normalization layers are layernorm and batchnorm. Further, PINNACLE consists of two custom graph attention neural network layers (‘Protein-level attention with cell type specificity’ and ‘Metagraph-level attention on cellular interactions and tissue hierarchy’ sections in Methods) per cell type-specific PPI network and metagraph and is trained for 250 epochs.

Implementation
We implement PINNACLE using Pytorch (Version 1.12.1)110 and Pytorch Geometric (Version 2.1.0)111. We leverage Weights and Biases108 for hyperparameter tuning and model training visualization, and we create interactive demos of the model using Gradio112. Models are trained on a single NVIDIA Tesla V100-SXM2-16GB GPU.
Generating contextualized 3D protein representationsAfter pretraining PINNACLE, we can leverage the output protein representations for diverse downstream tasks. Here, we demonstrate PINNACLE’s ability to improve the prediction of PPIs by injecting context into 3D molecular structures of proteins.OverviewGiven a protein of interest, we generate both the context-free structure-based representation via MaSIF3,17 and a contextualized PPI network-based representation via PINNACLE. We calculate the binding score of a pair of proteins based on either context-free representations or contextualized representations of the proteins. To quantify the added value, if any, provided by contextualizing protein representations with cell type context, we compare the size of the gap between the average binding scores of binding and nonbinding proteins in the two approaches.DatasetThe proteins being compared are PD-1, PD-L1, B7-1, CTLA-4, RalB, RalBP1, EPO, EPOR, C3 and CFH. The pairs of binding proteins are PD-1/PD-L1 (PDB ID: 4ZQK) and B7-1/CTLA-4 (PDB ID: 1I8L). The nonbinding proteins are any of the four proteins paired with any of the remaining six proteins (for example, PD-1/RalB, PD-1/RalBP1 and PD-L1/RalBP1). The PDB IDs for the other six proteins are 2KWI for RalB/RalBP1, 1CN4 for EPO/EPOR, and 3OXU for C3/CFH.Structure-based protein representation learningWe directly apply the pretrained model for MaSIF3,17 to generate the 3D structure-based protein representations. We use the model pretrained for MaSIF-site task, named all_feat_3l_seed_benchmark. The output of the pretrained model for a given protein is P × d, where P is the number of patches (precomputed by the authors of MaSIF3,17) and d = 4 is the dimension of the pretrained model’s output layer. As proteins vary in size (that is, the number of patches to cover the surface of the protein), we select a fixed k number of patches that are most likely to be part of the binding site (according to the pretrained MaSIF model). For each protein, we select k = 200 patches, which is the average number of patches for PD-1, PD-L1, B7-1 and CTLA-4, resulting in a matrix of size 200 × 4. Finally, we take the element-wise median on the 200 × 4 matrix to transform it into a vector of length 200. This vector becomes the structure-based protein representation for a given protein.Experimental setupFor each cell type context of a given protein, we concatenate the 3D structure-based protein representation (from MaSIF) with the cell type-specific protein representation (from PINNACLE) to generate a contextualized structure-based protein representation. To create the context-free protein representation, we concatenate the structure-based protein representation with an element-wise average of PINNACLE’s protein representations. This is to maintain consistent dimensionality and latent space between context-free and contextualized protein representations. Given a pair of proteins, we calculate a score via cosine similarity (a function provided by sklearn113) using the context-free or contextualized protein representations. Lastly, we quantify the gap between the scores of binding and nonbinding proteins using context-free or contextualized protein representations to evaluate the added value (if any) of contextual AI.Fine-tuning PINNACLE for context-specific target prioritizationAfter pretraining PINNACLE, we can fine-tune the output protein representations for diverse biomedical downstream tasks. Here, we demonstrate PINNACLE’s ability to enhance the performance of predicting a protein’s therapeutic potential for a specific therapeutic area.For each protein of interest, we feed its PINNACLE-generated embedding into an MLP. The model outputs a score between 0 and 1, where 1 indicates strong candidacy to target (that is, by a compound/drug) for treating the therapeutic area and 0 otherwise. Since a protein has multiple representations corresponding to the cell types it is activated in, the MLP model generates a score for each of the protein’s cell type-specific representations (Fig. 4a). For example, Protein 1’s representation from Cell type 1 is scored independently of its representation from Cell type 2. The output scores can be examined to identify the most predictive cell types and the strongest candidates for therapeutic targets in any specific cell type.Therapeutic targets datasetWe obtain labels for therapeutic targets from the Open Targets Platform61.
Therapeutic area selection
To curate target information for a therapeutic area, we examine the drugs indicated for the therapeutic area of interest and its descendants. The two therapeutic areas examined are RA and IBD. For RA, we collected therapeutic data (that is, targets of drugs indicated for the therapeutic area) from OpenTargets61 for RA (EFO_0000685), ankylosing spondylitis (EFO_0003898) and psoriatic arthritis (EFO_0003778). For IBD, we collected therapeutic data for ulcerative colitis (EFO_0000729), collagenous colitis (EFO_1001293), colitis (EFO_0003872), proctitis (EFO_0005628), Crohn’s colitis (EFO_0005622), lymphocytic colitis (EFO_1001294), Crohn’s disease (EFO_0000384), microscopic colitis (EFO_1001295), IBD (EFO_0003767), appendicitis (EFO_0007149), ulcerative proctosigmoiditis (EFO_1001223) and small bowel Crohn’s disease (EFO_0005629).

Positive training examples
We define positive examples (that is, where the label y = 1) as proteins targeted by drugs that have at least completed phase 2 of clinical trials for treating a certain therapeutic area. As such, a protein is a promising candidate if a compound that targets the protein is safe for humans and effective for treating the disease. We retain positive training examples that are activated in at least one cell type-specific protein interaction network. The final number of positive training examples for RA and IBD is 152 and 114, respectively.

Negative training examples
We define negative examples (that is, where the label y = 0) as druggable proteins that do not have any known association with the therapeutic area of interest according to OpenTargets. A protein is deemed druggable if it is targeted by at least one existing drug114. We extract drugs and their nominal targets from DrugBank79. We retain negative training examples that are activated in at least one cell type-specific protein interaction network. The final number of negative training examples for RA and IBD is 1,465 and 1,377, respectively.

Data processing workflow
For a therapeutic area of interest, we identify its descendants. With the list of disease terms for the therapeutic area, we curate its positive and negative training examples. We split the dataset such that about 60%, 20% and 20% of the proteins are in the train, validation and test sets, respectively. We additionally apply two criteria to avoid data leakage and ensure that all cell types are represented during training/inference: Proteins are assigned to train (60%), validation (20%) and test (20%) datasets based on their identity; this is to prevent data leakage where cell type-specific representations of a single protein are observed in multiple data splits. We also ensure that there are sufficient numbers of train, validation and test positive samples per cell type; proteins may be reassigned to a different data split so that each cell type is represented during training, validating and testing stages. With these criteria, the train, validation and test dataset splits may not necessarily consist of approximately 60%, 20% and 20% of the total protein representations (Supplementary Table 6).
Fine-tuning model details
Model architecture
Our MLP comprises an input feedforward neural network, one hidden feedforward neural network layer and an output feedforward neural network layer. In between each layer, we have a nonlinear activation layer. In addition, we use dropout and normalization layers between the input and hidden layer (see ‘Implementation’ section for more information). Our objective function is binary cross-entropy loss.

Hyperparameter tuning
We leverage Weights and Biases108 to select optimal hyperparameters via a random search over the hyperparameter space. The best-performing hyperparameters are selected by optimizing the AUPRC on the validation set. The hyperparameter space on which we perform a random search to choose the optimal set of hyperparameters is the dimension of the first hidden layer ∈ [8, 16, 32], dimension of the second hidden layer ∈ [8, 16, 32], learning rate ∈ [0.01, 0.001, 0.0001], weight decay rate ∈ [0.001, 0.0001, 0.00001, 0.000001], dropout rate ∈ [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8], normalization layer ∈ [layernorm, batchnorm, none] and the ordering of dropout and normalization layer (that is, normalization before dropout or vice versa).

Implementation
We implement the MLP using Pytorch (Version 1.12.1)110. In addition, we use Weights and Biases108 for hyperparameter tuning and model training visualization. Models are trained on a single NVIDIA Tesla M40 GPU.
Metrics and statistical analysesHere, we describe metrics, visualization methods and statistical tests used in our analysis.Visualization of embeddingsWe visualize PINNACLE’s embeddings using a uniform manifold approximation and projection for dimension reduction (UMAP)115 and seaborn. Using the Python package, umap, we transform PINNACLE’s embeddings to two-dimensional vectors via the parameters: n_neighbors = 10, min_dist = 0.9, n_components = 2 and the euclidean distance metric. The plots are created using the seaborn package’s scatterplot function.Visualization of cell type embedding similarityThe pairwise similarity of PINNACLE’s cell type embeddings is calculated using cosine similarity (a function provided by sklearn113). Then, these similarity scores are visualized using the seaborn package’s clustermap function. For visualization purposes, similarity scores are mapped to colors after being raised to the 20th power.Spatial enrichment analysis of PINNACLE’s protein embeddingsTo quantify the spatial enrichment for PINNACLE’s protein embedding regions, we apply a systematic approach, SAFE31, that identifies regions that are overrepresented for a feature of interest (Extended Data Figs. 3 and 4). The required input data for SAFE are networks and annotations of each node. We first construct an unweighted similarity network on PINNACLE protein embeddings: (1) calculate pairwise cosine similarity, (2) apply a similarity threshold on the similarity matrix to generate an adjacency matrix and (3) extract the largest connected component. The protein nodes are labeled as 1 if they belong to a given cell type context and 0 otherwise. We then apply SAFE to each network using the recommended settings: neighborhoods are defined using the shortpath_weighted_layout metric for node distance and neighborhood radius of 0.15, and P values are computed using the hypergeometric test, adjusted using the Benjamini–Hochberg false discovery rate correction (significance cutoff α = 0.05).Due to computation and memory constraints, we sample 50 protein embeddings from a cell type context of interest and 10 protein embeddings from each of the other 155 cell type contexts. We use a threshold of 0.3 in our evaluation of PINNACLE’s protein embedding regions (Fig. 2 and Extended Data Fig. 3). We also evaluate the spatial enrichment analysis on networks constructed from different thresholds to ensure that the enrichment is not sensitive to our network construction method: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] (Extended Data Fig. 4). We use the Python implementation of SAFE (https://github.com/baryshnikova-lab/safepy).Statistical significance of tissue embedding distanceTissue embedding distance between a given pair of tissue nodes is calculated using cosine distance (a function provided by sklearn113). Tissue ontology distance between a given pair of tissue nodes is calculated by taking the sum of the nodes’ shortest path lengths to the lowest common ancestor (functions provided by networkx116. We use the two-sample Kolmogorov–Smirnov test (a function provided by scipy117) to compare PINNACLE embedding distances against randomly generated vectors (via the randn function in numpy to sample an equal number of vectors from a standard normal distribution). We also use the Spearman correlation (a function provided by scipy117) to correlate PINNACLE embedding distance to tissue ontology distance. We additionally generate a null distribution of tissue ontology distance by calculating tissue ontology distance on a shuffled tissue hierarchy (repeated ten times). Concretely, we shuffle the node identities of the Brenda Tissue Ontology106 and compute the pairwise tissue ontology distances.Statistical significance of binding and nonbinding proteins’ score gapsWe perform a one-sided nonparametric permutation test. First, we concatenate the scores for the N binding pairs and M nonbinding pairs. Next, for 100,000 iterations, we randomly sample N scores as the new set of binding protein scores and M scores as the new set of nonbinding protein scores, calculate the mean μN of the N binding protein scores and the mean μM of the M nonbinding protein scores, calculate the score gap by taking the difference of the means as μN  − μM, and keep track of the score gaps that are greater than or equal to the true score gap calculated from the real data. Lastly, we calculate the P value, defined as the fraction of 100,000 iterations in which the permuted score gap is greater than or equal to the true score gap (that is, one-sided nonparametric permutation test).Performance metric for therapeutic target prioritizationFor our downstream therapeutic target prioritization task (‘Fine-tuning PINNACLE for context-specific target prioritization’ section in Methods), we use a metric called Average Precision and Recall at K (APR@K) to evaluate model performance. APR@K leverages a combination of Precision@K and Recall@K to measure the ability to rank the most relevant items (in our case, proteins) among the top K predictions. In essence, APR@K calculates Precision@K for each k ∈ [1, …, K], multiplying each Precision@k by whether the kth item is relevant, and divides by the total number of relevant items r at K:$${{{\rm{APR}}}}@K=\frac{1}{r}\mathop{\sum}\limits_{k=1}^{K}{{{\rm{Precision}}}}@k\times {{{\rm{rel}}}}(k),$$where$${\mathrm{rel}}(k)=\left\{\begin{array}{ll}1,\quad &\,{{\mbox{if item at}}}\,\,k\,\,{{\mbox{is relevant}}}\,\\ 0,\quad &{{{\rm{otherwise}}}}\end{array}.\right.$$Given the nature of our target prioritization task, some key advantages of using APR@K include robustness to (1) varied numbers of protein targets activated across cell type-specific protein interaction networks and (2) varied sizes of cell type-specific protein interaction networks.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles