MiLoPYP: self-supervised molecular pattern mining and particle localization in situ

Self-supervised pattern mining and exploration moduleThe goal of the exploration module is to learn an embedding from the input set of tomograms and facilitate the identification of abundant protein species that could be used for high-resolution SPT analysis. This module is composed of three main steps: (1) 3D tomogram preprocessing, candidate coordinate generation, corresponding tilt series averaging and subtomogram slice extraction; (2) 2D CNN feature representation learning based on self-supervised contrastive learning; and (3) visualization of learned features.Preprocessing, coordinate generation and filteringA typical CET dataset, denoted as \({\mathcal{D}}\), consists of a collection of tomograms represented as Ti ∈ RW×H×D, where i ranges from 1 to j (j typically ranging from tens to a few hundred tomograms) and W, H and D are the tomogram dimensions. The preprocessing stage involves applying Gaussian denoising and histogram equalization techniques to enhance image contrast. 3D candidate coordinates are generated for each preprocessed tomogram Ti using a DoG pyramid approach. To obtain these coordinates, the input tomogram Ti is Gaussian filtered using progressively higher standard deviations σi with i = 1, …, n. This results in n Gaussian smoothed tomograms, denoted as \(\tilde{{T}_{i}}\). A DoG representation is then obtained by calculating the difference between two Gaussian smoothed tomograms with different standard deviations: \({\rm{DoG}}_{i}=\tilde{{T}_{i}}-{\tilde{T}}_{i+1}\). The final DoG representation is obtained by selecting the maximum voxel value across all levels of the pyramid followed by 3D NMS. Following this step, 3D coordinates are extracted based on voxel values that exceed a specified threshold determined automatically using 0.5 standard deviations above the mean voxel intensity. For each 3D coordinate (x, y, z), its corresponding 2D coordinate (xθ, yθ) in the tilted projection at angle θ is calculated as shown in equation (1):$$({x}_{\theta },{y}_{\theta })=\left(\left(x-\left\lfloor \frac{W}{2}\right\rfloor \right)\cos (\theta )+\left(z-\left\lfloor \frac{D}{2}\right\rfloor \right)\sin (\theta )+\left\lfloor \frac{W}{2}\right\rfloor ,y\right).$$
(1)
Since tilted images within a −15∘ to +15∘ range contain higher contrast information, we calculate the projected 2D coordinates (xθ, yθ) only for images in this angular range. Subsequently, patches centered at the calculated coordinates (xθ, yθ) are extracted and averaged to obtain the final 2D tilt series input, denoted as Pt. In addition, we also extract Ps, consisting of the central tomographic slice at coordinates (x, y, z). The pair Pt, Ps serves as input to the neural network (Extended Data Fig. 1a). For datasets containing low-quality tilt series or substantial overlap in the z-direction due to crowding, only Ps is used as input, rather than feeding the pair Pt, Ps.NMS is an essential step aimed at selecting the most prominent locations based on nearby voxel values. This operation is performed in a greedy manner, starting from coordinates having the highest voxel values. To prevent the selection of neighboring voxels, a user-defined radius is used during NMS. Any voxel within this distance from an already selected coordinate is excluded from further consideration. The choice of radius is determined by the user and should be adjusted according to the expected particle density. For densely populated tomograms, a smaller radius is recommended to ensure accurate identification of individual particles and to minimize the risk of selecting nearby voxels.Overall architecture and contrastive learning loss functionMiLoPYP uses 2D CNNs for feature representation learning. The overall architecture follows the Siamese neural network design, which contains two identical subnetworks with the same configuration and parameters40,51. During training, MiLoPYP takes as input two randomly augmented views Pt1, Ps1 and Pt2, Ps2 based on the original pair Pt, Ps. The two views are processed by an encoder network, f, consisting of a ResNet18-based backbone and a projection multilayer perceptron (MLP) head, h. The encoder f shares weights between the two views. A prediction MLP head h is applied to the output of one view, and a stop-gradient operation is applied on the other side. The ResNet18 backbone consists of 18 layers52, including one convolution layer with kernel size 7 × 7 and four residual blocks with kernel size 3 × 3, one average pooling layer and one fully connected layer. The projection MLP head is composed of three fully connected layers with one-dimensional (1D) batch normalization and rectified linear unit (ReLU) activation. The prediction MLP head is composed of two fully connected layers with 1D batch normalization and ReLU activation. The output vector from the encoder network of dimension 128 is denoted as \({y}_{i}\triangleq h(\;f({P}_{{t}_{i}},{P}_{{s}_{i}}))\) and the output vector from the prediction head h of dimension 128 is denoted as \({z}_{i}\triangleq f({P}_{{t}_{i}},{P}_{{s}_{i}})\). CNNs are trained by maximizing the cosine similarity between the learned representations of positive pairs. Through this training process, MiLoPYP learns to embed macromolecules with shared structural similarities close to each other while placing dissimilar macromolecules farther apart in feature space.Following ref. 40, we define the negative cosine similarity as in equation (2):$${\mathcal{D}}(\,{y}_{1},{z}_{2})=-\frac{{y}_{1}}{\parallel {y}_{1}{\parallel }_{2}}\cdot \frac{{z}_{2}}{\parallel {z}_{2}{\parallel }_{2}},$$
(2)
where ∥ ⋅ ∥2 represents the L2 norm. The final symmetrized loss function has the form shown in equation (3):$${\mathcal{L}}=\frac{1}{2}{\mathcal{D}}(\,{y}_{1},{\rm{stopgrad}}{z}_{2})+\frac{1}{2}{\mathcal{D}}(\,{y}_{2},{\rm{stopgrad}}{z}_{1}),$$
(3)
with a minimum possible value of −1. Here, ‘stopgrad’ refers to the stop-gradient operation where the model parameter-dependent variable z is treated as a constant during back-propagation. By applying the stop gradient, the encoder function f operating on the input pairs Pt1, Ps1 does not receive gradients from z2 in the first term, but it does receive gradients from y1 in the second term. The same principle applies to the input pairs Pt2, Ps2, where f receives gradients from y2 but not from z1. This stop-gradient operation is crucial for preventing model degeneration when training models with only positive pairs.Model training, inference and data augmentationTraining of the 2D CNN was performed for 300 epochs using stochastic gradient descent53. We initialize the network using pretrained ResNet18 weights on ImageNet from PyTorch. We use a batch size of 256 and a learning rate of 0.02. The learning rate has a cosine decay schedule. The weight decay is 0.0001 and the stochastic gradient descent momentum is 0.9. During training, a subset of the dataset \({\mathcal{D}}\) is used, which typically includes 5 to 10 tomograms. Training is performed on a single NVIDIA V100 GPU with 32 GB of RAM. Training for 300 epochs typically takes less than 2 h. After training, evaluation can be performed on the same subset of tomograms or on the entire dataset.In self-supervised contrastive learning, the goal is to train the network in such a way that the representation of different augmented views of the same instance are maximally similar. To achieve this, careful selection of data augmentation techniques is essential to preserve the identity of each instance. In our approach, we apply several random augmentations during training. These include horizontal and vertical flipping, intensity jittering, random resize crop, corner dropout and rotations (multiples of 90°). Each of these augmentations serves a specific purpose in maintaining instance identities and promoting effective learning. Intensity jittering is used to vary the brightness and contrast of the input instances. Brightness is jittered from 0 to 0.5 and contrast is jittered from 0 to 0.2. By jittering these values within certain ranges, we enable the model to learn the shapes of instances irrespective of their gray value characteristics. Random resize crop is used to randomly crop a portion of the input instances within a range of 0.7 to 1 times the original size, which is then resized back to the original size. This augmentation helps the model focus on learning the shapes of instances while minimizing sensitivity to their size variations. The choice of a larger lower bound (0.7) in the crop range allows the model to capture important shape information without completely disregarding size differences, which can be substantial in distinguishing macromolecules. Corner dropout involves randomly setting the voxel values to 0 within a subarea located at the corner of each input instance. This operation encourages the model to concentrate more on the central region (where particles are), enhancing the model’s ability to capture important features. Rather than applying random rotations, we restrict rotation to multiples of 90°. This decision is based on experimental findings that random rotation, particularly with interpolation, can introduce distortions that may hinder shape learning. Using rotations that are multiples of 90° avoids interpolation artifacts and simplifies the resampling of voxel values.Dimensionality reduction, visualization and module outputFor 2D grid visualization, the 128-dimension learned representation is reduced to 2D using UMAP54. 2D embeddings were computed using version 0.5.0 of the Python implementation (https://github.com/lmcinnes/umap/) with k = 40 for the k-nearest-neighbors graph and a minimum distance parameter of 0.1. Using fewer neighbors may help reveal finer local structure, while the use of more neighbors will capture the overall structure with some loss of fine local structure (Extended Data Fig. 8). Grid visualization and 3D mapping tools are included with MiLoPYP, through the nextPYP and Phoenix (https://github.com/Arize-ai/phoenix/) applications.To assign each learned embedding with a class label without knowing the actual number of classes c in the dataset, we perform over-clustering and label reassignment. As the main objective is to explore protein variability within the dataset, instead of identifying the correct number of classes, grouping similar proteins into the same class is more important, which makes over-clustering a reasonable approach. For a set of embeddings, we first perform vanilla k-means using FAISS55 with k substantially larger than the actual potential number of clusters for the dataset, that is, k ≫ c. Then, spectral clustering56 is applied to k cluster centers, and we further recluster these k initial centers into h classes, where h is a user-defined value. Typically, h is smaller than k but still greater than the potential actual number of classes. Following the over-clustering and label reassignment step, the learned embeddings are classified into h labels. Each class of embeddings is assigned a different color and users can select specific classes or individual patches to view in the 3D interactive session (Extended Data Figs. 4 and 6).Through an interactive session, users can effectively identify targets of interest for further investigation. By selecting specific patches and their neighbors within a user-defined subregion or user-specified labels, the corresponding 3D coordinates can be obtained. It is important to note that as the subregions become bigger, the accuracy of selection diminishes due to the unsupervised nature of feature representation learning. To obtain a precise set of candidate coordinates, a refinement stage is necessary. The training process for the refinement model requires a small number of annotated examples. These examples can be derived by selecting candidate patches from the interactive session, eliminating the need for manual annotation.Few-shot particle localization moduleSimilarly to the exploration module, the particle localization module is composed of the following main steps (Fig. 1b and Extended Data Fig. 2): (1) 3D tomogram preprocessing with Gaussian denoising and voxel intensity rescaling using histogram equalization; (2) few-shot learning-based neural-network training; and (3) extraction of particle coordinates using NMS (for globular-shaped particles) or extraction of particle coordinates using polynomial fitting (for fiber-like proteins or microtubules; Extended Data Fig. 6). The particle localization module offers the flexibility to be trained using partially annotated tomograms, which only need a limited number of labeled particles. These labels can be obtained either through the previous exploration module or by manual annotation. This approach enables efficient training and facilitates the accurate localization of particles with minimal requirements for labeled data.Framework for particle detectionThe objective of semi-supervised protein localization is to develop a model capable of detecting the locations of proteins in 3D tomograms, while learning from only a few annotated examples. Each tomogram Ti contains a varying number of proteins, typically ranging from a few hundreds to a few thousands. In the semi-supervised protein identification scenario, the training set \({{\mathcal{D}}}_{\rm{tr}}\) consists of a single tomogram Ttr with a few annotated proteins, while the remaining tomograms serve as the testing set \({{\mathcal{D}}}_{\rm{te}}\). Unlike standard object detection algorithms that aim to localize objects of interest using bounding box locations and sizes, in cryogenic electron microscopy applications, the desired output for the particle detection task are the center coordinates of proteins. Hence, inspired by ref. 57, we train the particle detector using \({{\mathcal{D}}}_{\rm{tr}}\) to generate a center point heat map \(\hat{Y}\in {[0,1]}^{C\times \frac{W}{{R}_{1}}\times \frac{H}{{R}_{1}}\times \frac{D}{{R}_{2}}}\), where Ri represents the output stride and C is the number of protein species. The output stride downsamples the prediction by a factor of Ri on each dimension. In our case, we consider monodisperse samples, so we set C = 1 and we use Ri = 1. In subsequent sections, we omit the dimension for simplicity. We extend the approach used in ref. 58 to generate the ground-truth heat map \(Y\in {[-1,1]}^{\frac{W}{{R}_{1}}\times \frac{H}{{R}_{1}}\times \frac{D}{{R}_{2}}}\) using the partially annotated tomogram. For each annotated center coordinate position p = (x, y, z), we compute its downsampled equivalent as \(\tilde{p}=(\lfloor \frac{x}{{R}_{1}}\rfloor ,\lfloor \frac{y}{{R}_{1}}\rfloor ,\lfloor \frac{z}{{R}_{2}}\rfloor )\). For each \(\tilde{p}\) in Y, in the case of globular particle detection, we apply a Gaussian kernel \({K}_{xyz}=\exp \, (-\frac{{(x-{\tilde{p}}_{x})}^{2}+{(y-{\tilde{p}}_{y})}^{2}+{(z-{\tilde{p}}_{z})}^{2}}{2{\sigma }_{k}^{2}})\), where σk is determined by the particle size58. For the detection of tubular-shaped macromolecules, we set all voxels within a radius of 3 from \(\tilde{p}\) to 1. The remaining unlabeled coordinates in Y are assigned a value of −1.Feature extraction and center localization modulesMiLoPYP’s particle localization framework has three essential components (Extended Data Fig. 2): (1) an encoder–decoder feature extraction backbone, (2) a protein center localization module, and (3) a debiased voxel-level contrastive feature regularization module. Notably, although the network’s input is a 3D tomogram, the feature extraction backbone primarily consists of 2D convolutional layers. Only the final two layers utilize 3D convolutional layers. This design is inspired by the manual particle picking process, where xy slices contain the most useful information, while the xz and yz views provide limited information due to the missing wedge distortions. The extracted features are then directed to both the center localization module and the contrastive feature regularization module. The center localization module is trained using a novel objective function called positive unlabeled focal loss59, which reduces overfitting when training with only positive data. On the other hand, the contrastive feature regularization module is trained using a debiased infoNCE loss60, enabling it to learn the maximization of feature representation similarities for voxels belonging to the same protein class while minimizing the similarity between voxels corresponding to proteins versus non-protein regions. The combination of these two modules enables the accurate training of the protein localization model while minimizing annotations. Additionally, the contrastive regularization helps mitigate overfitting concerns commonly associated with positive unlabeled learning methods. By identifying particle centers on a voxel level, the trained model can precisely locate proteins even in crowded environments.The feature extraction backbone of the model consists of a 2D UNet-based per-slice feature extraction backbone and a 3D convolutional layer for fusing the per-slice features into a single 3D representation. The number of encode blocks and decode blocks can be adjusted by the user as hyper-parameters. For smaller-sized particles, it is recommended to use a maximum of four encode blocks. For larger-sized particles, five encode blocks are suggested. Each encode downsampling block includes two convolutional layers with a kernel size of 3 × 3, ReLU activation and max pooling with a kernel size of 2 × 2. Each decode upsampling block consists of one transpose convolutional layer with a kernel size of 2 × 2, two convolutional layers with a kernel size of 3 × 3, and ReLU activations. The 3D fusion layer is a convolutional layer with a kernel size of 3 × 3 × 3, and ReLU activation. The overall architecture follows a Siamese network such that the input to the network is the tomogram T and its augmented pair \(\tilde{T}\). For the input tomogram \(T\in {{\mathbb{R}}}^{W\times H\times D}\) and its augmented pair \(\tilde{T}\), we denote the output from the feature extraction backbone as \(M\in {{\mathbb{R}}}^{Ch\times \frac{W}{R}\times \frac{H}{R}\times \frac{D}{R}}\) and the augmented pair \(\tilde{M}\). M and \(\tilde{M}\) are used to generate: (1) the output heat map \(\hat{Y}\) and its augmented pair \(\tilde{Y}\), and (2) the projected feature maps F and \(\tilde{F}\).The center localization module is a 3D convolutional layer with kernel size 1 × 1 × 1. Output of the center localization module is the heat map \(\hat{Y}\). For the input tomogram T and its output heat map \(\hat{Y}\), protein localization can be viewed as a per-voxel classification problem such that each voxel vi,j,k at position (i, j, k) is the input and the corresponding \({\hat{y}}_{i,\,j,k}\in [0,1]\) is the classification output. If we denote p(v) as the underlying data distribution from which vi,j,k is sampled, p(v) can be decomposed as in equation (4):$$p(v)={\pi }_{p}{p}_{p}(v|\,y=1)+{\pi }_{n}{p}_{n}(v|\,y=0),$$
(4)
where pp(v∣y = 1) is the positive class conditional probability of protein voxels, pn(v∣y = 0) is the negative class conditional probability of background voxels, and πp and πn are the class prior probabilities. Subscripts n, p and u denote negative, positive and unlabeled, respectively. Denote \(g:{{\mathbb{R}}}^{d}\to {\mathbb{R}}\), an arbitrary classifier that can be parameterized by a neural network, with \(l(\,g(v)={\hat{y}},y)\) being the loss between model outputs \(\hat{y}\) and ground truth y. When all the voxels are labeled, this is a binary classification problem that can be optimized using a standard positive-negative (PN) learning approach with the following risk minimization given by equation (5):$${\tilde{R}}_{pn}={\pi }_{p}{\tilde{R}}_{p}^{+}(g)+{\pi }_{n}{\tilde{R}}_{n}^{-}(g),$$
(5)
where \({\tilde{R}}_{p}^{+}(g)\) is the mean positive loss \({{\mathbb{E}}}_{v \sim {p}_{p}(v)}[l(\,g({v}^{p}),y=1)]\) and can be estimated as \(1/{n}_{p}\mathop{\sum }\nolimits_{i = 1}^{{n}_{p}}l(\,{\hat{y}}_{p\!\,}^{i},1)\), \({\tilde{R}}_{n}^{-}(\,g)\) is the mean negative loss \({{\mathbb{E}}}_{v \sim {p}_{n}(v)}[l(g({v}^{n}),y=0)]\) and can be estimated as \(1/{n}_{n}\mathop{\sum }\nolimits_{i = 1}^{{n}_{n}}l(\,{\hat{y}}_{n\!\,}^{i},0)\), and np and nn are the number of positive and negative voxels. When only a few positive voxels are labeled and the remainder of the data are unlabeled, we reformulate the problem into the PU setting: the positively labeled voxels are sampled from pp(v∣y = 1), and the remaining unlabeled voxels are sampled from p(v). As shown in ref. 61, by rearranging equations (4) and (5), we obtain πnpn(v) = p(v) − πppp(v) and \({\pi }_{n}{\tilde{R}}_{n}^{-}(\,g)={\tilde{R}}_{u}^{-}(\,g)-{\pi }_{p}{\tilde{R}}_{p}^{-}(\,g\,)\). Therefore, we rewrite the risk minimization as in equation (6):$${R}_{pu}={\pi }_{p}{\tilde{R}}_{p}^{+}(\,g)-{\pi }_{p}{\tilde{R}}_{p}^{-}(\,g)+{\tilde{R}}_{u}^{-}(\,g),$$
(6)
with \({\tilde{R}}_{u}^{-}(\,g)={{\mathbb{E}}}_{v \sim p(v)}[l(\,g(v),y=0)]\) and \({\tilde{R}}_{p}^{-}(\,g)={{\mathbb{E}}}_{v \sim {p}_{p}(v)}[l(\,g(v),y=0)]\). To prevent overfitting in equation (6), we used non-negative risk estimation as in equation (7)59:$${\tilde{R}}_{pu}={\pi }_{p}{\tilde{R}}_{p}^{+}(g)+\max \{0,{\tilde{R}}_{u}^{-}(g)-{\pi }_{p}{\tilde{R}}_{p}^{-}(g)\}.$$
(7)
For globular-shaped particle detection, as the ground-truth heat map is splatted with Gaussian kernels, the labels are not strictly binary. Positive labels are split into two groups: true positives (tp) where yi,j,k = 1, which is the center of each Gaussian kernel (protein center), and soft positives (sp) where 0 < yi,j,k < 1 (voxels that are close to the center). Unlabeled voxels are labeled as −1. With this, the positive distribution pp(v) and positive-associated losses \({\tilde{R}}_{p}^{+}(\,g)\), \({\tilde{R}}_{p}^{-}(\,g)\) are decomposed into the following equation (8):$$\begin{array}{c}{p}_{p}(v)={\pi }_{tp}{p}_{tp}(v|\,y=1)+{\pi }_{sp}{p}_{sp}(v| 0 < y < 1),\\{\tilde{R}}_{p}^{+}(\,g)={\pi }_{tp}{\tilde{R}}_{tp}^{+}(\,g)+{\pi }_{sp}{\tilde{R}}_{sp}^{+}(\,g),{\tilde{R}}_{p}^{-}(\,g)={\pi }_{tp}{\tilde{R}}_{tp}^{-}(\,g)+{\pi }_{sp}{\tilde{R}}_{sp}^{-}(\,g).\end{array}$$
(8)
We adopt voxel-wise logistic regression with focal loss for l(g(v), y). Specifically, according to the following equation (9), we have:$$\begin{array}{c}{\tilde{R}}_{tp}^{+}(\,g)={(1-{\hat{y}}_{ijk})}^{\alpha }\log (\;{\hat{y}}_{ijk}),{\tilde{R}}_{sp}^{+}(\,g)={(1-{y}_{ijk})}^{\beta }{(\;{\hat{y}}_{ijk})}^{\alpha }\log (1-{\hat{y}}_{ijk}),\\ {\tilde{R}}_{tp}^{-}(\,g)={\hat{y}}_{ijk}^{\alpha }\log (1-{\hat{y}}_{ijk}),{\tilde{R}}_{sp}^{-}(g)={(\;{y}_{ijk})}^{\beta }{(1-{\hat{y}}_{ijk})}^{\alpha }\log (\;{\hat{y}}_{ijk}),\\ {\tilde{R}}_{u}^{-}(\,g)={(\;{\hat{y}}_{ijk})}^{\alpha }\log (1-{\hat{y}}_{ijk}).\end{array}$$
(9)
By combining equations (7), (8) and (9), we obtain the final minimization objective as shown in equation (10):$$\begin{array}{l}{\tilde{R}}_{pu}={\pi }_{p}({\pi }_{tp}{\tilde{R}}_{tp}^{+}(\,g)+{\pi }_{sp}{\tilde{R}}_{sp}^{+}(\,g))+ \\ \max \{0,{\tilde{R}}_{u}^{-}(\,g)-{\pi }_{p}({\pi }_{tp}{\tilde{R}}_{tp}^{-}(\,g)+{\pi }_{sp}{\tilde{R}}_{sp}^{-}(\,g))\}.\end{array}$$
(10)
For tubular-shaped macromolecule detection, there are only positive voxels and unlabeled voxels. Therefore, the only remaining terms are \({\tilde{R}}_{tp}^{+}(\,g),{\tilde{R}}_{tp}^{-}(\,g),{\tilde{R}}_{u}^{-}(\,g)\), and the final loss is equivalent to equation (7).Debiased contrastive regularization and final training lossAs suggested in ref. 37, instead of using M, it is beneficial to map the representations to a new space through a projection head composed of a 1 × 1 × 1 convolutional layer where a contrastive loss is applied. The contrastive regularization head is composed of a 3D convolutional layer with kernel size 1 × 1 × 1. Output from the projection head is the projected feature map F and \(\tilde{F}\). Denote \({f}_{i,\,j,k}\in {{\mathbb{R}}}^{Ch}\) and \({\tilde{f}}_{i,\,j,k}\) as the feature vector at the (i, j, k) position of the feature map F and its augmented counterpart. There exists a total of \(\frac{W}{R}\times \frac{H}{R}\times \frac{D}{R}=N\) such vectors. Each of these feature vectors is responsible for predicting \({\hat{y}}_{i,\,j,k}\). If yi,j,k = 1, mi,j,k and its projection fi,j,k should encode particle-related features. For a partially annotated T, the voxel-level feature vector set f can be separated into positive and unlabeled classes. Therefore, the voxel-level contrastive regularization loss is composed of positive supervised and unlabeled self-supervised debiased contrastive terms. For positive supervised debiased contrastive regularization, denote \({{\mathcal{F}}}^{p}=\{\,{f}_{i,\,j,k}^{\;p}:{y}_{i,\,j,k}=1\}\) as the set of positive feature vectors obtained from np annotated proteins and its augmented counterpart \({{\tilde{{\mathcal{F}}}\,}^{\,p}}=\{{\tilde{f}}_{i,\,j,k\,}^{p}:{\tilde{y}}_{i,\,j,k}=1\}\), \({{\mathcal{F}}}^{\;u}=\{{f}_{i,\,j,k}^{u}:{y}_{i,\,j,k} < 1\}\) as the set of unlabeled (including the soft positives) feature vectors with a total of nn. Because each tomogram contains up to a few hundred sub-volumes of the protein of interest, and each of these sub-volumes are the same protein with different relative orientations and are distorted in different ways, for a feature vector \({f}_{i}^{\;p}\in {{\mathcal{F}}}^{p}\), the remaining 2np − 1 feature vectors \({f}_{j}^{\;p},j=1,\ldots ,2{n}_{p}-1\) in \({\mathcal{F}}^{\;p}\) and \({\tilde{{\mathcal{F}}}}^{\,p}\) can be treated as naturally augmented pairs. Unlabeled feature vectors \({f}_{k}^{\;u}\in {{\mathcal{F}}}^{u},k=1,\ldots ,2{n}_{n}\), which include the augmented unlabeled features, are treated as negatives. However, because the unlabeled set \({{\mathcal{F}}}^{u}\) can contain positive feature vectors, the naive supervised contrastive loss as proposed in ref. 62 will be biased. Therefore, we adopt a modified debiased supervised contrastive loss based on equation (11)63:$${{\mathcal{L}}}_{\rm{sup}}^{db}={\mathbb{E}}\left[-\log \left[\frac{1/(2{n}_{p}-1)\mathop{\sum }\nolimits_{j = 1}^{2{n}_{p}-1}{e}^{\;{f}_{i}^{\,pT}{f}_{j}^{\,p}}}{1/(2{n}_{p}-1)\mathop{\sum }\nolimits_{j = 1}^{2{n}_{p}-1}{e}^{\;{f}_{i}^{\,pT}{f}_{j}^{\,p}}+{g}_{\rm{sup}}(\;{f}_{i}^{\,p},{\{\;{f}_{j}^{\,p}\}}_{j = 1}^{2{n}_{p}-1},{\{\;{f}_{k}^{\,u}\}}_{k = 1}^{2{n}_{n}})}\right]\right],$$
(11)
where the second term in the denominator is given by equation (12):$${g}_{\rm{sup}}(\cdot )=\max \left\{\frac{1}{{\pi }_{n}}\left(\frac{1}{2{n}_{n}}\mathop{\sum }\limits_{k=1}^{2{n}_{n}}{e}^{\,{f}_{i}^{\,pT}{f}_{k}^{\,u}}-{\pi }_{p}\frac{1}{2{n}_{p}-1}\mathop{\sum }\limits_{j=1}^{2{n}_{p}-1}{e}^{\;{f}_{i}^{\,pT}{f}_{j}^{\,p}}\right),{e}^{-1/t}\right\},$$
(12)
with πn and πp being the class prior probabilities and t the temperature.Regarding the unlabeled self-supervised debiased contrastive regularization, for the unlabeled feature vector \({f}_{k}^{\,u}\), the only known positive is its augmented pair \({\tilde{f}}_{k}^{\,u}\), and the remaining vectors are treated as negatives. Denote \({\{\,{f}_{l}^{\,r}\}}_{l = 1}^{2N-2}\) as the set of remaining vectors. The resulting contrastive loss for an unlabeled feature vector is given by equation (13):$${{\mathcal{L}}}_{\rm{unsup}}={\mathbb{E}}\left[-\log \left[\frac{{e}^{\,{f}_{k}^{\,uT}{\tilde{f}}_{k}^{\,u}}}{{e}^{\,{f}_{k}^{\,uT}{\tilde{f}}_{k}^{\,u}}+{g}_{\rm{unsup}}(\;{f}_{k}^{\,u},{\tilde{f}}_{k}^{\,u},{\{\;{f}_{l}^{\,r}\}}_{l = 1}^{2N-2})}\right]\right].$$
(13)
Similarly to equation (12), gunsup( â‹… ) involves class prior probabilities. However, the actual class of the unlabeled feature vectors is unknown. Therefore, we divide the unlabeled voxels into three groups: pseudo-positive, pseudo-negative and neutral. Pseudo-positives are voxels with predicted particle probability greater than 0.9. Pseudo-negatives are voxels with predicted particle probability smaller than 0.3. The remaining voxels are neutral. For the pseudo-positives loss \({{\mathcal{L}}}_{unsup}^{p}\), \({g}_{unsup}^{\,p}\) in the denominator is expressed as shown in equation (14):$${g}_{\rm{unsup}}^{\,p}(\cdot )=\max \left\{\frac{1}{{\pi }_{n}}\left(\frac{1}{2N-2}\mathop{\sum }\limits_{l=1}^{2N-2}{e}^{\,{f}_{k}^{\,uT}{f}_{l}^{\,r}}-{\pi }_{n}{e}^{\,{f}_{k}^{\,uT}{\tilde{f}}_{k}^{\,u}}\right),{e}^{-1/t}\right\},$$
(14)
which uses the negative class prior probability πn. For the pseudo-negatives loss \({{\mathcal{L}}}_{unsup}^{n}\), \({g}_{unsup}^{\,n}\) in the denominator is expressed as shown in equation (15):$${g}_{\rm{unsup}}^{\,n}(\cdot )=\max \left\{\frac{1}{{\pi }_{p}}\left(\frac{1}{2N-2}\mathop{\sum }\limits_{l=1}^{2N-2}{e}^{\,{f}_{k}^{\,uT}{f}_{l}^{\,r}}-{\pi }_{p}{e}^{\,{f}_{k}^{\,uT}{\tilde{f}}_{k}^{\,u}}\right),{e}^{-1/t}\right\},$$
(15)
which uses the positive class prior probability πp. The loss for neutrals is calculated as the weighted average based on the probabilities of the feature vector belonging to the positive class as given by equation (16):$${{\mathcal{L}}}_{\rm{unsup}}^{db}=\hat{Y}{{\mathcal{L}}}_{\rm{unsup}}^{p}+(1-\hat{Y}){{\mathcal{L}}}_{\rm{unsup}}^{n}.$$
(16)
We added a consistency regularization loss using the mean-squared error (MSE) between the output heat map \(\hat{Y}\) and its augmented version \(\tilde{Y}\) such that the probability of a voxel containing a protein should be invariant to augmentations as shown in equation (17):$${{\mathcal{L}}}_{\rm{cons}}={\rm{MSE}}(\hat{Y},\tilde{Y}).$$
(17)
The final training objective is shown in equation (18):$${\mathcal{L}}={\tilde{R}}_{pu}+{\lambda }_{1}({{\mathcal{L}}}_{\rm{sup}}^{db}+{\lambda }_{2}{{\mathcal{L}}}_{\rm{unsup}}^{db})+{\lambda }_{3}{{\mathcal{L}}}_{\rm{cons}},$$
(18)
where λ1 is the weight of the total contrastive module, λ2 is the weight of the unsupervised contrastive loss, and λ3 is the weight of consistency regularization. The resulting loss serves two purposes: (1) the contrastive term maximizes similarities for encoded features belonging to the same group (particle and background) and minimizes such similarities if features are from different groups; and (2) the heat map loss term forces predicted particle probabilities to be higher when they are closer to the true center location. To remove duplicate predictions, NMS is applied to the predicted heat map using 3D maximum pooling with a user-defined kernel size. For tubular-shaped macromolecules, we fixed the kernel size to 3 as we need to extract more coordinates for the polynomial fitting step.Post-processing for tubular-shaped macromoleculesWe apply a post-processing step specifically designed to detect tubular-shaped macromolecules. The workflow consists of three main stages: (1) grouping of extracted coordinates, (2) second-order polynomial fitting, and (3) resampling based on the fitted polynomial. In the grouping stage, we construct an undirected graph using the extracted coordinates, where each vertex represents a coordinate and an edge exists between two vertices if their Euclidean distance falls within a user-defined cutoff distance. Connected components in the graph correspond to groups of coordinates. For each connected component, we perform second-order polynomial fitting separately for the xy and yz dimensions as given by equation (19):$$\begin{array}{l}\hat{x}={a}_{0}+{b}_{0}y+{c}_{0}{y}^{2},\\ \hat{z}={a}_{1}+{b}_{1}y+{c}_{1}{y}^{2}.\end{array}$$
(19)
The quality of the fit is assessed by measuring the residual between the fitted points and the actual values. Additionally, we calculate the maximum curvature of the fitted polynomial using equation (20):$$k=\max \left(2a\right./{({(1+2ay+b)}^{2})}^{\frac{2}{3}}.$$
(20)
If the fitted residual and maximum curvature are both below user-defined cutoff values, we consider the set of coordinates as potential candidates for being tubular-shaped macromolecules. Finally, we resample the coordinates based on the fitted polynomial. This involves determining the minimum and maximum values of y among the extracted coordinates and performing uniform sampling of \(\hat{y}\) values with a step size of 2. The corresponding \(\hat{x}\) and \(\hat{z}\) values are then calculated using the fitted polynomial in equation (19). The resulting resampled 3D coordinates \((\hat{x},\hat{y},\hat{z})\) are used as the final coordinates identifying the position of tubular-shaped macromolecules.Comparison with state-of-the-art approachesFor comparing the performance of MiLoPYP against template matching and crYOLO-3D, we used tomograms from EMPIAR-10304 and EMPIAR-10499. For EMPIAR-10304, we used EMAN2’s implementation of template matching (e2spt_tempmatch.py)18. For EMPIAR-10499, we manually picked particles and used them to generate an initial reference (pixel size = 10 Å, box size = 34). This reference was low-pass filtered to 40.0 Å and used for template matching as implemented in Warp64. Particles were selected at positions that had _rlnAutopickFigureOfMerit values greater than 13. For crYOLO-3D, we retrained the model on each dataset using manually picked particles across ten consecutive slices (selecting 20–30% of particles) and using tomograms with different defocus values. Our results show what MiLoPYP has higher accuracy (F1 scores) when compared to template matching and crYOLO-3D (Extended Data Fig. 3).We also compared the performance of MiLoPYP against three other deep-learning-based approaches: DeepFinder31, DeePiCt32 and TomoTwin34. To ensure a thorough and meaningful comparison, we used tomograms from four different datasets representing synthetic datasets (SHREC-21) and real datasets (EMPIAR-10304, EMPIAR-10453 and EMPIAR-10987; Supplementary Table 1 and Extended Data Figs. 9 and 10). When running TomoTwin, we used the latest pretrained model weights and followed the instructions from the reference-based particle picking tutorial. For DeepFinder, F1 numbers for the SHREC-21 dataset were obtained from the SHREC-21 challenge results; for the EMPIAR datasets, we used the pretrained model weights (as the model was trained to detect four classes, we only report F1 numbers corresponding to the ribosome class). For DeePiCt, we used the pretrained model to detect ribosomes from EMPIAR-10304 and EMPIAR-10987. As DeePiCt was only trained to detect ribosomes, membranes, microtubules and fatty acid synthase, we could not apply this method to the SHREC-21 dataset (F1 scores reported as ‘not applicable’). Also, because neither DeepFinder nor DeePiCt was trained to detect native viral spikes, we did not obtain meaningful results when analyzing tomograms from EMPIAR-10453 using these methods (F1 scores reported as ‘not applicable’). As expected, our experiments show that on the synthetic SHREC-21 dataset the fully supervised method DeepFinder performs best on average (as more data are available for training compared to semi-supervised methods), while MiLoPYP and TomoTwin outperform DeepFinder and DeePiCt on the three EMPIAR datasets, with MiLoPYP outperforming TomoTwin by 7% (F1 scores).Model training, inference, evaluation and ablation studiesDuring training, instead of using the whole tomogram, we cropped subtomograms of size 64 × 64 × 5 as input to the network in batches of 2. Training time is thus independent of the input size. Inference is performed on the entire tomogram. The proposed framework is trained in an end-to-end manner using Adam optimizer with default parameter values and an initial learning rate of 0.001. We decrease the learning rate by a factor of 10 every 5 epochs. Training takes around 3 to 5 minutes for 10 epochs, and inference on each full tomogram takes less than a second on an NVIDIA Tesla V100 GPU. We used experimentally determined values: λ1 = 0.1, λ2 = 0.5 and λ3 = 0.1.For datasets where manual labels were available, we evaluated the quality of detection in terms of precision and recall scores. We followed the same definition of these metrics as in ref. 27. To account for small variations in the detected particle centers, instead of looking at a single pixel, we also look at pixels located within a certain radius from the center. If the detected particle position is within a certain radius of a ground-truth particle position, we considered it as a true positive match. Let TP(k) be the number of true positives in the top k predictions. Precision is defined as the fraction of predictions that are correct and recall is defined as the fraction of labeled particles that are retrieved in the top k predictions. With this, precision and recall are calculated as shown in equation (21):$$\begin{array}{c}{\rm{TP}}(k)=\mathop{\sum }\limits_{i}^{k}{y}_{i},\\ {\rm{Pr}}(k)=\frac{{\rm{TP}}(k)}{k},\\ {\rm{Re}}(k)=\frac{{\rm{TP}}(k)}{\mathop{\sum }\nolimits_{i}^{n}{y}_{i}}.\end{array}$$
(21)
To assess the effectiveness of the proposed contrastive learning regularization and positive unlabeled formulation, we conducted ablation studies on the voxel-level contrastive learning module and the positive unlabeled learning component, using the EMPIAR-10304 (Extended Data Fig. 3a) and EMPIAR-10499 (Extended Data Fig. 3b) datasets. In the first ablation study, we removed the voxel-level contrastive module and the corresponding term in the loss function. As a result, we observed a degradation in performance for both datasets, with a more substantial drop in performance observed in the more complex in situ dataset EMPIAR-10499. This demonstrates that inclusion of contrastive regularization enhances the feature learning process and facilitates the detection of particles even when limited training samples are available. For the second ablation study, we removed the positive unlabeled module by treating all unlabeled regions as negatives and utilized a standard focal loss while maintaining the contrastive module unchanged. Similarly, we observed a substantial deterioration in performance (Extended Data Fig. 3). This highlights the importance of debiasing in scenarios where annotated data are scarce. Together, these experiments establish the substantial contributions of the voxel-level contrastive learning and the positive unlabeled learning components, emphasizing their effectiveness in enhancing feature learning, mitigating biases and enabling robust particle detection, particularly when confronted with limited annotated data.Contribution of localization module to detection accuracyTo assess the performance improvement enabled by the protein-specific particle localization module, we compared the precision, recall and F1 scores of particles detected using the exploration module alone and particles detected after training of the localization module. Results on tilt series and tomograms from EMPIAR-10304, EMPIAR-10499, EMPIAR-10453 and EMPIAR-10987 show substantial gains in all accuracy metrics when adding the localization module (Supplementary Table 2).Dataset description and SPT detailsTilt series from EMPIAR-10304 were downloaded from the EMPIAR database and used to reconstruct tomograms at a pixel size of 25.2 Å in nextPYP45. These tomograms were used as input to MiLoPYP’s exploration and refinement modules. Evaluation of the refinement module on 11 tomograms produced a total of 8,965 particles that were subjected to SPT analysis in nextPYP. Particles were extracted using a binning factor of 4 and subjected to reference-based alignment using a map downloaded from EMD-23358 and a maximum resolution for refinement of 22 Å, followed by particle filtering according to the particle score distribution. A subset of 6,051 good-quality particles was kept for further processing. Particles were re-extracted without binning, and constrained particle and micrograph alignment using a shape mask was performed until no further improvements in resolution were observed. After per-particle contrast transfer function (CTF) refinement and an additional round of constrained particle and micrograph refinement, we obtained a final reconstruction of the 70S ribosome at 5.0 Å resolution (Fig. 2c and Supplementary Fig. 1a).For EMPIAR-11694, a total of 36,345 particles corresponding to RuBisCo enzymes were identified using a pixel size of 13.68 Å. Reference-based refinement in nextPYP gave an 11.0 Å resolution map from 35,352 particles (Extended Data Fig. 5b and Supplementary Fig. 1b). D4 symmetry was applied to the final reconstruction.For EMPIAR-10499, a total of 23,285 particles were identified from 65 tomograms reconstructed using a pixel size of 13.6 Å. A similar procedure to the one used for EMPIAR-10304 was conducted in nextPYP using a map downloaded from EMD-11650 as a reference, resulting in a 5.4 Å resolution reconstruction of the 70S ribosome from 17,381 particles (Fig. 3c and Supplementary Fig. 1c).For EMPIAR-10453, a total of 23,388 particles were identified from 266 tomograms reconstructed using a pixel size of 15.95 Å. All particles were used for reference-based refinement in nextPYP using a map downloaded from EMD-11347 and a maximum resolution during refinement of 20 Å. Constrained 3D classification without alignment was performed to remove particles containing projections that were either too close to each other or too close to gold beads. A total of 13,047 particles were kept and further refined using constrained refinement with a shape mask. 3D classification without alignment using three classes and a focus mask covering the receptor-binding domain (RBD) was performed. Class 1 only had a few particles resembling gold beads and was discarded. Class 2 had 9,194 particles showing an open spike conformation with one RBD pointed upwards, and class 3 containing 3,740 particles showed a closed spike with the three RBDs pointing down. Further constrained refinement was performed on these two classes yielding a final reconstruction for the open state of the SARS-CoV-2 spike of 5.6 Å resolution (no symmetry was applied) and 9.3 Å resolution for the closed state (C3 symmetry was applied; Fig. 4c and Supplementary Fig. 1d,e).For EMPIAR-10987, a total of 6,068 particles corresponding to 80S ribosomes were identified from 21 tomograms reconstructed using a pixel size of 15.95 Å. A similar processing procedure as that used for EMPIAR-10304 was followed in nextPYP using a map downloaded from EMD-33118 as a reference, yielding a 8.6 Å resolution map from 4,570 particles (Fig. 5c and Supplementary Fig. 1f). From the same set of 21 tomograms, a total of 1,761 particles corresponding to microtubules were identified and subjected to 3D refinement in nextPYP. All particles were used for reference-based refinement using a map downloaded from EMD-10061 as a reference, resulting in a 37.0 Å map (Fig. 5d and Supplementary Fig. 1g). No constrained 3D classification was conducted because of the limited number of particles. Helical symmetry was applied to the final map.For EMPIAR-11658, a total of 4,105 particles corresponding to ATP synthase were identified from 48 tomograms reconstructed using a pixel size of 15.68 Å that contained mitochondrial membranes. A similar processing procedure as that used for EMPIAR-10987 was followed in nextPYP using a map downloaded from EMD-28809 as a reference, yielding a 13.0 Å resolution map from 2,577 particles (Fig. 6c and Supplementary Fig. 1h).In all cases, map resolution was measured using the 0.143 cutoff of Fourier shell correlation curves between half-maps. Additional data processing statistics and timing information are included in Supplementary Table 3.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles