Differentially localized protein identification for breast cancer based on deep learning in immunohistochemical images

Based on the deep learning model, the features of breast IHC images were analyzed and protein subcellular location prediction models were constructed to identify differentially located proteins in breast cancer. Meanwhile, their potential mechanism and relationship with breast cancer were analyzed. The detailed steps are shown in the flowchart (Fig. 8).Fig. 8: Flowchart.I. Construct and train the model based on the deep learning framework. II. Identification of differentially localized proteins in breast cancer. III. Verification of the relationship between differentially localized proteins and breast cancer and classification performance analysis of the proteins. IV. The relationship between localization changes and breast cancer was analyzed through potential mechanisms analysis.DataThe IHC images of breast tissue were downloaded from the Human Protein Atlas database. The XML files of breast tissue images were obtained from the HPA based on the hpaXmlGet function in the R package “HPAanalysis”. Only Image whose staining intensity level is strong or moderate and quantity is higher than 75% was screened and downloaded for the subsequent research.The subcellular localization information under normal conditions was downloaded from the HPA database. According to the hierarchical structure of organelles and the number of proteins of each category, all subcellular location labels in the data were classified into the following seven categories: Cytoplasm, Endoplasmic reticulum, Golgi apparatus, Mitochondria, Plasma membrane, Nuclear, Vesicles. In order to ensure the reliability of labels, only images whose label confidence level was ‘enhanced’ in the cell atlas were retained. Subsequently, the images of proteins with both normal and disease images were extracted. Considering that using only one image to predict localization is not accurate or reliable, proteins with only one image in the screening results were removed. Finally, 3375 images of 332 proteins were obtained (Table 2).Table 2 Label distribution in the datasetThe protein expression data and corresponding clinical information in Breast Invasive Carcinoma and paracancerous tissue were downloaded from the Clinical Proteomic Tumor Analysis Consortium (CPTAC; https://proteomic.datacommons.cancer.gov/pdc/; accessed on 7 December 2022) database for subsequent analysis. Then, the data was preprocessed by deleting proteins with over 30% missing values and the imputation of the missing values in the remaining proteins was done based on the K-Nearest Neighbor (KNN) method. Here two sets of protein expression profiles were downloaded and organized: PDC000120 and PDC000173. The former compiled expression data for 9124 proteins from 143 samples, while the latter obtained 8790 protein expression data from 108 samples (Supplementary Table 4).Construction of protein localization prediction modelsBased on the deep learning framework, protein subcellular localization prediction models for breast cancer were built using features extracted from IHC images10 (Supplementary Fig. 3).The ResNet_18 model was adopted to extract features from IHC images. ResNet_18 is a classic and effective deep convolutional neural network model with excellent feature extraction and classification capabilities, which can be applied to computer vision tasks such as image classification and object detection. The ResNet18 model has four stages, each with 2 BasicBlocks. Each stage extracts features in different dimensions, namely 64, 128, 256, and 512 dimensions. The image information described by features of different dimensions is different. According to previous studies on interpreting the feature map of CNNs, the low-level features describe detailed information, e.g., textures, colors and edges, while high-level features, which are more abstract, capture more position-independent semantic information45. In order to investigate the impact of features of different dimensions, the four types of features with different dimensions were used to construct the models separately and the performance was compared.The features of images of normal samples were incorporated into the Transformer model to aggregate the image features, and predict protein subcellular localization. The Transformer is a model that uses attention, a concept that helped improve the performance of neural machine translation applications, to boost the speed with which these models can be trained16. It can accept multiple input feature vectors, comprehensively consider all vectors and find useful information from them. The image features obtained in the previous step were input into the model for feature aggregation, and finally, each protein obtained a predicted probability vector. The value in each column of the vector represents the probability of the protein located to the position.To improve the efficiency of the model, the hyperparameters were trained. Hyperparameters in the model include the dimension of the input features, the depth of the model, the number of heads, the number of hidden neurons in the feedforward layer, etc. Here the dimension of the input features was mainly trained and evaluated. For other parameters, referring to the Imploc model, the depth of the model was set to 4, the number of heads was set to 6, and the number of hidden neurons in the feed forward layer was set to four times of the feature dimension. In order to determine the most appropriate feature dimension, the four types of features with different dimensions were used to construct the models, respectively.Ten percent of the proteins were randomly selected as the test set, and the remaining proteins were used to train the models by 10-fold cross validation. The proteins were redivided into 10 parts. Each time, one of the 10 parts was used as the validation set, and the rest were used as the training set. After 10 training sessions, 10 models were obtained. To evaluate the effectiveness of the models, the downloaded localization information from HPA was used as known subcellular localization and the AUC value and F1 score\(\left(1\right)\) were calculated.$$F1=2\cdot \left({precision}\cdot {recall}\right)/\left({precision}+{recall}\right)$$
(1)
Where \({precision}\) refers to the proportion of positive samples in the positive cases determined by the classifier, and \({recall}\) refers to the proportion of predicted positive cases to the total positive cases.Comparing the results of the four dimensions, the dimension of the feature with the best prediction effect was used as the hyperparameter of the model.At the same time, the probability of a protein being located at each position was predicted. When the probability of the protein being localized to that position is greater than the threshold we set, the protein was localized to that position. Multiple localizations may be obtained for each protein. Multiple localizations may be obtained for each protein and 6 thresholds were considered (0.3, 0.4, 0.5, 0.6, 0.7, 0.8). Comparing the results of different dimensions and thresholds, the effectiveness of the models was evaluated.Using the optimal hyperparameters, the images were retrained with all 10 parts as training sets to obtain final models. To ensure the reliability and reproducibilityof the results, two sampling methods that divide the training set and the test set were applied to construct the protein localization prediction models. Ninety percent and seventy percent of the samples were randomly selected as the training set, and the remaining samples as the test set to evaluate the effectiveness of models. Both methods were repeated 10 times, resulting in two sets of models with 10 in each group.To further evaluate the predictive performance of our models, two existing methods for protein subcellular localization prediction based on convolutional neural networks, AnnoFly and Imploc, were used. Based on IHC images of normal breast tissue, protein subcellular localization was predicted and the results were compared with those of our models. To ensure the accuracy of the results, the same data and sampling methods were used for the three models. The AUC value and F1 values were calculated and predictive effectiveness of our models were evaluated.Identification of differentially localized proteinsBased on the constructed subcellular location prediction models, the localizations of proteins were mainly predicted based on the following three aspects:First, identification of stable differentially predictive localized proteins. The normal images and cancer images were input into the models separately to predict subcellular localization, and two prediction probability vectors were obtained for each protein (\({P}_{n},\,{P}_{c}\)). Since the subcellular localizations downloaded from the HPA database were classified into seven categories, the probability vector was a 7-dimensional vector.$${P}_{c}={\left[{p}_{c1}\,{p}_{c2}\cdots {p}_{c7}\right]}^{T}$$
(2)
$${P}_{n}={\left[{p}_{n1}\,{p}_{n2}\cdots {p}_{n7}\right]}^{T}$$
(3)
Each element of the vector represented the probability that the protein will be located on that location label. If the probability was greater than threshold (\(p\)) the protein was predicted to be located at that location\(\left(4\right)\).$${L}_{P}=\left\{{L}_{i}|\,{p}_{i}\, > \,p\right\}\,i\in \left[1,7\right]$$
(4)
Where \({L}_{P}\) is the localization of the protein predicted by the model, \({L}_{i}\) is the label of the i-th position, and \({p}_{i}\) is the probability of positioning to the i-th position.For the 20 models obtained above, the proteins with different prediction results in tumor and normal conditions were extracted, respectively. For each set of models, the number of times that the predicted location of each protein was different in the results was counted, and proteins whose number was higher than the upper quartile were extracted. The results of the two groups were intersected to obtain proteins with stable differentially predictive localization\(\left(5\right)\).$${P}_{s}=\left\{P | {P}_{C70} \ge Q\, \& \,{P}_{C90} \ge Q\right\},\,Q=\left[0.75 * \left(n+1\right)\right]$$
(5)
Where \({P}_{s}\) represents proteins with stable differentially predicted localization, \({{{{\rm{Q}}}}}\) is the value rounded down to the upper quartile of the counts, and \({P}_{C70}/{P}_{C90}\) is the count of times for proteins with different prediction results in all models obtained using 70% or 90% random sampling methods, respectively.Second, identification of proteins with maximum localization differences. The model with the best predictive performance among all models was selected for the analysis. For each protein, the Euclidean distance between the two probability vectors in tumor and normal conditions (\({P}_{n},\,{P}_{c}\)) was calculated and the top 5% of the proteins with the farthest distance were obtained as the proteins with the largest localization difference.Third, identification of proteins whose predicted results were not affected by the removal of a single image. For each protein, in order to exclude the influence of a single image on the prediction results, one image was deleted and the localization of the protein was predicted based on other images of the protein. Two probability vectors were obtained: the normal vector and the cancer vector. A T-test was performed on the prediction vector of removing a single image and the prediction vector of all images corresponding to the same protein, and proteins with insignificant differences in both disease and normal conditions were extracted (P < 0.05).The intersection of the results of the above three steps was extracted as the stable differentially localized protein.Verification of differentially localized proteinsThe relationship between differentially localized proteins and breast cancer was verified in the following aspects:First, literature validation. Literature related to differentially localized proteins and breast cancer in the PubMed database were browsed to analyze the role of these proteins in the progression of breast cancer.Second, functional analysis. Based on the enrichr function in the R package “enrichR”, Gene Ontology (GO; http://geneontology.org/; current release June 11, 2023) and Kyoto Encyclopedia of Genes and Genomes (KEGG; https://www.kegg.jp/; version: 107.0) analysis were performed on proteins. Biological processes and pathways with P values less than 0.05 were extracted as significantly enriched functions to analyze the association between the proteins and breast cancer.Third, survival analysis. The survival data of the corresponding samples in the CPTAC profiles were downloaded from The Cancer Genome Atlas (TCGA; https://www.cancer.gov/ccg/research/genome-sequencing/tcga; accessed on April 2022) database, and survival analysis was done through the R package “survival”. Overall survival was defined as the time from the date of initial surgical resection to the date of death or last contact. Survival curves were drawn using the Kaplan–Meier method and were statistically compared using the log-rank test46. Based on the results of survival analysis (p < 0.05), the relationship between proteins and prognosis of breast cancer was discussed.Classification performance analysisIn order to evaluate the effectiveness of the proteins we obtained in distinguishing breast cancer from normal samples, the expression of differential localization proteins in tumor and normal samples was compared and a classifier was constructed for classification analysis.Based on two sets of protein expression profiles downloaded from the CPTAC database, it was investigated that whether there were differences in the expression of differentially localized proteins. T-tests and foldchange analysis were adopted on two sets of data respectively, and proteins with P < 0.05 and \(\left|{\log }_{2}{{{{\rm{foldchange}}}}}\right| > 1\) were identified as differentially expressed proteins.A random forest model was adopted to construct a classifier based on the expression value of proteins to classify cancer and normal samples. Based on R package “randomForest”, the model was constructed and two important parameters of random forest, the number of decision trees in the model, ntree, and the number of variables contained in each decision tree, mtry, were set to 500 and \({\log }_{2}{{{{\rm{N}}}}}\), where N is the number of samples. The classification performance of the model was evaluated through the Leave One Out Cross Validation method. Then, an equal number of non-differentially localized proteins were randomly selected to construct a classifier, and the classification performance of the obtained proteins was compared with other proteins. Then, classifiers based on non-differentially localized proteins were constructed. Comparing the classification performance of the two groups of proteins, the ability of the obtained proteins to distinguish between normal and cancer samples was evaluated.Analysis of breast cancer potential mechanismsThe breast cancer potential mechanisms were investigated by extracting the co-expressed or co-located proteins and ncRNAs interacting with differentially localized proteins.Based on the protein expression profiles of breast tissue downloaded from the CPTAC database, the Pearson correlation coefficients between differentially localized proteins and other proteins in normal and cancer samples were calculated, respectively. Proteins whose correlation coefficient >0.3 and P < 0.05 were extracted as co-expressed proteins. Then, the interaction data of the differentially localized proteins was downloaded in the STRING (https://cn.string-db.org/; version: 11.5) database and the interaction relationships with scores higher than 700 were extracted. By taking the intersection of the two, the significantly co-expressed and interacting proteins of differentially localized proteins were obtained. Finally, two groups of proteins under normal and cancer conditions were obtained and two groups of modules were constructed. In order to study the functional changes of differentially localized proteins between tumor and normal conditions, the functions of the protein under one condition were defined as the functions of all proteins in the interaction module under corresponding condition. GO and KEGG enrichment analysis were conducted for the proteins in the modules based on R package “clusterProfiler”. Biological processes and pathways with P value less than 0.05 were taken as the significantly enriched functions and pathways, which were used to study the changes of functions caused by localization changes and their impact on breast cancer.The interaction relationships between RNA and protein were downloaded in the RNAInter database and non-coding RNA who interacts with the differentially localized proteins was extracted. Then, the positioning information of these RNAs was retrieved from the RNALocate database. Finally, ncRNAs with the same localization as the differentially localized proteins under both cancer and normal conditions were extracted. These ncRNAs were recognized as co-located RNAs of differentially localized proteins, which undergo localization changes together with the proteins. Their interaction might affect the localization of proteins, and thereby affecting their functions, which is of great significance for the occurrence of cancer.Statistics and ReproducibilityFor all hypothesis testing, we used a standard threshold of p < 0.05 to assign significance. To improve the efficiency of the model, two hyperparameters were trained and two sampling methods were adopted. Full details were provided in the results and methods sections. The source data and code used in this research have been uploaded to the GitHub and Zenodo47 and can be accessed online at https://github.com/wendyliwan/protein-localization-prediction-models and https://doi.org/10.5281/zenodo.12139567.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles