Defining condensate-forming proteins
In order to develop a model to identify condensate-forming proteins, we assembled a ground truth dataset for H. sapiens, that has the most experimentally studied condensates of all organisms, to date. Since we aimed at developing a binary classifier, we considered two classes of proteins: (1) proteins involved in condensates (positive dataset) and (2) proteins not involved in condensates (negative dataset) (Fig. 1a). The positive dataset was constructed from a semi-manually curated dataset of biomolecular condensates and their respective proteins, called CD-CODE (CrowDsourcing COndensate Database and Encyclopedia), developed by our labs34. CD-CODE compiles information from primary literature and from four widely used databases of LLPS proteins25,26,27,28.
Building the negative dataset is a complicated task as there is no publicly available resource that reports proteins that do not form condensates. Additionally, condensates may form only under specific conditions35. Here, we defined the negative dataset based on protein-protein interaction network (InWeb database36 for human proteins). We excluded all proteins that have direct connections with known condensate proteins. We reasoned that these proteins are potential condensate members that have not yet been studied. The remaining proteins comprised the negative dataset (Fig. 1a). Of course, this procedure doesn’t guarantee the absence of condensate proteins among the negative dataset (false negatives). But exclusion of proteins that directly interact with proteins that were reported as members of synthetic or biomolecular condensates is lowering the probability of mixing positive and negative data. Overall, our non-redundant dataset (filtered by 50% sequence identity) contained 2142 positive and 1709 negative human proteins, which were divided by 4:1 ratio into training and test datasets.
PICNIC identifies sequence- and structure-determinants of condensate formation
We hypothesized that the ability to form condensates is encoded in the proteins’ sequence and structure, and developed a machine learning classifier called PICNIC (Proteins Involved in CoNdensates In Cell) based on sequence-distance and structure-based features derived from AlphaFold2 models (Fig. 1b), in total 65 sequence-distance-based and 21 structure-based features.
It was shown that many proteins involved in condensates harbor intrinsically disordered regions (IDRs) and low-complexity sequences. IDRs, due to their inherent flexibility, multi-valency and ability to sample multiple conformations, are adept at a wide array of binding-related functions including molecular assemblies9,37,38. Therefore, we also tested several metrics of disorder and sequence complexity as features (see Methods). Our final model contained several features related to disorder, such as IUPred scores33, that have a feature importance of 0.5–3%.
Although the presence of highly disordered residues is among the most important features (Fig. 1d, pink), it is not a prerequisite for a protein to have long disordered domains to be a member of a condensate(s). This is supported by the observation, that the proportion of known condensate-forming proteins with no disordered regions in the human proteome is 21% (disordered regions <10aa. Fig. S1), while 33% of all human proteins have no disordered regions. For example, Human protein Guanine nucleotide exchange factor C9orf72 is a driver protein in stress granules; Speckle-type POZ protein is a driver in nuclear speckle and SPOP/DAXX body. Both proteins consist of ordered domains that were experimentally determined by electron microscopy and X-ray crystallography, respectively (Fig. S1, PDB ids 6LT0 ad 3HU6). Thus, both the analysis of experimentally verified condensates and the selected features by our model suggest that disorder is not a necessity for condensate-forming proteins.
Along with overall sequence complexity and disorder scores of a protein, the secondary structure of individual residue types was also found to be important. We used the confidence score of the AlphaFold2 model prediction, the pLDDT score, that was shown to correlate with sequence disorder39. We represented the occurrence of an amino acid (AA) in a given secondary structure element (SSE) with a given model confidence as a triad (AA-SSE-pLDDT).
As amino acid composition bias and patterning of charges were shown to impact the ability of proteins to form condensates30,31,32, we developed features that represent short and long range co-occurrences of amino acids in the protein sequence. We represent co-occurrence of amino-acids in the protein sequence within a distance (number of amino acids in linear sequence) by triads (AA1, distance, AA2). After feature selection, the long-range distance between charged amino acids, e.g. Lysine and Arginine (K,60, R) and Aspartic acid and Lysine (D,20,K), and short-range distance of Leucine and hydrophobic amino acids (e.g. L,0,W; F,2,L; L,2,L), and the distance between Cysteine and hydrophobic amino acids were shown to be the most important features (Fig. 1d). Among the amino acids, Lysine and Leucine amino acids contribute the most to the model (Fig. 1c).
PICNIC accurately identifies proteins involved in biomolecular condensate formation
Several data-driven predictors were developed in the last few years, that aim to predict proteins involved in LLPS from protein sequence alone or from sequence and experimental data, such as microscopy images40. Here, we compared the performance of PICNIC to sequence based predictors, PSAP23, DeePhase41 and the general model of PhaSePred29 (PdPS-8fea based on 8 features) (Fig. 2).
We compared the performance of tools on three different datasets: (1) test dataset from the recently published PhaSePred methods29; (2) proteins forming nuclear puncta defined by the OpenCell project42; (3) test dataset generated from CD-CODE34 (see Methods, Dataset S1). Although the CD-CODE test data is not independent and was partially used by existing predictors during their training process, PICNIC has superior performance with a maximum F1-score of 0.81 (Fig. 2c).
To further validate our model, we used microscopy images from Human Protein Atlas (HPA) where fluorescently labeled proteins were imaged and their cellular localization was determined43. Specifically, three types of cellular localizations were screened: nucleolus, centrosome and nuclear speckle. We filtered the list of proteins from HPA that were already in our training set that resulted in 484 proteins with known localization. Overall, PICNIC scores were higher for the proteins from HPA than for proteins without known localization (Fig. S4). 69% of proteins mapped from HPA (with exclusion of the proteins from the training dataset) have a PICNIC score greater than 0.5, meaning that PICNIC correctly identified them as members of biomolecular condensates. It should be noted that HPA doesn’t report if a protein does not belong to given condensate (negative examples). Therefore, this dataset can be used only to check model sensitivity (recall, what fraction of true condensate forming proteins were predicted correctly), but not model precision (what fraction of positive predictions are actually true positives).
PICNIC is robust in identifying small sequence perturbations that impact condensate formation
A challenging task for a computational predictor is to be sensitive to small sequence perturbations that can impact condensate formation. To test if PICNIC can distinguish similar sequences with altered condensate forming properties, we considered the synuclein family, that comprises three paralogs in human. Although they have similar sequences (60–70% identity, Fig. 3a, c) and structures as predicted by AlphaFold2 (Fig. 3b), only α- and γ-synuclein form condensates in vivo, and only α-synuclein phase separates in vitro. Specifically, FITC-labeled β-synuclein, which lacks the characteristic NAC region of α-synuclein, does not phase separate at high concentrations (200 μM) and under crowding conditions (10% [weight/volume] PEG), whereas FITC-labeled α-synuclein forms condensates under the same conditions22,44,45. While α- and γ-synuclein can form amyloid-like fibers, β-synuclein does not45,46. Moreover, α- and γ-synuclein are part of biomolecular condensates: α-synuclein is reported to be the member of synaptic vesicle pool condensate46, γ-synuclein is a member of the centrosome47, but β-synuclein has not been found in any biomolecular condensates yet. PICNIC is the only method tested here which accurately predicts the in vivo condensate-forming ability of the synuclein family (Fig. 3d). Other methods either give the same score for all three paralogs and/or do not predict the correct tendency of condensate formation in vivo. The features that stand out in β-synuclein and are absent in the most important features of α- and γ− synuclein (I-Alpha helix-l, F-Alpha helix-l) are connected to hydrophobic amino acids, being part of alpha-helix with low pLDDT score (Fig. S16). Thus, the structural changes involving the alpha helix are likely to drive the signal. We surmise that PICNIC is sensitive to structural rearrangements of proteins, and hypothesize that the bending of alpha-helix in β-synuclein potentially hinders the protein’s ability to form condensate as highlighted on the structure (Fig. S16).
This encouraged us to further test, if PICNIC can predict the impact of mutations, i.e. substitutions and deletions. We assembled a dataset of sequence perturbations that impact the ability of a protein to form condensates described in the literature. We excluded the most commonly used phase-separating proteins such as FUS because they are used as a model for many algorithms, and the performance of these proteins is heavily biased. To this end, our dataset comprised of proteins with 27 single mutations and deletions in 10 proteins in total (Table S1, Dataset S5). PICNIC scores were consistently lower for mutant sequences that have a reduced or completely abolished ability to form condensates (Fig. 3e, f). However, the scores are still higher than the threshold 0.5. This indicates that the impact of mutations is better resolved when interpreted in relation to each other, since the overall ability of the protein to form condensates is encoded in features that are shared across the different mutants. Nevertheless, PICNIC can predict the relative ability of mutants to form condensates.
Experimental validation of predicted condensate-forming proteins
In order to experimentally validate our model, we decided to predict the condensate localization of poorly characterized human proteins and sought to validate their condensate-forming behavior inside living cells. To do so, we chose 24 proteins which: (1) cover diverse molecular functions spanning the entire central dogma of molecular biology and regulation of all major cellular bio-polymers for instance, nucleic acids, proteins and chromatin (Fig. S8), (2) represent the average sequence length of human proteins (i.e., around 350 amino acids) by having a range of 125 to 684 amino acids, (3) represent diverse 3D structures from ordered, alpha-helical, beta-stranded to highly disordered (Fig. 4b) and (4) are known to be involved in genetic diseases (AIMP1, CWC27, RP9, LMOD1) as well as host-pathogen interaction (IF2GL). Overall, the 24 proteins (Dataset S2), we chose for experimentally verifying and benchmarking PICNIC, represent global cellular functions and therefore are suitable to demonstrate how robust our machine learning model is in predicting condensate-forming proteins across entire proteomes.
We cloned 24 transgenes and transfected them in U2OS cells expressing fluorescently labeled proteins (see Methods). Using fluorescent imaging, we found, that 21 out of the 24 tested proteins (87.5%) localized to mesoscale foci without any stressors, while 3 proteins (C1ORF52, SPAG7 and CWC27, encircled in red) localized to the nucleoplasm without forming any discernible mesoscale foci (Fig. 4a, Fig. S10). Foci were defined based on enrichment in fluorescent intensity, i.e., the intensity ratio inside relative to outside the foci is greater than one (Fig. S9a). In sum, only 3 tested proteins tested show close to no detectable foci, and 21 form foci (Fig. 4a).
In order to classify the observed foci as biomolecular condensates, we aimed to define quantitative characteristics and thresholds. We measured four simple characteristics from fluorescent microscopy images (Fig. S9): area and perimeter, informing on the size and the typical number of proteins in a foci; shape (roundness); number of foci per cell. Next, we decided on a threshold for these characteristics to aid a quantitative definition of condensates. We consider foci as condensates above the diameter of 350 nm (distance between two furthest pixels in one condensate), that is well above the diffraction limit. This would correspond to at least ~1 μm perimeter assuming a near round shape (Fig. S9c). Using back-of-the-envelope calculations, we can consider an average protein size as 10 nm3, then a 1 μm3 compartment can contain ca. 1 million protein molecules and a 500 nm3 compartment can contain 100,000 protein molecules. We note, that other super resolution techniques are required to characterize the size of the clusters of proteins below the diffraction limit.
Experiments confirm 87.5% of PICNIC predictions
By applying the above definition, in our dataset, 75 %, i.e., 18 proteins (encircled in blue) form high confidence condensates, 12.5 % i.e. 3 proteins (encircled in orange) form low confidence condensates (foci with perimeter <1 μm) while other 3 are not forming any condensates (encircled in red) and exhibit a fluorescent intensity ratio ~1. We observe most condensates to be round (Fig. S9d). The number of condensates per cell varies between 1 and 100 s depending on the protein of interest.
Next, we wanted to characterize the localization of the condensates that the proteins formed. We observed that 9 proteins (TYW5, SPA24, AIMP1, ZC3H15, IF2GL, LMOD1, RPS4Y2, DRC4, RS10L) localized to cytoplasmic bodies and 7 (RAD51AP1, KHDC4, CWC25, POLD3, RAMAC, DRC4, RP9) localize to nuclear bodies (Fig. 5a, Fig. S10).
Using co-localization experiments with known nucleolus (Fibrillarin) and processing body (P-body) (DCP1a) markers, we found that many proteins (H2A1H and RAD51AP1, H1T, MRPL1, RS10L, RPS4Y2, PolD3) can localize to the nucleolus at least in some cells (Fig. 5a, see also Fig. S10), a well-characterized liquid-like nuclear condensate (FRAP experiments in Fig. 5b), and PHP14 can localize to P-bodies, another well-characterized cytoplasmic condensate (Fig. 5a).
H2A1H (in purple) shows rather weak localization as a rim around the DFC (in green) showing sub-nucleolar localization specific to the outer Granular center (GC, see the cartoon representation of the nucleolar architecture). RAD15-associated protein 1 (RAD51AP1) shows multi-condensate localization that varies from exclusive nuclear bodies (Localization 3 and 4), to nuclear bodies abutting the nucleolus (Localization 3), and RAD51AP1 forming rim around DFC (GC, Localization 1) to sub-nucleolar localization (Localization 4) as well as complete but weak nucleolar localization (Localization 2) (Figs. 5a, S10) suggesting an interesting role for this protein’s involvement on multiple nuclear-condensates possibly in a cell-cycle stage regulated manner. The localization patterns of the condensate forming proteins are consistent with the wide-range of molecular functions that these proteins perform (Fig. S8). We also observed more than one type of foci in case of many proteins for e.g. DRC4: nuclear and cytoplasmic; MRPL1: nuclear and cytoplasmic bodies; RS10L: cytoplasmic bodies, filaments as well as nucleolar localization (Fig. S10).
In addition, FRAP recovery profiles for a subset of the condensate forming proteins (RAD51AP1, KHDC4, CWC25, RAMAC, DRC4, RP9, RBMY1D, TYW5) revealed, that 6 out of 8 tested proteins show very fast dynamics indicated by the recovery of the bleached foci, while two (TYW5 and RBMY1D) shows little to no recovery at the indicated time (Fig. 6).
In sum, we do not see any correlation with disorder content of a protein and its ability to form condensates in our experimental dataset (Fig. 4c). Other popular tools to predict condensate proteins would fail to make correct predictions for many of these proteins as shown by the high misclassification rates (Fig. 4d). Overall, 87.5% of PICNIC predictions were found to be correct (misclassification rate is 25% for high confidence condensates and 12.5% if we include both high and low confidence condensates) in our experimental assays validating the model.
Proteome-wide predictions detect no correlation of predicted condensate proteome size with disorder content and organismal complexity
To demonstrate the generalizability of PICNIC, which was trained on human data, we tested its performance in identifying known condensate-forming proteins of other organisms. We screened the CD-CODE database34 to evaluate the fraction of proteins that were correctly identified as members of condensates by our predictor. PICNIC successfully predicted 72% of such proteins in mouse and 86% in Caenorhabditis elegans for example (Fig. 7a). Thus, PICNIC model is species-independent and is useful for different organisms to assess the ability of proteins to be involved in biomolecular condensates.
To estimate the overall proteome fraction of condensate-proteins, we calculated PICNIC scores for 26 different organisms across the tree of life including bacteria, plants and fungi (Fig. 7). We selected 14 organisms that have already known condensate protein members that were experimentally verified in CD-CODE (we excluded organisms from further analysis where the number of known proteins is too small to compute statistics on the performance: Danio rerio (N = 14), Dictyostelium discoideum (N = 2), Escherichia coli (N = 6), Mycobacterium tuberculosis (N = 1), Oryza sativa (N = 1), Candida albicans (N = 6)). PICNIC correctly identified 50–100% of the known condensate proteins across organisms in CD-CODE (Fig. 7a). Although PICNIC was trained on human data, it generalized well and is likely applicable to proteomes of other organisms.
Next, we performed proteome-wide assessment of condensate proteins across 26 organisms. We found that the proportion of the predicted condensate-forming proteome is 40–60%, and is similar across related organisms, e.g., 42% and 39% in human and mouse, respectively (Fig. 7b). Interestingly, while the fraction of disordered proteins increases with organismal complexity as shown before48,49, we found no correlation between fraction of predicted condensate proteins in a proteome and the disordered protein content (Fig. 7c) across the 26 species tested even when using different metrics to assess the disorder of a proteome (Fig. S14). For example, E. coli and H. sapiens have both ~40% of their proteome predicted to be involved in biomolecular condensates (Fig. 7b).