Identification and characterization of specific motifs in effector proteins of plant parasites using MOnSTER

OomycetesWe used proteins from five oomycetes species to create the input datasets for MOnSTER, namely Phytophthora infestans, Phytophthora sojae, Phytophthora ramorum, Hyaloperonospora arabidopsidis and Bremia lactucae.The positive dataset consists of 1743 effector proteins belonging to the aforementioned oomycetes obtained from a concatenation of proteins selected from the PHI-base database (v4.14)49, Uniprot (release 2023_02)50, and the work of Haas et al.33, in which they have manually curated the annotations of the proteins. Since the proteins come from different sources, we used CD-HIT (v4.8.1)51 with the parameters in the Tools configuration paragraph, to filter out identical protein sequences. A total of 1283 proteins are annotated as RxLR effectors, 377 as Crinkler effectors and the last 83 sequences are proteins with no previously identified motif and known to be involved in the host-pathogen interaction.Proteins in the negative dataset derive all from Uniprot (release 2023_02) and from the oomycetes species cited before being filtered from proteins included in the positive dataset and for evident effector-related annotations. Due to the large amount of non-effector proteins remaining from the filtering, we firstly used CD-HIT to reduce protein sequence redundancy and then, to also reduce the unbalance of the final dataset, we refined the selection, taking only the representative sequences of the orthogroups found with Orthofinder (v2.5.4)52. In total 3009 non effector proteins are included in the negative dataset.The last input file consists of a list of motifs identified as enriched in the sequences of the positive dataset compared to the sequences of the negative one. We used MERCI and STREME (v5.5.1)53, with parameters detailed in the Tools configuration paragraph. We imposed different lengths for motif prediction to be inclusive but more stringent on the motifs in which we are interested. STREME’s output is a list of motifs. Hence, we used the tool FIMO (v5.5.1)54, with default parameters to extract 246 degenerated motifs from the 4524 different motifs.We obtained the following numbers of non-redundant motifs: 19 with MERCI and 246 with STREME. Then, we removed the identical motifs and created a single non-redundant list containing all the motifs in the same format, which resulted in 265 different motifs.Plant parasitic nematodes (PPNs)The positive dataset contains candidate parasitism proteins selected to be likely secreted by PPNs in their plant host and belonging to 13 species (Meloidogyne incognita, Meloidogyne javanica, Meloidogyne arenaria, Meloidogyne hapla, Meloidogyne chitwoodi, Meloidogyne graminicola, Globodera rostochiensis, Globodera pallida, Heterodera havenae, Heterodera glycines, Heterodera schachtii, Radopholus similis, Bursaphelenchus xylophilus). We collected candidate parasitism protein from literature mining. More precisely we considered as candidate parasitism proteins those proteins for which in-situ hybridization experiments showed that the corresponding transcript is present in nematode secretory glands (dorsal or sub-ventral), implying that these proteins are likely secreted by the nematodes into the host plant. The literature mining led to the extraction of 163 proteins from NCBI GeneBank thanks to the NCBI ‘entrez’ API. We also manually extracted 41 sequences from the publications’ core text and supplementary information. In addition, we downloaded 41 sequences from WormBase ParaSite (www.parasite.wormbase.org, vWBPS17-WS28255,56), and eight sequences from nematode.net57. In total, we obtained 229 candidate parasitism proteins. We extended the positive dataset with proteins that are non-redundant homologues of the previous candidate parasitism proteins in PPN proteomes. We first used cd-hit-2D with parameters in the “Tools configuration” section, to cluster sequences from PPNs proteomes and candidate parasitism proteins58. We then pooled all the candidate parasitism proteins from closely related Meloidogyne species (e.g., M. incognita, M. javanica and M. arenaria) and scanned each corresponding proteome with this multi-species set of sequences using cd-hit. Since the remaining species are genetically distinct, we then scanned each proteome with the relative set of candidate parasitism proteins, except for H. havenae and M. chitwoodi for which no proteomes were currently available. We merged the two sets of selected candidate parasitism proteins, and we performed CD-HIT intra- and inter-species to reduce dataset redundancy (parameters in the “Tools configuration” section), retaining only sequences having more than 1% divergence and aligning on more than 80% of their length (the longest sequence from each cluster was kept). The final positive dataset includes 546 candidate parasitism proteins from 13 species.The negative dataset is composed of 3849 protein sequences that we obtained by selecting genes widely conserved across the nematode tree of life and close outgroup species, including many species that are non-parasites. Specifically, we filtered the results from a previous analysis46 and only retained genes from orthogroups (i) conserved in more than 90% (62/64) of the analysed species including two tardigrade species (outgroups), and (ii) presenting <10 genes/species/orthogroups to avoid multigenic families, which would lead to overrepresentation of some proteins. To remove the redundancy, we used the same strategy as for the positive dataset (cd-hit-2D first and then CD-HIT).Using the aforementioned software in the same configuration, we obtained the following numbers of non-redundant motifs: 40 with MERCI and 229 with STREME applying FIMO. In total, we obtained 269 different motifs.All datasets are available at https://github.com/Plant-Net/MOnSTER_PROMOCA.git31 and in Supplementary Data 2.1-2.2 and 3.1-3.2.Tools configurationcd-hit-2D is used with the following configuration: -s2 0 -c 0.90 -g1 -aL 0.30 -aS 0.30. Where, s2, -c, -g parameter values are the default ones. -aL and -aS values are set so each sequence of a pair must cover at least 30% of the other one; CD-HIT: s = 0.8, c = 0.99, g = 1, aL = 0.80, aS=0.80; STREME, version 5.5.1, accessible at https://meme-suite.org/meme/doc/download.html, is used with the following parameters: -minw 3 -maxw 5; -minw 3 -maxw 7; MERCI, accessible at http://dtai-static.cs.kuleuven.be/ml/systems/MERCI/MERCI.zip, is used with parameter: -l 5 -fp 20; -l 7 -fp 20; -l 10 -fp 20.MOnSTER pipelineThe MOnSTER (MOtifs of cluSTERs) pipeline is composed of three main steps as described in Fig. 7 and in the following sections.Fig. 7: MOnSTER pipeline scheme.a MOnSTER pipeline is composed of three steps. It takes two FASTA protein sequences datasets (positive and negative) and a list of predicted motifs (enriched in the positive dataset) as input. The output is a list of CLUMPs and an associated MOnSTER score. The MOnSTER score is constituted by: b CLUMPscore calculation. c Two occurrences Indexes.MOnSTER pipeline—feature calculationThe first step of the pipeline concerns the calculation of parameters that describe protein sequences (Fig. 7a). To allow an easy calculation of the features on any dataset, we calculated the sequence length and used ProteinAnalysis class from the Bio.SeqUtils.ProtParam, a python sub-package to select 13 additional features based on individual AA properties, belonging to 4 categories:

secondary structure propensity ‘helix’ (V, I, Y, F, W, L), ‘turn’ (N, P, G, S), and ‘sheet’ (E, M, A, L).

amino-acids dimensions (‘tiny’ (A, C, G, S, T) and ‘small’ (A, C, F, G, I, L, M, P, V, W, Y)).

pH (‘basic’ (H, K, R), ‘acid’ (B, D, E), and ‘charged’ (H, K, R, B, D, E)).

physicochemical properties (‘hydropathy-score’, ‘polar’ (D, E, H, K, N, Q, R, S, T), ‘non-polar’ (A, C, F, G, I, L, M, P, V, W, Y), ‘aromatic’ (F, H, W, Y), and ‘aliphatic’ (A, I, L, V)).

For each of the 13 sub-categories, we calculated the cumulative percentage of the associated AA in each sequence as the value for each corresponding feature. The hydropathy score is the equivalent of the GRAVY (grand average of hydropathy) value, introduced by Kyte and Doolittle59. Accordingly, the score is obtained by the hydropathy value of each sequence residue normalized by its length. The length of each motif is also used as an additive feature, leading to 14 total features.We performed feature calculations on the positive and negative datasets and the list of motifs. At the end of this step, we obtained three tables of features, one for each of the input datasets (positive, negative datasets and the list of motifs).MOnSTER pipeline—ClusteringThis step allowed to cluster motifs based on their properties described by the 13 features. To make the features comparable to each other, we performed data normalization by using the StandardScaler method from sklearn.preprocessing60. This normalization consists of the removal of the mean and the scaling to unit variance.Then, we performed a hierarchical clustering of the motifs using the Euclidian distance. We then divided the resulting tree into clusters of motifs of proteins (CLUMPs) selecting the threshold distance that minimized the Davies–Bouldin score61.For each CLUMP, we removed the redundant motifs. Briefly, we identified motifs that shared a core sequence (for example: ‘HWT in HWTQ’ and ‘GHWTQ’), and we only retained the cores (for instance: “HWT”) in the CLUMPs.MOnSTER pipeline—ScoringThe final objective is to identify the CLUMP(s) with the highest discriminative power concerning the positive dataset. Thus, we conceived a new score called the MOnSTER score, to rank the CLUMPs by their discriminative power.The MOnSTER score is composed of three parts: the CLUMP score and two modified versions of the Jaccard index.MOnSTER pipeline—CLUMP scoreThis score considers the AA composition of the motifs belonging to each CLUMP concerning the preferences of the sequences of the positive dataset. The procedure that we implemented to calculate this score is shown in Fig. 7b.Feature selectionWe used the Mann–Whitney test to identify the features whose values were significantly different between the positive and negative datasets. We only retained the statistically significant features, with a p-value < 0.05. Then, we assigned them a score, by calculating −Log(p-value) of each feature. We will refer to it as the ‘feature weight’ hereafter.Average calculationFor each of the selected features (ranging from one to f), we calculated the average value for the positive dataset, the negative dataset, and each CLUMP (ranging from zero to c). We will refer to these values with the notation: ${\mu }_{f}^{+}$, ${\mu }_{f}^{-}$ and ${\mu }_{f}^{{{{\rm {CLUMP}}}}_{{c}}}$, respectively.CLUMPs sortingWe compared the averages of the positive and negative datasets for each feature and sorted CLUMPs accordingly.Thus, if the ${\mu }_{f}^{+}\ge {\mu }_{f}^{-}$, the CLUMPs averages would be sorted in ascending order.Otherwise (${\mu }_{f}^{+} < {\mu }_{f}^{-}$), CLUMPs averages would be sorted in descending order.CLUMPs votingFor each feature, and each CLUMP, we divided the CLUMP into two groups according to the following statements:If ${\mu }_{f}^{+}\ge {\mu }_{f}^{-}$: CLUMPs with ${\mu }_{f}^{{{{\rm {CLUMP}}}}_{c}}\ge {\mu }_{f}^{+}$ have a vote from 1 to the number of CLUMPs with an increment of 1, otherwise the score is set to 0.If ${\mu }_{f}^{+} < {\mu }_{f}^{-}$: CLUMPs with ${\mu }_{f}^{{{{\rm {CLUMP}}}}_{c}} < {\mu }_{f}^{+}$ the vote attributed goes from 1 to the number of CLUMPs, otherwise it is 0.CLUMPs scoringFor each CLUMP (ranging from zero to c), for each feature (ranging from one to f), we multiplied the feature-vote by the ‘feature weight’ (Wf) and summed-up to obtain a CLUMP-vote. Then we scaled each CLUMP-vote to a range from 0 to 1 using the following formula:$${{{\rm {CLUMPscore}}}}_{c}=\frac{{V}_{c}-\min (V)}{\left(\max \left(V\right)-\min \left(V\right)\right)}$$whereV is the list of CLUMPs votes and Vc is calculated as$${V}_{c}={\sum}_{{{\rm {features}}}[1,\,f]}\left({{{\rm {vote}}}}_{f}\subset {{{\rm {CLUMP}}}}_{c}\right){W}_{f}$$MOnSTER pipeline—Occurrences indexesThe two indexes respectively consider: (i) the occurrences of the motifs, for each CLUMP, in the positive dataset compared to the negative and (ii) the number of positive sequences containing the motifs in each CLUMP concerning the negatives (Fig. 7c).CLUMPs occurrencesWe calculated the occurrences of the motifs in each CLUMPs in the two datasets (positive and negative).
I’s scoresWe propose two ways to calculate the dissimilarity between two sets that will be called I1 and I2 hereafter.To obtain I1, we calculated the number of occurrences of the motifs for each CLUMP (ranging from zero to c) in the negative dataset over the number of occurrences of the motifs of the same CLUMP in the positive dataset, using the following equation:$${I}_{1\forall {{\rm {CLUMP}}}[0,c]}=\frac{1}{2}\left(1-\frac{\sum {\triangle }_{-}\subset {{{\rm {CLUMP}}}}_{c}}{\sum {\triangle }_{+}\subset {{{\rm {CLUMP}}}}_{c}}\right)$$where${\triangle }_{-}$ and ${\triangle }_{+}$ the number of occurrences of the motifs of the CLUMP in the negative or in the positive dataset, respectivelyTo obtain I2, for each CLUMP (ranging from zero to c), we calculated the number of sequences of the negative dataset that contain at least a motif of the CLUMP, over the number of sequences of the positive dataset that contain at least a motif of the same CLUMP, accordingly to the following formula:$${I}_{2\forall {{\rm {CLUMP}}}[0,c]}=\frac{1}{2}\left(1-\frac{\sum {{{\rm {seq}}}}_{-}\subset {{{\rm {CLUMP}}}}_{c}}{\sum {{{\rm {seq}}}}_{+}\subset {{{\rm {CLUMP}}}}_{c}}\right)$$where${{{\rm {seq}}}}_{-}$ is the number of sequences of the negative dataset containing at least a motif of the CLUMP.${{{\rm {seq}}}}_{+}$ is the number of sequences of the positive dataset containing at least a motif of the CLUMP.The ½ factor is applied to have values between 0 and 0.5 for each Index to have equal weight in the final score, and (1–Index) is to consider the degree of dissimilarity rather than similarity.MOnSTER pipeline—MOnSTER scoreThe MOnSTER score, for each CLUMP (from zero to c), is the sum of the corresponding CLUMP score, and the two I indexes:$${{{\rm {MOnSTERscore}}}}_{c}={{{\rm {CLUMPscore}}}}_{c}+{I}_{1c}+{I}_{2c}$$PRO-MOCA: a method to create motif logo of CLUMPsTo create motif logos for each CLUMP, we developed PRO-MOCA (PROtein-MOtifs Characteristics Aligner), which aligns protein motifs based on the characteristics of the amino acids as shown in Supplementary Fig. 8. The first step is to define the alphabets associated with each characteristic that can be used to represent the motifs (Supplementary Fig. 8a). We have defined four alphabets, namely: “chemical”, “hydrophobicity”, “charge”, “secondary structure propensity”.These alphabets are easily modifiable and other alphabets can be included. Different CLUMP logos can be obtained depending on the alphabet chosen. Secondly, PRO-MOCA uses the selected alphabet to translate the AA sequences of each motif in a CLUMP in the defined alphabet (Supplementary Fig. 8b). The third step is the alignment (Supplementary Fig. 8c). Briefly, PRO-MOCA screens the translated motif sequences of a CLUMP looking for a “summit position” with the highest frequency of the same “letter” of the just defined alphabet. Once this position is identified, all motifs are aligned accordingly (Supplementary Fig. 8d). Since the motifs of a CLUMP can have different lengths, after the alignment, PRO-MOCA calculates the number of gaps to add at the extremities to make all motifs having the same length. Importantly, gaps are not allowed inside the motif sequences. The last step of the method is to re-translate the aligned motifs in the original AA sequences (Supplementary Fig. 8e). The alignment is ready to feed a programme to create logos. Here we used the tool Weblogo362.PPNs candidate parasitism protein domains mining analysisTo investigate the relationship between the selected CLUMPs and functional domains in candidate parasitism proteins, we first selected proteins from the positive datasets containing at least one occurrence of a selected CLUMP (311 proteins in total). Then we predicted the protein domains with InterProScan (v5.54-87.0)30 with default parameters. From the results, we extracted the proteins containing the most frequent predicted domains and considered only entries coming from MobiDB-lite, Coils, CDD, PANTHER, Pfam and ProSitePatterns. Afterwards, we also predicted the presence of Signal Peptide (SP) (SignalP4.163) and TransMembrane (TM) domain regions (TMHMM2.064). We obtained 258 proteins having at least a CLUMP and one of the aforementioned predicted domains, SP or TM.In situ hybridisation (ISH) and N. benthamiana agroinfiltrationM. incognita strain “Morelos” was multiplied on tomato (Solanum lycopersicum cv. “Saint Pierre”) growing in a growth chamber (25 °C, 16 h photoperiod). Freshly hatched J2s were collected and ISH performed as previously described65,66. The M. incognita Minc3s00056g02931/MiEFF72 coding sequence (CDS) lacking the signal peptide for secretion was amplified by PCR with specific primers (EFF72_F: 5’-AAAAAGCAGGCTTCACCATGAATACTGCTGACAAGACACAG-3’ and EFF72_R: 5’- AGAAAGCTGGGTGTTAGAACAAAGCTCGCACTGC-3’) and inserted into the pDON207 entry vector. Antisense probe was amplified using EFF72_R from the entry vector. Sense probe was amplified using EFF72_F and used as a negative control. Images were obtained with a microscope (Axioplan2, Zeiss, Germany).The M. incognita MiEFF72 CDS lacking the signal peptide was recombined in pK7FGW2 (P35S:eGFP-MiEFF72) with Gateway technology (Invitrogen). The construct was sequenced (GATC Biotech) and transferred into Agrobacterium tumefaciens strain GV3101. Transient expression was achieved by infiltrating N. benthamiana leaves with A. tumefaciens GV3101 strain harbouring the GFP-fusion construct, as previously described67. Leaves were imaged 48 h after agroinfiltration, with an inverted confocal microscope (LSM880, Zeiss, Germany) equipped with an Argon ion and HeNe laser as the excitation source. GFP emission was detected selectively with a 505–530 nm band-pass emission filter.Statistics and reproducibilityTo obtain motifs enriched in the positive dataset for both applications, we used the internal calculation of STREME and varied the minimal frequency threshold of motifs in positive protein sequences, tuning the -fp parameter of MERCI.For feature selection needed for the CLUMP score calculation, we used the Python library scipy68 to perform the Mann–Whitney test on both positive datasets (1743 proteins for oomycetes and 546 for PPNs) and negative datasets (3009 proteins for oomycetes and 3849 for PPNs). The statistically significant threshold is a p-value < 0.05.For each ISH experiment (each probe), 10,000 Meloidogyne incognita juvenile (J2) were used as described in Jaouannet et al.66. This number means that at least a hundred nematodes can be observed at the end of the experiment, despite losses during the various treatments. For Nicotiana benthamiana agroinfiltration, a minimum of three leaves were agroinfiltrated and observed for the construct (or water control) in each experiment in order to take into account the possible variability of expression between leaves. ISH with the antisense probe was carried out three times independently and 30 pictures were taken; ISH with the sense negative probe were carried once and 12 pictures were taken; Agroinfiltrations with the GFP fusion or negative control (water) were carried out three times independently and 25 and 13 pictures were taken, respectively. All attempts at replication were successful, i.e. signals observed for ISH antisense probes or GFP fusion for agroinfiltration, or not in the case of negative controls (sense probe or water infiltration, respectively).Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Identification and characterization of specific motifs in effector proteins of plant parasites using MOnSTER

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models