Peptide hemolytic activity analysis using visual data mining of similarity-based complex networks

The overall workflow consists of four stages: (i) Metadata network visual mining, (ii) HSPNs generation and analysis, (iii) scaffold extraction and exploration, (iv) motif discovery and enrichment (Fig. 1). The first step involves the generation of metadata networks (METNs) and exploration of critical features related to hemolytic peptides. The second step consists in building HSPNs that represent the chemical space of hemolytic peptides retrieved from StarPepDB. Then the best HSPN candidates were selected based on global network descriptors for further analysis. In the third step, representative subsets (scaffolds) from the best HSPN candidates, built up with the optimal t value, and from their respective networks with cutoff t = 0.00 were extracted by using sequence alignment and centrality information from each peptide in the graph. Finally, the last step consists in proposing new putative hemolytic motifs by using an alignment-free approach and by comparing them with reported hemolytic motifs using benchmark datasets (enrichment analysis) to further select the most representative ones. All the steps of this section were performed using the StarPep toolbox, aided with in-house python scripts and the SeqKit toolkit33.Fig. 1: Workflow overview of the experimental procedure.Figure created with Inkscape74.Metadata networks (METNs)METNs are graphs that use metadata information (e.g., origin, target, activity) from the hemolytic peptides reported in the StarPepDB (refer to the “Materials and Methods” section for a more detailed description). Betweenness Centrality34 was employed as a measure the relevance of the nodes in the graphs. Four types of METNs were constructed: Database, Function, Origin and Target.Database METNMost hemolytic peptides of the StarPepDB come from the SATPdb12, Hemolytik14, DBAASP15, UniProt35, DRAMP36 and CyBase37 databases that are the six most central nodes in Fig. 2A. Most peptides are shared by SATPdb, Hemolytik, DBAASP and DRAMP, whereas CyBase contains more unique sets of peptides. It might be because CyBase mainly focuses on collecting information about specific types of proteins, cyclic proteins which have shown to possess important advantages such as higher stability and binding affinity compared with linear peptides38.Fig. 2: Metadata networks (METNs) of Database and Function.A Database METN describes the source databases from which hemolytic peptide from the StarPepDB has been retrieved. Aquamarine nodes represent the databases whereas blue-green nodes represent hemolytic peptides. The six most central databases were numbered according to their betweenness centrality rank: 1. SATPdb, 2. Hemolytik, 3. DBAASP, 4. UniProtKB, 5. DRAMP_General, 6. CyBase. B Function METN describes the functions associated with hemolytic peptides. Yellow nodes represent the functions reported for these peptides (red nodes are also metadata nodes but are related to hemolytic activity: “toxic”, “toxic to mammals” and “hemolytic”). Blue-green nodes represent hemolytic peptides. The nine most central peptide functions were numbered according to their betweenness centrality rank: 1. hemolytic, 2. antimicrobial, 3. toxic, 4. anti-Gram negative, 5. toxic to mammals, 6. anti-Gram positive, 7. antibacterial, 8. antifungal, 9. anticancer. These networks were visualized in Gephi70 using Force Atlas 2 layout67 and edited with Inkscape74.In addition, SATPdb has the highest betweenness centrality and node degree value since it is connected to 1817 hemolytic peptides. On the contrary, the databases having the least number of hemolytic peptides are NeuroPep39, Defensins40 and Bagel 241 which have node degrees of 4, 2, and 1, respectively. Overall, the Database METN can be helpful when searching for the most important databases regarding peptide hemolytic activity as well as the most unique and most specialized databases.Function METNWhen designing therapeutic drugs, understanding other activities associated with hemolytic peptides can be a good starting point for inferring possible mechanisms of action or chemical characteristics of peptides that might be related not only to certain therapeutic activity but also with hemolysis. A Function METN can be a fast and easy approach to tackle this question by using the StarPep toolbox. Figure 2B shows a Function METN of the 2004 hemolytic peptides reported in the StarPepDB. Evidently, the most central activities are “hemolytic”, “toxic” and “toxic to mammals” since the peptides of study are hemolytic and the metadata nodes are hierarchically related (colored red in Fig. 2B with centrality ranks: 1, 3, and 5, respectively). However, most of these peptides are also related to antimicrobial activity and hierarchically related metadata: antibacterial, anti-Gram positive, anti-Gram negative, antifungal, etc. In fact, these metadata comprise the nine most central nodes in the Function METN.Since the main target of AMPs is the bacterial cell membrane which is disrupted by several reported modes of action42, it might be feasible that similar modes of action can also target and disrupt human cells, specifically RBCs. Many studies have proposed that due to the positive charge of many AMPs, they can selectively disrupt negatively charged membranes of bacteria while not affecting the neutral membranes of mammals43,44. However, it has been demonstrated that several AMPs (some with high antimicrobial activity) can also disrupt mammalian cells as well, causing hemolysis in RBCs42,45. In fact, Function METN shows that 94.46% of the 2004 peptides that comprise the hemolytic space, have both antimicrobial and hemolytic activity.Origin METNThis type of METN helps to easily identify the origin of hemolytic peptides, whether they are synthetic or isolated from living organisms. Figure 3A shows the complete Origin METN in the dashed box. The central part of the METN was zoomed in and depicted in the center of Fig. 3A. Looking at the complete Origin METN three distinctive regions can be observed, an outer ring, a middle ring and a central network. The outer ring represents peptides isolated from living organisms but have not been chemically synthesized. For instance, the peptide StarPep_0695446 whose metadata origin node corresponds to only Caenorhabditis elegans. The middle ring represents peptides with nodes of degree zero.Fig. 3: Metadata networks (METNs) based on Origin and Target.A Origin METN describes the origin of the hemolytic peptides (e.g., synthetic, isolated from Halocynthia aurantium, etc.). The dashed box represents the whole Origin METN whereas the bigger figure represents the central part of the Origin METN that was zoomed in for a better visualization. Blue-green nodes represent peptides while violet nodes represent the origin of the peptides. B Target METN describes the target of the hemolytic peptides (e.g., RBCs, Gram-positive bacteria, etc.) which is useful information when exploring associations between therapeutic and hemolytic activities. The dashed box represents the whole Target METN whereas the bigger figure represents the central part of the Target METN that was zoomed in for a better visualization. Blue-green nodes represent peptides whereas green nodes represent the reported target of the peptides. These networks were visualized in Gephi70 using Force Atlas 2 layout67 and edited with Inkscape74.On the other hand, the central network shows peptides that have only synthetic origin (the most central blue-green nodes) and peptides isolated from living organisms that have also been chemically synthesized (nodes connected to the central violet metadata node and connected to radial violet nodes). Radial violet nodes connected in a chain-like way represent hierarchical taxonomic ranks that are related to species from which a particular peptide was obtained. For instance, the subsequent metadata nodes are connected in the following manner Urochordata->Ascidiacea->Pleurogona->Stolidobranchia->Pyuridae->Halocynthia->Halocynthia aurantium. The H. aurantium metadata node is then connected to 6 peptide nodes isolated from that species.Over half of the hemolytic peptides (1060) are of synthetic construct, whereas the rest are isolated from various organisms. Of the top 20 most central origin metadata nodes (synthetic construct not included), half of them belong to the class Amphibia. This is expected because most of the hemolytic peptides in the StarPepDB are antimicrobial (Fig. 2B) and a significant part of them have been isolated from frogs and toads since it has been known that they can produce broad-spectrum AMPs in their granular glands in the skin as a defense strategy47,48,49.Target METNAn outer ring and a central network can be observed in this METN (Fig. 3B). The outer ring of peptides seen in the dashed box are peptides that do not have a metadata node related to a target. This metadata network works in the same fashion as the Origin METN, where chain-like nodes represent the hierarchical taxonomic ranks, but instead of representing the origin of the peptide, it displays the target of the peptide i.e., the species/cell type in which a certain peptide activity has been evaluated. Evidently, the main target is human erythrocytes (colored red in Fig. 3B) since we are exploring the hemolytic peptide space. Other central targets include Escherichia coli, Staphylococcus aureus, Pseudomonas aeruginosa, Bacillus subtilis and Candida albicans. They are among the six most central metadata nodes in this METN. It shows that several of the hemolytic peptides have been evaluated as potential AMPs in important human pathogens such as P. aeruginosa which has become a real concern in hospital-acquired infections due to drug-resistance appearance50.GraphML files of METNs and the descriptor information from each node are available at SM2.Half-space proximal networks (HSPNs)The HSPN is a special type of network that was employed to represent the chemical space of hemolytic peptides based on sequence-based molecular descriptors (refer to the “Materials and Methods” section). The properties of the HSPNs were studied based on their global network parameters consisting of the number of edges, modularity, density, average clustering coefficient (ACC), number of communities and singletons, among others. Such statistics can provide a good picture of the topology of the graphs and help selecting networks with the cutoff t that better projects the chemical space of hemolytic peptides.Our results are consistent with another study that showed that there was little change in the global network parameters when networks are created within the cutoff t range 0.00–0.4525. This is because of the highly low number of edges that are removed within this range. In fact, on average, the number of removed edges at \(t=0.50\) correspond to the 1.9% of the initial edges when \(t=0.00\) (See SM3.6).Moreover, it can be observed that networks generated by different metric measures address differently the similarity between peptides (Fig. 4). Based on their behavior, the networks used in this study can be roughly grouped into three classes: Class I: Angular Separation; Class II: Bhattacharyya, Euclidean and Soergel; and Class III: Chebyshev. The influence of the metric measure in the global parameters of the networks is provided below. All global network parameters calculated for each metric are provided in SM3.Fig. 4: Global network parameters of HSPNs created with different metrics and similarity cutoff values t.The properties of the HSPNs were analyzed based on their global network parameters, including A modularity, B number of communities, C density, D singletons (atypical sequences or outliers), E average clustering coefficient (ACC), and F diameter. These parameters provide a comprehensive overview of the graph topology, aiding in the selection of networks with the optimal cutoff t for accurately representing the chemical space of hemolytic peptides, as well as facilitating comparisons between different metric measures. ACC average clustering coefficient. This figure was created with ggplot2 R package75 and edited with Inkscape74.ModularityThis is a measure of network connectivity which indirectly represents how well-defined communities are in the graph and is associated with the number of communities. Graphs generated with Angular Separation (AS) initially possess higher modularity values compared to the other metrics; however, the modularity keeps relatively low at higher t values (0.550 at t = 0.95) whereas the other four metrics increase their modularity to values near 1. On the other hand, Chebyshev (Ch) networks show the lowest modularity at low cutoff values, but then it increases to high values comparable with Soergel (So), Euclidean (Eu) and Bhattacharyya (Bh). So- and Eu-derived networks have quite similar behavior in the whole range of t values, whereas Bh networks initially behave like Eu and So networks, but then diverge at t = 0.70 (Fig. 4A). An adequate selection of modularity is important since highly sparse networks with an elevated number of communities would not provide useful information as several resulting communities would be just artifacts.DensityIt shows the ratio between the edges present in the network and the maximum number of possible edges. Similarity networks have been shown to have an inversely proportional relationship between similarity threshold (t) and density26,51. The same pattern is observed for all metrics, but with some notable variations. Here, we can identify three behaviors according to the three classes of metrics. AS networks have the highest density in the entire range of t, whereas Class II metrics (i.e., Bh, Eu, and So) have the lowest density until t = 0.70. On the other hand, Ch networks not only have an intermediate initial density but also show the biggest variation of density along the whole range of t (Fig. 4C). In order to select adequate networks, we should choose graphs that are neither too dense nor too sparce since the former would hamper retrieval of useful information whereas the latter would lose information52. Density values below 0.20 are desired as they allow us to properly understand the network while preserving high modularity. Particularly, HSPNs are suited because they have the intrinsic characteristic of showing low densities. In fact, the highest density value in this study corresponds to 0.020 (0.00_AS network).Average clustering coefficient (ACC)This measures the connectivity of the network, and it has been previously studied on molecular similarity networks varying the similarity cutoff t. One study showed that the ACC maximum peak correlates with the best clustering outcome and is a good indicator for finding the appropriate value of t51. In our study, three behaviors related to the metric class can be observed again. AS networks have the highest ACCs in the whole range of t with their local maximum at t = 0.95. On the other hand, Ch networks start with very low ACCs and get increased at t = 0.65 reaching their maximum peak at t = 0.90. Finally, Class II metrics have the lowest ACCs in the entire range of t with their maximum peaks at 0.70 (Eu), 0.80 (So) and 0.85 (Bh) (Fig. 4E).Communities and singletonsThe number of communities determined with the Louvain method, the number of singletons D0 (nodes of degree zero) and the number of singletons GC (nodes disconnected from the giant component) were calculated to select the networks with the most reasonable values of these parameters. When t = 0.00, HSPNs have the minimum spanning tree as a subgraph, this implies that at this t value all nodes are connected. In other words, no singletons D0 nor singletons GC are found. Regarding the number of communities at t = 0.00, all metric networks showed similar values (on average 8 communities). At higher t values, the number of communities and singletons D0 increase dramatically for all the metric networks, except for AS networks (Fig. 4B–D). This is expected as more edges are removed, more nodes are isolated, and now singletons are counted within the communities. Hence, an appropriate t value should be selected that comprises an equilibrium between singletons (atypical peptides) and communities that reflect a real chemical relationship.Other global network parameters were also calculated to characterize the networks, such as the diameter of the graph (Fig. 4F), the average path length and average degree (See SM3.6). To find the best t value for each metric network, we should look for a compromise between the best parameter value for each descriptor i.e., networks with low density, with neither too many clusters (<20) nor too many singletons (~15–30), retaining high ACC and high modularity. The global descriptors of the selected networks with their best cutoff value t and their respective networks constructed with t = 0.00 (10 networks in total) are shown in Table 1.Table 1 Global network parameters of HSPNs with their best t values and their corresponding network at t = 0.00Finally, we calculated the probability of k (also known as the degree distribution) for each of the selected networks (Fig. 5). Overall, all networks show a right-skewed bell-shaped distribution with high probability of intermediate node degrees. Evidently, plots on the left (t = 00) show a probability of zero for singletons (k = 0) whereas plots on the right (best value t) tend to have a higher probability when k = 0. In addition, plots with the best t value have smaller maximum degrees (as well as the average degree) compared with same-metric networks at t = 0.00. Thus, when comparing networks with the same metric but varying the cutoff value (t = 0.00 vs. best cutoff t), it seems both retain a similar degree distribution. However, when comparing networks with different metrics we can get marked differences. AS networks tend to have a wider distribution range and a higher average degree whereas Ch networks show intermediate values, and networks constructed with Class II metrics show a similar distribution shape among them and have the lowest distribution ranges and average degrees of all metrics. Figure 6 shows the graphical representation of the 10 selected HSPNs.Fig. 5: Probability of k (degree distribution) of the HSPNs with cutoff t = 0.00 (left) and with the best cutoff t (right) presented in Table 1.The average degree is presented next to the name of the corresponding network. A 0.00_AS: 32.15. B 0.90_AS: 30.44. C 0.00_Bh: 12.82. D 0.75_Bh: 11.37. E 0.00_Ch: 27.24. F 0.65_Ch: 20.31. G 0.00_Eu: 12.75. H 0.70_Eu: 10.30. I 0.00_So: 14.67. J 0.70_So: 11.56. This figure was created with ggplot2 R package75 and edited with Inkscape74.Fig. 6: Graphical representation of HSPNs with t = 0.00 (left) and networks with the best t value for each metric (right).Node colors represent communities of peptides, and the size of the node represents the HB centrality value. Layout: Fruchterman-Reingold69. Networks were created with StarPep toolbox26, visualized in Gephi70 and edited with Inkscape74.HSPNs scaffoldsA total of 240 scaffolds were extracted from the 10 HSPNs (SM4.1). To better understand the effect of the centrality measure, type of alignment and cutoff value s when constructing the scaffolds, several pairwise similarity comparisons between scaffolds were carried out using the Jaccard similarity coefficient (JSC)53.Metric comparisonWe compared the type of metric measure used to build the parental networks of the scaffolds. For this comparison, scaffolds (t = 0.00) built with the same combinations of centrality, alignment, and cutoff s but with different metrics were evaluated (SM 4.3.1). Each pair of scaffolds is represented as a point in Fig. 7.Fig. 7: Pairwise Jaccard similarity coefficient (JSC) between scaffolds from networks constructed with different metrics when t = 0.00.A, B HB centrality. C, D HC centrality. A, C Global alignment. B, D Local alignment. The cutoff s represents the similarity cutoff applied to extract the scaffolds whereas the percentage in the y-axis represents the percentage of the JSC, which is the number of common peptides between a pair of scaffolds with respect to the union of the peptides of these scaffolds. The higher the percentage, the higher the number of common peptides between pairs of scaffolds. This figure was created with ggplot2 R package75 and edited with Inkscape74.In all plots of Fig. 7 when s ≥ 0.60, all scaffold pairs constructed with Class II metrics (i.e., Bh, Eu, So) show the highest similarity percentage compared with the pairs from other combination of metrics. Moreover, scaffold pairs in which one of them is extracted by the AS metric show the smallest similarity percentage at almost any cutoff value s. On the contrary, scaffolds selected with Ch metric have an intermediate similarity percentage when compared with scaffolds extracted by other metrics.These results agree with the previous result which showed that the five metrics tend to have three types of behavior (three classes of metrics). The density (Fig. 4C) and the degree distribution (Fig. 5) of the networks with different metrics are the global descriptors most correlated with the results from the percentage similarity among scaffolds. Thus, it is possible to reduce the number of highly similar scaffolds by using only those HSPNs with the metrics that mostly differ in the global network parameters. In this case, Class II metrics: Bh, Eu, and So are the metric measures with the most similar behavior since they produce similar networks and scaffolds. Therefore, it was decided to conduct the following analyses using only one of the metrics of Class II: Euclidean. This metric was chosen since it is the default metric used in other studies24,25, and it would be advantageous to compare its performance with the other metrics not previously used in this type of study. Overall, this step allowed us to reduce the redundancy in the scaffold representativity from 240 to 144 scaffolds (SM 4.2).Cutoff comparisonA cutoff value t is not mandatory when constructing HSPNs since at t = 0.00, these networks already have low densities under 0.20. However, the topology, characterized by global network features, tends to vary when varying t as was demonstrated in the “Half-Space Proximal Networks” section. Thus, it is important to evaluate the effect of selecting a cutoff value (or not) when constructing representative scaffolds of the chemical space. The JSC was calculated between pairs of scaffolds extracted by using the same metric but at different cutoff values (t = 0.00 vs. best t value), see SM4.3.2 (Fig. 8).Fig. 8: Pairwise Jaccard similarity coefficient (JSC) between scaffolds from networks constructed with the same metric but differing their t values (t = 0.00 vs. best t value).A, B HB centrality. C, D HC centrality. A, C Global alignment. B, D Local alignment. This figure was created with ggplot2 R package75 and edited with Inkscape74.A marked difference was observed when these scaffold pairs were constructed with different types of centralities. Scaffolds constructed with HB centrality (Fig. 8A, B) tend to have more unique peptides at low s values and the number of common peptides between scaffold pairs tend to increase when s increases. A similar pattern was observed in Fig. 7. However, when the same scaffolds are constructed replacing HB centrality with HC centrality all scaffold pairs tend to share more than 89.50% of peptides regardless of the value of s (SM4.3.2.2) (Fig. 8 C, D). Furthermore, the same patterns are preserved when any alignment type is applied. Hence, when generating scaffolds using HC centrality, it is unnecessary to first find the best t value for the parental networks since similar scaffolds will be obtained using networks with t = 0.00.Alignment comparisonA clear pattern can be observed when extracting scaffold either using global or local alignment (Figs. 7, 8). In general, local alignment tends to discriminate more strongly at low s values than global alignments. Hence, scaffold pairs extracted with local alignment at such low s values have a lower similarity percentage than the analog scaffold pairs extracted using global alignment.In addition, when comparing the similarity percentage of scaffold pairs extracted using the same parameters but differing the alignment type, the same behavior was observed independently of the metric, type of centrality or the t value used, see Fig. 9. Scaffold pairs differing only in their alignment type tend to have a low percentage of similarity at low s values, which might indicate that these methods capture the similarity between peptides differently. However, when analyzing the proportion of unique peptides between these scaffold pairs, scaffolds extracted using local alignment are practically a subset of scaffolds extracted when using global alignment. In fact, the average number of unique sequences in local scaffolds when comparing them with their global counterparts at any cutoff s is 16.19 (SM4.3.3). An example is provided for the scaffold pairs: 0.00_AS_HB_G_0.40 and 0.00_AS_HB_L_0.40 (Fig. 10).Fig. 9: Pairwise Jaccard similarity coefficient (JSC) between scaffolds from networks constructed with the same metric but differing alignment type.A, B HB centrality. C, D HC centrality. A, C networks with t = 0.00. B, D networks with best cutoff t: AS (0.90), Ch (0.65), Eu (0.70). This figure was created with ggplot2 R package75 and edited with Inkscape74.Fig. 10: Size comparison of scaffold pairs generated from the network 0.00_AS.Pink area (G) represents the peptide sequences unique to the scaffold 0.00_AS_HB_G_0.40, green area (L) represents the sequences unique to the scaffold 0.00_AS_HB_L_0.40. The intersection of pink and green represents the number of common peptides between these two scaffolds. The area-proportional Venn diagram was created using DeepVenn76 and edited with Inkscape74.Centrality comparisonPairwise comparisons of the scaffolds constructed using the same parameter but changing the centrality measure show a trend like the pairwise comparisons presented before (SM4.3.4). This implies that the type of centrality used to extract the scaffold will affect the sequences that are removed/retained, especially at low s values.On the other hand, when comparing centrality measures, JSC between scaffold pairs extracted from networks with best t value tend to be higher than JSC from scaffold pairs from networks with t = 0.00. This pattern is clearer at low s values (Fig. 11).Fig. 11: Pairwise Jaccard similarity coefficient (JSC) between scaffolds from networks constructed with the same metric but differing the centrality type.A, B Global alignment. C, D Local alignment. A, C networks with t = 0.00. B, D networks with best cutoff t: AS (0.90), Ch (0.65), Eu (0.70). This figure was created with ggplot2 R package75 and edited with Inkscape74.All scaffolds presented in this section can be used in many applications. For instance, they can be used as training datasets for both ML-based and Multi-Query Similarity Searching (MQSS) prediction models of hemolytic peptides. In fact, a recent study demonstrated that MQSS models based on the scaffolds identified in this study outperformed state-of the art ML-based model classifiers54. The advantage of using these scaffolds is that they store information of central and important peptides as well as outliers or atypical hemolytic peptides while avoiding overrepresentation of certain peptide classes (sampling bias). Each scaffold bears a unique type and amount of information of the hemolytic peptide space and one scaffold can be more suitable than another depending on the scaffold’s use. Scaffolds extracted at low cutoff s values tend to cover fewer peptides of the original space, whereas higher s values capture more information of the space, but peptide overrepresentation might be present. Figure 12 depicts an example of the scaffold coverage when varying the cutoff s.Fig. 12: Barplot showing the coverage of the scaffolds 0.00_AS_HB_L at different s values.Scaffold representations are shown below their cutoff s values. This figure was created with ggplot2 R package75 and edited with Inkscape74.Hemolytic motif discovery and enrichmentMotif discoveryPeptides from each community were used as input sequences to uncover new hemolytic motifs within the communities’ diversity by means of STREME, an alignment-free method55,56,57,58,59,60,61. Table 2 shows a sample of the 42 new motifs discovered using clusters of HSPNs (t = 0.00) created with different metrics. 12 motifs were found from 6 clusters of the network 0.00_AS, 14 motifs were discovered from 4 clusters of the network 0.00_Ch and 16 motifs from 5 clusters were discovered using the network 0.00_Eu. The three metrics commonly detected only four motifs: GLP, MFTKL, ERBADE and VCTRN. It is worth mentioning that several other motifs were similar but not identical such as: GLP/GLPV or VGGTCN/GGTCN. In addition, 15 motifs were discovered without considering the community diversity by using all 1647 hemolytic peptides as input sequences. All these motifs were grouped as HSPNs motifs. After removing duplicated motifs, 50 HSPNs motifs were discovered (SM5.1.2).Table 2 Motifs discovered by STREME using the community information from the HSPNs created using Angular Separation, Chebyshev, and Euclidean metrics with t = 0.00Two previous reports on ML models for predicting hemolytic activity of peptides have also reported hemolytic motifs, namely: HemoPI8 and HAPPENN1. HemoPI reported 21 motifs extracted using MERCI software that were enriched in positive sequences from HemoPI-1 and HemoPI-2 datasets, whereas HAPPENN motifs resulted by looking for the 20-top motifs found exclusively in the positive dataset of HAPPENN. No HSPN-derived motifs were found among the reported ones. To generate a unique list of non-redundant hemolytic motifs, HSPNs motifs were combined with the previously reported ones resulting in 91 putative motifs. Then similar motifs were combined into consensus motifs resulting in 57 non-redundant motifs (SM5.1.2 and SM5.1.3).Motif enrichmentTo identify and validate the most representative hemolytic motifs and remove some artifacts from the 57 potential hemolytic motifs, we conducted enrichment analyses using SEA method on three different datasets: HemoPI-1, StarPepDB and Big-Hemo (SM5.2). Motifs not reported as significant in at least one dataset were removed. The resulting 47 hemolytic motifs sorted by the average enrichment ratio of all datasets are presented below (newly discovered motifs by HSPNs are shown in red): MFTLK, ALKAIS, GTCN, WKSFJK, VCGETC, WKK, AKKAL, GETCV, CYCR, LKKL, CVCV, ISWIK, RFC, LHTA[KL], FLHSAK, CSW, LWKT, FLGTI, GAVLKV, PGC, KKILG, KITK, KHI, LGKL, KWK, VNWK, K[GT]AGK, VCT, ALW, SWP, HIF, LLKK, [VI]LDTJ, CRR, KLL, JGKL, FKK, GAIA, VLK, GLP, PKIF, GKEV, GTIS, AAAK, GCS, IAS, MAL (Table 3).Table 3 Hemolytic motifs that have all their E-value ranks less than 37 sorted by their average enrichment ratio of the three datasets: HemoPI-1, StarPepDB and Big-HemoThese motifs might be involved in the mechanisms of action of hemolytic peptides as well as antimicrobial activity, but further studies are needed to corroborate this assumption. Another possible use of these motifs can be as a toxic signature, where proteins containing some of these motifs could be attributed to a relatively high hemolytic activity in comparison with proteins with few or nonhemolytic motifs. Table 4 shows an example of three pairs of peptides whose hemolytic activity is related to the number of hemolytic motifs present in their sequences.Table 4 An example of the use of hemolytic motifs as toxic signaturesWe decided to further explore this hypothesis by comparing the relation between the number of hemolytic motifs in a peptide and its likelihood of being hemolytic (SM5.3). To obtain the predicted hemolytic activity of a peptide, we used the consensus of two different model classifiers that were identified in a previous report to have a robust performance after a multiple comparison54. This experiment was carried out using three datasets: antibacterial, antiviral and FDA-approved. All three datasets have peptides with lengths up to 100 AAs.For the antibacterial and antiviral datasets, a general pattern can be identified. Most peptides without any of the reported hemolytic motifs tend to be non-hemolytic. When peptides have one or more motifs, peptides are mostly hemolytic. The hemolytic/non-hemolytic ratio gets more pronounced with the increase of the number of motifs. Interestingly, peptides with a high number of motifs were predicted to be exclusively hemolytic peptides. Nevertheless, it is worth noting that only a few peptides actually contained a high number of these motifs (Fig. 13).Fig. 13: Relation between the number of hemolytic motifs and the hemolytic activity predicted using two model classifiers: “SVM + Motif (HemoPI-1) based”8 and “MQSSM-I1”54.This analysis was performed in three datasets: Antibacterial, Antiviral and FDA-approved. A, B and C represent boxplots displaying the number of hemolytic motifs in a peptide and the predicted hemolytic activity obtained using the model classifiers. D, E and F show the absolute frequency of hemolytic and non-hemolytic peptides for each group (number of hemolytic motifs). This figure was created with ggplot2 R package75 and edited with Inkscape74.The activity of the peptide might be another aspect to consider when conducting this type of analysis. For example, antibacterial peptides without any reported motif tend to be non-hemolytic, but a high number of peptides without motifs are also predicted to be hemolytic. On the other hand, antiviral peptides without any motifs are almost exclusively non-hemolytic. Therefore, the absence of reported hemolytic motifs does not imply that the peptides are not hemolytic. The same is true when peptides contain one or more hemolytic motifs; it does not necessarily mean they are hemolytic, but the higher the number of motifs, the higher the possibility that peptides are hemolytic (Fig. 13).When the same analysis was conducted in 49 FDA-approved peptides, a major difference was observed. Only three peptides were predicted as hemolytic54, Glatiramer acetate (Th1113/ seq_32) contained one hemolytic motif, whereas two other peptides did not report any hemolytic motif, namely Lucinactant (Th1146/ seq_41) and Gramicidin D (Th1024/ seq_8). More importantly, the trend was opposed to what was found in the antibacterial and antiviral datasets. The ratio between hemolytic/non-hemolytic peptides containing more than one motif was inverted, i.e., hemolytic peptides with one or two motifs were scarce compared to non-hemolytic peptides containing the same number of motifs. Another important detail is that in these sequences, in spite of having the same maximum length (100 AAs) as the antibacterial or antiviral datasets, peptides from the FDA-approved dataset displayed at maximum two reported motifs (Fig. 13C, F). This result agrees with the fact that approved peptides have to be safe and to avoid being hemolytic/toxic, unless the application mode does not directly interact with the bloodstream as is the case of Gramicidin D, a commercial antibacterial drug that is only applied on the skin because of its high hemolytic activity62.

Hot Topics

Related Articles