Comparative genomic analysis of thermophilic fungi reveals convergent evolutionary adaptations and gene losses

Phylogeny and genomic featuresThe dataset we used for this analysis comprises genomes of 29 thermophiles, 8 thermotolerant fungi, and 42 mesophiles, totaling 79 species from Ascomycota, Basidiomycota, and Mucoromycota phyla (Fig. 1a and Supplementary Data 1). We used Mesquite (http://www.mesquiteproject.org) to reconstruct the ancestral state of the clades to find the common ancestor of Chaetomiaceae, Thermoascaceae, and Trichocomaceae families all to be thermophilic. Interestingly, only the Chaetiomiaceae revealed two or more thermophilic-mesophilic transitions, with ancestral clades within the families being thermophiles. Similar results were found by Hensen and colleagues10, where variation in genome features stems primarily from within-family evolution.Fig. 1: Phylogeny and genome features of thermophilic, mesophilic, and thermotolerant fungi.a Maximum likelihood tree of 79 fungi. The branch colors represent the ancestral ecological character states reconstructed with Mesquite using the Mk1 likelihood reconstruction model. b Assembly size in Mbp shows the distribution of repeats, gaps, and non-repeat content. c Normalized counts of CAZymes (Carbohydrate Active Enzymes) split into the classes AA (Auxiliary Activities), CBM Carbohydrate Binding Modules, CE (Carbohydrate Esterases), EXPN (Expansins), GH (Glycoside Hydrolases), GT (Glycoside Transferases, Myosin (Myosin Motor), and PL (Pectate Lyases). d Normalized counts of secondary metabolite clusters according to SMURF predictions.We observed that the genome sizes in thermophilic fungi are smaller overall (p = 3.3 × 10−5 – two-tailed Wilcoxon rank sum test [WT]) and when compared with their mesophilic counterparts (most phylogenetically close in our dataset). The pair Acremonium thermophillum (28.89 Mbp) and Coniochaeta lignaria (42.38 Mbp) is a good example, with the genome of thermophile 13.49 Mbp shorter than its counterpart. (Fig. 1b). Similar trends can be observed in Talaromyces and Thermomyces, with average thermophilic genomes reduced by 38.12%. The pattern is less clear in cases like Rasamsonia, which have genomes closer in size to Talaromyces, and the Mucoromycota phylum (Fig. 2b).Fig. 2: Pangenome of Chaetomiaceae and Eurotiomycetes.a Rarefaction curve of Chaetomiaceae pangenome split by Core/Shell/Unique and Mesophilic and Thermophilic. The bands represent the standard deviation of each average point. b Rarefaction curve of Eurotiomycetes pangenome split by Core/Shell/Unique in Mesophilic and Thermophilic. The bands represent the standard deviation of each average point in (a, b). c Enrichment analysis of GO terms lost in both thermophilic Chaetomiaceae and Eurotiomycetes core pangenomes compared to Mesophilic in the same clade. The X-axis represents the log10(corrected p-value) using the Benjamini-Hochberg procedure.Thermophilic fungi are known to secrete a wide range of thermostable enzymes, in particular CAZymes7 (Fig. 1c, and Supplementary Data 2). Among the classes of CAZymes within Chaetomiaceae, the auxiliary activities (AA) (p = 0.047 – WT), carbohydrate esterases (CE) (p = 0.042), and pectate lyases (PL) (p = 0.001 – WT) were significantly contracted in thermophiles. In Eurotiomycetes, in contrast, carbohydrate binding modules (CBM – WT) and expansins (EXPN – WT) (both with p = 0.039) were significantly reduced. When the human pathogens (Triru1, Hisca1, and Cocim1) were excluded from the analysis, glycoside hydrolases (GH) were shown to be significantly reduced in thermophiles as well (p = 0.0015 – WT). This is because human pathogen genomes encode a reduced number of GH compared to other mesophilic fungi (p = 0.023 – WT) (Fig. 1c).Thermophiles have a reduced core genome contentTo further investigate the reduced genome size and, consequently, gene content in thermophilic fungi, we performed a pan-genome analysis with species belonging to the Chaetomiaceae family and Eurotiomycetes class in parallel. Interestingly, the rarefaction curve of unique and shell genes has a similar distribution in thermophiles and mesophiles. However, the core gene set of thermophiles is strikingly reduced in both clades, revealing this conserved pattern even in distantly related groups (Fig. 2a, b). Next, we investigated the GO terms significantly reduced in thermophiles from both clades (Fig. 2c). The most significant term reduced in thermophiles was O-methyltransferase activity, followed by phospholipase activity and phospholipid catabolic process. Several terms involved in primary metabolism (i.e., hydrolase activity, acyl-CoA metabolic process, and glucose catabolic process) and defense mechanisms were also contracted (i.e., xenobiotic transmembrane transporter activity, response to UV (Ultra-Violet)).In addition, we investigated the expansion and contraction of gene families (HOGs) using the program CAFE v4.2.111. In this analysis (Supplementary Data 3A), we found that most gene losses happened more recently, in phylogenetic tree nodes closer to the leaves instead of older nodes. For example, in the nodes representing the last common ancestor of the families Chaetomiaceae, Thermoascaceae, and Thichocomaceae, the net gain of gene families is negative, but close to zero. The overall average gene gains and losses show a clear trend of fewer genes gained by thermophilic fungi (2.23 times fewer genes gained than mesophilic, p = 0.00091 – [WT]), and more genes lost (1.06 times more genes lost than mesophilic, p = 0.6361 – [WT]); and the opposite trend in mesophilic fungi. Interestingly, on average, thermotolerant fungi have a high number of gene gains and losses (Supplementary Data 3A).The effective number of codons and GC3 are altered in thermophilic fungiBerka and collaborators7 found an increase in GC3 in Myceliophthora thermophila and Thileavia terrestris transcripts, which to this day has not been confirmed to be true in other thermophilic fungi. In this study, we evaluated the coding sequences of 29 thermophiles (Fig. 3a) to show that even though GC3 was not increased for all of them, it is significant in the overall distribution (Fig. 3c). We also measured the degree of codon usage bias by computing the effective number of codons (ENC) for each species12. ENC ranges from 20 to 61, where 20 represents an extreme bias of using only one codon per amino acid, whereas 61 represents uniform synonymous codon usage, i.e., no bias13. The mean ENC values in thermophiles were 47.9, compared to 50.48 for mesophiles and 49.78 for thermotolerant. We found a higher fluctuation in ENC and GC3 along the Mucoromycota and Basidiomycota branches (R2 = 0.11, p = 0.284 versus R2 = 0.94, p < 1 × 10−5 for Ascomycota) (Fig. 3d).Fig. 3: GC content of third base of the codon (GC3) and Effective number of codons (ENC) in thermophilic fungi.a GC3 distribution across 79 genomes. b ENC distribution across 79 genomes. c Violin plot of GC3 distribution in thermophilic, mesophilic, and thermotolerant genomes showing a significant difference between thermophilic and mesophilic using Moods median test (MT) and Wilcoxon rank sum test (WT). Horizontal lines show the median extended across the plot. d Scatterplots of average GC3 and ENC (gray dashed lines represent standard deviation) show a high correlation (R2 = −0.72) between them. The black dashed lines represent the average GC3 and ENC of all genomes used in this study.On the other hand, species in Ascomycota, in general, exhibit less variation in their codon bias than within clades such as Chaetomiaceae and Eurotiomycetes, where thermophilic fungi show significant biases compared to mesophilic counterparts (p = 0.033 and p = 0.007, respectively [WT]). Variations in the GC3 and ENC are anti-correlated (R2 = 0.72, p < 1 × 10−5), with thermophilic fungi skewed toward the first quadrant (17 out of 29–58.6%) of the scatter plot (Fig. 3d), Basidiomycota split between first and fourth quadrant (2 and 3 species, respectively), and Mucoromycota all present in the fourth quadrant. In addition, early-diverging fungi exhibit, on average, less variation in ENC, but higher variation in GC3. Wint and collaborators13 made a similar observation in their study with 450 fungal species.Machine learning can classify thermophilic and mesophilic fungiWe used Support Vector Machines (SVM) to identify hidden correlations between lifestyle and gene content, as described by Haridas and collaborators14. Large data sets are critical to machine learning, and only thermophilic and mesophilic rather than thermotolerant fungi had a sufficient number of genomes in our dataset to be adequately separated using these methods. We selected a subset for training (20 thermophilic and 30 mesophilic fungi – Supplementary Data 3B) across the tree to identify genes that could discriminate between these two lifestyles. Using the HOGs dataset (ignoring unassigned genes), 100 orthogroups produced an accuracy of 70% each in their ability to distinguish between thermophilic and mesophilic lifestyles. The best discriminator among the HOGs with an average prediction accuracy of 0.81 was HOG0006878 (Supplementary Data 3C), found in 36% of the thermophiles and 83% of mesophiles and containing Cytochrome c/c1 heme lyases (PF16815). We used all 100 clusters in all possible combinations to improve prediction accuracy. Our results suggest that combining eight clusters increased accuracy to 94% (Supplementary Data 3C and Fig. 4). We did not include thermotolerant in the training set, only in the test set. We found that all thermotolerant species were classified as mesophilic, suggesting their genomic adaptations resemble mesophilic species.Fig. 4: Informative gene clusters to thermophilic lifestyle identified using support vector machine (SVM) classifier.a Biplot PCA using the eight most informative clusters showing the separation between thermophiles and mesophiles. b Heatmap of the eight most informative HOGs showing prevalent gene loss in thermophiles. c Scatterplot of probabilities of a genome being a thermophile or mesophile based on these eight gene clusters containing the following Pfam domains: PF00005 and PF12848: ABC transporters; PF00106: short chain dehydrogenase; PF00962: Adenosine deaminase; PF02182: SAD/SRA domain (involved in 5mC methylation); PF16815: Protein HRI1; PF00149: Calcineurin-like phosphoesterase.These eight HOGs comprise the following Pfam domains: Two ABC transporters (PF00005 and PF12848), a critical component in response to environmental stresses15; Two short-chain dehydrogenases (PF00106), involved in the metabolism of a wide range of compounds and stresses15,16; Protein HRI1 (PF16815), a protein of unknown function in Saccharomyces cerevisiae15,16,17; Adenosine deaminase (PF00962) – plays an important role in nitrogen metabolism and may have a critical function in the regulation of fatty acid synthesis18; SAD/SRA domain (PF02182) – involved in 5mC methylation19; and a Calcineurin-like phosphoesterase (PF00149), enzymes potentially involved in DNA repair, stress response, and other cellular functions20. It is important to note that all HOGs seem to be reduced in thermophilic genomes, except the ABC transporter (PF12848), which interacts with ribosomes and modulates translation elongation in bacteria21.Protein structures embed optima temperature in endoxylanasesEnzymes are the workhorses of biotechnological processes, and their performance is linked to their three-dimensional structure. By deciphering the structural intricacies of enzymes involved in biomass degradation (i.e., cellulases and xylanases), researchers can strategically engineer or select enzymes that are stable and highly active at elevated temperatures, a key requirement for efficient biomass conversion in industrial applications2,7. The AlphaFold2 protein structure prediction technology22 has emerged as a transformative tool with profound biotechnological significance in biomass degradation. By providing highly accurate predictions of protein structures, AlphaFold2 empowers researchers to unlock a deeper understanding of the intricate molecular machinery involved in enzymatic processes. This newfound precision is invaluable when identifying the optimal temperature, pH, and enzymatic activity conditions for biomass degradation23. We used AlphaFold2 to predict the GH10 endoxylanase protein structures. Since this family is highly sampled in the Protein Data Bank (PDB)24, the predicted structures had predicted template modeling (pTM) > 0.59 (Supplementary Data 4).We built a protein structure similarity network using 318 GH10 endoxylanases with Foldseek25 from the 79 genomes used in this study (Fig. 5). It is important to note that all GH10 proteins fell into two orthogroups (N0.HOG0000305 and N0.HOG0011586 – Supplementary Data 3B) based on the OrthoFinder26 clustering results, one large cluster with 307 proteins, and a small one with 10 proteins (cluster 9 – Fig. 5a, b). It suggests that this family is highly conserved sequence-wise, so differences might be present at the structural level. We found that the number of disulfide bonds was the same within each cluster and different between them (Fig. 5), modulating their stability and constraining their conformational dynamics27. In addition, proteins from both mesophilic and thermophilic fungi were clustered together, with clusters with three proteins scattered along the phylogenetic tree, and with two proteins surrounding cluster 8 (Figure a). Also, we found some examples of proteins with long branch lengths (high evolutionary rate) losing structural similarity with closely related proteins (i.e. 38518_Stalo1, 654_Theste1, and 6258_Chathe1 – Fig. 5a).Fig. 5: Phylogeny and structural similarity network of GH10 endoxylanases using AlphaFold2.a Phylogeny of GH10 endoxylanase built with iqtree [-m MFP -bb 10000 -safe]. The branches are colored based on the bootstrap value; the leaves’ colors (green – mesophilic, orange – thermotolerant, red – thermophilic) represent the organism of origin; the color stripe in the outer circle represents the structural clusters shown in (b) (plus duplets – two protein clusters; and triplets – three protein clusters), and the bar chart (black) shows the number of disulfide bonds. b GH10 endoxylanase network clustered based on structural similarity using Foldseek. Edge thickness represents the percentage identity between protein structures. The clusters were separated based on percent identity (>0.7) and TMscore (>0.8) on a pairwise comparison of all endoxylanases using Foldseek. Colors (green – mesophilic, orange – thermotolerant, red – thermophilic) represent the organism of origin, and the size shows the number of disulfide bonds. Edges for pairs with percent identity <0.7 and TMscore < 0.8 were removed. c One structure from each cluster was randomly selected to show the prediction quality based on the AlphaFold2 pLDDT – a per-residue measure of local confidence scaled from 0 to 100.We mapped available data on the optimal temperature of 31 proteins (9.7% – Supplementary Data 4), onto the obtained structural network (Fig. 5). Most sequences with optima between 55–60 °C (77%) fell in the same structural cluster (Fig. 5). On the other hand, there was no clear distribution pattern for enzymes with optima <55 °C. One hypothesis is that enzyme activity at higher temperatures requires a tight structural conformation, reflected by the high structural similarity. It is important to note that different labs and hosts used for heterologous expression can generate divergent enzymatic optimal temperatures28,29.

Hot Topics

Related Articles