Birth of protein folds and functions in the virome

Preparation of protein sequencesProtein sequences for eukaryotic viruses present in RefSeq54 were collected through the NCBI Viruses portal (https://www.ncbi.nlm.nih.gov/labs/virus) in July 2022. GenPept files were downloaded for viruses that were annotated by NCBI to have an eukaryotic host. Because not all viruses have a host labelled by NCBI, GenPept files of human-infecting viruses annotated by ViralZone (https://viralzone.expasy.org/678) were also downloaded. Finally, proteins from all coronaviruses present in RefSeq, regardless of NCBI-labelled host, were downloaded.Each GenPept file was processed such that polyproteins with defined ‘mature peptide’ fields produced separate protein sequences for each mature peptide. GenPept files without a mature peptide field were output as full amino acid sequences. These processing steps are present in the vpSAT github directory (https://github.com/jnoms/vpSAT) in the process_gbks.py file. Proteins larger than 1,500 residues, or in some cases 1,000 residues, were excluded. Only 1,706 proteins were excluded for this reason.Structure predictionMSAs were generated with MMseqs2 release version b0b8e85f3b8437c10a666e3ea35c78c0ad0d7ec2. To increase MSA generation speed, the RefSeq virus protein database (downloaded on 6 June 2022) was used as the target database for MSA generation. Structures were predicted with ColabFold15 (downloaded 22 June 2022). The majority of samples used three recycles, three models, stop_at_score=70, and stop_at_score_below=40. MMseqs2 and Colabfold_batch were run with a Nextflow55 pipeline, and all parameters used can be found at https://github.com/jnoms/vpSAT. Information on all viruses and structures included in this manuscript is present in Supplementary Table 1.Protein cluster generationAll proteins were initially clustered with MMseqs2, with a requirement of at least 20% sequence identity and 70% query and target coverage. MMseqs2 cluster mode 0 was used, meaning that many but not all pairs of aligned proteins are placed into the same sequence cluster. Predicted structures for each sequence cluster representative were subjected to an all-by-all alignment using Foldseek17, requiring the alignment to consist of at least 70% query and target coverage and an alignment E-value less than 0.001. The resultant structural alignment file was then filtered using SAT aln_filter to keep alignments with a TMscore of at least 0.4. Clusters were generated from this alignment file using SAT aln_cluster in a similar manner as Foldseek cluster mode 1, wherein all query-target pairs are assigned to the same cluster. Cluster information from sequence and structure clustering were merged using SAT aln_expand_clusters. Taxonomic counts information was generated using SAT aln_taxa_counts, producing a ‘tidy’ table for each cluster_ID with the number of members of each taxon at multiple taxonomy levels. Taxonomy information was also added directly to the merged cluster file using SAT aln_add_taxonomy.Cluster purity analysisTo determine the structural consistency of the clusters, all clusters with at least 100 members were selected for analysis. DALI was used to align the cluster representative with each cluster member. Clusters whose members were on average smaller than 150 residues were excluded. This led to the analysis of 49 clusters. Cluster members that failed to align to their representative were assigned a z value of 0. For each cluster, the average z-score between the representative and each member was determined and plotted. All scripts used to run DALI can be found in vpSAT’s dali_format_inputs.sh and dali.sh files. Dalilite version 5 was used. DALI output files were parsed into a tabular format using SAT’s aln_parse_dali.PhylogeneticsPhylogenetic reconstructions were conducted using all sequence cluster representatives, or in the cases of clusters 56 and 735, all members within each cluster. For the nucleoside transporter tree, all herpesvirus sequence representatives of cluster 119, as well as a F. catus gammaherpesvirus 1 protein (YP_009173937) from a singleton cluster, were used as queries. Iterative sequence similarity searches against the NCBI non-redundant database were performed using standalone PSI-BLAST v2.15.0, using the following parameters56: -num_iterations 10, -max_hsps 1, -subject_besthit, -gapopen 9, -inclusion_ethresh 1e-15, -evalue 1e-10, and -qcov_hsp_perc 70. For the LigT-like PDE tree, this search was restricted to only viral targets. Each of these protein sets were then clustered by utilizing mmseqs2 v15.6f452 with high sensitivity (command line option: -s 7.5) to compress the amount of highly similar sequences into cluster representatives. Subsequently, these sequence sets were aligned using Clustal Omega v1.2.4 with default settings57. Comprehensive taxonomic information for each aligned sequence was integrated into the unique sequence identifiers by utilizing the biopython v1.81 package58. Phylogenetic trees were reconstructed using IQTREE v2.3.359 with -m TEST -B 1000 options for model testing and bootstrapping. The best model was selected for each tree based on Bayesian Information Criterion (BIC), and were as follows: Nucleoside transporters, VT + F + G4; LigTs, VT + F + G4; cluster 28, VT + G4; cluster 55, VT + I + G4; cluster 56, VT + G4; cluster 735, VT + I + G4. Trees were visualized with the Interactive Tree of Life (iTOL)60. Code used for this analysis can be found at https://github.com/Doudna-lab/nomburg_j-LigT_phylogeny.Structural alignments against the AlphaFold databasesIn Fig. 1i, Foldseek was used to align a protein representative from every viral protein cluster against 2.3 million protein cluster representatives from the AlphaFold database3. For Fig. 3, all 67,715 viral protein structures were searched against the pre-made Foldseek databases of the original release of the AlphaFold database, consisting of proteins from 48 organisms and including members of the bacterial, eukaryote, and archaeal superkingdoms. For this search, the full AlphaFold database of over 200 M structures was not used because it contains many viral proteins misannotated as non-viral proteins (these misannotations reflect errors in Uniprot metadata). Alignments were filtered to keep only those with a minimum TMscore of 0.4 and an E-value of less than 0.001.DALI alignments of specific non-viral proteins against the viral protein databaseFollowing Foldseek alignments against the AlphaFold database, specific hits of interest (for example, ENT4) were selected. These structures were downloaded and imported to the DALI database format using vpSAT’s dali_format_inputs.sh. They were then aligned against the full viral protein structure database using vpSAT’s dali.sh, which lists all parameters. Dalilite version 5 was used. DALI output files were parsed into a tabular format using SAT’s aln_parse_dali.Identification of annotated protein sequence clustersEach protein in the database was searched against the Pfam23, CDD24, and TIGRFAM25 databases using InterProScan22. A sequence cluster was considered annotated if more than 25% of members had any InterProScan alignment, and was considered unannotated if otherwise. Note that some proteins without an InterProScan alignment have existing annotations through other methods, including manual curation. Values of RMSD in Fig. 3 were calculated using DALI.DALI alignments to identify shared domainsThis analysis used the structure representatives from clusters with at least 2 members, resulting in 5,700 cluster representatives. Structures from these representatives were imported to the DALI database format using vpSAT’s dali_format_inputs.sh. To compare eukaryotic virus protein cluster representatives, an all-by-all alignment was conducted using vpSAT’s dali.sh, which lists all parameters.Dalilite version 5 was used. DALI output files were parsed into a tabular format using SAT’s aln_parse_dali. All DALI alignments were filtered for an alignment length of at least 120, and for a z-score greater than or equal to (alignment length/10) − 4.MSA generation using the full ColabFold MMseqs2 databaseWe selected the protein cluster representatives from the top 100 protein clusters by size, as well as 100 randomly selected singleton clusters, for analysis. ColabFold was used with FASTA inputs, such that MSAs were generated using the MMseqs2 ColabFold server (which maps each sequence against UniRef, BFD and Mgnify), and this MSA was used for structure prediction.Benchmarking sequence and structure methodsFor all protein clusters with at least two sequence clusters, we conducted all-by-all alignments between members using MMseqs2 (version b0b8e85f3b8437c10a666e3ea35c78c0ad0d7ec2), DIAMOND blastp61 (version 0.9.14), or jackhmmer62 (version 3.1b2). These alignments and subsequent clustering occur separately for each protein cluster. From these alignments, we conducted connected-component clustering using sat.py aln_cluster. Here, all proteins that align will be assigned to the same resultant cluster. Thus, each original protein cluster (determined through our approach, combining sequence alignment with MMseqs2 and structure alignment with Foldseek) now has a set of clusters identified through each of the sequence-only methods. We then measured, for each original protein cluster, how many clusters created by each of the sequence-only methods and how many proteins fall into the largest cluster generated by these sequence methods.For benchmarking virus–non-virus alignments, we conducted sequence alignments (again using MMseqs2, DIAMOND blastp, and jackhmmer) analogous to the DALI structural alignments present in Extended Data Fig. 4, using the same query against all viral proteins included in the dataset. We then determined the fraction of DALI-identified targets were identified for each non-viral query and through each sequence method.For the comparison between hhPred43 and DALI, we identified 4,409 sequence clusters that contained more than 1 member and for which fewer than one-quarter of members had an InterProScan alignment. We then identified sequence cluster representatives that were well folded, with an average pLDDT of at least 70. This resulted in a final set of 1,326 proteins. We used DALI to align each of these proteins against the PDB25 database provided by the DALI authors. Alignments were considered high-confidence if they contained a z-score of at least 7. DALI alignments were conducted with vpSAT’s dali.sh. For hhPred searches, we established a local pipeline using HHsuite’s (v3.3.0) HHblits and HHsearch modules. For each query protein, we first used HHblits to align them against the Uniref30 HMM database provided by the HHsuite authors, using the flags -n 2 and -cov 20. We then used HHsearch to align each resultant MSA against the HHsuite-provided PDB database with the flag -cov 20. Alignments were considered high-confidence if they had an E-value of less than or equal to 0.001.Searching the TCDBWe used a map of PDB accession to TCDB classification (https://www.tcdb.org/cgi-bin/projectv/public/pdb.py) to download all experimental structures associated with TCDB classifications. For subsequent processing, we used a maximum of five structures per TCDB classification. One structure was excluded (PDB: 1HXI) as it is highly truncated. Nine additional structures failed to import to DALI database files, typically due to small protein size. For PDB entries that contained multiple chains, we selected the first chain for alignment. Due to the absence of experimental structures, the AlphaFold models for ENT3 (AF-Q9BZD2-F1-model_v4) and ENT4 (AF-Q7RTT9-F1-model_v4) were added to the dataset. For the 46 protein structures with multiple classifications, one classification was chosen at random. This ultimately resulted in a dataset of 1,812 structures from 485 classifications, with an average of 3.7 structures per classification. Structures were imported to the DALI database format using vpSAT’s dali_format_inputs.sh. The predicted structure of EBV BMRF2 (YP_001129455) was aligned against this structure database using dali.sh.PDE cloning and activity assaysTwo tandem STREP2 tags, following a GGS linker, were appended to the end of each putative LigT-like PDE. Sequences were codon-optimized for humans, and gBlocks encoding each product were ordered from IDT and cloned into a custom lentiviral expression vector. PDE mutants have dual H>A mutations of the catalytic histidines (or, in the case of MHV NS2a and pigeonpox PDE, one H>A and one H>R mutation).The 293T cells were seeded into 96-well plates at 20,000 cells per well. The 293T cells were kindly provided by the Ott laboratory, and were originally from ATCC. The 293T cells were screened for Mycoplasma within the last year, and were not otherwise authenticated. The day after plating, each well was transfected with 15 ng STING (pMSCV-hygro-STING R232, Addgene 102608), 20 ng firefly luciferase driven by an IFNB promoter (IFN-Beta_pGL3, Addgene 102597), 5 ng Renilla luciferase (pRL-TK, Promega E2241), and 20 ng of each putative PDE using the Mirus TransITX2 transfection reagent. After at least 4 h, cells were treated with 0.1 μM diABZI (Invivogen) or transfected with 10 μg ml−1 2′,3′-cGAMP (Invivogen) using TransITX2. The next day, firefly and Renilla luciferase were measured using the Promega Dual-Glo luciferase assay system. Three wells were transfected per condition, and experiments are representative of at least two independent experiments. The ‘no STING’ conditions were transfected with both reporters and a noncoding transgene, but no STING plasmid.PDE western blotsThe 293T cells were plated in 6-well dishes at 5 × 105 cells in 2 ml per well. The next day, each well was transfected with 200 ng of the indicated transgene using Mirus TransITX2. The following day, cells were lysed using RIPA buffer (ThermoFisher) supplemented with protease/phosphatase inhibitor (ThermoFisher), and lysate protein concentrations were determined using the Pierce BCA assay kit. All samples were then normalized to the same protein concentration. Bio-Rad Criterion 4%–20% acrylamide gels were loaded with 30 µg of protein per well, followed by transfer to a 0.2-µm nitrocellulose membrane. For visualization of the Strep-tagged PDEs, the Streptactin HRP (IBA 2-1502-001IAB) antibody was used (1:100,000 dilution, 1 h at room temperature). For visualization of GAPDH, we used Santa Cruz Biotech Mouse anti GAPDH (sc-365062) primary (1:1,000 dilution, incubation at 4 °C overnight) and ECL Anti-mouse IgG (Amersham NXA931) secondary (1:5,000 dilution, 1 h at room temperature).Recombinant protein expression and purificationExpression plasmids for pigeon poxvirus PDE (wild-type and H72 A/H167R), MHV nonstructural protein 2A (NS2A), and T4 anti-CBASS protein 1 (Acb1) were cloned into custom pET-based vectors by Gibson assembly to yield N-terminal His10-MBP-TEV constructs. Proteins were expressed from 4 l Escherichia coli Rosetta 2 (DE3) pLysS by growing to an of OD600 of 0.4–0.6 in 2× yeast extract tryptone medium at 37 °C and induced with 0.5 mM isopropyl β-d-1-thiogalactopyranoside. After induction, cells expressing each protein were grown overnight at 16 °C to an OD600 of 1.2–1.4. Cells were collected by centrifugation for 20 min at 4,000 rpm at 4 °C and resuspended in 20 mM Tris-HCl, pH 8.0, 10 mM imidazole, 2 mM MgCl2, 500 mM KCl, 10% glycerol, 0.5 mM TCEP and Roche protease inhibitor. Cells were lysed by sonication and cell lysate was clarified by centrifugation at 17,000g, 4 °C for 0.5 h. The supernatant was bound to 5 ml Nickel-NTA affinity resin for 1 h at 4 °C. Supernatant was discarded and resin was washed 5 × 30 ml wash buffer (20 mM Tris-HCl, pH 8.0, 500 mM KCl, 30 mM imidazole, 10% glycerol and 0.5 mM Tris(2-carboxyethhyl) phosphate). Protein was eluted in 10 ml elution buffer (20 mM Tris-HCl, pH 8.0, 500 mM KCl, 300 mM imidazole, 10% glycerol, and 0.5 mM Tris(2-carboxyethyl) phosphate). Each protein was concentrated to 10 mg ml−1 during buffer exchange to storage buffer (20 mM Tris-HCl, pH 8.0, 500 mM KCl, 30 mM imidazole, 10% glycerol and 0.5 mM Tris(2-chloroethyl) phosphate) using a 10 kDa MWCO centrifugal filter (Amicon). A total of 5–15 mg target protein fused to N-terminal His10–MBP–TEV was stored at −80 °C.In vitro characterization of PDEsRecombinant enzymes were assessed for PDE activity by in vitro cGAMP degradation reactions and downstream analysis by TLC. Reactions were initiated by the addition of recombinant enzyme (40 μM) in reaction buffer (50 mM Tris, pH 8.0, 10 mM MgCl2, 100 mM NaCl) to 1.25 mM 2′,3′-cGAMP or 3′,3′-cGAMP (Biolog). The reaction mixture was incubated at 37 °C for 18 h and stopped by vortexing for 20 s.Silica gel TLC plates (5 cm × 10 cm) with fluorescent indicator 254 nm were spotted with 2 μl in vitro enzymatic reaction. Separation was performed in an eluent of n-propanol/ammonium hydroxide/water (11:7:2 v/v/v). The plate was allowed to dry fully and visualized with a short-wave ultraviolet light source at 254 nm.Data analysis and plottingAll analysis, plotting, and statistical tests used R version 4.0.3. The genome type and average genome size were determined from information downloaded from the NCBI Virus portal (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/).Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Birth of protein folds and functions in the virome

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models