Neural network extrapolation to distant regions of the protein fitness landscape

Neural network model trainingWe used the Olson et al. GB1-IgG binding dataset20 and the Gelman et al. model architectures12 with the same parameter initializations, train-validation-test splits, and hyperparameters. We additionally trained 100 models of each architecture with different random initializations to assess the effects of parameter initialization and for ensemble learning. We constructed two CNN ensemble predictors: a median predictor (EnsM) that inputs a sequence into 100 CNN models and outputs the median fitness prediction, and a conservative predictor (EnsC) that inputs a sequence into 100 CNN models and outputs the 5th percentile fitness prediction.For model training and library design, protein sequences were embedded as a concatenated vector containing a physicochemical and categorical (one-hot) representation of the amino acid sequence. Embeddings are a fixed length across all sequences and contain no gap characters.Evaluation of model extrapolation on 3- and 4-mutant fitness landscapesWe evaluated the models’ ability to extrapolate to 3- and 4-mutants using the four-site combinatorial GB1 dataset from Wu et al. 22 (Dataset: https://doi.org/10.7554/eLife.16965.024) This dataset consists of experimental measurements for 93% of variants from a 4-site combinatorial saturation mutagenesis GB1 library screened for binding to IgG. The models were trained on single and double mutant data from Olson et al. 20 and were used to predict the fitness values of the 3- and 4-mutants. We also tested the ability of the models to identify the 4-mutants with the highest fitness values. The model recall was calculated by giving a model a design budget n, having the model rank all 121,174 characterized 4-mutants and select the top n, and then determining what percentage of the true top 100 variants were in this top n set. An optimal model would achieve 100% recall with a design budget n = 100.Model-guided protein designWe designed GB1 variants with eight different models at 5, 10, 20, 30, 40, and 50 mutations from wild-type GB1 (Supplementary Data 1). For each model-distance combination, we designed 41 diverse GB1 variants with high predicted fitness. For a given model-distance combination, we used simulated annealing (SA) to maximize the model’s predicted fitness constraining the number of mutations allowed in the designed protein. To ensure the number of mutations in a design were fixed, we exchanged random mutations. For example, when designing a 5-mutant, we might randomly choose two current mutations, mutate those back to the wild-type amino acids, and then choose two new random mutations to keep the total mutations fixed at five. At each step of simulated annealing, one or more mutations (sampled from a ${{{\rm{\lambda }}}}=1\,$Poisson distribution) were randomly exchanged and predicted fitness of the given model was evaluated. Mutations that improved predicted fitness were automatically accepted while all other mutations were accepted with probability ${e}^{\Delta {Fitness}/T}$ where T is decreased along a logarithmic temperature gradient ranging from 103 to 10−5 over 15,000 to 50,000 steps. We performed 500 independent simulated annealing runs for each model-distance combination, which typically resulted in more than 41 unique candidate sequences. The individual SA trajectories showed strong convergence to similar fitness values indicating a sufficient number of optimization steps (Supplementary Fig. 2). For some model-distance combinations, 50,000 SA steps converged on fewer than 41 candidate sequences, in which case, we decreased the number of SA steps until more than 41 unique candidate sequences were designed in 500 simulations. The number of SA steps for these specific categories were: (LR 5-mutants) and (FCN 5-mutants) = 10,000 steps and (LR-10 mutants), (FCN-10 mutants), (EnsM-5 mutants), and (EnsC-5 mutants) = 25,000 steps. All sequence designs were run in parallel using high-throughput computing40. To select a diverse and representative set of designs, we performed K-means clustering on the 500 independent designs using 41 clusters and selected the variant from each cluster with the highest predicted fitness for experimental characterization. This resulted in 41 unique GB1 designs for 8 models at 6 distances, for a total of 1968 designs.BLOSUM analysisWe calculated the mean site-wise BLOSUM score for each design category using the BLOSUM62 substitution matrix. All BLOSUM scores were calculated with respect to WT. Since our objective was to illustrate whether mutations from WT were conservative or non-conservative, non-mutated residues were excluded from analyses. As such, reported mean BLOSUM scores only reflect the nature of mutations and not the effect of non-mutated residues. The mean site-wise BLOSUM score is the average BLOSUM score for mutated residues at a given site.UMAP visualization of GB1 variant library structure spaceWe used Alphafold24,25 (AlphaFold Colab v2.3.2) to predict the structure for each GB1 variant. AlphaFold uses an input MSA to predict the query sequence structure. We used WT GB1 to build an MSA using a Jackhmmer search on the Uniref90, Small BFD, and Mgnify sequence databases as recommended by AlphaFold Colab. This MSA was used as the input MSA for all structure predictions. We generated six AlphaFold predictions for each sequence; the structure with the highest mean pLDDT (measure of structure confidence) was chosen as the representative model.We generated a UMAP projection of the GB1 library structure space according to methods previously described41. Briefly, we calculated TMscores for all pairwise structures using TM-Align26 and used TMscore-1 as a distance metric for the UMAP projection. We performed a hyperparameter scan for n_neighbors and min_dist and selected a representative projection (n_neighbors = 15, min_dist = 0.1). General spatial patterns are conserved across hyperparameter settings.High-throughput characterization of GB1 designs via yeast surface displayWe codon optimized our GB1 designs for expression in Saccharomyces cerevisiae using the GenSmart Codon Optimization tool (GenScript) and excluded the BsaI, NheI, BamHI, MluI, and XhoI restriction enzyme sites. We also identified 25 sequences from the training data with a broad range of fitness values to correlate our fitness measurements with the original data from Olson et al. 20. Each of these variants was designed with two different synonymous codon sequences to provide internal controls to ensure reproducibility of our fitness measurements. The designed genes and control sequences were synthesized as an oligonucleotide pool by Twist Biosciences with flanking sequences to allow PCR amplification and downstream cloning.The GB1 gene library was cloned into the yeast surface display vector pCTCON2 (provided by Dane Wittrup, MIT)42. To prepare the vector for library cloning, we (1) removed the BsaI restriction site from the AmpR gene using site directed mutagenesis (for: 5′-AGCGTGGGTCGCGCGGTATCA-3′; rev: 5′-CACCGGCTCCAGATTTATCAGC-3′) and (2) added in BsaI sites for cloning library sequences (for: 5′-GAAGGGTCTCTGATCCGAACAAAAGCTTATTTCTGAAG-3′; rev: 5′-GTATTGGTCTCTCTAGCCGACCCTCCGC-3′). We amplified the oligonucleotide pool using either Phusion Hot Start Flex 2X Master Mix (New England Biolabs, #M0536L) or KAPA HiFi HotStart ReadyMix (Roche, #KK2602) (for: 5′-ACTCAAGGTCTCGCTA-3′; rev: 5′-GTCAAGGTCTCGGAT-3′), cloned the gene library into the modified pCTCON2 vector using Golden Gate assembly (37 °C 5 min → 16 °C 5 min ×30 cycles → 60 °C 10 min), and transformed into 10G supreme electrocompetent Escherichia coli (Lucigen, #60080-2). The transformed library was cultured in Luria Broth (LB) (DOT Scientific Inc., #D5L24400-2000) media at 37 °C to an optical density of ~0.5, at which point plasmid DNA was harvested using the Qiaprep Spin miniprep kit (Qiagen, #27104). We then transformed the GB1 library into yeast display S. cerevisiae strain EBY100 made competent using the Zymo Research Frozen EZ Yeast Transformation II kit (Zymogen, #T2001). We grew the transformed library in Sabouraud Dextrose Casamino Acid media (SDCAA, pH 4.5: Components per liter – 20 g dextrose (Sigma Aldrich, CAS# 50-99-7, #47249), 6.7 g yeast nitrogen base (VWR Scientific, 97064-162), 5 g casamino acids (VWR, 97063-378), 10.4 g sodium citrate (Sigma Aldrich, CAS# 6132-04-3, C8532), 7.4 g citric acid monohydrate (Sigma Aldrich, CAS# 5949-29-1, C7129)) at 30 °C and 250 rpm for two days. We plated an aliquot of the transformant pool on synthetic dropout (SD)-Trp agar (Teknova, C3060) to quantify the number of library transformants.We analyzed and sorted the GB1 library using florescence-activated cell sorting (FACS). We induced the library expression in Sabouraud Galactose Casamino Acid media (SGCAA, pH 7.4: Components per liter – 8.6 g NaH2PO*H2O (Sigma Aldrich, CAS# 10049-21-5, 71507), 5.4 g Na2HPO4 (Sigma Aldrich, CAS# 7558-79-4), 20 g galactose (Sigma Aldrich, CAS# 59-23-4, G0750-10G), 6.7 g yeast nitrogen base, 5 g Casamino Acids) overnight, harvested approximately 3 × 106 yeast cells by centrifugation, washed once in pH 7.4 Phosphate Buffered Saline (PBS) (Crystalgen, 221-133-40) containing 0.2% (w/v) bovine serum albumin (BSA) (Sigma Aldrich, 1071145400), and incubated overnight at 4 °C on a tube rotator at 18 rpm in 800 µL of PBS/0.2% BSA containing 15 nM human IgG1 (BioLegend, 403501) that had been conjugated with Alexa647 using NHS chemistry (Thermo Fisher Scientific, A37573) and 3 µg/mL anti-myc IgY (Aves Labs, ET-MY100) conjugated with Alexa488 using NHS chemistry (Thermo Fisher Scientific, A20000). Following the overnight incubation, we washed the yeast in PBS/0.2% BSA and resuspended in ice-cold PBS for FACS. We performed FACS using a FACS Aria II (Becton Dickinson) and analyzed yeast cells for display at 488 nm and IgG binding at 647 nm. Our qualitative experiments sorted cells into display-only and IgG bind populations, while our quantitative experiments sorted cells into display-only, low-bind, WT-like bind, and high-bind populations.We recovered the sorted yeast populations, as well as the initial unsorted library, and grew them in SDCAA media overnight. The following morning, we expanded the cultures into SDCAA media at an optical density of 0.1, grew them until reaching density of ~1.0, harvested and centrifuged the cultures, and extracted the plasmids using the Zymo Research Yeast Plasmid Miniprep II kit (Zymo Research, D2004). We transformed the extracted plasmid DNA into 10 G supreme electrocompetent E. coli (Lucigen, #60080-2), cultured overnight in LB + carbenicillin (GoldBio, #C1035) media shaking at 250 rpm at 30 °C, and harvested the plasmid DNA using the Qiaprep Spin miniprep kit (Qiagen, #27104). We cut out the GB1 gene insert using XhoI (NEB, R0146S) and PstI (NEB, R0140S) restriction enzymes, excised the corresponding band using agarose gel extraction, and purified using the Zymo Research gel DNA recovery kit (Zymo Research, #D4001). The UW-Madison Biotechnology Center DNA sequencing core prepared a sequencing library using the NEBNext Ultra II kit (NEB, E765S) and sequenced the samples using an Illumina NovaSeq6000 with 2 × 150 bp reads.Illumina data processing and analysisWe aligned forward and reverse Illumina reads to wild-type GB1, using a predetermined offset, and merged the two reads by selecting the base with the higher quality score in overlapping regions. A design’s count was equal to the number of sequencing reads that exactly matched the designed nucleotide sequence and we filtered out any designs if they had fewer than 10 counts in the unsorted population.The initial experiments consisted of two replicates each of unsorted (u), display only (d), and binding (b) populations. For each population, we divided by the total number of reads to obtain the proportion of each design. We refer to pu,i, pd,i, and pb,i as the proportion of design i in the unsorted, display only, and binding populations, respectively. We calculated a binding score as the enrichment of a design in the binding population relative to the display only population, where pb,wt and pd,wt are the proportion of wild-type GB1 in the binding and display only populations, respectively (Eq. 1). We also calculated a display score as the enrichment of a design in the full displaying population (b + d) relative to the unsorted population (Eq. 2). We observed a bimodal distribution for both binding and display scores that correspond to active and inactive populations. We used these distributions to classify that designs with display scores ≥ −0.26 display and designs with a binding score ≥ −0.5 bind IgG. Here, the 0.6 and 0.4 correspond the relative proportion of the binding and display only populations from the FACS experiment. The numerator estimates the full displaying population from the binding and display only populations.For the quantitative screening of designs, we obtained unsorted (u), display only (d), low-binding (l), wild-type-like binding (w), and high-binding (h) populations. For each population we calculated the enrichment relative to the unsorted population where x corresponds to any sorted population l, w, h (Eq. 3). Each sequence was categorized into display only, low-binding, wild-type-like binding, and high-binding based on thresholds determined by calibration sequences with known fitness values (Supplementary Fig. 16). All data analysis can be found in Jupyter notebooks in the supplementary material.$${e}_{{{{\rm{bind}}}},i}={{{\rm{lo}}}}{{{{\rm{g}}}}}_{10}\frac{{p}_{b,i}}{{p}_{d,i}}-{{{\rm{lo}}}}{{{{\rm{g}}}}}_{10}\frac{{p}_{b,{{{\rm{wt}}}}}}{{p}_{d,{{{\rm{wt}}}}}}$$
(1)
$${e}_{{{{\rm{disp}}}},i}={{{\rm{log}}}}_{10}\frac{0.6*{p}_{b,i}+0.4*{p}_{d,i}}{{p}_{u,i}}-{{{\rm{lo}}}}{{{{\rm{g}}}}}_{10}\frac{0.6*{p}_{b,{{{\rm{wt}}}}}+0.4*{p}_{d,{{{\rm{wt}}}}}}{{p}_{u,{{{\rm{wt}}}}}}$$
(2)
$${e}_{x,i}={{{\rm{lo}}}}{{{{\rm{g}}}}}_{10}\frac{{p}_{x,i}}{{p}_{u,i}}-{{{\rm{lo}}}}{{{{\rm{g}}}}}_{10}\frac{{p}_{x,{{{\rm{wt}}}}}}{{p}_{u,{{{\rm{wt}}}}}}$$
(3)
Characterization of individual designs for display and bindingWe identified 20 designs to test more thoroughly for binding and display in clonal yeast display assays (Supplementary Table 1). We chose five 5- and 10-mutants with high binding across all replicates, five 20 -mutants with some binding activity across all replicates, and ten 40 and 50-mutants with high levels of display across all replicates. We resynthesized their DNA sequences as individual gene fragments (Twist Biosciences) and cloned them into the same yeast display vector used for library screening. We transformed the plasmid DNA into EBY100 made competent using the Zymo Research Frozen EZ Yeast Transformation II (Zymo Research, T2001) kit and grew the transformants on SD -Trp agar plates (Teknova, C3060) for two days at 30 °C. After 2 days, we picked individual colonies into 4 mL SDCAA and grew them overnight at 30 °C and 250 rpm. We induced GB1 display in a 5 mL SGCAA culture started at an optical density as measured at 600 nm of 0.5 and shaken overnight at 250 rpm and 20 °C.For IgG binding titrations, we harvested approximately 2 × 105 induced yeast cells for each titration data point, washed once in pH 7.4 PBS containing 0.2% (w/v) BSA, and incubated for three hours at 4 °C on a tube rotator at 18 rpm in between 100 µL and 1 mL of PBS/0.2% BSA containing various concentrations of Alexa647 human IgG1 and 3 µg/mL Alexa 488 anti-myc IgY. We varied the volumes of Alexa647 IgG-containing incubation solution to prevent ligand depletion from occurring at the lowest IgG concentrations. Following incubation, we washed the yeast once in PBS/0.2% BSA and resuspended in ice-cold PBS for flow cytometric analysis. We analyzed the samples using a Fortessa analyzer (Becton Dickinson).For each design, we subtracted background display and binding activity from an unlabeled yeast negative control from raw MFU display and binding measurements. These activity measurements were normalized against WT GB1 by dividing by the WT activity with background subtracted out.Statistics and reproducibilityNo statistical method was used to predetermine the sample size.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Neural network extrapolation to distant regions of the protein fitness landscape

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis