The simplicity of protein sequence-function relationships

We have several goals in dissecting a protein’s genetic architecture. First, we would like to know how sequence determines function across the space of all possible variants, including the effects and interactions of each amino acid and any systematic nonlinearity in sequence-function map. Second, we would like to use these fine-scale causal rules for macroscopic descriptions of the genetic architecture, such as the overall importance of effects at each epistatic order or of sequence variation at each site or set of sites. Third, knowing the rules of genetic architecture inferred from a sample of genotypes could allow us to predict the function of uncharacterized variants. Finally, once the rules of genetic architecture are known, they can be interpreted in biochemical and structural terms to understand the physical mechanisms by which sequence shapes function. They also explain why protein phenotypes are distributed as they are across sequence space, which shapes the trajectory and outcome of evolution. In these ways, analyzing genetic architecture allows us to deepen our understanding of how and why a protein works as it does.To achieve these ends, an ideal method of analysis would meet three criteria: (1) the structure of the model yields a transparently interpretable description of the causal rules by which sequence determines function; (2) the model’s terms can be accurately estimated from real datasets, which usually contain experimental noise and are missing measurement for some variants; and (3) the model decomposes the genotype-phenotype relationship parsimoniously, explaining the observed data while minimizing gratuitous complexity.Reference-free analysis of genetic architectureWe designed reference-free analysis (RFA) to achieve these goals. It uses Fisher’s statistical formalism for decomposing genetic architecture42—and for analyzing interaction effects in factorial designs more generally—and applies it to protein sequence space.RFA takes a bird’s-eye view of genetic architecture. The causal factors are sequence states rather than mutations, and their effects on the phenotype are defined relative to the global average of all variants (Fig. 1a). The formalism is simple and interpretable. The zero-order term, which affects all genotypes, is the mean phenotype across sequence space. The first-order effect of a state at a site is its context-independent effect on the phenotype, calculated as the difference between the mean phenotype of all sequences containing that state and the global mean. The epistatic effect of a combination of states is the difference between the mean phenotype of all sequences containing the combination and that expected given the lower-order effects. The phenotype of any genotype is simply the sum of the effects of the genetic states in its sequence (Fig. 1b).Fig. 1: Reference-free analysis (RFA) of genetic architecture.a Illustration of RFA on a 2-site, 2-state genetic architecture. The four possible genotypes are arranged on a plane with their phenotype indicated by elevation. (First panel) The zero-order effect (e0) is the mean phenotype of all genotypes, marked by the clear plane with cyan edges. The first-order effect of state A or B at site 1 [e1(A) or e1(B), green arrows] measures how the mean phenotype of all genotypes containing the state (dashed line) differs from the global mean. The green plane predicts the phenotype based on the state at site 1. (Second panel) First-order effects at site 2 are defined similarly and shown in pink. (Third panel) The first-order model predicts phenotypes by summing the first-order effects of all genetic states plus the global mean, shown as the gray plane tilted in both dimensions; the fraction of phenotypic variance explained is shown. (Fourth panel) The pairwise interaction between states A and B at sites 1 and 2 [e1, 2(A, B)] measures how the mean phenotype of all genotypes containing the two states [here just one genotype (A, B)] differs from the first-order prediction. b We implement RFA with a nonlinear link function to incorporate nonspecific epistasis. Each variant’s genetic score (s) is the sum of the effects of its genetic states. The link function transforms s of each variant into its phenotype, y. Although the link function can take any form, here we use a simple sigmoid defined by two parameters representing the upper and lower bounds of the measurable phenotype. c (Left) A 5-site, 3-state genetic architecture was simulated by drawing reference-based effects from the standard normal distribution (but setting all fifth-order effects to zero); a small amount of simulated noise was added to the phenotypes. (Middle) Absolute error of RFA terms computed from the simulated measurements. Dashed lines, mean absolute error of individual phenotypes. Supplementary Fig. 1 shows the individual inferred terms. (Right) The fraction of phenotypic variance explained by the true, directly computed, and regression-estimated RFA terms. Supplementary Section 1.1 analyzes additional simulated genetic architectures.This way of dissecting the sequence-function relationship gives RFA several desirable properties. First, RFA offers a maximally efficient description of the global sequence-function relationship. An RFA model truncated at any epistatic order captures the maximum amount of phenotypic variance that can be captured by any linear model of the same order (Supplementary Section 2.6). Consider all zero-order models, which predict the phenotype of every sequence by a single number. The RFA zero-order term is the mean phenotype of all sequences and is therefore the best predictor in the sense of minimizing the total squared error. The first-order RFA model predicts each variant’s phenotype as the sum of the first-order effects of its constituent states and the global mean. This predictor again achieves the minimum total squared error among all possible first-order models and therefore explains the maximum possible amount of phenotypic variance. This property continues as the model order increases. To the greatest extent possible, RFA explains the sequence-function relationship by low-order causal factors, which are relatively few in number and apply most broadly, rather than by high-order factors, which at the limit explain every single data point as the result of a unique set of idiosyncratic causes.Second, RFA is robust to measurement noise, because its terms are defined using average phenotypes over sets of genotypes. To illustrate this property, we simulated a genetic architecture in which phenotypic measurements are determined by up to fourth-order effects plus a moderate amount of measurement noise (Fig. 1c). The RFA terms computed from the simulated measurements accurately estimate the true effects; errors in the estimated terms are smaller than the noise in the individual phenotypic measurements, even for the highest-order terms. The fraction of phenotypic variance explained by the computed terms is also accurate.Third, when data are partially sampled, RFA models can be accurately estimated by least-squares regression. When 50% of genotypes are missing from the simulated example, the estimated terms of the model and the variance partition are highly accurate (Fig. 1c, Supplementary Fig. 1). RFA can be accurately estimated by regression because its true terms minimize the sum of squared error across all genotypes, so least-squares estimates converge on the true values as long as noise and sampling are unbiased. Truncated models can be estimated accurately because the patterns of variation produced by the unmodeled higher-order interactions appear as noise around lower-order predictions, so they cannot be absorbed by the model (Supplementary Section 2.9).Shortcomings of reference-based analysisReference-based analysis (RBA) is less suited in both theory and practice for analyzing a protein’s global genetic architecture. The causal genetic factors in RBA are not amino acid states but mutations when introduced into a designated wild-type reference sequence (Fig. 2a). Each first-order effect is defined as the difference in phenotype between the one variant that contains that single mutation and the wild-type. Each second-order interaction effect is the difference between the phenotype of the one double mutant and that expected from the sum of the first-order effects. This structure continues for higher-order mutants, invoking interactions whenever one variant’s phenotype deviates from the sum of lower-order effects.Fig. 2: Reference-based analysis (RBA) is unsuitable for inferring global genetic architecture.a The apparent complexity of genetic architecture varies depending on the choice of wild-type genotype. A first-order RBA model is directly fitted to the genetic architecture in Fig. 1a, with (A, A), (B, A), or (B, B) as wild-type. λ0, wild-type phenotype; λ1, λ2, first-order effects of mutations at sites 1 and 2; empty circle, predicted phenotype of the double mutant; r2, fraction of phenotypic variance explained by the model. b (Left) Absolute error of RBA terms computed from the simulated measurements with a small amount of noise in Fig. 1c. Dashed lines, mean absolute error of individual phenotypes. (Right) Fraction of phenotypic variance explained by the true and computed RBA terms. c Regression yields incorrect estimates of RBA terms and overestimates the fraction of phenotypic variance they explain. The first panel fits the first-order RBA model defined with respect to (A, A) by regression; the estimated terms, predicted phenotype for each genotype, and r2 are shown. The next two panels repeat the analysis by choosing (B, A) or (B, B) as wild-type. d (Left) Fraction of phenotypic variance explained by the true and regression-estimated RBA terms for the simulated measurements in Fig. 1c. (Right) RBA terms estimated by fitting the full model, compared with their true values.RBA is useful in principle if one is interested in the effects and interactions of mutations when introduced into a particular sequence of interest43,44. Its structure is not suited, however, for understanding how sequence determines function across the space of possible variants. First, the wild-type-centric view means that the genetic architecture varies depending on the choice of wild-type genotype; in the example of Fig. 2a, first-order effects may make zero contribution to phenotypic variance or explain most of it, depending on the reference sequence chosen, and the pairwise interaction switches in both magnitude and sign. Second, the RBA formalism implies that proteins containing wild-type residues are unaffected by any of those states. The wild-type protein has no genetic determinants whatsoever because it contains no mutations. A point mutant is subject to the first-order effect of one mutation but is by definition unaffected by epistasis. A double mutant is shaped by one pairwise interaction but no higher-order interactions, and so on. In reality, these proteins have a genetic architecture just as interesting and complex as those of sequences distant from the wild-type. Finally, RBA efficiently explains phenotypic variation in the neighborhood of the reference sequence but produces a less parsimonious description of a protein’s genetic architecture over sequence space as a whole. In the absence of noise, the zero-order RBA term predicts the wild-type sequence with perfect accuracy but is less accurate across all sequences than the global mean. The first-order RBA terms perfectly predict the point mutants, and the second-order terms exactly predict the double mutants, but across the vast number of other sequences these terms are less accurate predictors and thus leave more variation to be explained by higher-order terms. RBA thus infers a genetic architecture that is more complicated and idiosyncratic than is necessary to explain the distribution of phenotype across sequence space.A second concern is that in practice, the RBA model cannot be accurately estimated from noisy and partially sampled datasets, either by exact computation or by regression. Exact RBA is hypersensitive to measurement noise: each term is calculated as a chain of sums and subtractions of phenotypic measurements, so the noise of each measurement propagates when estimating high-order terms. This phenomenon is illustrated in Fig. 2b: using the same simulated measurements in Fig. 1c, the calculated RBA terms are dramatically incorrect, with errors larger than that of the individual measurements and snowball as the order increases. When the computed terms at each order are used to predict the phenotype, high-order epistasis appears to be far more important than it actually is under the true RBA architecture (Fig. 2b). Exact computation of RBA is also incompatible with missing data: if a variant is unmeasured, it becomes impossible to compute the effect of the mutation and all the interactions that involve it.To cope with this limitation of exact estimation of RBA models, an alternative approach has been to use least-squares regression: a series of truncated RBA models are fit to the data to estimate the variance explained by the model at each order, and the complete RBA model is then used to estimate the individual effects7,19,20. This procedure yields biased estimates that oversimplify the genetic architecture under the true RBA model. Consider the simple example of Fig. 2c, setting (A, A) as the reference genotype. In the true RBA model, first-order terms explain no variance, all of which is caused by the pairwise interaction; when fit by regression, however, 67% of variance is explained at the first order, leaving only 33% attributable to the interaction. The estimated terms of the truncated first-order model are also inflated in magnitude. Another anomaly is that the fraction of first-order variance estimated by regression is the same irrespective of the reference genotype chosen, whereas under the true RBA architecture this quantity depends strongly on the reference.Measurement noise further undermines the accuracy of RBA-regression. For the simulation with mild noise (Fig. 1c), the terms estimated by RBA-regression using each truncated model deviate substantially from the true RBA terms (Supplementary Fig. 1), and the variance partition implies a genetic architecture far simpler than the true RBA architecture (Fig. 2d). When the complete model is finally fit, the estimated terms deviate wildly from the true terms, with particularly large errors for the high-order interactions. If taken at face value, these two observations would suggest the anomalous conclusion that high-order interactions are widespread and very large but somehow contribute negligibly to phenotypic variance.The bias of RBA-regression has been previously established29, and it exemplifies a general bias that arises whenever regression is used to fit uncentered interaction models in which variables are correlated across orders45. RBA-regression oversimplifies the RBA variance partition because regression finds the parameter values that minimize the sum of squared error between observed and model-predicted phenotypes across all variants. The true RBA terms are defined only by the phenotypes of mutants at that order and below; they do not minimize the squared error for higher-order variants if those variants are affected by noise or higher-order interactions. Regression therefore optimizes the low-order terms to fit the variation caused by effects that are excluded from the truncated model; the resulting estimates of low-order terms are incorrect and the fraction of variance they explain is overestimated. When the complete model is used for the final coefficient estimates, using regression is identical to exact computation, so measurement noise propagates into wildly inflated epistatic terms. Because RBA-regression produces biased and anomalous results, we do not explore the behavior of this method further.RFA is more interpretable and robust than other global formalismsLike RFA, Fourier analysis (FA) and background-averaging (BA) take a bird’s-eye view of genetic architecture, but RFA has a more interpretable structure and is more robust to missing data. In RFA, each model term directly expresses the global phenotypic effect of an amino acid or combination, and a variant’s phenotype is the sum of the effects of only the states in its sequence (Fig. 3a). In FA, each sequence state is recoded as a series of (1, –1) coordinates over (q – 1) Fourier dimensions, where q is the number of states (Supplementary Section 1.2). Each first-order Fourier term represents the effect of having a positive or negative coordinate along one of these dimensions. If more than two amino acid states are present, Fourier terms have no straightforward genetic or biochemical meaning. With 20 states, for example, the effect of each amino acid at a site is a uniquely signed sum over 19 first-order Fourier terms, each pairwise amino acid interaction is a signed sum over 361 second-order Fourier terms, and so on. The phenotype of any variant is therefore a sum over every term in the entire model (Fig. 3a). This complex mapping makes it difficult to understand how a variant’s phenotype arises from its sequence.Fig. 3: RFA is more interpretable and robust than Fourier analysis (FA) and background-averaged analysis (BA).a Mapping of model terms to phenotype in RFA, FA, and BA shown for a two-amino-acid sequence YP. fi(k) and fi, j(k), k-th first- and second-order FA terms for site i and site-pair (i, j). bi(α) and bi, j(α, β), first- and second-order BA terms for amino acids α and β; the summations shown are over all 19 mutant amino acids. b RFA is more robust to missing genotypes than FA and BA. Genetic architectures with the indicated number of states and sites were simulated by drawing up to third-order reference-based effects from the standard normal distribution. Third-order models were fit to a varying fraction of genotypes and evaluated by predicting the phenotypes of the remaining genotypes. The mean and 95% interval of out-of-sample R2 across 200 trials are shown. c RFA with nonspecific epistasis does not misinterpret high-order epistasis as clusters of low-order interactions. Phenotypes were simulated with only third-order determinants (distribution shown in the first panel) under a sigmoid link function (second panel). First-, second-, and third-order reference-free effects were inferred from all genotypes or a random 40%. The distribution of inferred effects and the variance they explain are shown in the right panels.In BA, each term is defined as the average effect of a state (or combination) relative to some arbitrarily chosen reference state (typically the first “letter” in the alphabet of sequence states), and the phenotype is a weighted sum over all terms in the entire model, including the coefficients for states not in the genotype of interest (Fig. 3a). As in FA, the effects of each amino acid or combination can be derived from the model terms only via an elaborate set of equations when more than two amino acids per site are considered (Supplementary Section 1.3).FA and BA models can be estimated by regression, but RFA is more robust to partial sampling. We simulated genetic architectures of varying shape and removed a variable fraction of genotype measurements; we then fit the three models to the remaining sequences by regression and predicted the phenotypes of the excluded genotypes using the estimated models (Fig. 3b). When there are only four states per site, all models have high predictive accuracy, which declines only when the fraction of sampled sequences drops below 1%, at which point RFA is slightly more accurate. When there are 16 states, however, RFA is much more robust than BA, the accuracy of which degrades rapidly as sample size shrinks; it is also more robust than FA, but to a smaller extent. RFA is more robust to missing genotypes because the phenotype of each unsampled variant is predicted as the sum of only the terms for its genetic states; FA and BA predict the phenotype as a weighted sum of all model terms, so the error associated with every model term propagates to all genotypes. This difference is exacerbated as more states are considered, because the total number of terms increases exponentially with the number of states.Incorporating nonspecific epistasisNonspecific epistasis can be incorporated into RFA by using a generalized linear model in which the phenotype of a variant is a nonlinear transformation of the effects of its genetic states25 (Fig. 1b). The total effect of a variant’s genetic states is its genetic score, and its phenotype is a nonlinear transformation of the score. The parameters of the link function from score to phenotype can be inferred by regression in a joint fitting procedure along with the specific RFA genetic effects.We explore using a sigmoid link function to incorporate nonspecific epistasis (Fig. 1b). We reasoned that most DMS datasets are likely to involve a limited dynamic range, and the sigmoid function can account for the diminishing effects of amino acid states in variants that are near the minimum or maximum of this range. The sigmoid also contains only two free parameters, which facilitates estimation and interpretation. Although the mechanisms and precise forms of nonlinearity are likely to be complex and vary among datasets, we explore here whether this simple and common form of nonspecific epistasis might be an important factor in protein genetic architecture.We used simulations to determine whether regression can be used to accurately estimate the RFA model coupled with sigmoid nonspecific epistasis. We were particularly interested in whether this procedure might oversimplify the genetic architecture by misinterpreting true high-order interactions as nonspecific epistasis or as clusters of low-order interactions. We first simulated phenotypes under a genetic architecture that contains only third-order effects plus nonspecific epistasis and then fitted RFA models (with the sigmoid link) truncated at various orders (Fig. 3c). The first- and second-order truncated models correctly explain no phenotypic variance and detect no first- or second-order effects. When the third-order model is used, all variance is correctly attributed to third-order effects. Similar results hold when variants are only partially sampled.We next explored whether including the link function might absorb specific epistasis when the true phenotypes are unaffected by global nonlinearity. We simulated measurements with specific epistasis derived from a real DMS dataset but imposed no nonspecific epistasis; we then fitted the RFA model with and without the sigmoid link function to these data (Supplementary Fig. 2). We found that variance partition across orders is estimated accurately, and the link function has no effect on these inferences. The minimum and maximum of the sigmoid function are estimated to be well beyond the range of phenotypic prediction, so the transformation has no effect.Taken together, these data indicate that the impact of limited dynamic range on genetic architecture can be effectively inferred by coupling RFA with a sigmoid link function. Under the realistic conditions we examined, this procedure does not artifactually absorb specific epistatic interactions or underestimate the true complexity of genetic architecture when nonspecific epistasis is weak or absent.Simplicity of protein sequence-function relationshipsTo understand the genetic architecture of real proteins, we performed RFA on 20 combinatorial mutagenesis datasets available for antibodies, enzymes, fluorescent proteins, transcription factors, viral surface proteins, and toxin-antitoxin pairs (Table 1). We considered only datasets with precise measurement (r2 > 0.9 among replicates) and sampling of at least 40% of possible variants. We focused primarily on large libraries but included three small ones in which high-order epistasis has been reported. The datasets range in size from 32 to 160,000 possible genotypes, with the number of variable sites ranging from 3 to 16 and the number of sampled states per site from 2 to 20. To assess the complexity of each dataset, we fitted a series of truncated reference-free models of increasing order, each time using the sigmoid link function to incorporate nonspecific epistasis and L1 regularization to reduce overfitting; we then used cross-validation to estimate the fraction of phenotypic variance explained at each order as the out-of-sample R2, which measures how well a model inferred from a random subset of data can predict the phenotypes of unsampled variants.Table 1 Combinatorial mutagenesis datasets analyzed in this studyAcross all proteins examined, most phenotypic variance is explained by first-order effects of amino acids and virtually all of the remainder by pairwise interactions. The first-order model achieves a median R2 of 0.91 across the 20 datasets—with a maximum of 0.97 and greater than 0.75 in all but four cases (Fig. 4a). When pairwise interactions are included, virtually all genetic variance is explained, with a median out-of-sample R2 of 0.96 and a minimum of 0.92 across the datasets. There is no relationship between the fraction of variance explained at low orders and the number of sites or states assayed (Supplementary Fig. 3).Fig. 4: Simplicity of protein sequence-function relationships.a RFA of 20 combinatorial mutagenesis datasets (Table 1). First-, second-, and third-order models with the sigmoid link function were evaluated by cross-validation—by inferring the model from a subset of data and predicting the rest of data. Each dot shows the mean out-of-sample R2 for one dataset; boxplots show the median, interquartile range, and total range across the datasets. Supplementary Fig. 3 shows the out-of-sample R2 for individual datasets. b Variants possibly affected by strong high-order epistasis were identified as outliers in the second-order model (residual > 20% of the phenotype range). (Left) Outliers in the ParD3-ParE3 (203) dataset. Each point is a variant, plotted by its observed and predicted phenotype. (Right) Proportion of outliers in each dataset. c RBA of the 20 datasets. Each truncated model was fit to the phenotype of all mutants up to the specified model order using as a reference the genotype designated as wild-type in the original publication; nonspecific epistasis was accounted for as in (a). Accuracy of prediction of higher-order mutants by each model is shown as R2, with negative values plotted as zero. Only higher-order mutants for which all the relevant lower-order effects can be computed were predicted. Supplementary Fig. 6 shows the R2 for individual datasets.Incorporating third-order terms confers only a marginal or no improvement in fit (median change in out-of-sample R2 of 0.02, maximum 0.04). The very small fraction of phenotypic variance unexplained by the third-order model represents some combination of fourth- and higher-order epistasis, measurement noise, and limitations in the sigmoid link function to accurately capture nonspecific epistasis. The inferred simplicity of the architecture is not attributable to the use of regularization (Supplementary Fig. 4). The estimated third-order effects are generally of small magnitude, and by nature each one affects fewer genotypes than the low-order effects, explaining why together they have a small impact on genetic variation (Supplementary Fig. 5).Although high-order epistasis is negligible across sequence space as a whole, there could still be a subset of genotypes shaped by strong high-order epistasis. To address this possibility, we analyzed the residuals of the second-order model, which represent the sum of all higher-order interactions and measurement noise. Genotypes with a residual greater than 20% of the phenotype range were considered candidates for strong higher-order epistasis, although erratic measurement noise cannot be excluded. The proportion of such genotypes is zero in six datasets and between 0.02 and 2% in the others (Fig. 4b). Only a tiny fraction of genotypes is therefore potentially affected by strong high-order epistasis.These analyses show that the genetic architecture of proteins is simple: knowing just the first-order effects and pairwise interactions, coupled with a simple model of nonspecific epistasis, is sufficient to accurately predict and explain phenotypes across the entire ensemble of sequences. Higher-order interactions are not completely absent, but they are weak and limited to a very small fraction of genotypes.We also examined the 20 datasets using RBA. We exactly computed the first-, second-, and third-order RBA models, using the sigmoid link function with parameters that maximize predictive accuracy for all genotypes. We then used each fitted model to predict the phenotypes of the higher-order mutants not used to compute the model. The median R2 across datasets is less than 0.2 for all three model orders; the vast majority of phenotypic variation is thus left to be explained by higher-order epistasis (Fig. 4c). The RBA formalism therefore leads to a complex and idiosyncratic description of the genetic architecture of these proteins.Phenotype bounding is the major cause of nonspecific epistasisTo understand the impact of incorporating nonspecific epistasis, we compared RFA of the empirical datasets when estimated with and without the sigmoid link function. We found that incorporating nonspecific epistasis dramatically improves phenotype prediction and reduces the variance attributed to epistasis (Fig. 5a, b). Using the sigmoid link raises the median out-of-sample R2 of first-order models from 0.59 to 0.92, reducing the variance attributable to specific epistasis by a factor of 5. For second-order models, it improves the median R2 from 0.87 to 0.96, reducing the variance explained by higher-order epistasis by a factor of 3. For third-order models, incorporating nonspecific epistasis increases the median R2 from 0.95 to 0.98.Fig. 5: The primary cause of nonspecific epistasis is phenotype bounding.a RFA of the 20 datasets without incorporating nonspecific epistasis, shown as in Fig. 4a. b Incorporating nonspecific epistasis reduces the fraction of phenotypic variance attributed to pairwise and higher-order interactions. Each dot shows the variance component attributed to each model order for one dataset when computed with or without incorporating nonspecific epistasis. c Nonspecific epistasis causes the phenotypic effect of a mutation (∆y) to vary among genetic backgrounds (magenta versus green) even when the effect on genetic score (∆s) is constant. Phenotype bounding is a particularly strong form of nonspecific epistasis that causes mutations to appear nearly neutral on backgrounds close to the bounds but not on others. d The extent to which the sigmoid link function improves the model fit (the difference between out-of-sample R2 in (a) and that in Fig. 4a) is proportional to the fraction of genotypes at the phenotype bounds. e In an experimental dataset in which only 0.1% of genotypes are within the bounds, incorporating nonspecific epistasis improves the fraction of variance explained by first-order effects from 0.01 to 0.97.The dramatic improvement in fit conferred by the simple sigmoid function suggests that phenotype bounds—biological or technical limits on the dynamic range over which genetic states have measurable effects on function—are the major cause of nonspecific epistasis in these datasets (Fig. 5c). Corroborating this conclusion, the degree to which the link function improves the R2 is tightly correlated with the fraction of genotypes at the phenotype bounds (Fig. 5d). In the CR9114-B dataset, for example, 99.9% of genotypes are at the lower bound, and incorporating nonspecific epistasis improves the first-order variance explained from 1% to 97% (Fig. 5e). Conversely, in the CH65-MA90 dataset, virtually all genotypes are within the dynamic range, and using the sigmoid link function has little effect on the variance partition.Although the causes of nonspecific epistasis are likely to be complex and vary among datasets, these results indicate that the simple sigmoid link function effectively captures its most salient features and allows the specific genetic architecture to be described economically.Sparsity of protein sequence-function relationshipsWe next asked whether protein function is determined by many genetic states and interactions of small effect or by a few determinants of large effect. For each dataset, we estimated the minimal number of reference-free terms required to predict the phenotype with 90% accuracy (T90): we ranked the terms in the fitted third-order model by their contribution to variance, constructed increasingly complex models by sequentially including each term, and estimated the accuracy of each model by cross-validation (Fig. 6a).Fig. 6: Sparsity of protein sequence-function relationships.a Measuring the sparsity of genetic architecture, illustrated using the CR9114-H1 dataset. RFA terms up to third order were estimated and ranked by the fraction of variance they explain (calculated using the simple method for computing the variance contribution of each term described in Methods). Models of increasing complexity were constructed by sequentially including each term, and each model was evaluated by cross-validation. Each dot represents a model, colored by the order of the last term added. Vertical line marks T90, the minimal number of terms required for an out-of-sample R2 of 0.9. b T90 as a function of the total number of genotypes. Dotted line, best-fit power function. Asterisk, GB1 dataset. Each T90 was estimated in two ways: as the number of terms required to reach R2 of 0.9 (upper error bar)—an overestimate because measurement noise prevents any model from attaining an out-of-sample R2 of 1—and as the number of terms required for an R2 equal to 90% of that of the full third-order model (lower error bar); circles show the average of the two estimates. c Fraction of all terms required to explain 90% of phenotypic variance shown against the total number of genotypes. Asterisk, GB1 dataset. Error bars show the possible maximum and minimum computed as in (b).The genetic architecture of proteins is very sparse (Fig. 6b). Out of up to 160,000 possible terms in each model, T90 ranges from just 6 to 44 across all datasets except for GB1, in which the mutated sites were specifically chosen to be enriched for epistatic interactions12. As the total number of possible genotypes (N) increases, T90 increases very slowly, so that the fraction of all terms required for an R2 of 0.9 declines almost linearly (Fig. 6c). These relationships hold irrespective of the number of states per variable site.Our findings suggest that even a very large genetic architecture should be describable with a compact set of terms. For example, the relationship between T90 and N predicts that a very large genetic architecture—two states at 100 variable sites, ~1030 possible genotypes and model terms—could be described with 90% accuracy by a model with just ~10,000 key terms.Inferring genetic architecture by sparse samplingAlthough a protein’s genetic architecture is defined by relatively few causal factors, identifying them could be challenging. Comprehensive experimental characterization is impractical for sequence spaces much larger than those we have analyzed, so a critical question is whether the important terms can be inferred from a small sample of genotypes by sparse learning methods15. To address this possibility, we sampled a variable number of genotypes from the datasets, fitted RFA models using regression with L1 regularization, predicted phenotypes of the unsampled genotypes, and determined N90, the minimum sample size required for R2 of 0.9 (Fig. 7a).Fig. 7: Inferring genetic architecture by random sampling.a Illustration using the CR9114-H1 dataset. Up to third-order RFA terms were inferred from a varying number of genotypes randomly sampled from the experimental results, and the estimated models were then evaluated by predicting the phenotypes of all unsampled genotypes. For each sample size, the mean and standard deviation of out-of-sample R2 across 10 trials are shown. Dashed line marks N90, the minimal sample size required for a mean out-of-sample R2 ≥0.9. b N90 as a function of the total number of genotypes. Error bars were computed as in Fig. 6b. The three datasets with 48 or fewer genotypes are not shown. c N90 as a function of T90, the minimal number of terms required to explain 90% of phenotypic variance (Fig. 6). d N90 as a function of the fraction of genotypes within phenotype bounds.We found that genetic architecture of proteins cannot be efficiently inferred from sparse random samples (Fig. 7b). Excluding the three small datasets, N90 ranges from 0.2 to 25% of the total number of genotypes, with a median of 5%. Even the lowest end of this range does not bode well for inferring the architecture of large sequence spaces with many states at many variable sites.We evaluated several factors that might determine the necessary sample size. First, we found that large sequence spaces require larger samples: N90 increases with the total number of genotypes, although there is a considerable scatter in this relationship (Fig. 7b). Second, the complexity of the genetic architecture is not a major factor: N90 depends only weakly on T90 (Fig. 7c). Finally, we found that the fraction of genotypes within the dynamic range of measurement is a critical factor: N90 increases sharply with the degree of phenotype bounding (Fig. 7d). An extreme case is the CR9114-B dataset (65,536 genotypes), where just 10 first-order effects account for 90% of phenotypic variance but approximately 16,000 genotypes are needed to identify them. This is because 99.9% of genotypes are at the lower bound, providing little quantitative information on genetic effects. By contrast, the CH65-MA90 dataset consists of the same number of genotypes, but the genetic architecture can be inferred from just 99 random genotypes because there is virtually no phenotype bounding.We conclude that despite the global simplicity of proteins’ genetic architecture, the important causal factors cannot be efficiently identified by sparse random sampling. A critical step is therefore to develop a sampling strategy that can efficiently identify the key first-order effects and pairwise interactions that define a genetic architecture.Understanding genetic architectureA benefit of coupling RFA with the sigmoid link function is that the genetic effects are expressed in a unit that is intelligible through a simple biophysical analogy, and they become comparable across datasets, even when different phenotypes are measured. The sigmoid model describes the phenotype of a variant as an equilibrium between two thermodynamic states: the functional state, whose phenotype is U, and the nonfunctional state, whose phenotype is L (Fig. 8a). A variant’s phenotype, lying between U and L, reflects the relative occupancy of the functional to nonfunctional state, which is determined by its genetic score (s) as es. The genetic score takes the role of the Gibbs free energy difference between the two states (∆G) expressed in the unit of –kT (the product of Boltzmann constant and absolute temperature). If a variant’s genetic score is 0, the two states are equally populated and its phenotype is midway between U and L. A sequence state or combination that increases the genetic score by 2.3 always causes a ten-fold increase in the relative occupancy of the functional state, corresponding to an apparent ∆∆G of –1.4 kcal/mol at 37 °C. This relationship holds across proteins, functions, and experimental systems.Fig. 8: Understanding genetic architecture.a Interpreting genetic score (s) as free energy difference (∆G). (Left) Relative occupancy of two thermodynamic states as a function of their ∆G. k, Boltzmann constant; T, absolute temperature. (Right) The sigmoid link function can be interpreted as describing an equilibrium between two states—the “functional” state, which produces a phenotype of U, and the “nonfunctional” state, which produces a phenotype of L. Their relative occupancy (pink versus blue lines) equals es, allowing s to be interpreted as ∆G in the unit of –kT. b CR9114-H3 dataset, which measures the affinity of all combinations of ancestral and derived amino acids at 16 sites in an antibody towards a hemagglutinin. (Left) First-order RFA. Each dot is a genotype, plotted by its estimated genetic score and measured phenotype. Histogram, distribution of genetic score; yellow curve, inferred sigmoid link; horizontal lines, inferred phenotype bounds; vertical line, mean genetic score; green and purple dots, ancestral and derived genotypes. (Right) First-order effects of amino acids at each site. c avGFP dataset, which measures the fluorescence of all combinations of pairs of amino acids at 13 sites. (Left) Second-order RFA. (Right) First-order effects and pairwise interactions, which account for 57 and 38% of phenotypic variance, respectively. Values are shown for one of the two of amino acids at each site. The ten pairwise interactions possible among sites 3, 5, 6, 9, and 10 are outlined. d Crystal structure of avGFP (PDB ID: 3e5w). The 13 mutated sites are shown in spheres, and the chromophore and its five surrounding sites are colored in red. e ParD3-ParE3 and ParD3-ParE2 (203) datasets, which measure the binding of all possible variants of ParD3 at three sites to ParE3, the cognate substrate, and ParE2, a noncognate substrate. (Left) First-order RFA. (Right) First-order effects at each site. Asterisk, wild-type amino acid. f Comparing the effect of each amino acid on ParE3 versus ParE2 binding. Wild-type amino acids are marked.We applied this framework to understand the genetic architecture of several example proteins. The CR9114-H3 dataset (Fig. 8b) consists of affinity measurements for binding of 216 antibody variants (all possible combinations of ancestral and derived amino acids at 16 sites that evolved during affinity maturation) to an influenza hemagglutinin. The vast majority of variants are at the lower bound of detectable binding, so the average genetic score is –7.8, corresponding to just 0.04% occupancy of the bound state, or ∆Gapp = 5.6 kcal/mol. The best variant has a score of just 2.6, corresponding to 93% occupancy and ∆Gapp = –1.9 kcal/mol. There is virtually no specific epistasis in this genetic architecture (Supplementary Fig. 3). First-order effects at three key sites mostly determine the phenotype: each favorable state increases the genetic score by 2.1 to 2.6 (∆∆Gapp < –2 kcal/mol); together, these states increase the relative occupancy by almost three orders of magnitude compared with the global average but still yield absolute occupancy of the bound state of just 36%. Five other sites make modest contributions, each changing the genetic score by ~0.5 and shifting the relative occupancy by ~1.3 fold. The remaining eight have even smaller effects. A variant must therefore have all three large-effect favorable states to achieve measurable binding, and the particular combination of states at the other sites modulates the affinity.Specific pairwise interactions are important in the avGFP dataset (Fig. 8c), accounting for 38% of variance in fluorescence measurements. There are many functional variants in this library, including a large number at the measurement maximum, so the average variant has a genetic score of –1 with the occupancy of the fluorescent state at 20%. First- and second-order effects involving just five of 13 variable sites account for 86% of variance. These sites, which tightly surround the chromophore in the crystal structure (Fig. 8d), engage in a dense epistatic network in which nine of the ten possible pairwise interactions are nonzero. Only four of these interactions alter the genetic score by more than 1, but their total impact is substantial, conferring an increase in genetic score by 7.8 and relative occupancy by 2400-fold (∆∆Gapp = –5.6 kcal/mol) when all are in the most favorable combination. Not all of these are necessary to achieve high fluorescence, however: because the global average has measurable fluorescence, one or more favorable states can be removed while leaving the other interactions intact.RFA terms can also be used to understand the determinants of functional specificity in multistate sequence space and when multiple functions are measured. The ParD3 library (all combinations of 20 states at 3 sites in the binding interface) was assayed separately for binding its cognate ligand ParE3 and the noncognate ligand ParE2. Effects on specificity can be quantified as the difference between a state’s effects on the genetic score with the two ligands. The average variant displays a weak but measurable binding to both ligands, with a preference for ParE3 over ParE2 by a genetic score of ~ 1 (difference in relative occupancy of 2.5-fold). For both ligands, first-order effects account for the vast majority of variance in binding (Fig. 8e). There are only eight amino acid states that can change the genetic score in favor of one ligand over the other by more than 1.6, each equivalent to more than 5-fold difference in occupancy (Fig. 8f). The three strongest of these each favor ParE3 by scores of 2.2 to 2.8 (~10-fold preference in occupancy, ∆∆Gapp ~ 2 kcal/mol). Two of these change specificity by increasing affinity for both ligands but more strongly enhancing ParE3 binding, and the third has opposite effects on the two ligands. The wild-type protein in this case possesses these three specificity-optimal states.

The simplicity of protein sequence-function relationships

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Quick chemistry crossword #055

High-throughput identification of repurposable neuroactive drugs with potent anti-glioblastoma activity

Cryptic chemistry crossword #055

Do AI models produce more original ideas than researchers?

Hot Topics

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Quick chemistry crossword #055

High-throughput identification of repurposable neuroactive drugs with potent anti-glioblastoma activity

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Quick chemistry crossword #055

High-throughput identification of repurposable neuroactive drugs with potent anti-glioblastoma activity

Cryptic chemistry crossword #055

Popular Articles

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Quick chemistry crossword #055

High-throughput identification of repurposable neuroactive drugs with potent anti-glioblastoma activity