Improving rigor and reproducibility in western blot experiments with the blotRig analysis

Designing reproducible western blot experimentsDetermining linear range for each primary antibodyMost WB analyses assume semi-quantitatively that the relationship between qWB assay optical density data (i.e. western band signal) and protein abundance is linear2,3,11,18. Accordingly, most qWB analyses use statistical tests (t-test; ANOVA) that assume a linear effect. However, recent studies have shown that the relationship can potentially be highly non-linear19 As Fig. 1 illustrates, the WB band signal can become non-linearly correlated with protein concentrations at low and high values. This may result in inaccurate quantification of relative target protein amount in the experiment and violates the assumptions for linear model which can lead to false inferences. To address the assumption of linearity, it is important to first determine the optimal linear range for each protein of interest so that one can be confident that a unit change in band density reflects a linear change in protein concentration. This enables an experimenter to accurately quantify the protein of interest and apply linear statistical methods appropriately for hypothesis testing.Counterbalancing during experimental designCounterbalancing is the practice of having each experimental condition represented on each gel and evenly distributing them to prevent overrepresentation of the same experimental groups in consecutive lanes. For example, imagine an experimental design in which we are studying two experimental groups (wild type and transgenic animals) and are also looking at two treatment conditions (Drug and Vehicle). The best way to determine the effects and interactions between our experimental and treatment groups would be to create a balanced factorial design. A factorial design is one in which all combinations of levels across factors are represented. For the current example, a balanced factorial design would produce four groups, covering each possible combination (Drug-treated Wild Type, Vehicle-treated Wild Type, Drug-Treated Transgenic and Vehicle-treated Transgenic) (Fig. 2A). During WB gel loading, experimenters often distribute their samples unevenly such that certain experimental conditions may be missing on some gels or samples from the same experimental condition are loaded adjacently on a gel. This is problematic because we know that polyacrylamide gel electrophoresis (PAGE) gels are not perfectly uniform, reflecting a source of technical variability43; in the worst case, if we have only loaded a single experimental group on a gel and found a significant effect of the group, we cannot conclude if the effect is due to the experimental condition or a technical problem of the gel. At minimum, experimenters should ensure that every group in a factorial design is represented on each gel to avoid confounding technical gel effects with experimental differences. If the number of combinations is too large to represent on a single gel because of the number of factors or the number of levels of the factors, then a smaller “fractional factorial” design will provide maximal counterbalancing to ensure unbiased estimates of all factor effects and the most important interactions.In addition, experimenters can further counter technical variability by arranging experimental groups on each gel to ensure adequately counterbalanced design assuming the uniformed protein concentration and fluid volume of all samples. This importantly addresses the variability due to physical effects within an individual gel. In our example, this means alternating the tissue areas and experimental conditions as much as possible to minimize similar samples from being loaded next to one another (Fig. 2B). By spreading the possibility of technical variability across all samples by counterbalancing across and within gels, we can mitigate potential technical effects that can bias our results. Proper counterbalancing also enables us to implement more rigorous statistical analysis to account for and remove more technical variability25,26,32,33. Overall, this will help to ensure that experimenters can find the same result in the future and improve reproducibility.Technical replicationTechnical replicates are used to measure the precision of an assay or method by repeating the measurement of the same sample multiple times. The results of these replicates can then be used to calculate the variability and error of the assay or method13. This is important to establish the reliability and accuracy of the results. Most experimenters acknowledge the importance of running technical replicates to avoid false positives and negatives due to technical error13. Even beyond extreme results, technical replicates can account for the differences in gel makeup, human variability in gel loading, and potential procedural discrepancies. In fact, most studies run at least duplicates; however, the experimental implementation of replicates (e.g., running replicates on the same gel or separate gels) as well as the statistical analysis of replicates (e.g., dropping “odd-man-out” or taking the mean or standard deviation) can differ greatly44,45. This experimental variability ultimately impedes our ability to meaningfully compare results. For experimenters to establish accuracy and advance reproducibility in WB experiments, it is important to implement standardized and rigorous protocols to handle technical replicates11,13. In doing so, we can further reduce the technical variability with statistical methods during analysis.As underscored previously, we recommend that technical replicates are counterbalanced on separate gels to mitigate any possible gel effect. Additionally, by running triplicates, we can treat replicates as a random effect in a LMM during statistical analysis. Importantly, triplicates provide more values to measure the distribution of technical variance to ensure the robustness of the LMM than only running duplicates. This approach isolates and removes technical variance from biological variation which ultimately improves our sensitivity for true experimental effects46.In the following demonstration of statistical methods, we replicated all WB analyses in triplicate with a randomized counterbalanced design. We then explore how the way in which technical replicates and loading controls are incorporated into analysis can have a significant impact on both the sensitivity of our results and the interpretation of the findings. An example mockup of a dataset illustrating the various ways in which western blot data are typically prepared for analysis can be found in Fig. 3.Figure 3Western Blot Gel and Replication Strategies. (A) Illustration of Western Blot Gel. This depiction of a typical multiplexed western blot gel highlights the antibody-labeled target protein bands of interest (green/yellow) and housekeeping protein loading control that is always run and quantified in the same sample and lane as the target of interest. Total protein stain (fluorescent ponceau stain) is shown in red can can be used as an alternative loading control. Specific, quantification is typically executed on a single antibody-labeled channel for the target protein and housekeeping protein loading control (gray scale image). (B) Balanced Factorial Technical Replicate Strategy. Here we show the western blot data for the first 3 subjects from an example dataset. In a balanced factorial design, an equal number of samples from all possible experimental groups are represented on each gel. This table shows the subject number, the technical replicate, experimental group, and the band quantifications for both the target protein and the loading control. A ratio of target protein and loading control is also calculated. (C) Other Common Technical Replicate Strategies. In this example table are two of the other ways western blot data are typically formatted. Some experimenters choose to not include technical replicates, with only one sample from each subject quantified. In another replication strategy, technical replicates are averaged. Averaging may bias or skew the data. We recommend running technical replicates on separate gels or batches, and using gel/batch as a random factor when analyzing western blot data.Statistical methodology to improve western blot analysisLoading control as a covariateMost qWB assay studies use loading controls (either a housekeeping protein or total protein within lane) to ensure that there are no biases in total protein loaded in a particular lane2,11,27. The most common way that loading controls are used to account for variability between lanes is by normalizing the target protein expression values by dividing it by the loading control values (Fig. 3), resulting in a ratio between target protein to loading control2,47,48. However, ratios may violate assumptions of common statistical test used to analyze qWB (e.g., t-test, ANOVA, etc.)49 This ultimately hinders the ability to statistically account for the variance in qWB outcomes and have a reliable estimate of the statistics. An alternative approach to improve the parametric properties would be to include loading control values as a covariate—a variable that is not our experimental factors but that may affect the outcome of interest and presents a source of variance that we may account for50. For instance, we know the amount of protein loaded is a source of variability in WB quantification, so we can use the loading control as a covariate to adjust for that variance. In doing so, we extend the method of ANOVA into that of ANCOVA51. This approach accounts for the technical variability present between lanes while meeting the necessary assumptions for parametric statistics which helps curb bias and averts false discoveries.Replication and subject as a random effectMost WB studies use ANOVA, a test that allows comparison of the means of three or more independent samples, for quantitative analysis of WB data49. One of the assumptions in ANOVA is the independence of observations49. This is problematic because we often collect multiple observations from the same analytical unit, for example different tissue samples from a single subject, or technical replicates. As a result, those observations don’t qualify as independent and should be analyzed using models controlling for variability within units of observations (e.g., the animal) to mitigate inferential errors (false positives and negatives)52 caused by what is known as pseudoreplication. This arises when the quantity of measured values or data points surpasses the number of actual replicates, and the statistical analysis treats all data points as independent, resulting in their full contribution to the final result53.In addition, when conducting experiments, it is important to consider the randomness of the conditions being observed. Treating both subjects and conditions as fixed effects can lead to inaccurate p-values. Instead, subjects/ animals should be treated as random effects and the conditions should be considered as a sample from a larger population54. This is especially important when collecting data from different replicates or gels, as the separate technical replicate runs should be considered as random.In Fig. 4 we use a simple experimental design comparing the difference in a target protein between two experimental groups to demonstrate four of the most common ways researchers tend to analyze western blot data: (1) running each sample once without replication, (2) treating each technical replicate as an independent sample, (3) taking the mean of technical replicate values, and (4) treating subject and replication as a random effect (Fig. 4). We then tested how effect size, power, and p value are affected by each of these strategies to get a sense of how much these estimates vary between analyses. For each of these strategies, we also tested the difference between using the ratio of target protein to loading controls versus using loading control as a statistical covariate. For further exploration of the way these data are prepared and analyzed, see the data workup in Supplementary Figs. 1 and 2.Figure 4Effect of different replication and loading control strategies on statistical outcomes. Eight possible strategies are shown, representing the most common ways in which replication and loading controls are treated in a typical Western blot analysis. Four replication strategies: either no replication at all, 3 technical replicate gels treated as independent, mean of three replicates, or replicate treated as a random effect in a linear mixed model. These are crossed with two loading control strategies: either target protein is divided by loading control, or loading control is treated as a covariate in a linear mixed model. (A) Effect Size: Standardized effect size coefficient is generally improved when loading control is treated as a covariate, compared to using a ratio of the target protein and loading control values. (B) Power: By treating each replication as independent the statistical power is increased (due to the inaccurate assumption that technical replicates are not related, thus artificially tripling the n). Conversely, including the variability inherent in technical replicates as a part of the statistical model, we work to identify and account for a major source of variability, thus improving power in a more appropriate way. (C) P value: As expected, when each replication is inaccurately treated as independent the p value is low (due to artificially inflated n). We found that using the mean of replications and loading controls as covariates also resulted in a p value below 0.05. The smallest p value was found when including replication as a random factor. Across each of these statistical measures, only when replication is included as a random factor and loading control as a covariate do we see a strong effect size, high power, and low p value.In the first scenario, we imagined that no technical replication was run at all (by using only the first replication). With this strategy, we found that standardized effect size is weak, power is low, and the p value was high (Fig. 4). Second, we demonstrate how analytical output would be different if we did run three technical replicates, but treated each as independent. As discussed above, this strategy does not take into account the fact that each sample is being run three times, and consequently the overall n of your experiment is artificially tripled! As one might expect, observed power is quite high, and our p value is low (< 0.05). Power is increased by an increase in sample size, so it is not surprising that the power is much higher if we erroneously report that we have a 3X larger sample size (i.e., pseudoreplication)53. In this case, the observed power is inflated and an artifact of inappropriate statistics, and the probability of a false positive is considerably increased with respect to the expected 5%.So, what would be a more appropriate way to handle technical replicates? One method that researchers often use is to take the mean of their technical replicates. This does ensure that we are not artificially inflating our sample size, which is certainly an improvement over the previous strategy. With this strategy, we do find that our p value is less than 0.05 (when loading control is treated as a covariate). But we also see that our power is still low. We have effectively taken our replicates into account by collapsing across them within each sample, but this can be dangerous. If there is wide variation across replicates of a particular sample, then taking the mean of three replicates could produce an inaccurate estimate of the ‘true’ sample value. Ideally, we want to find a solution where instead of collapsing this variation, we add it to our statistical model so that we can better understand what amount of variation is randomly coming from within technical replicates, and in turn what amount of variation is actually due to potential differences in our experimental groups.To achieve this, we need to model both the fixed effect of all groups in a full factorial design, and the random effect of replication across western blot gels. When we use both fixed and random effects, this is referred to as a linear mixed model (LMM). When using this strategy, we find that our effect size remains strong, and our p value is low. But importantly, we now have strong observed power (Fig. 4). This suggests that we can achieve greater sensitivity in our WB experiment when using this approach. Specifically, if we implement careful counterbalancing while designing our experiments, then we can use the variability between gels to our advantage during analysis using linear mixed effects model55.LMM is recommended because it takes into account both the multiple observations within a single subject/animal in a given condition and differences across subjects observed in multiple conditions. This reduces chances of inaccurate p-values and improves reliability56. Further, treating both subjects and replication as random effects generalizes the results to the population of subjects and also to the population of conditions57.Real world application of blotRig software for western blot experimental design, technical replication, and statistical analysisWe have designed a user interface that is designed to facilitate appropriate counterbalancing and technical replication for western blot experimental design. The ‘blotRig’ application is run through RStudio, and can be found here: https://atpspin.shinyapps.io/BlotRig/ Upon starting the blotRig application, the user is prompted to upload a comma separated values (CSV) spreadsheet. This spreadsheet should include separate columns for subject ID and experimental group. The user is then prompted to enter the total number of lanes that are available on their particular western blot gel apparatus. The blotRig software will first run a quality check to confirm that each subject ID (unique sample or subject) is only found in one experimental group. If duplicates are found, a warning will be shown that specifies which subjects are repeated across groups. If no errors are found, a centered gel map will be generated that illustrates the western blot gel lanes into which each subject should be loaded (Fig. 5A). The decision for each lane loading is based on two main principles outlined above: (1) each western blot gel should hold a representative sample of each experimental group (2) samples from the same experimental group are not loaded in adjacent lanes whenever possible. This ensures that proper counterbalancing is achieved so that we can limit the chances that the inherent variability within and across western blot gels is confounded with the experimental groups that we are interested in experimentally testing.Figure 5Example of the blotRig Gel Creator interface. (A) Illustration of the blotRig interface. User has entered their sample IDs, experimental groups, and the number of lanes per western blot gel. (B) The blotRig system then creates a counterbalanced gel map that ensures each gel contains a representative from each experimental group. This illustration shows the exact lane for each gel in which each sample should be run.Once the gel map has been generated, the user can then select to export this gel map to a CSV spreadsheet. This sheet is designed to clearly show which gel each sample is on, which lane on each gel a sample is found, what experimental group each sample belongs to, and importantly, a repetition of each of these values for three technical replicates (Fig. 5B). User will also see columns for Target Protein and Loading Control. These are the cells where the user can then input their densitometry values upon completing their western blot runs. Once this spreadsheet is filled out, it is then ready to go for blotRig analysis.To analyze western blot data, users can upload the completed template that was exported in the blotRig experimental design phase or their own CSV file under the ‘Analysis’ tab (Fig. 6). The blotRig software will first ask the user to identify which columns from the spreadsheet represent Subject/SampleID, Experimental Group, Protein Target, Loading Control, and Replication. The blotRig software will again run a quality check to confirm that there are no subject/sample IDs that are duplicated across experimental groups. If no errors are found, the data will then be ready to analyze. The blotRig analysis will then be run, using the principles discussed above. Specifically, a linear mixed-model runs using the lmer R package, with Experimental Group as a fixed effect, Loading Control as a covariate, and Replication (nested within Subject/Sample ID) as a random factor. Analytical output is then displayed, giving a variety of statistical results from the linear mixed model output table, including fixed and random effects and associated p values (Fig. 6). A bar graph of group means and 95% confidence interval error bars will also be generated, along with a summary of the group means, standard error of the mean, and upper/lower 95% confidence intervals. These outputs can be directly reported in the results sections of papers, improve the statistical rigor of published WB reports. In addition, since the entire pipeline is opensource, the blotRig code itself can be reported to support transparency and reproducibility.Figure 6Workflow for running statistical analysis of replicate western blot data using blotRig. First, fill out spreadsheet with subject ID, experimental group assignment, number of technical replication, the densitometry values for your target proteins and loading controls. After saving this spreadsheet as a.csv file, the file can be uploaded to blotRig. Tell blotRig the exact names of each of your variables, then click ‘Run Analysis’. This will produce a statistical output using linear mixed model testing for group differences using loading control as a covariate and replication as a random effect. Bar graph with error bars and summary statistics can then be exported.

Hot Topics

Related Articles