Adherence to non-pharmaceutical interventions following COVID-19 vaccination: a federated cohort study

EnrollmentAll study activities, including enrollment and data collection, were conducted on the “Google Health Studies” mobile app, downloaded from the Google Play Store. A convenience sample was recruited via blog posts, webinars, and online advertisements. Those interested were instructed to download the app and enroll in the “Respiratory Health Study”, where a study description was provided. Eligibility was confirmed by reporting state of residence and age (minimum age of 18–21 years varied with each state’s age of majority). If age and state requirements were met, participants were provided with an in-app consent form and phone number to contact with additional questions. Consent was indicated by clicking a box at the form bottom. Participants were enrolled for a 6-month period.The study (Clinicaltrials.gov: NCT04663776) was approved by the Boston Children’s Hospital institutional review board (IRB-P00036213), with a waiver for documentation of written informed consent.MeasuresFollowing consent, participants completed an initial survey with demographics (e.g., gender and race/ethnicity) and their work and home addresses. Individuals who did not complete the initial survey and/or did not grant mobility data permissions were excluded.Each week, while enrolled, participants were sent an app notification to complete a survey and were asked about the use of preventive measures, including consciously practicing social distancing and wearing face masks while in public. Participants responded using a 5-point Likert scale, ranging from “never” to “always.” Beginning in March 2021, participants were asked whether they received a COVID-19 vaccine and, if so, the date of the first dose, second dose (if received) or indicate a second dose was not required based on the specific vaccine received. Unvaccinated participants were asked if they planned to receive the vaccine when available.Mobility data were recorded daily from device sensors (e.g., GPS)17 on the following measures: time spent at home, time spent at work (both based on the participant’s reported work and home addresses), and number of unique places visited. Each completed survey was associated with average daily mobility measurements from the 14-day period prior to the survey completion date. Data were discarded if less than 7 days were measured in a 14-day period.Privacy overviewThe Google Health Studies app employs three primary technologies to protect participant privacy (Fig. 3) and control of raw data: federated analytics (FA), secure aggregation (SecAgg), and differential privacy (DP). Detailed descriptions of privacy technologies employed are available in “Privacy details” section.Fig. 3: Federated Cohort Study Framework.Federated analytics (FA) refers to the process of broadcasting statistical computations (“federated queries”) to client devices, executing those computations locally over each device’s raw data, and aggregating these local results without ever making any data from individual devices available to engineers or researchers. In this study, the results are securely aggregated using a cryptographic protocol (SecAgg) that prevents the central server from learning any individual device’s results18,19. On the server, noise is drawn from a Laplace distribution and added to the aggregates to achieve differential privacy23 before writing them into an encrypted datastore. Only these differentially private aggregates are accessible to researchers for further analysis and publication.FA is a method for analyzing raw data stored locally on participants’ devices without any centralized server or researcher accessing that raw data. Instead, local computations are conducted on individual devices, and statistics like counts and quantiles are aggregated from them without revealing any individual-level values to engineers or researchers18. Via the SecAgg protocol, individual-level values are not revealed to any server, even transiently. Our implementation of the SecAgg protocol is secure against an honest but curious adversary with a joint view of the server and up to 20% of client devices contributing data to the aggregation; i.e., a “passive” adversary who possesses such a view but does not actively interfere with the protocol cannot gain any additional information about the other contributors’ inputs. It is also robust against up to 1/3 of the clients dropping out during the protocol execution and maintains input privacy regardless of the number of clients that drop out18,19,20.In addition to collecting aggregate results via SecAgg, we applied differential privacy (DP) to the aggregate output as a secondary mechanism to promote privacy21,22. Note that informed consent communications did not discuss DP, and participants’ opt-in was not contingent on DP guarantees. DP is a rigorous, mathematical definition of privacy that characterizes the impact a single participant’s contribution can have on the result of the computation23. DP data-processing mechanisms are randomizing; they guarantee statistically similar aggregate outputs regardless of whether a single participant’s data was included, thereby ensuring no individual’s contribution can be inferred with certainty from any single output. To achieve this, random noise is drawn from a Laplace distribution and added to the aggregate data before making it visible to researchers.Intuitively, the injection of noise to achieve DP adds a degree of uncertainty to the collected data that is calibrated to give each individual participant plausible deniability about their own contribution while still preserving scientific utility of the aggregate result. This privacy-accuracy tradeoff is illustrated in Supplementary Fig. 2.Privacy detailsOur FA framework imposed some restrictions on the study:

1.

Participant data is only accessible to researchers via aggregation operations—summation in particular. We, therefore, estimate continuous variables, such as time spent at home or unique places visited, by bucketing these data and aggregating histograms across participants.

2.

To further protect participant privacy, our FA infrastructure intentionally breaks any connection between participant identifiers and the aggregate results to which they contribute. While each participant can contribute at most once to each FA query, the lack of individual identifiers prevents us from computing the number of unique contributors across those multiple queries.

3.

Data recording at participants’ devices was decoupled from aggregate data collection at the server, so participants who filled out surveys or activity data did not actually contribute data to the study results if they deleted the application or that data from their devices before FA aggregations could be performed.

4.

Only participants whose mobile devices were reachable and could rendezvous with other participants during securely aggregated FA aggregations (arranged in repeated minutes-long windows over the course of a few weeks) could contribute their data to the analysis. To minimize the impact on participants’ mobile devices, devices were only eligible to contribute when idle, charging, and connected to an unmetered network.

In practice, for contemporaneous FA queries A and B, the symmetric difference between participating populations (i.e., the number of participants who contribute to only query A or only to query B) should be small because the technical eligibility criteria enumerated in the final bullet above are shared between queries, and the availability of data for contemporaneous queries (e.g., whether the contributor has filled out surveys during the queried time period) are generally correlated, so any device able to contribute to query A can likely contribute to query B. Since our final analysis queries for this study were run contemporaneously, we expect their contributors to be mostly overlapping, mitigating limitation #2.Alongside data-recording nonresponse (that is, not completing surveys, declining to grant location permissions to the study application, leaving the mobile device turned off, etc.), points #3 and #4 contributed significantly to the study’s ~30% data-capture rate. In addition, point #4 introduces an unquantified risk of bias from uneven participation by devices with different availability. For future federated studies, these two issues could be mitigated via both technical and operational changes, including:

improvements to the SecAgg protocol, such as SubGraph SecAgg24 that would reduce the cost of coordination among contributing devices

alterations to the device eligibility criteria to allow more devices to attempt aggregation more frequently

aggregation of data throughout the study rather than after its completion, reducing the chance of data loss due to app uninstallation

Differential privacy provides participants plausible deniability about the data that they contributed to a computation by offering a rigorous, mathematical upper bound on the impact that a single participant’s contribution can have on the result of the computation23. A DP mechanism M is considered to be ε-DP if, for all pairs of adjacent datasets D and D’ that differ only by changing, adding, or removing one participant’s data and for all possible outputs S, Pr[M(D) = S] ≤ eε⋅Pr[M(D′) = S]23.Intuitively, ε defines a ratio that bounds how much a single contributor’s data can affect the probability of any given result. For example, if ε = ln(3), no change to a single participant’s contribution can change the probability of any result by more than 3×. For a binary question, this is the level of protection provided by a randomized response in which the participant flips a coin and answers truthfully if heads and randomly if tails, since a participant whose true answer is Yes has a probability of 75% of answer Yes and 25% for answering No, and vice versa if their true answer is No.For each FA aggregation, we apply the Laplace mechanism2 to the server-side aggregates to guarantee differential privacy of the users who contributed to that FA aggregation. The Laplace mechanism involves addition of noise drawn from the Laplace distribution and scaled according to the sensitivity of the output to an individual participant’s contributions and the desired privacy parameter ε. We add this noise to the server-side, securely aggregated data before making it visible to researchers.Critically, the addition of noise to the aggregated results limited researchers’ ability to slice aggregates to control for arbitrary confounders, since more slices with the same privacy parameter ε implies a lower signal-to-noise ratio, eventually washing out the data’s utility. We, therefore, limit our aggregates to bivariate comparisons.Since noise-addition imposes a tradeoff between utility and privacy guarantees, we, therefore, lean toward utility in our strategy selection for differential privacy. We choose a strong ε value of ln(3) to protect each participant’s contribution to each bivariate aggregation but acknowledge that each participant can contribute to multiple bivariate comparisons and that these comparisons have some correlation.AnalysisThe analysis focused on the association between COVID-19 vaccination status and NPIs, including self-reported use of face masks, conscious social distancing, and mobility measures. Primary study outcomes were: daily time at work (<4 h vs. ≥4 h), daily time at home (≥20 h vs. <20 h), daily number of unique places visited (<4 places vs. ≥4 places), self-reported use of face masks (“always” vs. less), and self-reported social distancing (“always” vs. less).The primary study exposure was vaccination status. Participants were defined as unvaccinated until 11 days after an initial dose of COVID-19 vaccine, partially vaccinated from 11 days after dose 1 until 11 days after dose 2, and fully vaccinated when they were ≥11 days post vaccine dose 2. Individuals who indicated they received a single-dose version of the COVID-19 vaccine were classified as fully vaccinated ≥11 days after the dose.During the study period, there were multiple changes in public policy, public health recommendations, and vaccine availability. To account for these changes, study data were grouped into 6-week time periods, beginning on December 9, 2020, paralleling the release of the COVID-19 vaccine in the US, through July 6, 2021. During each 6-week period, participants could contribute data on 0-6 weekly surveys. Due to FA, it was not possible to attribute multiple weekly surveys to a single participant. For each time period, the total number of responses for each dichotomously measured NPI was determined for each of the 3 possible vaccination statuses. We adjusted for repeated responses by dividing the total number of responses for a given exposure variable by the total number of participants who responded to at least one data query on that exposure variable during the time period. We divided the responses for each NPI- vaccination status pair by this correction factor and rounded to the nearest integer to approximate per-participant values.Chi-square tests were used to assess differences in behaviors related to vaccination status during each time period. Separately, logistic regression models were used to identify overall differences in behaviors by vaccination status after adjusting for time period. Finally, logistic regression models were used to compare behaviors among the sub-groups of unvaccinated participants who did and did not plan to get vaccinated after adjusting for the time period. To account for idiosyncrasies introduced due to our privacy-preserving methodology, we used a non-parametric bootstrap with 10,000 iterations to construct 95% confidence intervals (Supplementary Note 1). Our use of federated analytics constrained our ability to adjust for multiple variables without compromising data utility, as adding confounders would have necessitated more data stratifications, diluting the signal under constant noise levels enforced by DP. Therefore, we limited our analyses to select key variables aggregated results across the time periods.

Hot Topics

Related Articles