A large-scale and PCR-referenced vocal audio dataset for COVID-19

“I love nothing more than an afternoon cream tea” declared the 72,999  volunteers of the UK COVID-19 Vocal Audio Dataset, as part of an effort to test claims of AI models that could classify COVID-19 infection from voice recordings. 
UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date, and is the subject of our recently published Data Descriptor: A large-scale and PCR-referenced vocal audio dataset for COVID-19. 
In late 2020, as the second wave of COVID-19 surged in the UK, use of lateral flow tests was not yet widespread, and it was clear that a faster, more widespread way to screen for infections was needed to supplement PCR tests (Polymerase Chain Reaction). While PCR tests were highly sensitive and specific, they were expensive and limited by lab capacity.
Around this time, several studies were published claiming that COVID-19 infection could be detected from a patient’s cough, speech, or breathing sounds using AI. This vocal audio could be recorded on a smartphone, which could, in theory, allow a non-invasive, cheap and scalable option for COVID-19 screening. However, these studies relied on crowdsourced data, where participants self-reported their COVID-19 status. This meant the “positive” recordings were likely from participants with symptoms, while asymptomatic individuals (who might not know their infection status) were underrepresented, and could even be labelled as “negative”. 
It was in this context that the UK Health Security Agency (UKHSA) set up the “Speak up and help beat coronavirus’’ digital survey, which recorded the voices of 72,999 of volunteers and linked recordings to their COVID-19 PCR test results. This information could then be used to independently evaluate the claims of the AI studies. The results of the digital survey and linked test results were wrangled, cleaned, and anonymised to eventually form the openly-accessible UK COVID-19 Vocal Audio Dataset.
This project was part of a unique collaboration between The Alan Turing Institute, Royal Statistical Society and UKHSA, known as the Turing-RSS Health Data Lab. This Lab worked together throughout the COVID-19 pandemic on seven COVID-related projects. All projects were inherently interdisciplinary by nature and therefore demanded consultation and collaboration between statisticians, machine learning researchers from Turing-RSS and the public health expertise, partnership professionals and clinical contacts that UKHSA provided. The team also greatly benefited from the expertise and support of research infrastructure professionals such as data wranglers, research software engineers, project managers and a research community manager.
As an example, in this project, bioacoustics expertise helped identify relevant voice recordings – “I love nothing more than an afternoon cream tea” included certain vowel and nasal sounds. Machine learning researchers determined the optimal audio sampling rate. Public health and infectious disease expertise guided us on capturing informative clinical data (e.g. SARS-CoV-2 viral load). Medical statisticians ensured sufficient and representative samples across clinical and demographic groups, while also considering samples matching previous studies for comparison. Turing researchers and the research community manager drove forward open and reproducible publication of results, code and the dataset with the help of data wranglers and partnership professionals from UKHSA. 
The scale of the UK COVID-19 Vocal Audio Dataset allowed us to train and evaluate our own AI models, documented in our Nature Machine Intelligence publication, where we present negative results for symptom-agnostic detection of SARS-CoV-2 infection. Despite providing gold-standard test results, the dataset still shows some of the recruitment  and self-selection biases in previous studies. We’ve documented these limitations in our data descriptor paper and propose statistical methods to address them in a separate  preprint. We discuss this study and the negative results we found in this blog post: https://www.turing.ac.uk/blog/can-sound-someones-cough-be-used-detect-covid-19. 
As Coppock et al. note, a lack of publicly available codebases and datasets hinders replication in AI research, particularly for COVID-19 audio detection. To maximise impact and sustainability of our research, we wanted to ensure the UK COVID-19 Vocal Audio Dataset was made available for other researchers to use for the purposes of reproducibility and reuse in their own research. Sustainable data sharing conversations focused on making the dataset and metadata well documented in line with the FAIR data principles. We worked with the data owner to explore how we could enable a fully open dataset. This led to adjustments to the dataset to ensure that re-identification would not ever be possible. The result was a strict anonymisation process that safeguards participant confidentiality while maintaining reusability with essential variables for analysis. This extra consideration of adjusting our dataset to ensure reproducibility and reusability of our research has really boosted both impact and sustainability, with new studies already re-using the dataset.
Despite challenging pandemic conditions, we were able to pursue the collection of data not traditionally used in public health research. We hope that our documented methods and processes serve as an example and inspiration for further collection and open publishing of more non-traditional public health data for use in innovative prediction, diagnosis, and screening tasks, alongside more established public health data. In the UK COVID-19 Vocal Audio Dataset, other participant clinical information such as influenza PCR test results and asthma status is available, which we hope will enable use of our dataset for research in multiple clinical use-cases.
A big thank you to all the volunteers who participated and to the fantastic Turing-RSS Health Data Lab team from the Turing, RSS and UKHSA who made this open research data possible!
Special thanks to Joe Packham and Richard Payne for their help editing this blog.
Image attribution: 
The Turing Way Community. This illustration is created by Scriberia with The Turing Way community, used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Hot Topics

Related Articles