CORE
🇺🇦
make metadata, not war
Services
Research
Services overview
Explore all CORE services
Access to raw data
API
Dataset
FastSync
Content discovery
Recommender
Discovery
OAI identifiers
OAI Resolver
Managing content
Dashboard
Bespoke contracts
Consultancy services
Support us
Support us
Membership
Sponsorship
Community governance
Advisory Board
Board of supporters
Research network
About
About us
Our mission
Team
Blog
FAQs
Contact us
The UK COVID-19 Vocal Audio Dataset
Authors
Kieran Baker
Jobie Budd
+25 more
Lorraine Butler
Ana Tendero Cañadas
Harry Coppock
Peter Diggle
Sabrina Egglestone
Steven Gilmour
Chris Holmes
David Hurley
Radka Jersakova
Emma Karoune
Ivan Kiskin
Vasiliki Koutra
Jonathon Mellor
George Nicholson
Josef Packham
Selina Patel
Richard Payne
Davide Pigoli
Sylvia Richardson
Stephen Roberts
Björn Schuller
The Alan Turing Institute
Tracey Thornley
Alexander Titcomb
UK Health Security Agency
Publication date
30 October 2023
Publisher
Zenodo
Doi
Cite
Abstract
<p>The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech (speech not available in open access version) were collected in the 'Speak up to help beat coronavirus' digital survey alongside demographic, self-reported symptom and respiratory condition data, and linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,794 of 72,999 participants and 24,155 of 25,776 positive cases. Respiratory symptoms were reported by 45.62% of participants. This dataset has additional potential uses for bioacoustics research, with 11.30% participants reporting asthma, and 27.20% with linked influenza PCR test results.</p><h3>Contents</h3><ul><li><strong>participant_metadata.csv</strong> row-wise, participant identifier indexed information on participant demographics and health status. Please see <a href="https://arxiv.org/pdf/2212.07738.pdf">A large-scale and PCR-referenced vocal audio dataset for COVID-19</a> for a full description of the dataset.</li><li><strong>audio_metadata.csv</strong> row-wise, participant identifier indexed information on three recorded audio modalities, including audio filepaths. Please see <a href="https://arxiv.org/pdf/2212.07738.pdf">A large-scale and PCR-referenced vocal audio dataset for COVID-19</a> for a full description of the dataset.</li><li><strong>train_test_splits.csv</strong> row-wise, participant identifier indexed information on train test splits for the following sets: 'Randomised' train and test set, Standard' train and test set, Matched' train and test sets, 'Longitudinal' test set and 'Matched Longitudinal' test set. Please see <a href="https://arxiv.org/abs/2212.08570">Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers</a> for a full description of the train test splits.</li><li><strong>audio/ </strong>directory containing all the recordings in .wav format<ul><li>Due to the large size of the dataset, to assist with ease of download, the audio files have been zipped into <strong>covid_data.z{ip, 01-24}.</strong> This enables the dataset to be downloaded in short periods, reducing the chances of a dropped internet connection scuppering progress. To unzip, first, ensure that all zip files are in the same directory. Then run the command 'unzip covid_data.zip' or right-click on 'covid_data.zip' and use a programme such as 'The Unarchiver' to open the file.</li><li>Once extracted, to check the validity of the download, please run the 'python Turing-RSS-Health-Data-Lab-Biomedical-Acoustic-Markers/data-paper/unit-tests.py. All tests should pass with no exceptions. Please clone the GitHub repo detailed below.</li></ul></li><li><strong>README.md</strong> full dataset descriptor.</li><li><strong>DataDictionary_UKCOVID19VocalAudioDataset_OpenAccess.xlsx </strong>descriptor of each dataset attribute with the percentage coverage.</li></ul><h3>Code Base</h3><p>The accompanying code can be found here: https://github.com/alan-turing-institute/Turing-RSS-Health-Data-Lab-Biomedical-Acoustic-Markers</p><h3>Citations:</h3><p>Please cite.</p><p>@article{coppock2022,</p><p> author = {Coppock, Harry and Nicholson, George and Kiskin, Ivan and Koutra, Vasiliki and Baker, Kieran and Budd, Jobie and Payne, Richard and Karoune, Emma and Hurley, David and Titcomb, Alexander and Egglestone, Sabrina and Cañadas, Ana Tendero and Butler, Lorraine and Jersakova, Radka and Mellor, Jonathon and Patel, Selina and Thornley, Tracey and Diggle, Peter and Richardson, Sylvia and Packham, Josef and Schuller, Björn W. and Pigoli, Davide and Gilmour, Steven and Roberts, Stephen and Holmes, Chris},</p><p> title = {Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers},</p><p> journal = {arXiv},</p><p> year = {2022},</p><p> doi = {10.48550/ARXIV.2212.08570},</p><p> url = {https://arxiv.org/abs/2212.08570},</p><p>}</p><p> </p><p>@article{budd2022,</p><p> author={Jobie Budd and Kieran Baker and Emma Karoune and Harry Coppock and Selina Patel and Ana Tendero Cañadas and Alexander Titcomb and Richard Payne and David Hurley and Sabrina Egglestone and Lorraine Butler and George Nicholson and Ivan Kiskin and Vasiliki Koutra and Radka Jersakova and Peter Diggle and Sylvia Richardson and Bjoern Schuller and Steven Gilmour and Davide Pigoli and Stephen Roberts and Josef Packham Tracey Thornley Chris Holmes},</p><p> title={A large-scale and PCR-referenced vocal audio dataset for COVID-19},</p><p> year={2022},</p><p> journal={arXiv},</p><p> doi = {10.48550/ARXIV.2212.07738}</p><p>}</p><p>@article{Pigoli2022,</p><p> author={Davide Pigoli and Kieran Baker and Jobie Budd and Lorraine Butler and Harry Coppock and Sabrina Egglestone and Steven G.\ Gilmour and Chris Holmes and David Hurley and Radka Jersakova and Ivan Kiskin and Vasiliki Koutra and George Nicholson and Joe Packham and Selina Patel and Richard Payne and Stephen J.\ Roberts and Bj\"{o}rn W.\ Schuller and Ana Tendero-Ca
n
~
\tilde{n}
n
~
adas and Tracey Thornley and Alexander Titcomb},</p><p>title={Statistical Design and Analysis for Robust Machine Learning: A Case Study from Covid-19},</p><p> year={2022},</p><p> journal={arXiv},</p><p> doi = {10.48550/ARXIV.2212.08571}</p><p>}</p><p> </p><h3>The Dublin Core™ Metadata Initiative</h3><p> </p><p>- Title: The UK COVID-19 Vocal Audio Dataset, Open Access Edition.</p><p>- Creator: The UK Health Security Agency (UKHSA) in collaboration with The Turing-RSS Health Data Lab.</p><p>- Subject: COVID-19, Respiratory symptom, Other audio, Cough, Asthma, Influenza.</p><p>- Description: The UK COVID-19 Vocal Audio Dataset Open Access Edition is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs and exhalations were collected in the 'Speak up to help beat coronavirus' digital survey alongside demographic, self-reported symptom and respiratory condition data, and linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset Open Access Edition represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,794 of 72,999 participants and 24,155 of 25,776 positive cases. Respiratory symptoms were reported by 45.62% of participants. This dataset has additional potential uses for bioacoustics research, with 11.30% participants reporting asthma, and 27.20% with linked influenza PCR test results.</p><p>- Publisher: The UK Health Security Agency (UKHSA).</p><p>- Contributor: The UK Health Security Agency (UKHSA) and The Alan Turing Institute.</p><p>- Date: 2021-03/2022-03</p><p>- Type: Dataset</p><p>- Format: Waveform Audio File Format audio/wave, Comma-separated values text/csv</p><p>- Identifier: <strong>10.5281/zenodo.10043978</strong></p><p>- Source: The UK COVID-19 Vocal Audio Dataset Protected Edition, accessed via application to <a href="https://www.gov.uk/government/publications/accessing-ukhsa-protected-data/accessing-ukhsa-protected-data">Accessing UKHSA protected data</a>.</p><p>- Language: eng</p><p>- Relation: The UK COVID-19 Vocal Audio Dataset Protected Edition, accessed via application to <a href="https://www.gov.uk/government/publications/accessing-ukhsa-protected-data/accessing-ukhsa-protected-data">Accessing UKHSA protected data</a>.</p><p>- Coverage: United Kingdom, 2021-03/2022-03.</p><p>- Rights: Open Government Licence version 3 (OGL v.3), © Crown Copyright UKHSA 2023.</p><p>- accessRights: When you use this information under the Open Government Licence, you should include the following attribution: The UK COVID-19 Vocal Audio Dataset Open Access Edition, UK Health Security Agency, 2023, licensed under the <a href="https://www.nationalarchives.gov.uk/doc/open-government-licence/">Open Government Licence v3.0</a> and cite the papers detailed above.</p><p> </p>
Similar works
Full text
Available Versions
ZENODO
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:zenodo.org:10043978
Last time updated on 07/05/2024