ZOE and King's College data science and machine learning teams have been working around the clock to create a machine learning model that uses Symptom Tracker data to predict COVID-19 in the UK. Based on data from the COVID Symptom Tracker app and the assumptions that we lay out below, we estimate that there are a total of 1.9m people in the UK with symptomatic COVID (aged 20-69 only) as of 1st April 2020.
1. Learn which symptoms best predict COVID, based on app users who have been tested
2,932 users of COVID Symptom Tracker both recorded their symptoms and have been tested for COVID, with 1,130 testing positive and 1,802 testing negative. We used machine learning* on this data to learn which symptoms are most predictive of a positive test. The most predictive symptoms, with most important first, were: anosmia (lack of taste & smell), fatigue, shortness of breath, fever and persistent cough.
2. Estimate total number of app users with COVID by applying those rules to all users’ logged symptoms (people who don’t report symptoms are not part of the model)
In total there were 1,626,355 users of COVID Symptom Tracker aged 20-69 who have recorded their symptoms, healthy or not, as of 1st April 2020. We applied the rules learnt from the tested users to estimate that 79,405 out of these total users would be positive if tested (4.9%).
3. Extrapolate to the whole UK population from app users, based on region, age & gender proportions
We segmented the whole UK population by location, age-decade and gender. For each such segment, we applied the percentage predicted as positive by our rules amongst app users, and then combined back to a total UK estimate.
Does not include asymptomatic COVID infections: there may be a large additional number of these.
Does not include COVID infections for people aged <20 or >70, since we have too little data logged in the app to model these.
Assumes that tested app users’ symptoms are representative of symptoms for those with positive or negative Covid status in the whole population
Assumes that app users are representative of the whole population within each segment of location, age-decade and gender. This model does not adjust for other demographics or health information.
Assumes healthy and unhealthy people are equally likely to use our app.
Assumes that app users report the severity of their symptoms in the same way.
Assumes no interaction between symptoms in this first logistic regression model.
Does not capture the most serious hospitalised cases well, although these are smaller in number compared to our total estimate.
* Technical model detail: We trained a logistic regression model on app users who had been tested, to learn which symptoms are most predictive of a positive COVID test.
To assess the variability of the model, we trained ten logistic regression models on 80:20 random splits of the data. The mean and standard deviation of the weights for each symptom are given below, with a mean intercept of -1.44 (sd 0.04).
We obtained a mean classification accuracy on the test set of 0.73 (sd 0.01).