Stop looking for an NPI miracle
Why underpowered trials hinder our ability to respond to future disease threats
If there’s an epidemic control measure that’s useful, we ideally want to know that it’s useful. But there’s a problem. It’s all too common to set the bar too high for usefulness, and test non-pharmaceutical interventions (NPIs) under the assumption that they’re basically the most useful thing ever. And then have lots of arguments when the resulting study is inconclusive.
Take air filters. Earlier this week, a new randomised controlled trial was published that tested high-efficiency particulate air (HEPA) filters in Australian residential aged-care facilities. The study set out to answer a clear question: does air purification reduce acute respiratory infections (ARIs)?
The paper’s conclusion, when looking at what happened to everyone who initially enrolled in the study was as follows:
the use of air purifiers with HEPA-14 filters did not reduce ARIs compared with the control
So, not useful then?
Not quite. The study actually estimated that infections were 43% lower in the HEPA group, with a 95% confidence interval that ranged from –4% to 68%. In other words, we can’t be 95% confident that the filters led to a reduction – because otherwise the lower end of the interval would be above 0% – but most of the estimate sat firmly on the ‘useful’ side of things.
Why, if HEPA filters have potentially have such a useful effect, couldn’t the study be more confident in its conclusions? The reason is statistical power. When studying interventions like treatments or control measures, there are two main errors we want to avoid: we don’t want to conclude something works when it doesn’t, and we don’t want to conclude it doesn’t work when it does.
In traditional statistics, studies use confidence intervals (or p-values) to guard against the first error: if the confidence interval includes zero, we can’t be that sure that there’s genuinely a positive effect. We might have just observed a fluke difference between the control and intervention groups. But the size of the confidence interval will depend on how many events have occurred in the study. If a trial is relatively small, we could miss a more modest – but still useful – effect, simply because we haven’t observed enough events in the control and intervention groups to distinguish true effects from background noise.
In other words, the study needs to have enough statistical power to have a decent probability (usually set at 80%) of concluding that a useful intervention is genuinely useful. Which means we must decide how useful, i.e. how big an effect, we think an intervention might have. In the HEPA study, the team justified the sample size as follows:
With a sample of 94 participants in total, our study would be powered to 80% to identify a 50% reduction in ARI incidence at a 5% significance level.
So, if HEPA filters genuinely reduce ARI by more than 50%, there was an 80% chance that the resulting 95% confidence interval would only span positive values.
If, say, HEPA filters instead only reduced ARI by 40%, the study would be underpowered to detect this effect. In this situation, it would be likely to return an inconclusively wide confidence interval. Which is exactly what happened with the HEPA trial; remember that infections were 43% lower in the HEPA group on average.
A 50% reduction in infection from a single NPI is a massive effect, and routinely powering studies based on this assumed effect size risks ambiguity and debates. Unfortunately, it’s not the first time this issue has occurred. In 2021, the DANMASK trial compared rates of infection among people who’d been randomised to wear a mask and those who hadn’t. As with the HEPA study, it was designed with the power to detect a reduction of at least 50%. In reality, there ended up being 18% fewer infections in the mask group, and the 95% confidence interval ranged from –23% to 46%. Which led to muddled headlines about masks having ‘no significant effect’ and claims that masks were ‘proven not to work’.
A reduction in infection risk of ~20% from masks and ~40% from air filters – if genuine – would be a very useful finding to have lined up for future disease threats. But instead of consensus, we’ve ended up with uncertainty and stasis. As I’ve written about previously, ambiguity about such measures can lead to two vocal extremes: some who suggest light touch measures will solve everything, and others who suggest they do nothing.
Running randomised trials is difficult, and getting good evidence for smaller effects can dramatically increase the sample size required. But there are examples of progress, such as studies using rapid antigen tests to capture more infection events, and hence obtain more confidence in their conclusions. Or, as I’ve written about before, tailoring studies to the dynamics of an outbreak to increase their power. And we need this innovation to continue, and build on the promising early signals in studies like the HEPA trial. Because when the next major epidemic threat comes along, we’ll need all the useful tools we can get our hands on.
Cover image: Jennifer Burk via Unsplash
If you’re interested in reading about this topic, here’s my post from last year on how epidemics can end before we learn what works:
This is an excellent breakdown of the issue, well done
I think you’re being kind in your remarks about the DANMASK commentary. People who should and did know better (EBMers) drew some bad faith conclusions. Surely Bayesian methods have to have a role here, when we need to understand accumulating data and estimate efficacy ‘on the fly’?