Scientific results don’t come fully formed, in neatly typeset tables and polished figures. Raw data must be collected. Analysis code written. Summary results generated. But historically these aspects have often been hidden from readers, with only the final outcomes shared.
There’s been increasing focus on changing this, making the invisible aspects of science more visible. The British Medical Journal is the latest to get on the transparency train, announcing that they will require papers to include data and code alongside the traditional manuscript summarising the results.
This isn’t the first time we’ve seen a much-needed shift in standards. Thanks to some thoughtful advocacy beginning in the 1980s, it’s now common to see uncertainty and confidence intervals in medical papers, rather than just solitary p-values. More recently, there has been the shift to making research open access, rather than hiding (often publicly funded) results behind paywalls.
Just as open access papers help wider audiences see the results, sharing data and code helps others build on previous work. The involvement of journals and funders in requiring such sharing is a key step, I think, for a few reasons:
Researchers haven’t had an incentive to share good code. Historically, researchers have been judged on the papers they publish, rather than whether data and code are made available alongside them. That made sharing a ‘nice-to-have’ for many, an add-on they were not incentised to invest time in. However, if funders and journals require sharing, it no longer becomes an add-on, and researchers will increasingly have no choice but to put plans and resources in place to deliver. We’ve already seen the big increase that’s happened in open access publishing after it became a requirement of funders and government research evaluations.
Code might be wrong. There can be a hesistancy among researchers to share the raw ingredients of their work, in case it has hidden flaws. It is only human to think ‘what if someone realises there’s an error in my work?’ But ultimately the converse is the more important question: ‘what if people don’t realise there’s an error in my analysis?’ What if subsequent publications – or policies – are built on innocent mistakes that could be spotted earlier? Making pre-prints and code available is useful here, getting visibility and feedback on work early, so it can be iterated and improved. It also allows others to build on analysis and code earlier, which can in turn benefit the original research team. As well as requirements to share at the publication stage, the increasing acceptability of pre-prints alongside submission to traditional journals is helping this process.
Researchers are not trained in good software practice. Often when I’ve reviewed papers with accompanying code, there’s been a single file (e.g. code.R
) with an analysis script, and little in the way of description or comments. Technically the work is reproducible, but does it count as useful sharing? Sharing is really a spectrum, with single scripts at one end, and fully documented, stable software libraries at the other. For big research projects, we’d ideally have less of the former and more of the latter. But this improvement will require a shift in practice, culture, and resources. So, once again, it will be important to have journals and funders – and hence the research teams they interact with – on board in driving this change.