How safe is your data?
17 Apr 2014 by Evoluted New Media
What would you do if you lost your data? Mark Hahnel explores options for storing your research safely and in the process discovers that all data – even negative – could be useful I left my laptop on a plane last week and only just had it returned in time before my onward flight out of Abu Dhabi. If it hadn’t arrived in time then I doubt it would have been seen again. In terms of how I managed my data during my time as an academic, the story wasn’t much different. I had good intentions. The entire research output from my PhD is sitting safely on an external hard drive. The location of said hard drive, however, is anyone’s guess. The common failings exhibited by academics when managing their research data has never been a more important topic. As data output grows, effective data organisation is only going to get more difficult, and if data continues to be managed poorly then science will ultimately suffer – experiments will be hard to replicate, findings called into question, papers retracted and careers will be impacted. With an increasing amount of scientific output being generated within academic institutions, the biggest pushback cited by researchers tends to be a lack of knowledge about how to most effectively manage and store their data, as well as a fear of the consequences of sharing their data openly. This is particularly true of those who have dedicated several years of their life to post-doctoral research and are hopeful of tenure. These researchers are often fearful that making their data easily accessible will not be beneficial to their career, and may actually harm their chances of promotion. While there is plenty of evidence to the contrary, the awareness of this among academics is still low.
The entire research output from my PhD is sitting safely on an external hard drive. The location of said hard drive, however, is anyone’s guessIn order to achieve the benefits of open research, I feel that the push must come initially from governments and funders. Training is an essential part of this. By nudging researchers to be better organised and structure the storage of their data at the time of generation, it becomes easier to track the provenance of the research and should save hours of frustration when writing up papers, dissertations, or demonstrating the outcomes of funding. The stats from the infographic speak for themselves when it comes to the key issues that need to be addressed. Just as importantly, the process of research data management needs to take place as close to the point of data capture as possible, while the experiment is still fresh in the mind of the academic. This should also take up as little time as possible, or ideally, be part of the researchers existing workflow. Commercial data storage facilities always use the most up-to-date technology and provide the necessary legal guarantees about uptimes, back-ups and disaster recovery. However, most universities do not have the budget for this type of solution. Also, the influence of economies of scale, the likes of which is often necessary to make data storage cost-effective, does not necessarily apply in academia. Commercial data storage services cost less per terabyte as storage capacity increases, but this is a benefit that will be lost by storing research outputs in siloed repositories provided at a subject or institutional level. Given this, I would argue that we do not need to store every bit of data generated, nor back it up with hundreds of copies. Instead, the filter for deciding what to store should be taken at the level of the academic in the first instance. Each researcher knows intuitively which data will be of absolutely no use to anyone – the experiments that had methodological errors or computational errors, making the data inaccurate and irrelevant. The danger comes when they assume their research will not be of use to other academics, such as those experiments producing null or ‘unexciting’ results – such data is still of high value to those working in the same field, as it saves them from inadvertently reproducing work that is unlikely to bear fruit. This is an area where I feel academia could be more open minded about serving the science community, rather than supporting only the research being produced. As it stands, the majority of negative data is never seen by anyone outside of the lab it was generated in. I believe that publishers should mandate that all the research underlying the conclusions of a peer reviewed paper should be made openly available, where ethically possible. Even better would be for all raw data to be made available to the community, as is currently requested by open access journals such as F1000 Research and PLOS. Given the challenges I’ve suggested, I am hopeful that we can also look to other stakeholders to ensure that the integrity of research is maintained. For example, funding bodies expect a return for ‘investing’ in researchers – namely, the generation of reliable data that will help further human understanding. That’s why I believe that data management plans and mandates must come initially from the funders and that, without a plan in place, funding should not be awarded. Fortunately, many funding bodies now request such plans. Isn’t it time for academics and institutions to do their bit? The technology is available. I would suggest that institutions need to provide better tools to empower the academics, but at the same time researchers from PhD level to PI need to step up and take control over their own research outputs… before that laptop does go missing and all that precious data is lost forever. References 1. SINTEF (2013, May 22). Big Data, for better or worse: 90% of world's data generated over last two years. https://bit.ly/1lDXQ3Z. 2. Pryor, G. (2012). Why manage research data? In G. Pryor (Ed.), Managing research data (pp. 1-16). 3. 2013 Global R & D Funding Forecast, Advantage Business Media, https://bit.ly/UyETjN. 4. Vine T.H. et al. (2013), The availability of research data declines rapidly with article age. Current Biology (24)1: 94-97. 5. Vasilevsky N.A. et al. (2013), On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ 1:e148. 6. Problems with scientific research: How science goes wrong (Oct 19th 2013). The Economist. 7. Steen R.G. et al. (2013), Why has the number of scientific retractions increased? PLoS ONE 8(7): e68397. 8. OECD members and partners: https://bit.ly/SQEyu0. 9. The Digital Curation Centre (DCC), Overview of funders' data policies: https://bit.ly/tLvdR7. 10. NIH Data Sharing Policy and Implementation Guidance: http://1.usa.gov/1cNNbNO Author Mark Hahnel of Digital Science. While working on his PhD, Mark developed figshare - an open data tool that allows researchers to store their outputs securely in the cloud, share them privately with lab-mates and collaborators or make them public in the name of open research.