To researchers’ credit across the globe the amount of data being shared is growing and this will only increase over time as open research becomes ubiquitous. There are significant benefits to data sharing including increased rigour, transparency, and visibility.
But this post isn’t going to get blogged down in the benefits of data sharing as it is a path well-trodden. Instead, let’s consider that as researchers have been archiving and sharing data in archives and repositories there is a rich source of material that can be accessed, reworked, reanalysed and compared to recent data collections.
This secondary data analysis is a growing area of interest to researchers and funders, with the latter having calls focusing solely on reanalysis of data (e.g. UKRI). Accessing historic data also allows for research to be undertaken where costs are prohibitive, data is impossible or difficult to collect, and, possibly, reduce the burden on over researched populations. With the continuing challenges with collecting primary data during the pandemic there might not be a better time to consider what data is already out there.
And it is not only research that can benefit but also teaching and learning. Archived data sources can be accessed to introduce students to a fantastic range of existing data and code. Using secondary data can free students of data collection allowing them to focus on developing skills of research questions and analysis.
Based on data from re3data.org as of April 2021 there are over 2600 data repositories available for researchers to archive data, up from 1000 in November 2013. This isn’t a completely exhaustive list but is close enough to give an idea of the scale. Amongst these is our own data.ncl that now houses over 1200 datasets shared by university colleagues from across all disciplines and collected using a variety of methods and techniques.
However, finding the right dataset for your latest research project or teaching idea isn’t always straightforward. To help with that I have created guidance on how to find, reuse and cite data on the RDM webpages.
I would also be very keen to hear from users of secondary data to create case studies to inspire colleagues on this approach. If you would be interested in sharing your approach and experience, then please do get in touch.
Chris Harrison, as an astronomer who is a Newcastle University Academic Track Fellow (NUAct). Here he reflects on the good and bad aspects of reproducible science in observational astronomy and describes how he is using Newcastle’s Research Repository to set a good example. We are keen to hear from colleagues across the research landscape so please do get in touch if you’d like to write a post.
I use telescopes on the ground and in space to study galaxies and the supermassive black holes that lurk at their centres. These observations result in gigabytes to terabytes of data being collected for each project. In particular, when using interferometers such the Very Large Array (VLA) or the Atacama Large Millimetre Array, (ALMA) the raw data can be 100s of gigabytes from just one night of observations. These raw data products are then processed to produce two dimensional images, one dimensional spectra or three dimensional data cubes which are used to perform the scientific analyses. Although I mostly collect my own data, every so often I have felt compelled to write a paper from which I wanted to reproduce the results from other people’s observational data and their analyses. This has been in situations where the results were quite sensational and appeared to contradict previous results or conflict with my expectations from my understanding of theoretical predictions. As I write this, I have another paper under review that directly challenges previous work. This has been after a year of struggling to reproduce the previous results! Why has this been and what can we do better?
On the one hand most astronomical observations have incredible archives where all raw data products ever taken can be accessed by anyone after the, typically 1 year long, proprietary period has expired (great archive examples are ALMA and the VLA). These always include comprehensive meta-data and is always provided in standard formats so that it can be accessed and processed by anyone with a variety of open access software. However, from painful experience, I can tell you that it is still extremely challenging to reproduce other people’s results based on astronomical observational data. This is due to the many complex steps that are taken to go from the raw data products to a scientific result. Indeed, these are so complex it is basically not possible to adequately describe all steps in a publication. The only real solution for completely reproducible science would be to publicly release processed data products and the codes that were used both to reproduce these and analyse them. Indeed, I have even requested such products and codes from authors and found that they have been destroyed forever on broken hard drives. As early-career researchers work in a competitive environment and have vulnerable careers, one cannot blame them for wanting to keep their hard work to themselves (potentially for follow-up papers) and to not expose themselves to criticism. Discussing the many disappointing reasons why early career research are so vulnerable – and how this damages scientific progress – is too much to discuss here. However, as I now in an academic track position, I feel more confident to set a good example and hopefully encourage other more senior academics to do the same.
In March 2021 I launched the “Quasar Feedback Survey”, which is a comprehensive observational survey of 42 galaxies hosting rapidly growing black holes. We will be studying these galaxies with an array of telescopes. With the launch of this survey, I uploaded 45 gigabytes of processed data products to data.ncl (Newcastle’s Research Repository), including historic data from pilot projects that lead to this wider survey. All information about data products and results can also easily be accessed via a dedicated website. I already know these galaxies, and hence data, are of interest to other astronomers and our data products are being used right now to help design new observational experiments. As the survey continues the data products will continue to be uploaded alongside the relevant publications. The next important step for me is to find a way to also share the codes, whilst protecting the career development of the early career researchers that produced the codes.
To be continued!
Image Credit: C. Harrison, A. Thomson; Bill Saxton, NRAO/AUI/NSF; NASA.
This has been the first full calendar year data.ncl has been available for our researchers to archive and share data. And in the spirit of best of 2020 articles on film, TV shows and music I have dug into data.ncl’s usage statistics to pull out the headlines.
360 data deposits (718 in total)
118 different researchers archiving data (174 in total)
47,190 data downloads
Our top three datasets based on views and downloads in 2020 were: