Guest Post: Rethinking What Data Is

This is our first guest post on the Opening Research blog. We are keen to hear from colleagues across the research landscape so please do get in touch if you’d like to write a post. But the honor of debut guest blogger goes to David Johnson, PGR in History, Classics and Archaeology.


The trainings on open publishing and data storage fundamentally changed my perspective on what constitutes data.

Coming to start my PhD from a background in history and the humanities, I really didn’t give the idea of data much thought.  I knew I was expected to present evidence about my topic in order to defend my research and my ideas, but in my mind there was a fundamental difference between the kind of evidence I was going to work with and ‘data’.  Data was something big and formal, a collection of numbers and formulae that people other than me collated and manipulated using advanced software.  Evidence was the warm and fuzzy bits of people’s lives that I would be collecting in order to try and say something meaningful about them, not something to ‘crunch’, graph, or manipulate.  This was a critical misconception that I am pleased to say I have come to terms with now.

What I had to do was get away from the very numerical interpretation of the term ‘data’, and start to think in broader terms about the definition of the word.  When I was asked about a data plan for my initial degree proposal, I said I didn’t have one.  I simply didn’t think I was going to need one.  In fact, I had already developed a basic data plan without realising what it was called.  My initial degree proposal included going through a large volume of domestic literature and gathering as many examples of emotional language as I could find to create a lexicon of emotions words in use during the nineteenth century.  In retrospect, it’s obvious that effort was fundamentally based in data analysis, but my notion of what ‘data’ was prevented me from seeing that at the time. 

What changed my mind was some training I went to as part of my PhD programme, which demonstrates how important it is to engage with that training with an open mind.  The trainings on open publishing and data storage fundamentally changed my perspective on what constitutes data.  Together these two training events prompted me to reconsider the way I approached the material I was collecting for my project.  My efforts to compile a vocabulary of emotions words from published material during the nineteenth century was not just a list of word, but was a data set that should be preserved and made available.  Likewise, the ever-growing pile of diary entries demonstrating the lived emotional experiences of people in the nineteenth century constitutes a data set.  Neither of these are in numerical form, yet they both can be qualitatively and quantitatively evaluated like other forms of data.

I suspect I am not alone in carrying this misconception as far into my academic work as I have.  I think what is required for many students is a rethinking of what constitutes data.  Certainly in the hard sciences, and perhaps in the social sciences there is an expectation of working with traditional forms of data such as population numbers, or statistical variations from a given norm, but in the humanities we may not be as prepared to think in those terms.  Yet whether analysing an author’s novels, assessing parish records, or collecting large amounts of diary writings as I am, the pile of text still constitutes a form of data, a body of material that can be subjected to a range of data analysis tools.  If I had been able to make this mind shift earlier in my degree, I might have been better able to manage the evidence I collected, and also make a plan to preserve that data for the long term.  That said, it’s still better late than never, and I am happy say I have made considerable progress since I rethought my notions of what data was.  I have put my lexicon data set out on the Newcastle Data Repository, so feel free to take a look at https://doi.org/10.25405/data.ncl.11830383.v1.

Image credit: JD Handcock Photos: http://photos.jdhancock.com/photo/2012-09-28-001422-big-data.html

Did we share more data during Open Access Week?

To celebrate Open Access Week, 19– 25 October, data.ncl through Figshare ran a competition to encourage data to be uploaded and shared. We promoted Open Access Week on this blog, NUConnect, social media and in schools to help promote data.ncl and the merits of data sharing.

Anil Yildiz, Research Associate, in the School of Engineering has long embraced open data and has shared several datasets and supporting scripts from his research projects in data.ncl. The idea of a competition piqued his interest as an incentive for researchers to share data but also switched on his inquisitive nature as he wondered if it leads to an increase in uploads.

Figshare has an API that allows anyone to access a wide range of data and after we chatted Anil took an interest in the following four item types: figures; media; dataset; and software. He ran a query through the API between 06/07/2020 and 26/10/2020 on those four item types.

Number of figures, media, datasets and software uploaded to Figshare between 06/07/2020 and 26/07/2020

The graph above shows that the variation in uploads is not significant between the weeks examined but there were slight increases in media and software during open access week. Taking a deeper look into when these items are uploaded it indicated that Thursday are the most common day for researchers to archive and share data. And unsurprisingly weekends were found to be the quietest days.   

Conclusion

Open Access Week 2020 didn’t result in an upload frenzy. However, the sharing of these four item types is consistent across the timeframe analysed and Figshare is one of many data repositories that researchers can use to openly share their data. The bigger picture is that open research data is of growing importance as we look to increase transparency, reproducibility and reuse of data produced by our researchers. Data.ncl can archive all four item types and we are keen to see an increase in these deposits across all research data repositories. When data is archived elsewhere you can create a record of it in data.ncl to help increase the impact and visibility of the data.

At Newcastle this is the first time we have promoted the competition so it will take time for Open Access Week and data sharing to be on the radar of our researchers. It is interesting that Thursday is a particularly popular day to share data so perhaps we need a Thor inspired sharing initiative – data sharers assemble, anyone?

This blog was written in collaboration with Anil and his original blog Open Access Data: What do we Share can be found here: https://www.anilyildiz.info/blog/2020-10-26-blog-8. And a review of the data findings is available on data.ncl.

Celebrating Open Research Data with a DATA.NCL Upload Competition

For Open Access Week (October 19-25), Figshare is running a research data upload competition, offering prizes for participating institutions who upload the most items and researchers who upload during that week.

Data.ncl, Newcastle’s Research Data Repository, is powered by Figshare so all data uploaders – regardless of whether we are a winning institution – will have a chance to win one of five £100 Amazon gift vouchers, distributed virtually. Figshare will also be making a $500 donation to Resourcing Racial Justice, an organization that supports individuals and communities working towards racial justice.

Items must be uploaded to data.ncl between 12am on 19th October until 11:59pm on 25th October. Where possible we would encourage the data to be openly available, but it doesn’t necessarily have to be published if you require more time to prepare the dataset.

This is a little incentive to find some time during Open Access Week to prepare and share that dataset you been sitting on or meaning to archive. Some of the key benefits of sharing data through data.ncl are:

  • The data is assigned a persistent identifier (DOI) and a citation provided, so the data can be formally attributed
  • The persistent identifier helps to make the data discoverable through Google and other search engines to maximise visibility and impact of the research
  • Data can be located and accessed by you, without having to actively manage it

Since data.ncl was launched in April 2019, Newcastle researchers and PGRs have archived and shared 486 datasets, which have been viewed nearly 270,000 times across the world. ​Datasets have also been downloaded over 50,000 times and cited by researchers who have went on to reuse the data.

Data.ncl is not just for data but also code/ software and methodology so you can archive and share on the research process as well as any data outputs. There is guidance on how to archive data in data.ncl and you can get in touch with the Research Data Service on support in planning, managing and sharing research data at rdm@ncl.ac.uk

Happy uploading!