Analysts and researchers face major hurdles understanding the quality of their data and the knock-on consequences of the choices they make during one stage of data processing on those that follow. Data visualisation offers many benefits that could help analysts and researchers to overcome those hurdles. This project investigate how visualisation techniques are and should be exploited for key aspects of data profiling.
This project, funded by the Alan Turing Institute, aims to characterise the way in which analysts and researchers profile data and design data processing pipelines. This is important in order to understand the limitations of current profiling and pipeline design methods, the barriers that analysts and researchers face, and the ways in which visualisation techniques could be transformative. The project engage with public and private sector analysts and researchers, aiming to identify quick wins, share best practice and develop a research agenda for the adoption of visualisation techniques in data profiling and pipeline design. The primary measure of success will be organisations beginning to adopt the techniques that are proposed, to make their profiling and pipeline design more rigorous and efficient. This is a catalyst for more scalable and higher quality data science.
The use of good-quality data to inform decision making is entirely dependent on robust processes to ensure it is fit for purpose. Such processes vary between organisations, and between those tasked with designing and following them. In this project 53 data analysts from many industry sectors were surveyed, 24 of whom also participated in in-depth interviews, about computational and visual methods for characterizing data and investigating data quality. Through this a list of data profiling tasks and visualization techniques was compiled which is more comprehensive than those previously published. Furthermore, the results highlight the diversity of profiling tasks, identify unusual practice and exemplars of visualization, andprovide recommendations about formalizing processes and creating rulebooks.
Publications
-
Ruddle, R., Cheshire, J. & Johansson Fernstad, S. (2024). A Practical Guide to Characterising Data and Investigating Data Quality. University of Leeds. https://doi.org/10.5518/1481
-
Ruddle, R. A., Cheshire, J., & Johansson Fernstad, S. (2023). Tasks and Visualizations Used for Data Profiling: A Survey and Interview Study. IEEE Transactions on Visualization and Computer Graphics. https://doi.org/10.1109/TVCG.2023.3234337
Team
- Prof Roy Ruddle (principal investigator), University of Leeds
- Dr Sara Johansson Fernstad (co-investigator), Newcastle University
- Prof James Cheshire (co-investigator), University College London