Long Summary

As the cost of allocating computing resources to data-intensive tasks continues to decrease, large-scale data analytics becomes ever more affordable, continuously providing new insights from vast amounts of data. Increasingly, predictive models that encode knowledge from data are used to drive decisions in a broad range of areas, from science to public policy, to marketing and business strategy. The process of learning such actionable knowledge relies upon information assets, including the data itself, the know-how that is encoded in the analytical processes and algorithms, as well as any additional background and prior knowledge. Because these assets continuously change and evolve, models may become obsolete over time, leading to poor decisions in the future, unless they are periodically updated.

Focus of the project

This project is concerned with the need and opportunities for selective recomputation of resource-intensive analytical workloads. The decision on how to respond to changes in these information assets requires striking a balance between the estimated cost of recomputing the model, and the expected benefits of doing so. In some cases, for instance when using predictive models to diagnose a patient’s genetic disease, new medical knowledge may invalidate a large number of past cases. On the other hand, such changes in knowledge may be marginal or even irrelevant for some of the cases. It is therefore important to be able, firstly, to determine which past results may potentially benefit from recomputation, secondly, to determine whether it is technically possible to reproduce an old computation, and thirdly, when this is the case, to assess the costs and relative benefits associated with the recomputation.

The project investigates the hypothesis that, based on these determinations, and given a budget for allocating computing resources, it should be possible to accurately identify and prioritise analytical tasks that should be considered for recomputation.

Technical approach

Our approach considers three types of meta-knowledge that are associated with analytics tasks, namely:

  1. Knowledge of the history of past results, that is, the provenance metadata that describes which assets were used in the computation, and how;
  2. Knowledge of the technical reproducibility of the tasks; and
  3. Cost/benefit estimation models.

Element (1) is required to determine which prior outcomes may potentially benefit from changes in information assets, while reproducibility analysis (2) is required to determine whether an old analytical task is still functional and can actually be performed again, possibly with new components and on newer input data.

A general framework

As the first two of these elements are independent of the data domain, we aim to develop a general framework that can then be instantiated with domain-specific models, namely for cost/benefit analysis, to provide decision support for prioritising and then carrying out resource-intensive recomputations over a broad range of analytics application domains.

Both (1) and (2) entail technical challenges, as systematically collecting the provenance of complex analytical tasks, and ensuring their reproducibility, requires instrumentation of the data processing environments. We plan to experiment with workflows, a form of high level programming and middleware technology, to address both these problems.

Validation of the approach

To show the flexibility and generality of our framework, we will test and validate it on two, very different case studies where decision making is driven by analytical knowledge, namely in genetic diagnostics, and policy making for Smart Cities.