A common problem in Data Science is that the value of knowledge generated from (big) data tends to decay over time.
This is well understood for instance in areas of machine learning, where
models learnt from examples at a certain time tend to lose their
predictive power, as those examples are no longer representative of the
actual data distribution. In some cases, re-training can be performed
incrementally to reduce the additional investment required to “refresh”
the models.
What is required for periodic re-training is the ability to monitor
model performance over time, a measure of training cost, a definition of
value associated to the model’s predictions, thresholds to trigger
re-training, and the ability to control the training process.
We observe that these decision elements are common to other areas of Data Science that involve expensive data analytics patterns. We make the hypothesis that these elements can be incorporated into a general decision support system that is able to control the periodic, selective re-computation of costly data-intensive processes.
In the ReComp project, we aim to formalise this selective re-computation problem and to build such a decision support system.
So far we have focused on three areas where we started addressing this problem: genomics, where the quality of genetic diagnosis depends on evolving genetics knowledge used in the diagnostic process; stream data analytics, where one can sometimes “skip” computations when the data stream exhibit some stability; and flood modelling, where decisions are made on when an expensive simulation should be re-run in response to changes in the underlying conditions.