John Mace will present at ESORICS 2014 some joint work with Aad van Moorsel and myself, on the problem of quantitative workflow resiliency.
Intuitively, a workflow is a partially ordered set of tasks, such that if a task t is less than a task t’, then t must be executed before t’. Each task is executed by a user, and different tasks can be executed by different users. For instance, consider the reviewing process for a journal: the paper is assigned to some reviewers by an editor, then each reviewer submits a review for the paper, and the editor makes a decision. Clearly, the decision can only be made after each review has been submitted, and each review can only be submitted after the assignment has been done. However, the different reviews can be submitted in any order.
A security policy for a workflow consists of multiple constraints over which users can execute which tasks. For instance, some tasks can only be executed by some users (e.g., only an editor can assign a reviewer), some tasks are bound by a separation of duty (e.g., someone submitting a paper cannot be reviewing that paper), while other tasks are bound by a binding of duty (e.g., the editor making the final decision must be the same than the editor in charge of the initial reviewer assignment). Assigning each task to a user while respecting the security policy is known as the Workflow Satisfiability Problem (WSP), and we refer to “On the Parameterized Complexity and Kernelization of the Workflow Satisfiability Problem”, by Crampton, Gutin and Yeo, from Royal Holloway, for a recent paper on the WSP (Jason Crampton is giving a Colloquium at Newcastle University on the 22/07 about this line of work).
To some extent, the WSP is rather static: the workflow is analysed together with its policy, and either a correct assignment is found, in which case the workflow can be executed, otherwise there is no possible assignment. However, in practice, the availability of the different users can vary from the start of the workflow until the end. For instance, a reviewer can become unavailable after being assigned a review, or the editor in charge of the paper at the beginning could develop a conflict-of-interest with the authors during the process. Wang and Li introduced the notion of workflow resiliency: roughly speaking, a workflow is N-resilient if the WSP can still be solved while N users becoming unavailable. Resiliency can be static, when users can only become unavailable before the start of the workflow, decremental when users can become unavailable at any point, or dynamic, when users can become unavailable at any point and then potentially become available again.
Resiliency is a binary notion: a workflow is N-resilient if any subset of N unavailable users blocks it, and not N-resilient otherwise. In the ESORICS paper, we suggest that in practice, user availability is not a strictly deterministic property, but a probabilistic one. Indeed, if we can know in advance that a reviewer is not available at a particular time, then we can simply make that reviewer unqualified for the corresponding task, thus reducing resiliency to solving the WSP; Instead, we might just know that a reviewer has some probability to become unavailable at some point (e.g., sickness, over-scheduling, travelling, etc). In other words, a workflow has some probability to terminate execution, depending on whether some users become unavailable or not.
Now, consider two workflows both satisfying the WSP (i.e., assuming full user availability, it is possible to find a user assignment ensuring that each task is executed), such that, over 100 instances execution, the first one fails to terminate 1 time, while the second fails to terminate 99 times, for a similar user availability. Although both are not resilient, we believe these two workflows should have different measures, since the first one could actually be acceptable in many situations, while the second one could effectively be unusable. Of course, one could argue that, if possible, a workflow terminating in all cases is ideal, but in practice, if we had to choose between the two previous workflows, the first one is likely to be better.
Hence, we define the notion of Quantitative Workflow Resiliency, which measures the likelihood of a workflow to terminate given a probabilistic user availability model, and considering that users can be reassigned at runtime, as long as the policy remains satisfied. More precisely, we encode the problem of selecting a user for a task as a Markov Decision Process, where each process state consists of a context (containing the previous task assignment and the user availability) and a task, and each action corresponds to selecting a user for the task of the state. The transition function govern which task is to be executed next and how the context evolves, in particular concerning user availability. The termination of the workflow is associated with a reward of 1, while any other transition is associated with a reward of 0. The value of the initial state then corresponds to the probability of the workflow to terminate. We also showed that by associating each transition with a reward of 1 and termination with a reward of 0, the value of the initial state corresponds to the expected number of tasks executed by the workflow.
We believe these two metrics are useful to evaluate workflow resiliency, and provide more information than a binary notion of resiliency. In particular, we provide in the paper an illustrative example, where the assignment optimising termination is different from that optimising the number of tasks, and both are different from that optimising the computation time required at runtime to calculate the assignment. In other words, finding an optimal user assignment might require to solve a tradeoff between termination, task execution and computation time.