Cache-aware scheduling of scientific workflows in a multisite cloud

Heidsieck, Gaëtan; De Oliveira, Daniel; Pacitti, Esther; Pradal, Christophe; Tardieu, François; Valduriez, Patrick

Cache-aware scheduling of scientific workflows in a multisite cloud

Many scientific experiments today are performed using scientific workflows, which become more and more data-intensive. We consider the efficient execution of such workflows in a multisite cloud, leveraging heterogeneous resources available at multiple geo-distributed data centers. Since it is common for workflow users to reuse code or data from previous workflows, a promising approach for efficient workflow execution is to cache intermediate data in order to avoid re-executing entire workflows. However, caching intermediate data and scheduling workflows to exploit such caching in a multisite cloud is complex. In particular, workflow scheduling must be cache-aware, in order to decide whether reusing cache data or re-executing workflows entirely. In this paper, we propose a solution for cache-aware scheduling of scientific workflows in a multisite cloud. Our solution includes a distributed and parallel architecture and new algorithms for adaptive caching, cache site selection, and dynamic workflow scheduling. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation in a three-site cloud with a real application in plant phenotyping shows that our solution can yield major performance gains, reducing total time up to 42% with 60% of the same input data for each new execution.

Saved in:

Bibliographic Details
Main Authors:	Heidsieck, Gaëtan, De Oliveira, Daniel, Pacitti, Esther, Pradal, Christophe, Tardieu, François, Valduriez, Patrick
Format:	article biblioteca
Language:	eng
Subjects:	U10 - Informatique, mathématiques et statistiques, informatique, processus, http://aims.fao.org/aos/agrovoc/c_27769, http://aims.fao.org/aos/agrovoc/c_13586,
Online Access:	http://agritrop.cirad.fr/597996/ http://agritrop.cirad.fr/597996/1/FGCS_2021.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Many scientific experiments today are performed using scientific workflows, which become more and more data-intensive. We consider the efficient execution of such workflows in a multisite cloud, leveraging heterogeneous resources available at multiple geo-distributed data centers. Since it is common for workflow users to reuse code or data from previous workflows, a promising approach for efficient workflow execution is to cache intermediate data in order to avoid re-executing entire workflows. However, caching intermediate data and scheduling workflows to exploit such caching in a multisite cloud is complex. In particular, workflow scheduling must be cache-aware, in order to decide whether reusing cache data or re-executing workflows entirely. In this paper, we propose a solution for cache-aware scheduling of scientific workflows in a multisite cloud. Our solution includes a distributed and parallel architecture and new algorithms for adaptive caching, cache site selection, and dynamic workflow scheduling. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation in a three-site cloud with a real application in plant phenotyping shows that our solution can yield major performance gains, reducing total time up to 42% with 60% of the same input data for each new execution.

Cache-aware scheduling of scientific workflows in a multisite cloud

Similar Items

Resource Map