A methodology for constructing the calculation model of scientific spreadsheets

Spreadsheets models are frequently used by scientists to analyze research data. These models are typically described in a paper or a report, which serves as single source of information on the underlying research project. As the calculation workflow in these models is not made explicit, readers are not able to fully understand how the research results are calculated, and trace them back to the underlying spreadsheets. This paper proposes a methodology for semi-automatically deriving the calculation workflow underlying a set of spreadsheets. The starting point of our methodology is the cell dependency graph, representing all spreadsheet cells and connections. We automatically aggregate all cells in the graph that represent instances and duplicates of the same quantities, based on analysis of the formula syntax. Subsequently, we use a set of heuristics, incorporating knowledge on spreadsheet design, computational procedures and domain knowledge, to select those quantities, that are relevant for understanding the calculation workflow. We explain and illustrate our methodology by actually applying it on three sets of spreadsheets from existing research projects in the domains of environmental and life science. Results from these case studies show that our constructed calculation models approximate the ground truth calculation workflows, both in terms of content and size, but are not a perfect match.

Saved in:
Bibliographic Details
Main Authors: de Vos, M., Wielemaker, J., Schreiber, G., Wielinga, B., Top, J.L.
Format: Article in monograph or in proceedings biblioteca
Language:English
Subjects:CalculationModel, Graph Aggregation, Heuris-tics, Spreadsheets,
Online Access:https://research.wur.nl/en/publications/a-methodology-for-constructing-the-calculation-model-of-scientifi
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Spreadsheets models are frequently used by scientists to analyze research data. These models are typically described in a paper or a report, which serves as single source of information on the underlying research project. As the calculation workflow in these models is not made explicit, readers are not able to fully understand how the research results are calculated, and trace them back to the underlying spreadsheets. This paper proposes a methodology for semi-automatically deriving the calculation workflow underlying a set of spreadsheets. The starting point of our methodology is the cell dependency graph, representing all spreadsheet cells and connections. We automatically aggregate all cells in the graph that represent instances and duplicates of the same quantities, based on analysis of the formula syntax. Subsequently, we use a set of heuristics, incorporating knowledge on spreadsheet design, computational procedures and domain knowledge, to select those quantities, that are relevant for understanding the calculation workflow. We explain and illustrate our methodology by actually applying it on three sets of spreadsheets from existing research projects in the domains of environmental and life science. Results from these case studies show that our constructed calculation models approximate the ground truth calculation workflows, both in terms of content and size, but are not a perfect match.