A Flexible, Robust, High-Performance Data System for the GCAM Model
Increasingly complex human-earth system models have increasingly complex data requirements, to the point that stand-alone software systems are required to track and assemble these data inputs to the main model. A new data system known as “gcamdata” was developed for the Global Change Assessment Model (GCAM) to provide a robust, reproducible, and transparent system to track and prepare hundreds of model inputs and enable researchers to easily construct alternative scenarios for research.
While this new data system was made specifically for the GCAM model, many of its components and approaches to processing are broadly applicable to, and reusable by, other complex model/data systems aiming to improve transparency, reproducibility, and flexibility. As open-source software with flexible architecture, gcamdata introduces a new way to handle and prepare data to feed complex global models. This saves researchers time and effort, improves traceability and reproducibility, and enables exploratory “what-if” analyses using GCAM.
Modern, integrated human-Earth system models are complex and require correspondingly detailed input datasets. These models are sophisticated attempts to quantify relationships between environmental, social and economic factors. This new data system software offers clear and easy-to-use application to a variety of modeling scenarios with documentation and error checking. Data objects in gcamdata are required to have descriptive metadata attached, which allows researchers to track data provenance throughout the system. As a result, a full system-wide data map can be constructed with particular data dependencies, upstream and/or downstream, traced through the system. Any object and its dependencies in the system can be explored in detail as all data objects flowing between the various parts of the system include extensive metadata (including title, units, source, and comments). Many parts of the gcamdata package can be repurposed for any data system that involves multiple, potentially interacting, data processing steps, improving the reproducibility and transparency of science in many modeling domains.