A First Look at A New Method to Detect Problems in ACME Atmosphere Code

Wednesday, May 6, 2015 - 07:00
Add to Calendar

Testing strategies are an integral part of source code development. Comprehensive testing strategies are difficult for large and complex source codes like ACME. Interactions between code components are complex, and highly nonlinear. While it is possible to write unit tests for some model components, Oftentimes neither analytic solutions, nor converged numerical solutions exist for code components, and there are no mechanisms currently in place for exercising the interactions between components, making it difficult to write comprehensive and rigorous 'unit checkers'. We are exploring another paradigm for examining the model solutions that exposes potential problems in the code by searching for solution 'discontinuities' when they are not necessarily expected. Our method is inspired by, but differs from, the 'perturbation growth test' strategy first described in Rosinski and Williamson (1997, hereafter R&W97) that historically was used to exploit accumulation of rounding errors to identify 'correct' implementation of CESM code in a new computer environment (consisting of hardware, operating system, compilers and libraries). Like R&W97, we monitor pairs of solutions in the presence of small perturbations, either introduced deliberately, or introduced by compilers, hardware, etc. Our strategy differs from R&W97 because we do not allow the perturbations to accumulate over many timesteps, but we do monitor the accumulation of perturbations within a timestep, and perform that test over many timesteps. The test is capable of exposing discontinuities in solution as thresholds are crossed, problems in iterative algorithms, and compiler problems. We exercise ALL the executing code over many physically meaningful parts of the code 'phase space' and appropriate meteorological regimes, and monitor the code for many millions of situations during one 'test'. The strategy is able to isolate the 'failure point' to a particular family of code (i.e. the 'fortran module' or 'parameterization'). Code revisions are relatively benign, requiring the insertion of new code in approximately a dozen code locations in the atmospheric model and revision of code to improve the reproducibility of simulations across MPI/OpenMP decompositions (it should now be insensitive to use of these parallelization strategies). The method is at this point mature enough to undergo more rigorous evaluation. We will describe some of our initial explorations of the methodology, that revealed some code issues in CAM. We hypothesize that this procedure can readily reveal problems in model simulations in a new environment by running an ensemble of short and inexpensive simulations.