research

Implementation of parallel NetCDF in the ParFlow hydrological model: A code modernisation effort as part of a big data handling strategy

Abstract

State-of-the-art geoscience simulations are tending towards ever increasing model complexity. This is due to the incorporation of multi-physics and fully coupled model systems, often in combination with higher spatial resolutions. In addition, simulations are being run for longer time periods in order to model phenomena such as climate change and water resources. These factors combined lead to a big data challenge. This data challenge is typically characterized by the TB-scale data volumes involved, namely I/O, where data variety, velocity and complexity are smaller issues in comparison. In this context, the NIC Scientific Big Data Analytics project “Towards a high-performance big data storage, handling and analysis framework for Earth science simulations” has been working since autumn 2015 on a code modernisation effort, towards a big data readiness of geoscience simulation codes, and data processing and analysis applications. The simulation code considered is the massively MPI-parallel hydrological model ParFlow. Thus far, work has centred around the modernisation of ParFlow's parallel I/O: A standalone C code was used to assess and test the pNetCDF and the HDF5-based NetCDF4 I/O libraries' features and their parallel read and write performance. Tuning and scaling studies on the JSC/JURECA HPC system led to optimised runtime environment settings and a near linear scaling behaviour of the API. This MPI C-code can be used as a showcase implementation for parallel I/O for some of the Geoverbund ABC/J modelling groups. The NetCDF4 interface was chosen as it constitutes a quasi-standard in geosciences and ensures consistent and efficient data flow paths and compression. The I/O testing and the scaling experiments have been done in a JUBE2-based benchmarking framework which also integrates the Score-P profiling and tracing infrastructure, the Scalasca performance optimisation tool and the Darshan HPC I/O characterisation tool. This JUBE2-based framework was then further extended to act as a portable generic testing platform for all benchmarking, development and testing work with ParFlow, including idealised and real data reference test cases for weak and strong scaling studies, a variety of compiler options, as well as common profiling tools, which are all embedded in an easy to use run environment. To further improve ParFlow's I/O functionality, we propose adding NetCDF4 interfaces that write to a shared compressed NetCDF file concurrently with one MPI task per node. The proposed code will automatically adjust for the computational set up, such as gathering of data on single node, number of nodes, I/O interfaces and MPI ranks per node. Another obvious big data challenge for complex geoscience simulations is post-processing terabytes of data. Therefore we plan to develop on-the-fly processing and visualisation for ParFlow, once the I/O optimisation is finished. This will be a joint effort with the JSC Cross Sectional Team Visualisation, in order to implement an in-situ, i.e. during runtime, processing and visualisation functionality, using the VisIt software. Additionally, this will help to improve scalability and performance whilst substantially reducing total processing time and model output

    Similar works