36 research outputs found

    NEMO-Med: Optimization and Improvement of Scalability

    Get PDF
    The NEMO oceanic model is widely used among the climate community. It is used with different configurations in more than 50 research projects for both long and short-term simulations. Computational requirements of the model and its implementation limit the exploitation of the emerging computational infrastructure at peta and exascale. A deep revision and analysis of the model and its implementation were needed. The paper describes the performance evaluation of the model (v3.2), based on MPI parallelization, on the MareNostrum platform at the Barcelona Supercomputing Centre. The analysis of the scalability has been carried out taking into account different factors, such as the I/O system available on the platform, the domain decomposition of the model and the level of the parallelism. The analysis highlighted different bottlenecks due to the communication overhead. The code has been optimized reducing the communication weight within some frequently called functions and the parallelization has been improved introducing a second level of parallelism based on the OpenMP shared memory paradigm

    Optimal Task Mapping for NEMO Model

    Get PDF
    The climate numerical models require a considerable amount of computing power. The modern parallel architectures provide the needed computing power to perform scientific simulations at acceptable resolutions. However, the efficiency of the exploitation of the parallel architectures by the climate models is often poor. Several factors influence the parallel efficiency such as the parallel overhead due to the communications among concurrent tasks, the memory contention among tasks on the same computing node, the load balancing and the tasks synchronization. The work here described aims at addressing two of the factors influencing the efficiency: the communications and the memory contention. The used approach is based on the optimal mapping of the tasks on the SMP nodes of a parallel cluster. The best mapping can heavily influence the time spent for communications between tasks belonging to the same node either to different nodes. Moreover, if we consider that each parallel task will allocate different amount of memory, the optimal tasks mapping can balance the total amount of main memory allocated on the same node and hence reduce the overall memory contention. The climate model taken into consideration is PELAGOS025 made by coupling the NEMO oceanic model with the BFM biogeochemical model. It has been used in a global configuration with a horizontal resolution of 0.25◦. Three different mapping strategies have been implemented, analyzed and compared with the standard allocation performed by the local scheduler. The parallel architecture used for the evaluation is an IBM iDataPlex with Intel SandyBridge processors located at the CMCC’s Supercomputing Center

    Performance and results of the high-resolution biogeochemical model PELAGOS025 v1.0 within NEMO v3.4

    Get PDF
    Abstract. The present work aims at evaluating the scalability performance of a high-resolution global ocean biogeochemistry model (PELAGOS025) on massive parallel architectures and the benefits in terms of the time-to-solution reduction. PELAGOS025 is an on-line coupling between the Nucleus for the European Modelling of the Ocean (NEMO) physical ocean model and the Biogeochemical Flux Model (BFM) biogeochemical model. Both the models use a parallel domain decomposition along the horizontal dimension. The parallelisation is based on the message passing paradigm. The performance analysis has been done on two parallel architectures, an IBM BlueGene/Q at ALCF (Argonne Leadership Computing Facilities) and an IBM iDataPlex with Sandy Bridge processors at the CMCC (Euro Mediterranean Center on Climate Change). The outcome of the analysis demonstrated that the lack of scalability is due to several factors such as the I/O operations, the memory contention, the load unbalancing due to the memory structure of the BFM component and, for the BlueGene/Q, the absence of a hybrid parallelisation approach

    The performance model for a parallel SOR algorithm using the red-black scheme

    No full text
    The successive over relaxation (SOR) is a variant of the iterative Gauss-Seidel method for solving a linear system of equations Ax = b. The SOR algorithm is used within the Nucleus for European Modelling of the Ocean (NEMO) model for solving the elliptical equation for the barotropic stream function. The NEMO performance analysis shows that the SOR algorithm introduces a significant communication overhead. Its parallel implementation is based on the red-black method and foresees a communication step at each iteration. An enhanced parallel version of the algorithm has been developed by acting on the size of the overlap region to reduce the frequency of communications. The overlap size must be carefully tuned for reducing the communication overhead without increasing the computing time. This work describes an analytical performance model of the SOR algorithm that can be used for establishing the optimal size of the overlap region

    The Oasis3 MPI1/2 Parallel Version

    No full text
    This work describes the optimization and parallelization activities performed on the OASIS3 coupler. The test case used for evaluating and profiling the coupler consists on the CMCCMED coupled model developed by the ANS division of the CMCC and currently in production on the NEC SX9 cluster. The experiments highlighted that the most time consuming transformations are the extrapolation of the fields on the masked points (performed in the extrap function) and interpolation (performed in the scriprmp function). The optimization has been mainly focused on reducing the time spent for I/O operations this reduced the coupling time of 27%. The parallelization of the OASIS3 has been a further step for reducing the elapsed time of the whole coupled model. The proposed parallel approach is based on the distribution of the fields among the available processes. Each process is in charge to apply the coupling transformations on the assigned fields. With this approach the number of coupling fields represents an upper bound to the parallelization level. However this approach can be fully combined with the parallelization based on the geographical domain distribution. The work concludes with a qualitative comparison of the proposed approach with the OASIS3 pseudoparallel version developed by CERFACS

    A Performance Evaluation Method for Coupled Models

    No full text
    In the High-Performance Computing context, the performance evaluation of a parallel algorithm is made mainly considering the elapsed time running the parallel application with both different number of cores or different problem sizes (for scaled speed-up). Typically, parallel applications embed mechanisms for efficiently using the allocated resources, guarantying for example a good load balancing and reducing the parallel overhead. Unfortunately, this assumption is not true for coupled models. These models are born from the coupling of stand-alone climate applications. The component models are developed independently from each other and they follow different development roadmaps. Moreover, they are characterized by different levels of parallelization, different requirements in terms of workload and they have their own scalability curve. Considering a coupled model as a single parallel application, we can note the lacking of a policy for balancing the computational load on the available resources. This work tries to address the issues related to performance evaluation of a coupled model, and to answer to the following questions: allocated a given number of processors for the whole coupled model, how to configure the run in order to balance the workload? How many processors must be assigned to each of the component models? The methodology here described has been applied for evaluating the scalability of the CMCC-MED coupled model designed by INGV and the ANS Division of the CMCC. The evaluation has been carried out on two different computational architectures: a scalar cluster based on IBM Power6 processors; and a vector cluster based on NEC-SX9 processors

    Nemo-Med: Extra-Halo Performance Model

    No full text
    The NEMO oceanic model, characterized by a resolution of 1/16â—¦and tailored on the Mediterranean Basin used at CMCC, has been analyzed to discover possible bottlenecks to the parallel scalability. A detailed analysis of scalability on all of the routines called during a NEMO time step allowed to identify the SOR solver routine as the most expensive from the communication point of view. The function implements the red-black successive-over-relaxation method, an iterative search algorithm used for solving the elliptical equation for the barotropic stream function. The algorithm iterates until reach the convergence; a limit on the maximum number of iteration is also set up. The high frequency of data exchanging within this routine implies a high communication overhead. The NEMO code includes an enhanced version of the routine, that reduce the frequency of communication by adding an extra-halo region. The use of this optimization requires the selection of the optimal value of the extra-halo dimension to trade-off computation and communication. A performance model, allowing the choice of the optimal extra-halo value for a pre-defined decomposition, has been designed. The model has been tested on the MareNostrum cluster at the Barcelona Supercomputing Centre

    Definition of an ESM Benchmark for Evaluating Parallel Architectures

    No full text
    Different approaches exist for evaluating the computational performance of a parallel system. These approaches are based on the appliance of some benchmarking tools for evaluating either the whole system or some of its sub-components (i.e. I/O system, memory bandwidth, node interconnection, etc). Different kind of benchmarks can be considered: real program benchmarks are based on real applications; kernel benchmarks include some key codes normally abstracted from actual programs (i.e. linear algebra operations); component benchmarks are focused on the evaluation of computer’s basic components; synthetic benchmarks are built taking statistics of all types of operation from many application programs and writing a program based on a proportional invocation of such operations. This report describes the development of an ESM (Earth System Model) benchmark, based on real applications, for evaluating the performance of a parallel system and its suitability for running climate models. The development of the ESM benchmark started from the composition of an evaluation suite that includes some of the most significant ESM models adopted in the climate community. The selection of the ESM models has been made within the ENES community involving all the main climate centers in Europe. Finally, we have defined a metric as index for measuring the system’s performance. The benchmark will be used for both comparing different parallel architectures and highlighting the hotspots of the target one. The benchmark’s results can provide useful hints for tuning and better configuring the analyzed system

    ORCA025: Performance Analysis on Scalar Architecture

    No full text
    This technical report describes the porting and performance evaluation activities performed on ORCA025 code, implementing the global ocean general circulation model (OGCM) OPA. The code, currently available and optimized on vector architectures, has been ported on HP XC6000 Itanium2 scalar cluster, provided by the associate partner SPACI. The activity is mainly focused to evaluate how a scalar architecture based on Itanium2 processor behaves with oceanographic model that traditionally run on vector clusters. Performance analysis of the parallel code showed good results in terms of scalability
    corecore