1,459 research outputs found

    HPC compact quasi-Newton algorithm for interface problems

    Full text link
    In this work we present a robust interface coupling algorithm called Compact Interface quasi-Newton (CIQN). It is designed for computationally intensive applications using an MPI multi-code partitioned scheme. The algorithm allows to reuse information from previous time steps, feature that has been previously proposed to accelerate convergence. Through algebraic manipulation, an efficient usage of the computational resources is achieved by: avoiding construction of dense matrices and reduce every multiplication to a matrix-vector product and reusing the computationally expensive loops. This leads to a compact version of the original quasi-Newton algorithm. Altogether with an efficient communication, in this paper we show an efficient scalability up to 4800 cores. Three examples with qualitatively different dynamics are shown to prove that the algorithm can efficiently deal with added mass instability and two-field coupled problems. We also show how reusing histories and filtering does not necessarily makes a more robust scheme and, finally, we prove the necessity of this HPC version of the algorithm. The novelty of this article lies in the HPC focused implementation of the algorithm, detailing how to fuse and combine the composing blocks to obtain an scalable MPI implementation. Such an implementation is mandatory in large scale cases, for which the contact surface cannot be stored in a single computational node, or the number of contact nodes is not negligible compared with the size of the domain. \c{opyright} Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Comment: 33 pages: 23 manuscript, 10 appendix. 16 figures: 4 manuscript, 12 appendix. 10 Tables: 3 manuscript, 7 appendi

    Research and Education in Computational Science and Engineering

    Get PDF
    Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie

    High Performance P3M N-body code: CUBEP3M

    Full text link
    This paper presents CUBEP3M, a publicly-available high performance cosmological N-body code and describes many utilities and extensions that have been added to the standard package. These include a memory-light runtime SO halo finder, a non-Gaussian initial conditions generator, and a system of unique particle identification. CUBEP3M is fast, its accuracy is tuneable to optimize speed or memory, and has been run on more than 27,000 cores, achieving within a factor of two of ideal weak scaling even at this problem size. The code can be run in an extra-lean mode where the peak memory imprint for large runs is as low as 37 bytes per particles, which is almost two times leaner than other widely used N-body codes. However, load imbalances can increase this requirement by a factor of two, such that fast configurations with all the utilities enabled and load imbalances factored in require between 70 and 120 bytes per particles. CUBEP3M is well designed to study large scales cosmological systems, where imbalances are not too large and adaptive time-stepping not essential. It has already been used for a broad number of science applications that require either large samples of non-linear realizations or very large dark matter N-body simulations, including cosmological reionization, halo formation, baryonic acoustic oscillations, weak lensing or non-Gaussian statistics. We discuss the structure, the accuracy, known systematic effects and the scaling performance of the code and its utilities, when applicable.Comment: 20 pages, 17 figures, added halo profiles, updated to match MNRAS accepted versio

    NEPTUNE_CFD High Parallel Computing Performances for Particle-Laden Reactive Flows

    Get PDF
    This paper presents high performance computing of NEPTUNE_CFD V1.07@Tlse. NEPTUNE_CFD is an unstructured parallelized code (MPI) using unsteady Eulerian multi-fluid approach for dilute and dense particle-laden reactive flows. Three-dimensional numerical simulations of two test cases have been carried out. The first one, a uniform granular shear flow exhibits an excellent scalability of NEPTUNE_CFD up to 1024 cores, and demonstrates the good agreement between the parallel simulation results and the analytical solutions. Strong scaling and weak scaling benchmarks have been performed. The second test case, a realistic dense fluidized bed shows the code computing performances on an industrial geometry

    Adaptive control in rollforward recovery for extreme scale multigrid

    Full text link
    With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchical weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The re-coupling process is steered by local contributions of the error estimator. We propose and compare two criteria which differ in their weights. Failure scenarios when solving up to 6.910116.9\cdot10^{11} unknowns on more than 245\,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method

    Efficient CFD code implementation for the ARM-based Mont-Blanc architecture

    Get PDF
    Since 2011, the European project Mont-Blanc has been focused on enabling ARM-based technology for HPC, developing both hardware platforms and system software. The latest Mont-Blanc prototypes use system-on-chip (SoC) devices that combine a CPU and a GPU sharing a common main memory. Specific developments of parallel computing software and well-suited implementation approaches are of crucial importance for such a heterogeneous architecture in order to efficiently exploit its potential. This paper is devoted to the optimizations carried out in the TermoFluids CFD code to efficiently run it on the Mont-Blanc system. The underlying numerical method is based on an unstructured finite-volume discretization of the Navier–Stokes equations for the numerical simulation of incompressible turbulent flows. It is implemented using a portable and modular operational approach based on a minimal set of linear algebra operations. An architecture-specific heterogeneous multilevel MPI+OpenMP+OpenCL implementation of such kernels is proposed. It includes optimizations of the storage formats, dynamic load balancing between the CPU and GPU devices and hiding of communication overheads by overlapping computations and data transfers. A detailed performance study shows time reductions of up to on the kernels’ execution with the new heterogeneous implementation, its scalability on up to 128 Mont-Blanc nodes and the energy savings (around ) achieved with the Mont-Blanc system versus the high-end hybrid supercomputer MinoTauro.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007–2013] and Horizon 2020 under the Mont-Blanc Project (www.montblanc-project.eu), grant agreement n 288777, 610402 and 671697. The work has been financially supported by the Ministerio de Ciencia e Innovación, Spain (ENE- 2014-60577-R), the Russian Science Foundation, project 15-11-30039, CONICYT Becas Chile Doctorado 2012, the Juan de la Cierva posdoctoral grant (IJCI-2014-21034), and the Initial Training Network SEDITRANS (GA number: 607394), implemented within the 7th Framework Programme of the European Commission under call FP7-PEOPLE- 2013-ITN. Our calculations have been performed on the resources of the Barcelona Supercomputing Center. The authors thankfully acknowledge these institutions.Peer ReviewedPostprint (published version

    Model Reduction of Synchronized Lur'e Networks

    Get PDF
    In this talk, we investigate a model order reduction schemethat reduces the complexity of uncertain dynamical networks consisting of diffusively interconnected nonlinearLure subsystems. We aim to reduce the dimension ofeach subsystem and meanwhile preserve the synchronization property of the overall network. Using the upperbound of the Laplacian spectral radius, we first characterize the robust synchronization of the Lure network bya linear matrix equation (LMI), whose solutions can betreated as generalized Gramians of each subsystem, andthus the balanced truncation can be performed on the linear component of each Lure subsystem. As a result, thedimension of the each subsystem is reduced, and the dynamics of the network is simplified. It is verified that, withthe same communication topology, the resulting reducednetwork system is still robustly synchronized, and the apriori bound on the approximation error is guaranteed tocompare the behaviors of the full-order and reduced-orderLure subsyste
    corecore