64 research outputs found

    Shall Numerical Astrophysics Step Into the Era of Exascale Computing?

    Get PDF
    High performance computing numerical simulations are today one of the more effective instruments to implement and study new theoretical models, and they are mandatory during the preparatory phase and operational phase of any scientific experiment. New challenges in Cosmology and Astrophysics will require a large number of new extremely computationally intensive simulations to investigate physical processes at different scales. Moreover, the size and complexity of the new generation of observational facilities also implies a new generation of high performance data reduction and analysis tools pushing toward the use of Exascale computing capabilities. Exascale supercomputers cannot be produced today. We discuss the major technological challenges in the design, development and use of such computing capabilities and we will report on the progresses that has been made in the last years in Europe, in particular in the framework of the ExaNeSt European funded project. We also discuss the impact of these new computing resources on the numerical codes in Astronomy and Astrophysics

    HPC for Urgent Decision-Making

    Get PDF
    Emerging use cases from incident response planning and broad-scope European initiatives (e.g. Destination Earth [1,2], European Green Deal and Digital Package [21]) are expected to require federated, distributed infrastructures combining computing and data platforms. These will provide elasticity enabling users to build applications and integrate data for thematic specialisation and decision support, within ever shortening response time windows. For prompt and, in particular, for urgent decision support, the conventional usage modes of HPC centres is not adequate: these rely on relatively long-term arrangements for time-scheduled exclusive use of HPC resources, and enforce well- established yet time-consuming policies for granting access. In urgent decision support scenarios, managers or members of incident response teams must initiate processing and control the resources required based on their real-time judgement on how a complex situation evolves over time. This circle of clients is distinct from the regular users of HPC centres, and they must interact with HPC workflows on-demand and in real-time, while engaging significant HPC and data processing resources in or across HPC centres. This white paper considers the technical implications of supporting urgent decisions through establishing flexible usage modes for computing, analytics and AI/ML-based applications using HPC and large, dynamic assets. The target decision support use cases will involve ensembles of jobs, data-staging to support workflows, and interactions with services/facilities external to HPC systems/centres. Our analysis identifies the need for flexible and interactive access to HPC resources, particularly in the context of dynamic workflows processing large datasets. This poses several technical and organisational challenges: short-notice secure access to HPC and data resources, dynamic resource allocation and scheduling, coordination of resource managers, support for data-intensive workflow (including data staging on node-local storage), preemption of already running workloads and interactive steering of simulations. Federation of services and resources across multiple sites will help to increase availability, provide elasticity for time-varying resource needs and enable leverage of data locality

    RETHINK big: European roadmap for hardware anc networking optimizations for big data

    Get PDF
    This paper discusses the results of the RETHINK big Project, a 2-year Collaborative Support Action funded by the European Commission in order to write the European Roadmap for Hardware and Networking optimizations for Big Data. This industry-driven project was led by the Barcelona Supercomputing Center (BSC), and it included large industry partners, SMEs and academia. The roadmap identifies business opportunities from 89 in-depth interviews with 70 European industry stakeholders in the area of Big Data and predicts the future technologies that will disrupt the state of the art in Big Data processing in terms of hardware and networking optimizations. Moreover, it presents coordinated technology development recommendations (focused on optimizations in networking and hardware) that would be in the best interest of European Big Data companies to undertake in concert as a matter of competitive advantage.This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement n° 619788. It has also been supported by the Spanish Government (grant SEV2015-0493 of the Severo Ochoa Program), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316) and by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272).Peer ReviewedPostprint (author's final draft

    Towards resilient EU HPC systems: A blueprint

    Get PDF
    This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the projects ECOSCALE (grant agreement No 671632), EPI (grant agreement No 826647), EuroEXA (grant agreement No 754337), Eurolab4HPC (grant agreement No 800962), EVOLVE (grant agreement No 825061), EXA2PRO (grant agreement No 801015), ExaNest (grant agreement No 671553), ExaNoDe (grant agreement No 671578), EXDCI-2 (grant agreement No 800957), LEGaTO (grant agreement No 780681), MB2020 (grant agreement No 779877), RECIPE (grant agreement No 801137) and SDK4ED (grant agreement No 780572). The work was also supported by the European Commission’s Seventh Framework Programme under the projects CLERECO (grant agreement No 611404), the NCSA-Inria-ANL-BSC-JSCRiken-UTK Joint-Laboratory for Extreme Scale Computing – JLESC (https://jlesc.github.io/), OMPI-X project (No ECP-2.3.1.17) and the Spanish Government through Severo Ochoa programme (SEV-2015-0493). This work was sponsored in part by the U.S. Department of Energy's Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No DE-AC05-00OR22725 with the U.S. Department of Energy.Preprin

    Προσομοίωση Συστημάτων Επεξεργασίας Δοσοληψιών και Μελέτη Μεθόδων για την Ικανοποίηση Στόχων Επίδοσης

    No full text
    The transaction concept, based on the concept of contract law, signifies the properties of atomicity, consistency, isolation, and durability for a sequence of actions. These properties, widely known with the acronym ACID, are essential for supporting concurrent access to shared data and for failure handling, especially in distributed environments. The object of this work is the development of an integrated environment for simulation of trancaction processing systems, with the aim to study the operation and performance of such systems. The environment incorporates the TPsim simulator and a mechanism that supports experiment management, so as to automate the process of specifying simulation experiments, conducting the experimental steps, and collecting measurements. The simulator models with considerable detail the operation of a typical trancaction processing system, and enables the estimation of a variety of variables related to resource consumption and system performance. The simulation environment was used for the experimental evaluation of a series of scheduling algorithms for complex units of work, that consist of multiple transactions. Workload units of this type occur often in practice, and represent a special class of workflows. The study of this workload class has started to draw considerable research interest, and this work presents a first performance study. The proposed scheduling algorithms are oriented towards satisfying performance goals, which are specified as requirements about the average response time per workload class. # Η έννοια της δοσοληψίας (transaction), βασισμένη στην νομική έννοια της σύμβασης, υποδηλώνει τις ιδιότητες της ατομικότητας, συνέπειας, απομόνωσης, και μονιμότητας για μια σειρά πράξεων. Οι ιδιότητες αυτές, που αναφέρονται στη βιβλιογραφία με την ακροστοιχίδα ACID (atomicity, consistency, isolation, durability), είναι απαραίτητες για την υποστήριξη ταυτόχρονης προσπέλασης σε κοινόχρηστα δεδομένα και την αντιμετώπιση βλαβών, ειδικά σε κατανεμημένα περιβάλλοντα. Στα πλαίσια της εργασίας αυτής αναπτύχθηκε ένα ολοκληρωμένο περιβάλλον για την προσομοίωση συστημάτων επεξεργασίας δοσοληψιών, με σκοπό την μελέτη της δυναμικής συμπεριφοράς και επίδοσης τέτοιων συστημάτων. Το περιβάλλον περιλαμβάνει τον προσομοιωτή TPsim και ένα μηχανισμό που υποστηρίζει την διαχείριση πειραμάτων προσομοίωσης, ώστε να αυτοματοποιηθεί κατά το δυνατόν η διαδικασία της περιγραφής πειραμάτων, του συντονισμού της εκτέλεσης των πειραματικών βημάτων, και της συλλογής μετρήσεων. Ο προσομοιωτής μοντελοποιεί με σημαντική λεπτομέρεια όλα τα υποσυστήματα ενός τυπικού συστήματος επεξεργασίας δοσοληψιών, και επιτρέπει την εκτίμηση ποικιλίας μεταβλητών σχετικών με την κατανάλωση πόρων και την επίδοση του συστήματος. Το περιβάλλον προσομοίωσης χρησιμοποιήθηκε για την πειραματική αξιολόγηση μιας σειράς αλγορίθμων χρονοπρογραμματισμού για σύνθετες μονάδες φόρτου εξυπηρέτησης, που εκτελούνται ως σειρά από δοσοληψίες. Αυτού του είδους οι μονάδες φόρτου εμφανίζονται συχνά στην πράξη, και αποτελούν μια περίπτωση της κατηγορίας μονάδων φόρτου που αναφέρεται με τον όρο {\sf workflows}. Η μελέτη αυτής της κατηγορίας έχει αρχίσει να συγκεντρώνει σημαντικό ερευνητικό ενδιαφέρον και στην εργασία αυτή γίνεται μια πρώτη μελέτη επίδοσης. Οι προτεινόμενοι αλγόριθμοι χρονοπρογραμματισμού είναι προσανατολισμένοι στην ικανοποίηση στόχων επίδοσης, που διατυπώνονται ως απαιτήσεις για τον μέσο χρόνο απόκρισης ανά κλάση μονάδων φόρτου

    Exploring the impact of node failures on the resource allocation for parallel jobs

    No full text
    Increasing the size and complexity of modern HPC systemsalso increases the probability of various types of failures. Failures maydisrupt application execution and waste valuable system resources dueto failed executions. In this work, we explore the eect of node failureson the completion times of MPI parallel jobs. We introduce a simulationenvironment that generates synthetic traces of node failures, assumingthat the times between failures for each node are independently dis-tributed, following the same distribution but with dierent parameters.To highlight the importance of failure-awareness for resource allocation,we compare two failure-oblivious resource allocation approaches withone that considers node failure probabilities before assigning a partitionto a job: a heuristic that randomly selects the partition for a job, andSlurm's linear resource allocation policy. We present results for a casestudy that assumes a 4D-torus topology and a Weibull distribution foreach node's time between failures, and considers several dierent tracesof node failures, capturing dierent failure patterns. For the synthetictraces explored, the benet is more prominent for longer jobs, up to82% depending on the trace, when compared with Slurm and a failure-oblivious heuristic. For shorter jobs, benets are noticeable for systemswith more frequent failures
    corecore