98 research outputs found

    A method of evaluation of high-performance computing batch schedulers

    Get PDF
    According to Sterling et al., a batch scheduler, also called workload management, is an application or set of services that provide a method to monitor and manage the flow of work through the system [Sterling01]. The purpose of this research was to develop a method to assess the execution speed of workloads that are submitted to a batch scheduler. While previous research exists, this research is different in that more complex jobs were devised that fully exercised the scheduler with established benchmarks. This research is important because the reduction of latency even if it is miniscule can lead to massive savings of electricity, time, and money over the long term. This is especially important in the era of green computing [Reuther18]. The methodology used to assess these schedulers involved the execution of custom automation scripts. These custom scripts were developed as part of this research to automatically submit custom jobs to the schedulers, take measurements, and record the results. There were multiple experiments conducted throughout the course of the research. These experiments were designed to apply the methodology and assess the execution speed of a small selection of batch schedulers. Due to time constraints, the research was limited to four schedulers. x The measurements that were taken during the experiments were wall time, RAM usage, and CPU usage. These measurements captured the utilization of system resources of each of the schedulers. The custom scripts were executed using, 1, 2, and 4 servers to determine how well a scheduler scales with network growth. The experiments were conducted on local school resources. All hardware was similar and was co-located within the same data-center. While the schedulers that were investigated as part of the experiments are agnostic to whether the system is grid, cluster, or super-computer; the investigation was limited to a cluster architecture

    HIL: designing an exokernel for the data center

    Full text link
    We propose a new Exokernel-like layer to allow mutually untrusting physically deployed services to efficiently share the resources of a data center. We believe that such a layer offers not only efficiency gains, but may also enable new economic models, new applications, and new security-sensitive uses. A prototype (currently in active use) demonstrates that the proposed layer is viable, and can support a variety of existing provisioning tools and use cases.Partial support for this work was provided by the MassTech Collaborative Research Matching Grant Program, National Science Foundation awards 1347525 and 1149232 as well as the several commercial partners of the Massachusetts Open Cloud who may be found at http://www.massopencloud.or

    Including accurate user estimates in HPC schedulers: ban empirical analysis

    Get PDF
    This article focuses on the problem of dealing with low accuracy of job runtime estimates provided by users of high performance computing systems. The main goal of the study is to evaluate the benefits on the system utilization of providing accurate estimations, in order to motivate users to make an effort to provide better estimates. We propose the Penalty Scheduling Policy for including information about user estimates. The experimental evaluation is performed over realistic workload and scenarios, and validated by the use of a job scheduler simulator. We simulated different static and dynamic scenarios, which emulate diverse user behavior regarding the estimation of jobs runtime. Results demonstrate that the accuracy of users runtime estimates influences the waiting time of jobs. Under our proposed policy, in a scenario where users improve their estimates, waiting time of users with high accuracy can be up to 2.43 times lower than users with the lowest accuracy.XV Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    A survey of the European Open Science Cloud services for expanding the capacity and capabilities of multidisciplinary scientific applications

    Get PDF
    Open Science is a paradigm in which scientific data, procedures, tools and results are shared transparently and reused by society. The European Open Science Cloud (EOSC) initiative is an effort in Europe to provide an open, trusted, virtual and federated computing environment to execute scientific applications and store, share and reuse research data across borders and scientific disciplines. Additionally, scientific services are becoming increasingly data-intensive, not only in terms of computationally intensive tasks but also in terms of storage resources. To meet those resource demands, computing paradigms such as High-Performance Computing (HPC) and Cloud Computing are applied to e-science applications. However, adapting applications and services to these paradigms is a challenging task, commonly requiring a deep knowledge of the underlying technologies, which often constitutes a general barrier to its uptake by scientists. In this context, EOSC-Synergy, a collaborative project involving more than 20 institutions from eight European countries pooling their knowledge and experience to enhance EOSC’s capabilities and capacities, aims to bring EOSC closer to the scientific communities. This article provides a summary analysis of the adaptations made in the ten thematic services of EOSC-Synergy to embrace this paradigm. These services are grouped into four categories: Earth Observation, Environment, Biomedicine, and Astrophysics. The analysis will lead to the identification of commonalities, best practices and common requirements, regardless of the thematic area of the service. Experience gained from the thematic services can be transferred to new services for the adoption of the EOSC ecosystem framework. The article made several recommendations for the integration of thematic services in the EOSC ecosystem regarding Authentication and Authorization (federated regional or thematic solutions based on EduGAIN mainly), FAIR data and metadata preservation solutions (both at cataloguing and data preservation—such as EUDAT’s B2SHARE), cloud platform-agnostic resource management services (such as Infrastructure Manager) and workload management solutions.This work was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 857647, EOSC-Synergy, European Open Science Cloud - Expanding Capacities by building Capabilities. Moreover, this work is partially funded by grant No 2015/24461-2, São Paulo Research Foundation (FAPESP). Francisco Brasileiro is a CNPq/Brazil researcher (grant 308027/2020-5).Peer Reviewed"Article signat per 20 autors/es: Amanda Calatrava, Hernán Asorey, Jan Astalos, Alberto Azevedo, Francesco Benincasa, Ignacio Blanquer, Martin Bobak, Francisco Brasileiro, Laia Codó, Laura del Cano, Borja Esteban, Meritxell Ferret, Josef Handl, Tobias Kerzenmacher, Valentin Kozlov, Aleš Křenek, Ricardo Martins, Manuel Pavesio, Antonio Juan Rubio-Montero, Juan Sánchez-Ferrero "Postprint (published version

    funcX: A Federated Function Serving Fabric for Science

    Full text link
    Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are available. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most suitable resources. To address these needs we present funcX---a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. funcX's endpoint software can transform existing clouds, clusters, and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints. We motivate the need for funcX with several scientific case studies, present our prototype design and implementation, show optimizations that deliver throughput in excess of 1 million functions per second, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than more than 130000 concurrent workers.Comment: Accepted to ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC 2020). arXiv admin note: substantial text overlap with arXiv:1908.0490
    • …
    corecore