222,293 research outputs found

    Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications

    Get PDF
    Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications

    Extending Capability and Implementing a Web Interface for the XALT Software Monitoring Tool

    Get PDF
    As high performance computing centers evolve in terms of hardware, software, and user-base, the act of monitoring and managing such systems requires specialized tools. The tool discussed in this thesis is XALT, which is a collaborative effort between the National Institute for Computational Sciences and Texas Advanced Computing Center. XALT is designed to track link-time and job level information for applications that are compiled and executed on any Linux cluster, workstation, or high-end supercomputer. The key objectives of this work are to extend the existing functionality of XALT and implement a real-time web portal to easily visualize the tracked data. A prototype is developed to track function calls resolved by external libraries which helps software management. The web portal generates reports and metrics which would improve efficiency and effectiveness for an extensive community of stakeholders including users, support organizations, and development teams. In addition, we discuss use cases of interest to center support staff and researchers on identifying users based on given counters and generating provenance reports. This work details the opportunity and challenges to further push XALT towards becoming a complete package

    Pervasive brain monitoring and data sharing based on multi-tier distributed computing and linked data technology

    Get PDF
    EEG-based Brain-computer interfaces (BCI) are facing grant challenges in their real-world applications. The technical difficulties in developing truly wearable multi-modal BCI systems that are capable of making reliable real-time prediction of users’ cognitive states under dynamic real-life situations may appear at times almost insurmountable. Fortunately, recent advances in miniature sensors, wireless communication and distributed computing technologies offered promising ways to bridge these chasms. In this paper, we report our attempt to develop a pervasive on-line BCI system by employing state-of-art technologies such as multi-tier fog and cloud computing, semantic Linked Data search and adaptive prediction/classification models. To verify our approach, we implement a pilot system using wireless dry-electrode EEG headsets and MEMS motion sensors as the front-end devices, Android mobile phones as the personal user interfaces, compact personal computers as the near-end fog servers and the computer clusters hosted by the Taiwan National Center for High-performance Computing (NCHC) as the far-end cloud servers. We succeeded in conducting synchronous multi-modal global data streaming in March and then running a multi-player on-line BCI game in September, 2013. We are currently working with the ARL Translational Neuroscience Branch and the UCSD Movement Disorder Center to use our system in real-life personal stress and in-home Parkinson’s disease patient monitoring experiments. We shall proceed to develop a necessary BCI ontology and add automatic semantic annotation and progressive model refinement capability to our system

    LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses

    Full text link
    System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological cases, provides instant performance feedback to the users, offers initial data to judge on the optimization potential of applications and helps to build a statistical foundation about application specific system usage. The LIKWID monitoring stack is a modular framework build on top of the LIKWID tools library. It aims on enabling job specific performance monitoring using HPM data, system metrics and application-level data for small to medium sized commodity clusters. Moreover, it is designed to integrate in existing monitoring infrastructures to speed up the change from pure system monitoring to job-aware monitoring.Comment: 4 pages, 4 figures. Accepted for HPCMASPA 2017, the Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications, held in conjunction with IEEE Cluster 2017, Honolulu, HI, September 5, 201

    Observing the clouds : a survey and taxonomy of cloud monitoring

    Get PDF
    This research was supported by a Royal Society Industry Fellowship and an Amazon Web Services (AWS) grant. Date of Acceptance: 10/12/2014Monitoring is an important aspect of designing and maintaining large-scale systems. Cloud computing presents a unique set of challenges to monitoring including: on-demand infrastructure, unprecedented scalability, rapid elasticity and performance uncertainty. There are a wide range of monitoring tools originating from cluster and high-performance computing, grid computing and enterprise computing, as well as a series of newer bespoke tools, which have been designed exclusively for cloud monitoring. These tools express a number of common elements and designs, which address the demands of cloud monitoring to various degrees. This paper performs an exhaustive survey of contemporary monitoring tools from which we derive a taxonomy, which examines how effectively existing tools and designs meet the challenges of cloud monitoring. We conclude by examining the socio-technical aspects of monitoring, and investigate the engineering challenges and practices behind implementing monitoring strategies for cloud computing.Publisher PDFPeer reviewe

    Checkpointing as a Service in Heterogeneous Cloud Environments

    Get PDF
    A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
    • …
    corecore