402 research outputs found
System noise, OS clock ticks, and fine-grained parallel applications
As parallel jobs get bigger in size and finer in granularity, âsystem noise â is increasingly becoming a problem. In fact, fine-grained jobs on clusters with thousands of SMP nodes run faster if a processor is intentionally left idle (per node), thus enabling a separation of âsystem noise â from the com-putation. Paying a cost in average processing speed at a node for the sake of eliminating occasional processes delays is (unfortunately) beneficial, as such delays are enormously magnified when one late process holds up thousands of peers with which it synchronizes. We provide a probabilistic argument showing that, under certain conditions, the effect of such noise is linearly pro-portional to the size of the cluster (as is often empirically observed). We then identify a major source of noise to be indirect overhead of periodic OS clock interrupts (âticksâ), that are used by all general-purpose OSs as a means of main-taining control. This is shown for various grain sizes, plat-forms, tick frequencies, and OSs. To eliminate such noise, we suggest replacing ticks with an alternative mechanism we call âsmart timersâ. This turns out to also be in line with needs of desktop and mobile computing, increasing the chances of the suggested change to be accepted. 1
Idle Period Propagation in Message-Passing Applications
Idle periods on different processes of Message Passing applications are
unavoidable. While the origin of idle periods on a single process is well
understood as the effect of system and architectural random delays, yet it is
unclear how these idle periods propagate from one process to another. It is
important to understand idle period propagation in Message Passing applications
as it allows application developers to design communication patterns avoiding
idle period propagation and the consequent performance degradation in their
applications. To understand idle period propagation, we introduce a methodology
to trace idle periods when a process is waiting for data from a remote delayed
process in MPI applications. We apply this technique in an MPI application that
solves the heat equation to study idle period propagation on three different
systems. We confirm that idle periods move between processes in the form of
waves and that there are different stages in idle period propagation. Our
methodology enables us to identify a self-synchronization phenomenon that
occurs on two systems where some processes run slower than the other processes.Comment: 18th International Conference on High Performance Computing and
Communications, IEEE, 201
Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study
Analytic, first-principles performance modeling of distributed-memory
applications is difficult due to a wide spectrum of random disturbances caused
by the application and the system. These disturbances (commonly called "noise")
destroy the assumptions of regularity that one usually employs when
constructing simple analytic models. Despite numerous efforts to quantify,
categorize, and reduce such effects, a comprehensive quantitative understanding
of their performance impact is not available, especially for long delays that
have global consequences for the parallel application. In this work, we
investigate various traces collected from synthetic benchmarks that mimic real
applications on simulated and real message-passing systems in order to pinpoint
the mechanisms behind delay propagation. We analyze the dependence of the
propagation speed of idle waves emanating from injected delays with respect to
the execution and communication properties of the application, study how such
delays decay under increased noise levels, and how they interact with each
other. We also show how fine-grained noise can make a system immune against the
adverse effects of propagating idle waves. Our results contribute to a better
understanding of the collective phenomena that manifest themselves in
distributed-memory parallel applications.Comment: 10 pages, 9 figures; title change
Measuring Operating System Overhead on CMT Processors
Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing (HPC), especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this paper, we analyze the major sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1, running Linux and Solaris. Since a real system is too complex
to analyze, we compare those results with a low-overhead runtime environment: the Netra Data Plane Software Suite (Netra DPS). Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context of the timer interrupt handler and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt
handler run on different cores of the processor.Peer Reviewe
Towards an Adaptive OS Noise Mitigation Technique for Microbenchmarking on Apple Ipad Devices
This study investigates levels of Operating System (OS) noise on Apple iPad mobile devices. OS noise causes variations in application performance that interfere with microbenchmark results. OS noise manifests in collected data through extreme outliers and variations in skewness. Using our collected data, we develop an iterative, semi-automated outlier removal process for Apple iPad OS noise profiles. The profiles generated by outlier removal represent the first step toward an adaptive noise mitigation technique, which presents opportunities for use in microbenchmarking across other mobile platforms
Hardware as a service - enabling dynamic, user-level bare metal provisioning of pools of data center resources.
We describe a âHardware as a Service (HaaS)â tool for isolating pools of compute, storage and networking resources. The goal of HaaS is to enable dynamic and flexible, user-level provisioning of pools of resources at the so-called âbare-metalâ layer. It allows experimental or untrusted services to co-exist alongside trusted services. By functioning only as a resource isolation system, users are free to choose between different system scheduling and provisioning systems and to manage isolated resources as they see fit. We describe key HaaS use cases and features. We show how HaaS can provide a valuable, and somehwat overlooked, layer in the software architecture of modern data center management. Documentation and source code for HaaS software are available at: https://github.com/CCI-MOC/haasPartial support for this work was provided by the MassTech Collaborative Research Matching Grant Program, National Science Foundation award #1347525 and several commercial partners of the Mass Open Cloud who may be found at http://www.massopencloud.org.http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/116.pd
A fine-grain time-sharing Time Warp system
Although Parallel Discrete Event Simulation (PDES) platforms relying on the Time Warp (optimistic) synchronization
protocol already allow for exploiting parallelism, several techniques have been proposed to
further favor performance. Among them we can mention optimized approaches for state restore, as well as
techniques for load balancing or (dynamically) controlling the speculation degree, the latter being specifically
targeted at reducing the incidence of causality errors leading to waste of computation. However, in
state of the art Time Warp systems, eventsâ processing is not preemptable, which may prevent the possibility
to promptly react to the injection of higher priority (say lower timestamp) events. Delaying the processing
of these events may, in turn, give rise to higher incidence of incorrect speculation. In this article we present
the design and realization of a fine-grain time-sharing Time Warp system, to be run on multi-core Linux
machines, which makes systematic use of event preemption in order to dynamically reassign the CPU to
higher priority events/tasks. Our proposal is based on a truly dual mode execution, application vs platform,
which includes a timer-interrupt based support for bringing control back to platform mode for possible CPU
reassignment according to very fine grain periods. The latter facility is offered by an ad-hoc timer-interrupt
management module for Linux, which we release, together with the overall time-sharing support, within the
open source ROOT-Sim platform. An experimental assessment based on the classical PHOLD benchmark and
two real world models is presented, which shows how our proposal effectively leads to the reduction of the
incidence of causality errors, as compared to traditional Time Warp, especially when running with higher
degrees of parallelism
Measuring operating system overhead on Sun UltraSparc T1 processor
Numerous studies have shown that Operating System (OS) noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for High Performance Computing, especially in multi-processor/core systems, most of them
focus on 2- or 4-core systems. In this study, we analyze sources of OS noise on a massive multithreading processor, the Sun
UltraSPARC T1.We compare results, measured in Linux and Solaris, with the results provided by a low-overhead runtime environment that introduces almost no overhead in applicationsâ execution time. Our results show that the overhead introduced by the OS timer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context as the timer interrupt handler, and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core. We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.Postprint (published version
- âŠ