60 research outputs found
Doctor of Philosophy
dissertationA modern software system is a composition of parts that are themselves highly complex: operating systems, middleware, libraries, servers, and so on. In principle, compositionality of interfaces means that we can understand any given module independently of the internal workings of other parts. In practice, however, abstractions are leaky, and with every generation, modern software systems grow in complexity. Traditional ways of understanding failures, explaining anomalous executions, and analyzing performance are reaching their limits in the face of emergent behavior, unrepeatability, cross-component execution, software aging, and adversarial changes to the system at run time. Deterministic systems analysis has a potential to change the way we analyze and debug software systems. Recorded once, the execution of the system becomes an independent artifact, which can be analyzed offline. The availability of the complete system state, the guaranteed behavior of re-execution, and the absence of limitations on the run-time complexity of analysis collectively enable the deep, iterative, and automatic exploration of the dynamic properties of the system. This work creates a foundation for making deterministic replay a ubiquitous system analysis tool. It defines design and engineering principles for building fast and practical replay machines capable of capturing complete execution of the entire operating system with an overhead of several percents, on a realistic workload, and with minimal installation costs. To enable an intuitive interface of constructing replay analysis tools, this work implements a powerful virtual machine introspection layer that enables an analysis algorithm to be programmed against the state of the recorded system through familiar terms of source-level variable and type names. To support performance analysis, the replay engine provides a faithful performance model of the original execution during replay
Scalability Engineering for Parallel Programs Using Empirical Performance Models
Performance engineering is a fundamental task in high-performance computing (HPC). By definition, HPC applications should strive for maximum performance. As HPC systems grow larger and more complex, the scalability of an application has become of primary concern. Scalability is the ability of an application to show satisfactory performance even when the number of processors or the problems size is increased. Although various analysis techniques for scalability were suggested in past, engineering applications for extreme-scale systems still occurs ad hoc. The challenge is to provide techniques that explicitly target scalability throughout the whole development cycle, thereby allowing developers to uncover bottlenecks earlier in the development process. In this work, we develop a number of fundamental approaches in which we use empirical performance models to gain insights into the code behavior at higher scales.
In the first contribution, we propose a new software engineering approach for extreme-scale systems. Specifically, we develop a framework that validates asymptotic scalability expectations of programs against their actual behavior. The most important applications of this method, which is especially well suited for libraries encapsulating well-studied algorithms, include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. We supply a tool-chain that automates large parts of the framework, thus allowing it to be continuously applied throughout the development cycle with very little effort. We evaluate the framework with MPI collective operations, a data-mining code, and various OpenMP constructs. In addition to revealing unexpected scalability bottlenecks, the results also show that it is a viable approach for systematic validation of performance expectations.
As the second contribution, we show how the isoefficiency function of a task-based program can be determined empirically and used in practice to control the efficiency. Isoefficiency, a concept borrowed from theoretical algorithm analysis, binds efficiency, core count, and the input size in one analytical expression, thereby allowing the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we analyze resource contention by modeling the efficiency of contention-free execution. This allows poor scaling to be attributed either to excessive resource contention overhead or structural conflicts related to task dependencies or scheduling. Our results, obtained with applications from two benchmark suites, demonstrate that our approach provides insights into fundamental scalability limitations or excessive resource overhead and can help answer critical co-design questions.
Our contributions for better scalability engineering can be used not only in the traditional software development cycle, but also in other, related fields, such as algorithm engineering. It is a field that uses the software engineering cycle to produce algorithms that can be utilized in applications more easily. Using our contributions, algorithm engineers can make informed design decisions, get better insights, and save experimentation time
Uniparallel Execution and its Uses.
We introduce uniparallelism: a new style of execution that allows
multithreaded applications to benefit from the simplicity of
uniprocessor execution while scaling performance with increasing
processors.
A uniparallel execution consists of a thread-parallel execution, where
each thread runs on its own processor, and an epoch-parallel
execution, where multiple time intervals (epochs) of the program run
concurrently. The epoch-parallel execution runs all threads of a
given epoch on a single processor; this enables the use of techniques
that are effective on a uniprocessor. To scale performance with
increasing cores, a thread-parallel execution runs ahead of the
epoch-parallel execution and generates speculative checkpoints from
which to start future epochs. If these checkpoints match the program
state produced by the epoch-parallel execution at the end of each
epoch, the speculation is committed and output externalized; if they
mismatch, recovery can be safely initiated as no speculative state has
been externalized.
We use uniparallelism to build two novel systems: DoublePlay and
Frost. DoublePlay benefits from the efficiency of logging the
epoch-parallel execution (as threads in an epoch are constrained to a
single processor, only infrequent thread context-switches need to be
logged to recreate the order of shared-memory accesses), allowing it
to outperform all prior systems that guarantee deterministic replay on
commodity multiprocessors.
While traditional methods detect data races by analyzing the events
executed by a program, Frost introduces a new, substantially faster
method called outcome-based race detection to detect the effects of a
data race by comparing the program state of replicas for divergences.
Unlike DoublePlay, which runs a single epoch-parallel execution of the
program, Frost runs multiple epoch-parallel replicas with
complementary schedules, which are a set of thread schedules crafted
to ensure that replicas diverge only if a data race occurs and to make
it very likely that harmful data races cause divergences. Frost
detects divergences by comparing the outputs and memory states of
replicas at the end of each epoch. Upon detecting a divergence, Frost
analyzes the replica outcomes to diagnose the data race bug and
selects an appropriate recovery strategy that masks the failure.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89677/1/kaushikv_1.pd
Generating Unit Tests for Concurrent Classes
Abstract—As computers become more and more powerful, programs are increasingly split up into multiple threads to leverage the power of multi-core CPUs. However, writing cor-rect multi-threaded code is a hard problem, as the programmer has to ensure that all access to shared data is coordinated. Existing automated testing tools for multi-threaded code mainly focus on re-executing existing test cases with different sched-ules. In this paper, we introduce a novel coverage criterion that enforces concurrent execution of combinations of shared memory access points with different schedules, and present an approach that automatically generates test cases for this coverage criterion. Our CONSUITE prototype demonstrates that this approach can reliably reproduce known concurrency errors, and evaluation on nine complex open source classes revealed three previously unknown data-races. Keywords-concurrency coverage; search based software en-gineering; unit testing I
Holistic System Design for Deterministic Replay.
Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, it is much harder to provide deterministic replay of shared memory multithreaded programs on multiprocessors because shared memory accesses add a high-frequency source of non-determinism. This thesis proposes efficient multiprocessor replay systems: Respec, Chimera, and Rosa.
Respec is an operating-system-based replay system. Respec is based on the observation that most program executions are data-race-free and for programs with no data races it is sufficient to record program input and the happens-before order of synchronization operations for replay. Respec speculates that a program is data-race-free and supports rollback and recovery from misspeculation. For racy programs, Respec employs a cheap runtime check that compares system call outputs and memory/register states of recorded and replayed processes at a semi-regular interval.
Chimera uses a sound static data race detector to find all potential data races and instrument pairs of potentially racing instructions to transform an arbitrary program to make it data-race-free. Then, Chimera records only the non-deterministic inputs and the order of synchronization operations for replay. However, existing static data race detectors generate excessive false warnings, leading to high recording overhead. Chimera resolves this problem by employing a combination of profiling, symbolic analysis, and dynamic checks that target the sources of imprecision in the static data race detector.
Rosa is a processor-based ultra-low overhead (less than one percent) replay solution that requires very little hardware support as it essentially only needs a log of cache misses to reproduce a multiprocessor execution. Unlike previous hardware-assisted systems, Rosa does not record shared memory dependencies at all. Instead, it infers them offline using a Satisfiability Modulo Theories (SMT) solver. Our offline analysis is capable of inferring interleavings that are legal under the Sequentially Consistency (SC) and Total Store Order (TSO) memory models.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102374/1/dongyoon_1.pd
一种基于宿主机/目标机架构的追踪/重演方法
实时嵌入式系统固有的不确定性使得系统运行具有不可重现性,从而造成系统调试与测试时故障可能无法重现。提出一种基于宿主机/目标机架构的追踪/重演方法来解决实时嵌入式系统运行的不可重现性问题。该方法通过插装探针来追踪系统的任务调度、任务间通信同步以及I/O操作等信息,并自动将系统的执行信息保存到宿主机端,然后通过任务控制模块来控制系统中的任务按照原有的先后顺序来执行,从而实现实时嵌入式系统执行情况的正确回放。目前,该方法已在ML505开发板和uC/OS-II操作系统上进行实现,并已成功应用到IC图像拍摄系统中。通过实验分析表明,该方法能够以较小的时间和空间开销实现实时嵌入式系统运行情况的追踪和重演
Dynamic datarace detection for object-oriented programs
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 63-66).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Multithreaded shared-memory programs are susceptible to dataraces, bugs that may exhibit themselves only in rare circumstances and can have detrimental effects on program behavior. Dataraces are often difficult to debug because they are difficult to reproduce and can affect program behavior in subtle ways, so tools which aid in detecting and preventing dataraces can be invaluable. Past dynamic datarace detection tools either incurred large overhead, ranging from 3x to 30x, or sacrificed precision in reducing overhead, reporting many false errors. This thesis presents a novel approach to efficient and precise datarace detection for multithreaded object-oriented programs. Our runtime datarace detector incurs an overhead ranging from 13% to 42% for our test suite, well below the overheads reported in previous work. Furthermore, our precise approach reveals dangerous dataraces in real programs with few spurious warnings.by Manu Sridharan.M.Eng
Recommended from our members
Finding, Measuring, and Reducing Inefficiencies in Contemporary Computer Systems
Computer systems have become increasingly diverse and specialized in recent years. This complexity supports a wide range of new computing uses and users, but is not without cost: it has become difficult to maintain the efficiency of contemporary general purpose computing systems. Computing inefficiencies, which include nonoptimal runtimes, excessive energy use, and limits to scalability, are a serious problem that can result in an inability to apply computing to solve the world's most important problems. Beyond the complexity and vast diversity of modern computing platforms and applications, a number of factors make improving general purpose efficiency challenging, including the requirement that multiple levels of the computer system stack be examined, that legacy hardware devices and software may stand in the way of achieving efficiency, and the need to balance efficiency with reusability, programmability, security, and other goals.
This dissertation presents five case studies, each demonstrating different ways in which the measurement of emerging systems can provide actionable advice to help keep general purpose computing efficient. The first of the five case studies is Parallel Block Vectors, a new profiling method for understanding parallel programs with a fine-grained, code-centric perspective aids in both future hardware design and in optimizing software to map better to existing hardware. Second is a project that defines a new way of measuring application interference on a datacenter's worth of chip-multiprocessors, leading to improved scheduling where applications can more effectively utilize available hardware resources. Next is a project that uses the GT-Pin tool to define a method for accelerating the simulation of GPGPUs, ultimately allowing for the development of future hardware with fewer inefficiencies. The fourth project is an experimental energy survey that compares and combines the latest energy efficiency solutions at different levels of the stack to properly evaluate the state of the art and to find paths forward for future energy efficiency research. The final project presented is NRG-Loops, a language extension that allows programs to measure and intelligently adapt their own power and energy use
- …