219 research outputs found
Uniparallel Execution and its Uses.
We introduce uniparallelism: a new style of execution that allows
multithreaded applications to benefit from the simplicity of
uniprocessor execution while scaling performance with increasing
processors.
A uniparallel execution consists of a thread-parallel execution, where
each thread runs on its own processor, and an epoch-parallel
execution, where multiple time intervals (epochs) of the program run
concurrently. The epoch-parallel execution runs all threads of a
given epoch on a single processor; this enables the use of techniques
that are effective on a uniprocessor. To scale performance with
increasing cores, a thread-parallel execution runs ahead of the
epoch-parallel execution and generates speculative checkpoints from
which to start future epochs. If these checkpoints match the program
state produced by the epoch-parallel execution at the end of each
epoch, the speculation is committed and output externalized; if they
mismatch, recovery can be safely initiated as no speculative state has
been externalized.
We use uniparallelism to build two novel systems: DoublePlay and
Frost. DoublePlay benefits from the efficiency of logging the
epoch-parallel execution (as threads in an epoch are constrained to a
single processor, only infrequent thread context-switches need to be
logged to recreate the order of shared-memory accesses), allowing it
to outperform all prior systems that guarantee deterministic replay on
commodity multiprocessors.
While traditional methods detect data races by analyzing the events
executed by a program, Frost introduces a new, substantially faster
method called outcome-based race detection to detect the effects of a
data race by comparing the program state of replicas for divergences.
Unlike DoublePlay, which runs a single epoch-parallel execution of the
program, Frost runs multiple epoch-parallel replicas with
complementary schedules, which are a set of thread schedules crafted
to ensure that replicas diverge only if a data race occurs and to make
it very likely that harmful data races cause divergences. Frost
detects divergences by comparing the outputs and memory states of
replicas at the end of each epoch. Upon detecting a divergence, Frost
analyzes the replica outcomes to diagnose the data race bug and
selects an appropriate recovery strategy that masks the failure.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89677/1/kaushikv_1.pd
Recommended from our members
Sandboxed, Online Debugging of Production Bugs for SOA Systems
Short time-to-bug localization is extremely important for any 24x7 service-oriented application. To this end, we introduce a new debugging paradigm called live debugging. There are two goals that any live debugging infrastructure must meet: Firstly, it must offer real-time insight for bug diagnosis and localization, which is paramount when errors happen in user-facing applications. Secondly, live debugging should not impact user-facing performance for normal events. In large distributed applications, bugs which impact only a small percentage of users are common. In such scenarios, debugging a small part of the application should not impact the entire system.
With the above-stated goals in mind, this thesis presents a framework called Parikshan, which leverages user-space containers (OpenVZ) to launch application instances for the express purpose of live debugging. Parikshan is driven by a live-cloning process, which generates a replica (called debug container) of production services, cloned from a production container which continues to provide the real output to the user. The debug container provides a sandbox environment, for safe execution of monitoring/debugging done by the users without any perturbation to the execution environment. As a part of this framework, we have designed customized-network proxies, which replicate inputs from clients to both the production and test-container, as well safely discard all outputs. Together the network duplicator, and the debug container ensure both compute and network isolation of the debugging environment. We believe that this piece of work provides the first of its kind practical real-time debugging of large multi-tier and cloud applications, without requiring any application downtime, and minimal performance impact
Techniques for Detection, Root Cause Diagnosis, and Classification of In-Production Concurrency Bugs
Concurrency bugs are at the heart of some of the worst bugs that
plague software. Concurrency bugs slow down software development
because it can take weeks or even months before developers
can identify and fix them.
In-production detection, root cause diagnosis, and classification of
concurrency bugs is challenging. This is because these activities require
heavyweight analyses such as exploring program paths and determining
failing program inputs and schedules, all of which are not
suited for software running in production.
This dissertation develops practical techniques for the detection,
root cause diagnosis, and classification of concurrency bugs for inproduction
software. Furthermore, we develop ways for developers
to better reason about concurrent programs. This dissertation builds
upon the following principles:
— The approach in this dissertation spans multiple layers of the
system stack, because concurrency spans many layers of the
system stack.
— It performs most of the heavyweight analyses in-house and resorts
to minimal in-production analysis in order to move the
heavy lifting to where it is least disruptive.
— It eschews custom hardware solutions that may be infeasible to
implement in the real world.
Relying on the aforementioned principles, this dissertation introduces:
1. Techniques to automatically detect concurrency bugs (data races
and atomicity violations) in-production by combining in-house
static analysis and in-production dynamic analysis.
2. A technique to automatically identify the root causes of in-production
failures, with a particular emphasis on failures caused
by concurrency bugs.
3. A technique that given a data race, automatically classifies it
based on its potential consequence, allowing developers to answer
questions such as “can the data race cause a crash or a
hang?”, or “does the data race have any observable effect?”.
We build a toolchain that implements all the aforementioned techniques.
We show that the tools we develop in this dissertation are
effective, incur low runtime performance overhead, and have high
accuracy and precision
効率良いテストケース生成による並行処理プログラムのデバッグとテスト
Debugging multi-threaded concurrent programs is more difficult than sequential programs because errors are not always reproducible. Re-executing or instrumenting a concurrent program for tracing might change the execution timing and might cause the concurrent program to take a different execution path. In other words, the exact timing that caused the error is unknown. In order to reproduce the error, one needs to execute the concurrent program with the same input values many times as test cases by changing interleavings, but it is not always feasible to test them all. This dissertation proposes a debugging/testing system that generates all possible executions as test cases based on the limited information obtained from an execution trace, and then detects potential race conditions caused by different schedules and interrupt timings on a concurrent multi-threaded program. There are a number of studies about test cases reduction using partial order reduction, but there are still redundancies for the purpose of checking race conditions. The objective is to efficiently reproduce concurrent errors, specifically race conditions, by proposing three methods. The first is to reduce the numbers of interleavings to be tested. This is achieved by reducing redundant test cases and eliminating infeasible ones. The originality of the proposed method is to exploit the nature of branch coverage and utilize data flows from the trace information to identify only those interleavings that affect branch outcomes, whereas existing methods try to identify all the interleavings which may affect shared variables. Since the execution paths with the same branch outcomes would have equivalent sequences of lock/unlock and read/write operations to shared variables, they can be grouped together in the same “race-equivalent” group. In order to reduce the task for reproducing race conditions, it is sufficient to check only one member of the group. In this way, the proposed method can significantly reduces the number of interleavings for testing while still capable of detecting the same race conditions. Furthermore, the proposed method extends the existing model of execution trace to identify and avoid generating infeasible interleavings due to dependency caused by lock/unlock and wait/notify mechanisms. Experimental results suggest that redundant interleavings can be identified and removed which leads to a significant reduction of test cases. We evaluated the proposed method against several concurrent Java programs. The experimental results for an open source program Apache Commons Pool show the number of test cases is reduced from 23, which is based on the existing Thread-Pair-Interleaving method (TPAIR), to only 2 by the proposed method. Moreover, for concurrent programs that contain infinite loops, the proposed method generates only a finite and very few numbers of test cases, while many existing methods generate an infinite number of test cases. The second is to reduce the memory space required for generating test cases. Redundant test cases were still generated by the existing reachability testing method even though there was no need to execute them. Here, we propose a new method by analyzing data dependency to generate only those test cases that might affect sequences of lock/unlock and read/write operations to shared variables. The experimental results for the Apache Commons Pool show that the size of the graph for creating the test cases is reduced from 990 nodes, as based on the reachability testing method used in our previous work, to only 4 nodes by our new method. The third improvement is to reduce the effort involved in checking race conditions by utilizing previous test results. Existing work requires checking race conditions in the whole execution trace for every new test case. The proposed method can identify only those parts of the execution trace in which the sequence of lock/unlock and read/write operations to shared variables might be affected by a new test case, thus necessitating that race conditions be rechecked only for those affected parts. From the new improvements introduced above, the proposed methods accomplish to significantly reduce the efforts for exhaustively checking all possible interleavings. The proposed methods provide programmers the information regarding whether there exist program errors caused by interleavings, the interleaving (path) when the errors occurred, and accesses to shared variables with inconsistent locking.電気通信大学201
Productive Development of Scalable Network Functions with NFork
Despite decades of research, developing correct and scalable concurrent
programs is still challenging. Network functions (NFs) are not an exception.
This paper presents NFork, a system that helps NF domain experts to
productively develop concurrent NFs by abstracting away concurrency from
developers. The key scheme behind NFork's design is to exploit NF
characteristics to overcome the limitations of prior work on concurrency
programming. Developers write NFs as sequential programs, and during runtime,
NFork performs transparent parallelization by processing packets in different
cores. Exploiting NF characteristics, NFork leverages transactional memory and
develops efficient concurrent data structures to achieve scalability and
guarantee the absence of concurrency bugs.
Since NFork manages concurrency, it further provides (i) a profiler that
reveals the root causes of scalability bottlenecks inherent to the NF's
semantics and (ii) actionable recipes for developers to mitigate these root
causes by relaxing the NF's semantics. We show that NFs developed with NFork
achieve competitive scalability with those in Cisco VPP [16], and NFork's
profiler and recipes can effectively aid developers in optimizing NF
scalability.Comment: 16 pages, 8 figure
Recommended from our members
Logical partitioning of parallel system simulations
Simulation has been a fundamental tool to prototype, hypothesize, and evaluate
new ideas to continue improving system performance. However, increasing levels
of processor parallelism and heterogeneity have introduced additional
constraints when evaluating new designs. The work embodied in this dissertation
explores how to leverage novel ideas in simulator partitioning to improve
simulator speed and flexibility for simulating these new types of systems.
The contribution of this work includes the introduction of optimistic
partitioned simulation to improve parallelization, and the introduction of
warped partitioned simulation for improved flexibility. These ideas are refined
and demonstrated through the use of prototypes to demonstrate their benefits
compared to state-of-the-art approaches. By leveraging partitioning in a
structured manner, it is possible to design simulators that better address the
open challenges of parallel and heterogeneous systems design.Electrical and Computer Engineerin
The exploitation of parallelism on shared memory multiprocessors
PhD ThesisWith the arrival of many general purpose shared memory multiple processor
(multiprocessor) computers into the commercial arena during the mid-1980's, a
rift has opened between the raw processing power offered by the emerging
hardware and the relative inability of its operating software to effectively deliver
this power to potential users. This rift stems from the fact that, currently, no
computational model with the capability to elegantly express parallel activity is
mature enough to be universally accepted, and used as the basis for programming
languages to exploit the parallelism that multiprocessors offer. To add to this,
there is a lack of software tools to assist programmers in the processes of designing
and debugging parallel programs.
Although much research has been done in the field of programming languages,
no undisputed candidate for the most appropriate language for programming
shared memory multiprocessors has yet been found. This thesis examines why this
state of affairs has arisen and proposes programming language constructs,
together with a programming methodology and environment, to close the ever
widening hardware to software gap.
The novel programming constructs described in this thesis are intended for use
in imperative languages even though they make use of the synchronisation
inherent in the dataflow model by using the semantics of single assignment when
operating on shared data, so giving rise to the term shared values. As there are
several distinct parallel programming paradigms, matching flavours of shared
value are developed to permit the concise expression of these paradigms.The Science and Engineering Research Council
Low-Impact Profiling of Streaming, Heterogeneous Applications
Computer engineers are continually faced with the task of translating improvements in fabrication process technology: i.e., Moore\u27s Law) into architectures that allow computer scientists to accelerate application performance. As feature-size continues to shrink, architects of commodity processors are designing increasingly more cores on a chip. While additional cores can operate independently with some tasks: e.g. the OS and user tasks), many applications see little to no improvement from adding more processor cores alone. For many applications, heterogeneous systems offer a path toward higher performance. Significant performance and power gains have been realized by combining specialized processors: e.g., Field-Programmable Gate Arrays, Graphics Processing Units) with general purpose multi-core processors. Heterogeneous applications need to be programmed differently than traditional software. One approach, stream processing, fits these systems particularly well because of the segmented memories and explicit expression of parallelism. Unfortunately, debugging and performance tools that support streaming, heterogeneous applications do not exist. This dissertation presents TimeTrial, a performance measurement system that enables performance optimization of streaming applications by profiling the application deployed on a heterogeneous system. TimeTrial performs low-impact measurements by dedicating computing resources to monitoring and by aggressively compressing performance traces into statistical summaries guided by user specification of the performance queries of interest
Finding and Tolerating Concurrency Bugs.
Shared-memory multi-threaded programming is inherently more difficult than single-threaded programming. The main source of complexity is that, the threads of an application can interleave in so many different ways. To ensure correctness, a programmer has to test all possible thread interleavings, which, however, is impractical. Many rare thread interleavings remain untested in production systems, and they are the major cause for a majority of concurrency bugs.
Given that untested interleavings are the major cause of a majority of the concurrency bugs, this dissertation explores two possible ways to tackle concurrency bugs in this dissertation. One is to expose untested interleavings during testing to find concurrency bugs. The other is to avoid untested interleavings during production runs to tolerate concurrency bugs. The key is an efficient and effective way to encode and remember tested interleavings.
This dissertation first discusses two hypotheses about concurrency bugs: the small scope hypothesis and the value independent hypothesis. Based on these two hypotheses, this dissertation defines a set of interleaving patterns, called interleaving idioms, which are used to encode tested interleavings. The empirical analysis shows that the idiom based interleaving encoding scheme is able to represent most of the concurrency bugs that are used in the study.
Then, this dissertation discusses an open source testing tool called Maple. It memoizes tested interleavings and actively seeks to expose untested interleavings. The results show that Maple is able to expose concurrency bugs and expose interleavings faster than other conventional testing techniques.
Finally, this dissertation discusses two parallel runtime system designs which seek to avoid untested interleavings during production runs to tolerate concurrency bugs. Avoiding untested interleavings significantly improve correctness because most of the concurrency bugs are caused by untested interleavings. Also, the performance overhead for disallowing untested interleavings is low as commonly occuring interleavings should have been tested in a well-tested program.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99765/1/jieyu_1.pd
- …