4,734 research outputs found
CheckFence: Checking Consistency of Concurrent Data Types on Relaxed Memory Models
Concurrency libraries can facilitate the development of multithreaded programs by providing concurrent implementations of familiar data types such as queues or sets. There exist many optimized algorithms that can achieve superior performance on multiprocessors by allowing concurrent data accesses without using locks. Unfortunately, such algorithms can harbor subtle concurrency bugs. Moreover, they require memory ordering fences to function correctly on relaxed memory models. To address these difficulties, we propose a verification approach that can exhaustively check all concurrent executions of a given test program on a relaxed memory model and can verify that they are observationally equivalent to a sequential execution. Our Check- Fence prototype automatically translates the C implementation code and the test program into a SAT formula, hands the latter to a standard SAT solver, and constructs counterexample traces if there exist incorrect executions. Applying CheckFence to five previously published algorithms, we were able to (1) find several bugs (some not previously known), and (2) determine how to place memory ordering fences for relaxed memory models
Validation of Memory Accesses Through Symbolic Analyses
International audienceThe C programming language does not prevent out-of- bounds memory accesses. There exist several techniques to secure C programs; however, these methods tend to slow down these programs substantially, because they populate the binary code with runtime checks. To deal with this prob- lem, we have designed and tested two static analyses - sym- bolic region and range analysis - which we combine to re- move the majority of these guards. In addition to the analy- ses themselves, we bring two other contributions. First, we describe live range splitting strategies that improve the effi- ciency and the precision of our analyses. Secondly, we show how to deal with integer overflows, a phenomenon that can compromise the correctness of static algorithms that validate memory accesses. We validate our claims by incorporating our findings into AddressSanitizer. We generate SPEC CINT 2006 code that is 17% faster and 9% more energy efficient than the code produced originally by this tool. Furthermore, our approach is 50% more effective than Pentagons, a state- of-the-art analysis to sanitize memory accesses
A metadata-enhanced framework for high performance visual effects
This thesis is devoted to reducing the interactive latency of image processing computations in
visual effects. Film and television graphic artists depend upon low-latency feedback to receive
a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising
compiler which leverages high-level program metadata to guide key computational and
memory hierarchy optimisations. This metadata encodes static and dynamic information about
data dependence and patterns of memory access in the algorithms constituting a visual effect –
features that are typically difficult to extract through program analysis – and presents it to the
compiler in an explicit form. By using domain-specific information as a substitute for program
analysis, our compiler is able to target a set of complex source-level optimisations that a vendor
compiler does not attempt, before passing the optimised source to the vendor compiler for
lower-level optimisation.
Three key metadata-supported optimisations are presented. The first is an adaptation of
space and schedule optimisation – based upon well-known compositions of the loop fusion and
array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised
visual effect. This adaptation sidesteps the costly solution of runtime code generation
by specialising static parameters in an offline process and exploiting dynamic metadata to
adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second
optimisation comprises a set of transformations to generate SIMD ISA-augmented source code.
Our approach differs from autovectorisation by using static metadata to identify parallelism, in
place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable
parameters for optimal aligned memory access. The third optimisation comprises a related set
of transformations to generate code for SIMT architectures, such as GPUs. Static dependence
metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads.
Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying
inter-thread and intra-core data sharing opportunities in memory access metadata.
A detailed performance analysis of these optimisations is presented for two industrially developed
visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD
multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations
of these two effects. Programmability is enhanced by automating the generation of
SIMD and SIMT implementations from a single programmer-managed scalar representation
Array languages and the N-body problem
This paper is a description of the contributions to the SICSA multicore challenge on many body
planetary simulation made by a compiler group at the University of Glasgow. Our group is part of
the Computer Vision and Graphics research group and we have for some years been developing array
compilers because we think these are a good tool both for expressing graphics algorithms and for
exploiting the parallelism that computer vision applications require.
We shall describe experiments using two languages on two different platforms and we shall compare
the performance of these with reference C implementations running on the same platforms. Finally
we shall draw conclusions both about the viability of the array language approach as compared to
other approaches used in the challenge and also about the strengths and weaknesses of the two, very
different, processor architectures we used
Compiling Tree Transforms to Operate on Packed Representations
When written idiomatically in most programming languages, programs
that traverse and construct trees operate over pointer-based data
structures, using one heap object per-leaf and per-node. This
representation is efficient for random access and shape-changing
modifications, but for traversals, such as compiler passes, that
process most or all of a tree in bulk, it can be inefficient. In this
work we instead compile tree traversals to operate on
pointer-free pre-order serializations of trees. On modern
architectures such programs often run significantly faster than
their pointer-based counterparts, and additionally are directly suited
to storage and transmission without requiring marshaling.
We present a prototype compiler, Gibbon, that compiles a
small first-order, purely functional language sufficient for tree
traversals. The compiler transforms this language into intermediate
representation with explicit pointers into input and output buffers
for packed data. The key compiler technologies include an effect
system for capturing traversal behavior, combined with an algorithm to
insert destination cursors. We evaluate our compiler on tree
transformations over a real-world dataset of source-code syntax trees.
For traversals touching the whole tree, such as maps and folds, packed
data allows speedups of over 2x compared to a highly-optimized
pointer-based baseline
Dynamic race detection for C++11
The intricate rules for memory ordering and synchronisation associated with the C/C++11 memory model mean that data races can be difficult to eliminate from concurrent programs. Dynamic data race analysis can pinpoint races in large and complex applications, but the state-of-the-art ThreadSanitizer (tsan) tool for C/C++ considers only sequentially consistent program executions, and does not correctly model synchronisation between C/C++11 atomic operations. We present a scalable dynamic data race analysis for C/C++11 that correctly captures C/C++11 synchronisation, and uses instrumentation to support exploration of a class of non sequentially consistent executions. We concisely define the memory model fragment captured by our instrumentation via a restricted axiomatic semantics, and show that the axiomatic semantics permits exactly those executions explored by our instrumentation. We have implemented our analysis in tsan, and evaluate its effectiveness on benchmark programs, enabling a comparison with the CDSChecker tool, and on two large and highly concurrent applications: the Firefox and Chromium web browsers. Our results show that our method can detect races that are beyond the scope of the original tsan tool, and that the overhead associated with applying our enhanced instrumentation to large applications is tolerable
Multithreading Aware Hardware Prefetching for Chip Multiprocessors
To take advantage of the processing power in the Chip Multiprocessors design,
applications must be divided into semi-independent processes that can run concur-
rently on multiple cores within a system. Therefore, programmers must insert thread
synchronization semantics (i.e. locks, barriers, and condition variables) to synchro-
nize data access between processes. Indeed, threads spend long time waiting to
acquire the lock of a critical section. In addition, a processor has to stall execution
to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data.
We present a hardware prefetcher that enables large performance improvements
from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand
interference. Furthermore, it will utilize the time that a thread spends waiting on syn-
chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics
- …