1,370 research outputs found
Memory performance of and-parallel prolog on shared-memory architectures
The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly
beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology
Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery
The trend of downsizing transistors and operating voltage scaling has made the processor chip more sensitive against radiation phenomena making soft errors an important challenge. New reliability techniques for handling soft errors in the logic and memories that allow meeting the desired failures-in-time (FIT) target are key to keep harnessing the benefits of Moore's law. The failure to scale the soft error rate caused by particle strikes, may soon limit the total number of cores that one may have running at the same time. This paper proposes a light-weight and scalable architecture to eliminate silent data corruption errors (SDC) and detected unrecoverable errors (DUE) of a core. The architecture uses acoustic wave detectors for error detection. We propose to recover by confining the errors in the cache hierarchy, allowing us to deal with the relatively long detection latencies. Our results show that the proposed mechanism protects the whole core (logic, latches and memory arrays) incurring performance overhead as low as 0.60%. © 2014 IEEE.Peer ReviewedPostprint (author's final draft
SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures
Near-Data-Processing (NDP) architectures present a promising way to alleviate
data movement costs and can provide significant performance and energy benefits
to parallel applications. Typically, NDP architectures support several NDP
units, each including multiple simple cores placed close to memory. To fully
leverage the benefits of NDP and achieve high performance for parallel
workloads, efficient synchronization among the NDP cores of a system is
necessary. However, supporting synchronization in many NDP systems is
challenging because they lack shared caches and hardware cache coherence
support, which are commonly used for synchronization in multicore systems, and
communication across different NDP units can be expensive.
This paper comprehensively examines the synchronization problem in NDP
systems, and proposes SynCron, an end-to-end synchronization solution for NDP
systems. SynCron adds low-cost hardware support near memory for synchronization
acceleration, and avoids the need for hardware cache coherence support. SynCron
has three components: 1) a specialized cache memory structure to avoid memory
accesses for synchronization and minimize latency overheads, 2) a hierarchical
message-passing communication protocol to minimize expensive communication
across NDP units of the system, and 3) a hardware-only overflow management
scheme to avoid performance degradation when hardware resources for
synchronization tracking are exceeded.
We evaluate SynCron using a variety of parallel workloads, covering various
contention scenarios. SynCron improves performance by 1.27 on average
(up to 1.78) under high-contention scenarios, and by 1.35 on
average (up to 2.29) under low-contention real applications, compared
to state-of-the-art approaches. SynCron reduces system energy consumption by
2.08 on average (up to 4.25).Comment: To appear in the 27th IEEE International Symposium on
High-Performance Computer Architecture (HPCA-27
Proximity coherence for chip-multiprocessors
Many-core architectures provide an efficient way of harnessing the growing numbers of transistors available in modern fabrication processes; however, the parallel programs run on these platforms are increasingly limited by the energy and latency costs of communication. Existing designs provide a functional communication layer but do not necessarily implement the most efficient solution for chip-multiprocessors, placing limits on the performance of these complex systems. In an era of increasingly power limited silicon design, efficiency is now a primary concern that motivates designers to look again at the challenge of cache coherence.
The first step in the design process is to analyse the communication behaviour of parallel benchmark suites such as Parsec and SPLASH-2. This thesis presents work detailing the sharing patterns observed when running the full benchmarks on a simulated 32-core x86 machine. The results reveal considerable locality of shared data accesses between threads with consecutive operating system assigned thread IDs. This pattern, although of little consequence in a multi-node system, corresponds to strong physical locality of shared data between adjacent cores on a chip-multiprocessor platform.
Traditional cache coherence protocols, although often used in chip-multiprocessor designs, have been developed in the context of older multi-node systems. By redesigning coherence protocols to exploit new patterns such as the physical locality of shared data, improving the efficiency of communication, specifically in chip-multiprocessors, is possible. This thesis explores such a design â Proximity Coherence â a novel scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure.EPSRC DTA research scholarshi
A comparison of software and hardware synchronization mechanisms for distributed shared memory multiprocessors
technical reportEfficient synchronization is an essential component of parallel computing. The designers of traditional multiprocessors have included hardware support only for simple operations such as compare-and-swap and load-linked/store-conditional, while high level synchronization primitives such as locks, barriers, and condition variables have been implemented in software [9,14,15]. With the advent of directory-based distributed shared memory (DSM) multiprocessors with significant flexibility in their cache controllers [7,12,17], it is worthwhile considering whether this flexibility should be used to support higher level synchronization primitives in hardware. In particular, as part of maintaining data consistency, these architectures maintain lists of processors with a copy of a given cache line, which is most of the hardware needed to implement distributed locks. We studied two software and four hardware implementations of locks and found that hardware implementation can reduce lock acquire and release times by 25-94% compared to well tuned software locks. In terms of macrobenchmark performance, hardware locks reduce application running times by up to 75% on a synthetic benchmark with heavy lock contention and by 3%-6% on a suite of SPLASH-2 benchmarks. In addition, emerging cache coherence protocols promise to increase the time spent synchronizing relative to the time spent accessing shared data, and our study shows that hardware locks can reduce SPLASH-2 execution times by up to 10-13% if the time spent accessing shared data is small. Although the overall performance impact of hardware lock mechanisms varies tremendously depending on the application, the added hardware complexity on a flexible architecture like FLASH [12] or Avalanche [7] is negligible, and thus hardware support for high level synchronization operations should be provided
- âŠ