1,947 research outputs found
Distributed data cache designs for clustered VLIW processors
Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in What we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular; we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible LO-buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite'show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored.Peer ReviewedPostprint (published version
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
The upcoming many-core architectures require software developers to exploit
concurrency to utilize available computational power. Today's high-level
language virtual machines (VMs), which are a cornerstone of software
development, do not provide sufficient abstraction for concurrency concepts. We
analyze concrete and abstract concurrency models and identify the challenges
they impose for VMs. To provide sufficient concurrency support in VMs, we
propose to integrate concurrency operations into VM instruction sets.
Since there will always be VMs optimized for special purposes, our goal is to
develop a methodology to design instruction sets with concurrency support.
Therefore, we also propose a list of trade-offs that have to be investigated to
advise the design of such instruction sets.
As a first experiment, we implemented one instruction set extension for
shared memory and one for non-shared memory concurrency. From our experimental
results, we derived a list of requirements for a full-grown experimental
environment for further research
TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA
Memory consistency models (MCMs) which govern inter-module interactions in a
shared memory system, are a significant, yet often under-appreciated, aspect of
system design. MCMs are defined at the various layers of the hardware-software
stack, requiring thoroughly verified specifications, compilers, and
implementations at the interfaces between layers. Current verification
techniques evaluate segments of the system stack in isolation, such as proving
compiler mappings from a high-level language (HLL) to an ISA or proving
validity of a microarchitectural implementation of an ISA.
This paper makes a case for full-stack MCM verification and provides a
toolflow, TriCheck, capable of verifying that the HLL, compiler, ISA, and
implementation collectively uphold MCM requirements. The work showcases
TriCheck's ability to evaluate a proposed ISA MCM in order to ensure that each
layer and each mapping is correct and complete. Specifically, we apply TriCheck
to the open source RISC-V ISA, seeking to verify accurate, efficient, and legal
compilations from C11. We uncover under-specifications and potential
inefficiencies in the current RISC-V ISA documentation and identify possible
solutions for each. As an example, we find that a RISC-V-compliant
microarchitecture allows 144 outcomes forbidden by C11 to be observed out of
1,701 litmus tests examined. Overall, this paper demonstrates the necessity of
full-stack verification for detecting MCM-related bugs in the hardware-software
stack.Comment: Proceedings of the Twenty-Second International Conference on
Architectural Support for Programming Languages and Operating System
Boosting Multi-Core Reachability Performance with Shared Hash Tables
This paper focuses on data structures for multi-core reachability, which is a
key component in model checking algorithms and other verification methods. A
cornerstone of an efficient solution is the storage of visited states. In
related work, static partitioning of the state space was combined with
thread-local storage and resulted in reasonable speedups, but left open whether
improvements are possible. In this paper, we present a scaling solution for
shared state storage which is based on a lockless hash table implementation.
The solution is specifically designed for the cache architecture of modern
CPUs. Because model checking algorithms impose loose requirements on the hash
table operations, their design can be streamlined substantially compared to
related work on lockless hash tables. Still, an implementation of the hash
table presented here has dozens of sensitive performance parameters (bucket
size, cache line size, data layout, probing sequence, etc.). We analyzed their
impact and compared the resulting speedups with related tools. Our
implementation outperforms two state-of-the-art multi-core model checkers (SPIN
and DiVinE) by a substantial margin, while placing fewer constraints on the
load balancing and search algorithms.Comment: preliminary repor
- …