13,445 research outputs found
Crux: Locality-Preserving Distributed Services
Distributed systems achieve scalability by distributing load across many
machines, but wide-area deployments can introduce worst-case response latencies
proportional to the network's diameter. Crux is a general framework to build
locality-preserving distributed systems, by transforming an existing scalable
distributed algorithm A into a new locality-preserving algorithm ALP, which
guarantees for any two clients u and v interacting via ALP that their
interactions exhibit worst-case response latencies proportional to the network
latency between u and v. Crux builds on compact-routing theory, but generalizes
these techniques beyond routing applications. Crux provides weak and strong
consistency flavors, and shows latency improvements for localized interactions
in both cases, specifically up to several orders of magnitude for
weakly-consistent Crux (from roughly 900ms to 1ms). We deployed on PlanetLab
locality-preserving versions of a Memcached distributed cache, a Bamboo
distributed hash table, and a Redis publish/subscribe. Our results indicate
that Crux is effective and applicable to a variety of existing distributed
algorithms.Comment: 11 figure
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
- …