13,445 research outputs found

    Crux: Locality-Preserving Distributed Services

    Full text link
    Distributed systems achieve scalability by distributing load across many machines, but wide-area deployments can introduce worst-case response latencies proportional to the network's diameter. Crux is a general framework to build locality-preserving distributed systems, by transforming an existing scalable distributed algorithm A into a new locality-preserving algorithm ALP, which guarantees for any two clients u and v interacting via ALP that their interactions exhibit worst-case response latencies proportional to the network latency between u and v. Crux builds on compact-routing theory, but generalizes these techniques beyond routing applications. Crux provides weak and strong consistency flavors, and shows latency improvements for localized interactions in both cases, specifically up to several orders of magnitude for weakly-consistent Crux (from roughly 900ms to 1ms). We deployed on PlanetLab locality-preserving versions of a Memcached distributed cache, a Bamboo distributed hash table, and a Redis publish/subscribe. Our results indicate that Crux is effective and applicable to a variety of existing distributed algorithms.Comment: 11 figure

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS
    • …
    corecore