Scalable and Broad Hardware Acceleration through Practical Speculative Parallelism

Abstract

With the slowing down of Moore’s Law, silicon fabrication technology is not yielding the performance improvements it once did. Hardware accelerators, which tailor their architecture to a specific application or domain, have emerged as an attractive approach to improve performance. Unfortunately, current accelerators have been limited to domains such as deep learning, where parallelism is easy to exploit. Many applications do not have such easy-to-extract parallelism and have remained off-limits to accelerators. This thesis presents techniques to build accelerators for applications with speculative parallelism. These applications consist of atomic tasks, sometimes with order constraints, and need speculative execution to extract parallelism. In speculative execution, tasks are executed in parallel assuming they are independent. A runtime system monitors their execution to see if they are. If a task produces a conflict during execution, i.e., if it may violate a data dependence, then it is aborted and re-executed. This thesis proposes Chronos, a framework-based approach for building accelerators that use speculation to extract parallelism. Under Chronos, accelerator designers express the algorithm as a set of ordered tasks, and then design processing elements (PEs) to execute each of these tasks. The framework provides reusable components for task management and speculative execution, saving most of the developer effort in creating accelerators for new applications. Prior general-purpose architectures have leveraged already existing techniques, like cache-coherence protocols, for conflict detection, but implementing coherence would add complexity, latency and significant on-chip storage requirement, making these techniques expensive on accelerators. To tackle this challenge, we first propose a new execution model, Spatially Located Ordered Tasks (SLOT), that uses order as the only synchronization mechanism and limits task accesses to a single read-write object. We then use SLOT to implement the Chronos framework. This implementation avoids the need for cache coherence and makes speculative execution cheap and distributed. This reduces overheads and improves performance by up to 2× over prior conflict detection techniques. While SLOT achieves excellent performance on many algorithms, it is sometimes desirable to allow a single task to access multiple objects. Thus, we extend Chronos to support the more general Swarm execution model, which allows this and is also easier to program. This Chronos-Swarm implementation improves performance when Swarm’s features are needed, but it hurts performance when they are not, as the Swarm execution model requires more expensive conflict checks on each memory access. To bridge this gap, we introduce a hybrid SLOT/Swarm execution model that combines the generality and ease-of-programming of Swarm with the performance of SLOT. We develop FPGA implementations of Chronos and use them to build accelerators for several challenging applications. When run on cloud FPGA instances, these accelerators outperform state-of-the-art software versions running on a higher-priced multicore instance by 3.5× to 15.3×.Ph.D

    Similar works

    Full text

    thumbnail-image

    Available Versions