18 research outputs found
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
Deep learning recommendation models (DLRMs) are used across many
business-critical services at Facebook and are the single largest AI
application in terms of infrastructure demand in its data-centers. In this
paper we discuss the SW/HW co-designed solution for high-performance
distributed training of large-scale DLRMs. We introduce a high-performance
scalable software stack based on PyTorch and pair it with the new evolution of
Zion platform, namely ZionEX. We demonstrate the capability to train very large
DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup
in terms of time to solution over previous systems. We achieve this by (i)
designing the ZionEX platform with dedicated scale-out network, provisioned
with high bandwidth, optimal topology and efficient transport (ii) implementing
an optimized PyTorch-based training stack supporting both model and data
parallelism (iii) developing sharding algorithms capable of hierarchical
partitioning of the embedding tables along row, column dimensions and load
balancing them across multiple workers; (iv) adding high-performance core
operators while retaining flexibility to support optimizers with fully
deterministic updates (v) leveraging reduced precision communications,
multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we
develop and briefly comment on distributed data ingestion and other supporting
services that are required for the robust and efficient end-to-end training in
production environments
Exploiting software information for an efficient memory hierarchy
Power consumption is one of the most important factors in the design of today’s processor chips. Multicore and heterogeneous systems have emerged to address the rising power concerns. Since the memory hierarchy is becoming one of the major consumers of the on-chip power budget in these systems, designing an efficient memory hierarchy is critical to future systems. We identify three sources of inefficiencies in memory hierarchies of today’s systems: (a) coherence, (b) data communication, and (c) data storage. This thesis takes the stand that many of these inefficiencies are a result of today’s software-agnostic hardware design. There is a lot of information in the software that can be exploited to build an efficient memory hierarchy. This thesis focuses on identifying some of the inefficiencies related to each of the above three sources, and proposing various techniques to mitigate them by exploiting information from the software.
First, we focus on inefficiencies related to coherence and communication. Today’s hardware based directory coherence protocols are extremely complex and incur unnecessary overheads for sending invalidation messages and maintaining sharer lists. We propose DeNovo, a hardware-software co-designed protocol, to address these issues for a class of programs that are deterministic. DeNovo assumes a disciplined programming environment and exploits features such as structured parallel control, data-race-freedom, and software information about data access patterns to build a system that is simple, extensible, and performance-efficient compared to today’s protocols. We also extend DeNovo to add two optimizations to address the inefficiencies related to data communication, specifically, aimed at reducing the unnecessary on-chip network traffic. We show that adding these two optimizations did not only result in addition of zero new states (or transient states) to the protocol but also provided performance and energy gains to the system, thus validating the extensibility of the DeNovo protocol. Together with the two communication optimizations DeNovo reduces the memory stall time by 32% and the network traffic by 36% (resulting in direct savings in energy) on average compared to a state-of-the-art implementation of the MESI protocol for the applications studied.
Next we address the inefficiencies related to data storage. Caches and scratchpads are two popular organizations for storing data in today’s systems but they both have inefficiencies. Caches are power-hungry incurring expensive tag lookups and scratchpads incur unnecessary data movement as they are only locally visible. To address these problems, we propose a new memory organization, stash, which has the best of both cache and scratchpad organizations. Stash is a globally visible unit and its functionality is independent of the coherence protocol employed. In our implementation, we extend DeNovo to provide coherence for stash. Compared to a baseline configuration that has both scratchpad and cache accesses, we show that the stash configuration (in which scratchpad and cache accesses are converted to stash accesses), even with today’s applications that do not fully exploit stash, reduces the execution time by 10% and the energy consumption by 14% on average.
Overall, this thesis shows that a software-aware hardware design can effectively address many of the inefficiencies found in today’s software oblivious memory hierarchies
Verification and Performance of the DeNovo Cache Coherence Protocol
With the advent of multicores, parallel programming has gained a lot of importance. For parallel programming to be viable for the predicted hundreds of cores per chip, shared memory programming languages and environments must evolve to enforce disciplined practices like “determinism-by-default semantics ” and ban “wild shared-memory behaviors ” like arbitrary data races and potential non-determinism everywhere. This evolution can not only benefit software development, but can also greatly reduce the complexity in hardware. DeNovo is a hardware architecture designed from the ground up to exploit the opportunities exposed by such disciplined software models to make the hardware much simpler and efficient at the same time. This thesis describes an effort to formally verify and evaluate the DeNovo cache coherence protocol. By using a model checking tool, we uncovered three bugs in the protocol implementation which had not been found either in the testing phase or in the simulation runs. All of these bugs were caused by errors in translating the high level description into the implementation. Surprisingly, we also found six bugs in a state-of-the-art implementation of the widely used MESI protocol. Most of these bugs were hard to analyze and took several days to fix. We provide quantitative evidence that DeNovo is a much simpler protocol by showing that the DeNovo protocol has about 15X fewer reachable states when compared to MESI when using the Murphi model checking tool for verification. This translates to about 20X difference in the runtim
Parallel SAH k-D Tree Construction for Fast Dynamic Scene Ray Tracing
The k-D tree is a well-studied acceleration data structure for ray tracing. It is used to organize primitives in a scene to allow efficient execution of intersection operations between rays and the primitives. The highest quality k-D tree can be obtained using greedy cost optimization based on a surface area heuristc (SAH). While the high quality enables very fast ray tracing times, a key drawback is that the k-D tree construction time remains prohibitively expensive. This cost is unreasonable for rendering dynamic scenes for future visual computing applications on emerging multicore systems. Much work has therefore been focused on faster parallel k-D tree construction performance at the expense of approximating or ignoring SAH computation, which produces k-D trees that degrade rendering time. In this paper, we present new, faster multicore al- gorithms for building precise SAH-optimized kd-trees. Our best algorithm makes a tradeoff between worse cache performance and higher parallelism to provide up to 7X speedup on 16 cores, using two different kinds of parallelism models, without degrading tree quality and rendering time
A Type and Effect System for Deterministic Parallelism in Object-Oriented Languages
We describe a type and effect system for ensuring deterministic semantics in a concurrent object-oriented language. Our system provides several new capabilities over previous work, including support for linear arrays (important in parallel update traversals), flexible effect specifications and subtyping (important for, e.g., tree-based algorithms), dynamic partitioning into subarrays (important for divide-and-conquer algorithms), and a novel invocation effect for handling higher-level commutative operations such as set insert. We informally describe the key type system features, formally define a core subset of our system, and explain the steps leading to the key soundness result, i.e., that the type and effect annotations allow us to reason soundly about parallel noninterference between sections of code. Finally, we describe our experience with using the system to express realistic parallel algorithms, which validates the importance of the new type system features
Appears in Second USENIX Workshop on Hot Topics in Parallelism (HotPar’10) DeNovo: Rethinking Hardware for Disciplined Parallelism ∗
We believe that future large-scale multicore systems will require disciplined parallel programming practices, including data-race-freedom, deterministic-by-default semantics, and structured, explicit parallel control and side-effects. We argue that this software evolution presents far-reaching opportunities for parallel hardware design to greatly improve complexity, power-efficiency, and performance scalability. The DeNovo project is rethinking hardware design from the ground up to exploit these opportunities. This paper presents the broad research agenda of DeNovo, including a holistic rethinking of cache coherence, memory consistency, communication, and cache architecture.