Search CORE

16 research outputs found

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Author: Ebrahimi Eiman
Fu Yaosheng
Gupta Puneet
Migacz Szymon
Nellans David
Pal Saptadeep
Zhang Victor
Zulfiqar Arslan
Publication venue
Publication date: 30/07/2019
Field of study

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22% respectively compared to what DP alone can achieve at scale

arXiv.org e-Print Archive

Recommended from our members

Scale-Out Packageless Processing

Author: Pal Saptadeep
Publication venue: eScholarship, University of California
Publication date: 01/01/2021
Field of study

Demand for increasing system performance is far outpacing the capability of conventional methods for performance scaling. Traditionally, performance and energy scaling has relied on transistor and silicon scaling. However, developing chips, often very large ones in the advanced technology nodes is becoming very challenging and costly. Moreover, system performance is often limited by inter-die connections. Today, dies with different functionality are packaged and integrated using PCBs. Unlike silicon features, package and PCB features have barely scaled (about 4-5x) over the past few decades. This severely limits performance and efficiency of processor systems. Moreover, next-generation of applications driven by artificial intelligence, and other data intensive applications are driving the demand for very large scale-out systems. Traditional scale-out system building and integration methodologies are failing to deliver the performance these applications demand. As a result of the above trends, future performance, power, and cost improvements cannot come from improvements in transistor technology alone. Then, how do we enable “System scaling”?In this dissertation, first we show that packages inhibit system scaling as it reduces the potential memory bandwidth of a processor by at least one order of magnitude, allowable thermal design power (TDP) by up to 70%, and area efficiency by a factor of 5 to 18. We therefore propose packageless processors - processors where packages have been removed and dies directly mounted on a silicon board using a novel integration technology, Silicon Interconnection Fabric (Si-IF). We show that Si-IF-based packageless processors outperform their packaged counterparts by up to 58% (16% average), 136% (103% average), and 295% (80% average) due to increased memory bandwidth, increased allowable TDP, and reduced area respectively. We also extend the concept of packageless processing to the entire processor and memory system, where the area footprint reduction was up to 76%. To guide technology direction for dielet integration substrate technologies, we also developed a die-to-die interconnect pathfinding tool to explore the effects of physical trade-offs such as bump pitch, wire pitch, I/O ESD capacitance etc. We show that incessant reduction of bump and wire pitch below 10 um wouldn't be helpful for interconnect performance and we need to develop techniques and technologies to minimize reliance on large ESD structures in the chiplet I/Os as ESD capacitance starts dominating performance and energy cost of these die-to-die interconnect links. Next, we show that fine pitch chiplet integration technologies allow us to disintegrate large SoCs in to chiplets with minuscule hit in performance. This opens up the opportunity to build a chiplet ecosystem, where application-optimized systems can be built by selecting a subset of chiplets from a chiplet pool. Such an ecosystem however needs us to find the suitable minimal set of chiplets to build in order to target a variety of workloads efficiently. To that end, we developed the first chiplet selection framework to target a large variety of applications. We show that up to 35% improvement in EDP can be obtained from application-specific system customization and when total cost of design and manufacturing is considered, up to 72% benefit in cost is possible over SoCs.Part 2 of the dissertation focuses on scale-out processing systems. To target scale-out systems, we propose chiplet-based waferscale processors to dramatically reduce communication overheads. The Si-IF technology can be used to build scale-out processors up to a size of an entire wafer. However, building such a large consolidated waferscale system has its own challenges. Using a waferscale GPU as a case study, we showed that while a 300 mm wafer can house about 100 GPU modules (GPMs), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We analyzed the design space of power-delivery network, cooling and trade-offs of yield and inter-GPM network topologies, and proposed optimized waferscale GPU architecture. We also optimized thread scheduling and data placement policies. Overall, our simulations show that an optimized waferscale architecture can provide up to 19x speedup compared to traditionally integrated systems. Then, we architected and designed a 14,336-core shared memory waferscale system in order to understand the design challenges of waferscale processors. Several aspects of the design were built from the ground up due to the scale of the system:power delivery and on-chip regulation methods, reliable waferscale clock distribution, wafer-scale fault-tolerant network design, chiplet and waferscale system test mechanisms, and multiple physical and architectural techniques to enhance system yield. The chiplets were taped out in TSMC N40-LP process and a smaller prototype system has been functionally verified.Next, we have focused on understanding the scalability characteristics of deep learning (DL) training applications and exploring the cross-stack impact of hardware-software-technology co-design at-scale. With the aid of an optimal operation-to-device placement tool, we have proposed a framework which allows us to figure out when to use model parallelism with data parallelism instead of data parallelism alone in order to minimize end-to-end training time. Next, we developed a system-technology co-optimization tool which explores the cross-stack impact of technology scaling, model scaling and architectural innovations on end-to-end DL training time. Using this tool, we can perform rapid-yet-accurate design space exploration and find optimal architectures under given logic, memory, and inter-chip interconnect technology parameters.Together, the techniques and methodologies developed in this dissertation lays the groundwork for a revolutionary new way of thinking about system scaling. Packageless processing and scale-out waferscale architectures can indeed provide orders of magnitude improvement in performance and energy efficiency required by next-generation of applications. Moreover, the cross-stack pathfinding tools provide rapid-assessment frameworks to understand bottlenecks across different levels in the system and helps guide technology optimal decisions for processing systems

eScholarship - University of California

Supervia: Relieving Routing Congestion usingDouble-height Vias

Author: Pal Saptadeep
Publication venue: eScholarship, University of California
Publication date: 03/07/2017
Field of study

With increase in transistor packing density and use ofuni-directional metal routing, resources on local metal layers areincreasingly limited. A major contributor to routing congestion is theminimum metal area (minArea) design rule, which has been steadilyincreasing over the past few technology nodes. For a net which crossesmultiple metal layers (e.g., M2 to M4), polygons on intermediate layers(e.g., M3) i.e. via landing pads must satisfy the minArea rule; thiscreates unnecessary routing blockage, which can lead to area overhead.In this work, we investigated the benefits of introduction into theBEOL stack of a new “supervia” structure, namely, a double-height viaspanning two metal layers without a landing pad on an intermediatemetal layer. We study the benefit of supervia using (i) routing clip-basedevaluation using an optimal ILP-based router (OptRouterSV) and (ii)chip-level evaluation using a commercial routing tool in conjunctionwith MILP-based supervia aware legalization. With the latter, if thelegalization approach fails, the failures are localized to clips, whichare then routed optimally using OptRouterSV. Our results suggest thatwhen the P&R tool is allowed to generate via structures which optimizesfor minArea in stacked vias, using supervia can save∼2% of the chiparea whereas in absence of this option, supervia can save as much as20% of the chip area

Ezid

eScholarship - University of California

Recommended from our members

Advanced packaging and heterogeneous integration to reboot computing

Author: Gupta Puneet
Iyer Subramanian S.
Pal Saptadeep
Publication venue: eScholarship, University of California
Publication date: 01/11/2017
Field of study

eScholarship - University of California

Recommended from our members

A Case for Packageless Processors

Author: Bajwa Adeel
Gupta Puneet
Iyer Subramanian
Kumar Rakesh
Pal Saptadeep
Petrisko Daniel
Publication venue: eScholarship, University of California
Publication date: 28/02/2018
Field of study

eScholarship - University of California

A Case for Packageless Processors

Author: Bajwa Adeel
Gupta Puneet
Iyer Subramanian
Kumar Rakesh
Pal Saptadeep
Petrisko Daniel
Publication venue: eScholarship, University of California
Publication date: 01/02/2018
Field of study

Crossref

eScholarship - University of California

Recommended from our members

Demonstration of a Heterogeneously Integrated System-on-Wafer (SoW) Assembly

Author: Bajwa Adeel
Goorsky Mark
Irwin Randall
Iyer Subramanian
Jangam SivaChandra
Pal Saptadeep
Vaisband Boris
Publication venue: eScholarship, University of California
Publication date: 01/06/2018
Field of study

eScholarship - University of California

Latency, Bandwidth and Power Benefits of the SuperCHIPS Integration Scheme

Author: Bajwa Adeel
Gupta Puneet
Iyer Subramanian S.
Jangam SivaChandra
Pal Saptadeep
Pamarti Sudhakar
Publication venue: eScholarship, University of California
Publication date: 01/06/2017
Field of study

Crossref

eScholarship - University of California

Recommended from our members

Demonstration of a Heterogeneously Integrated System-on-Wafer (SoW) Assembly

Author: Bajwa Adeel
Goorsky Mark
Irwin Randall
Iyer Subramanian
Jangam SivaChandra
Pal Saptadeep
Vaisband Boris
Publication venue: eScholarship, University of California
Publication date: 01/06/2018
Field of study

eScholarship - University of California

Recommended from our members

Latency, Bandwidth and Power Benefits of the SuperCHIPS Integration Scheme

Author: Bajwa Adeel
Gupta Puneet
Iyer Subramanian S.
Jangam SivaChandra
Pal Saptadeep
Pamarti Sudhakar
Publication venue: eScholarship, University of California
Publication date: 01/06/2017
Field of study

eScholarship - University of California