16 research outputs found

    Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

    Full text link
    Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22% respectively compared to what DP alone can achieve at scale

    Supervia: Relieving Routing Congestion usingDouble-height Vias

    No full text
    With increase in transistor packing density and use ofuni-directional metal routing, resources on local metal layers areincreasingly limited. A major contributor to routing congestion is theminimum metal area (minArea) design rule, which has been steadilyincreasing over the past few technology nodes. For a net which crossesmultiple metal layers (e.g., M2 to M4), polygons on intermediate layers(e.g., M3) i.e. via landing pads must satisfy the minArea rule; thiscreates unnecessary routing blockage, which can lead to area overhead.In this work, we investigated the benefits of introduction into theBEOL stack of a new “supervia” structure, namely, a double-height viaspanning two metal layers without a landing pad on an intermediatemetal layer. We study the benefit of supervia using (i) routing clip-basedevaluation using an optimal ILP-based router (OptRouterSV) and (ii)chip-level evaluation using a commercial routing tool in conjunctionwith MILP-based supervia aware legalization. With the latter, if thelegalization approach fails, the failures are localized to clips, whichare then routed optimally using OptRouterSV. Our results suggest thatwhen the P&R tool is allowed to generate via structures which optimizesfor minArea in stacked vias, using supervia can save∼2% of the chiparea whereas in absence of this option, supervia can save as much as20% of the chip area
    corecore