37 research outputs found

    GPU NTC Process Variation Compensation with Voltage Stacking

    Get PDF
    Near-threshold computing (NTC) has the potential to significantly improve efficiency in high throughput architectures, such as general-purpose computing on graphic processing unit (GPGPU). Nevertheless, NTC is more sensitive to process variation (PV) as it complicates power delivery. We propose GPU stacking, a novel method based on voltage stacking, to manage the effects of PV and improve the power delivery simultaneously. To evaluate our methodology, we first explore the design space of GPGPUs in the NTC to find a suitable baseline configuration and then apply GPU stacking to mitigate the effects of PV. When comparing with an equivalent NTC GPGPU without PV management, we achieve 37% more performance on average. When considering high production volume, our approach shifts all the chips closer to the nominal non-PV case, delivering on average (across chips) ˜80 % of the performance of nominal NTC GPGPU, whereas when not using our technique, chips would have ˜50 % of the nominal performance. We also show that our approach can be applied on top of multifrequency domain designs, improving the overall performance

    Selectivity filter instability dominates the low intrinsic activity of the TWIK-1 K2P K+ channel

    Get PDF
    Two-pore domain K+ (K2P) channels have many important physiological functions. However, the functional properties of the TWIK-1 (K2P1.1/KCNK1) K2P channel remain poorly characterized because heterologous expression of this ion channel yields only very low levels of functional activity. Several underlying reasons have been proposed, including TWIK-1 retention in intracellular organelles, inhibition by posttranslational sumoylation, a hydrophobic barrier within the pore, and a low open probability of the selectivity filter (SF) gate. By evaluating these potential mechanisms, we found that the latter dominates the low intrinsic functional activity of TWIK-1. Investigating this further, we observed that the low activity of the SF gate appears to arise from the inefficiency of K+ in stabilizing an active (i.e. conductive) SF conformation. In contrast, other permeant ion species, such as Rb+, NH4+, and Cs+, strongly promoted a pH-dependent activated conformation. Furthermore, many K2P channels are activated by membrane depolarization via an SF-mediated gating mechanism, but we found here that only very strong nonphysiological depolarization produces voltage-dependent activation of heterologously expressed TWIK-1. Remarkably, we also observed that TWIK-1 Rb+ currents are potently inhibited by intracellular K+ (IC50 = 2.8 mM). We conclude that TWIK-1 displays unique SF gating properties among the family of K2P channels. In particular, the apparent instability of the conductive conformation of the TWIK-1 SF in the presence of K+ appears to dominate the low levels of intrinsic functional activity observed when the channel is expressed at the cell surface

    MTrainS: Improving DLRM training efficiency using heterogeneous memories

    Full text link
    Recommendation models are very large, requiring terabytes (TB) of memory during training. In pursuit of better quality, the model size and complexity grow over time, which requires additional training data to avoid overfitting. This model growth demands a large number of resources in data centers. Hence, training efficiency is becoming considerably more important to keep the data center power demand manageable. In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We observe that the bandwidth requirement is not uniform across different tables and that embedding tables show high temporal locality. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically. MTrainS allows for higher memory capacity per node and increases training efficiency by lowering the need to scale out to multiple hosts in memory capacity bound use cases. By optimizing the platform memory hierarchy, we reduce the number of nodes for training by 4-8X, saving power and cost of training while meeting our target training performance

    Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

    Full text link
    Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments

    Global, regional, and national burden of colorectal cancer and its risk factors, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019

    Get PDF
    Funding: F Carvalho and E Fernandes acknowledge support from Fundação para a Ciência e a Tecnologia, I.P. (FCT), in the scope of the project UIDP/04378/2020 and UIDB/04378/2020 of the Research Unit on Applied Molecular Biosciences UCIBIO and the project LA/P/0140/2020 of the Associate Laboratory Institute for Health and Bioeconomy i4HB; FCT/MCTES through the project UIDB/50006/2020. J Conde acknowledges the European Research Council Starting Grant (ERC-StG-2019-848325). V M Costa acknowledges the grant SFRH/BHD/110001/2015, received by Portuguese national funds through Fundação para a Ciência e Tecnologia (FCT), IP, under the Norma Transitória DL57/2016/CP1334/CT0006.proofepub_ahead_of_prin

    ESESC: A Fast Multicore Simulator Using Time-Based Sampling

    No full text
    Architects rely on simulation in their exploration of the design space. However, slow simulation speed caps their productivity and limits the depth of their exploration. Sampling has been a commonly used remedy. While sampling is shown to be an effective technique for single core processors, its application has been limited to simulation of multiprogram, throughput applications only. This work presents Time-Based Sampling (TBS), a framework that is the first to enable sampling in simulation of multicore processors with virtually no limitation in terms of application type (multiprogrammed or multithreaded), number of cores, homogeneity or heterogeneity of the simulated configuration (4.99 % error averaged across all the evaluated configurations). TBS also is the first to enable integrated power and temperature evaluation in statistically sampled simulation of multicore systems (with 5.5 % and 2.4 % error on average, respectively). We implement an architectural simulator based on TBS, called ESESC, that provides a holistic set of tools for a fair evaluation of different architectures. 1
    corecore