29 research outputs found

    Beyond the socket: NUMA-aware GPUs

    Get PDF
    GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.We would like to thank anonymous reviewers and Steve Keckler for their help in improving this paper. The first author is supported by the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925)Peer ReviewedPostprint (published version

    Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

    Full text link
    Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22% respectively compared to what DP alone can achieve at scale

    Tumor markers: a proteomic approach

    Get PDF
    This article reviews the recently published data on the diagnosis of cancer with proteomics, including the major proteomics technologies and promising strategies for biomarker discovery and development.  Most of the tumor markers are proteins that either numerically increase in response to the alteration of cancer conditions or are produced by cancer cells. However, they are natural compounds ordinarily available in the typical cells to a little extent what are affected by increase of expression due to cancer and its intensity in blood, body fluids or tissues. Tumor markers are substances normally available in body fluids such as serum, urine, blood, and tissues that increase in the desired tissue of cancer patients. Most of tumor markers are proteins that either are produced in response to changes in cancer conditions or are made by the cancer cells. However, most of tumor markers are among the natural compounds of normal cells present in normal conditions in the cell in small amounts and are affected by increase of expression, due to cancer and their levels in the blood, body fluids or tissues

    Structural Biology: Modeling applications and techniques at a glance

    Get PDF
    As recent advancements in biology shows, the molecular machines specially proteins, RNA and complex molecules play the main role of the so called cell functionality. It means a very big part of the system biology is concerned with the interactions of such molecular components. Drug industries and research institutes are trying hard to better understand the concepts underlying these interactions and are highly dependent on the issues regarding these molecular elements. However the costs for such projects are so high and in many cases these projects will be funded by governments or profit making companies. With this in mind it has to be said that the techniques like stimulation are always a very good candidate to decrease such costs and to provide scientists with a bright future of the project results before undergoing costly experiments. However the costs involved projects that determine an approximation for the problem is not that much high but they are also costly. So it is of utmost importance to invent special techniques for the concept of stimulation that can also decrease the project costs and also predict much accurately. Since the system biology and proteomics as the study of the proteins and their functions are in the center of consideration for the purpose of drug discovery, understanding the cell functionalities and the underlying causes behind diseases; so we need advance software and algorithms that can predict the structure of the molecular components and to provide researchers with the computational tools to analyze such models. In this paper we make review of the importance of molecular modeling, its limitations and applications

    Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems

    No full text
    Linked data structure (LDS) accesses are critical to the performance of many large scale applications. Techniques have been proposed to prefetch such accesses. Unfortunately, many LDS prefetching techniques 1) generate a large number of useless prefetches, thereby degrading performance and bandwidth-efficiency, 2) require significant hardware or storage cost, or 3) when employed together with stream-based prefetchers, cause significant resource contention in the memory system. As a result, existing processors do not employ LDS prefetchers even though they commonly employ stream-based prefetchers. This paper proposes a low-cost hardware/software cooperative technique that enables bandwidth-efficient prefetching of linked data structures. Our solution has two new components: 1) a compiler-guided prefetch filtering mechanism that informs the hardware about which pointer addresses to prefetch, 2) a coordinated prefetcher throttling mechanism that uses run-time feedback to manage the interference between multiple prefetchers (LDS and stream-based) in a hybrid prefetching system. Evaluations show that the proposed solution improves average performance by 22.5 % while decreasing memory bandwidth consumption by 25 % over a baseline system that employs an effective stream prefetcher on a set of memory- and pointer-intensive applications. We compare our proposal to three different LDS/correlation prefetching techniques and find that it provides significantly better performance on both single-core and multi-core systems, while requiring less hardware cost
    corecore