Search CORE

29 research outputs found

Beyond the socket: NUMA-aware GPUs

Author: Arunkumar Akhil
Bolotin Evgeny
Ebrahimi Eiman
Jaleel Aamer
Nellans David
Ramirez Alex
Ugljesa Milic
Villa Oreste
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2017
Field of study

GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.We would like to thank anonymous reviewers and Steve Keckler for their help in improving this paper. The first author is supported by the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925)Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Author: Ebrahimi Eiman
Fu Yaosheng
Gupta Puneet
Migacz Szymon
Nellans David
Pal Saptadeep
Zhang Victor
Zulfiqar Arslan
Publication venue
Publication date: 30/07/2019
Field of study

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22% respectively compared to what DP alone can achieve at scale

arXiv.org e-Print Archive

Tumor markers: a proteomic approach

Author: Abrishamkar Afshin
Behrouzi Gholam Reza
Ebrahimi Maryam
Heidari Keshel Saeed
Raeisossadati Reza
Rahnema Falavarjani Eiman
Roozafzoon Reza
Publication venue: Publisher: School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences
Publication date: 24/11/2013
Field of study

This article reviews the recently published data on the diagnosis of cancer with proteomics, including the major proteomics technologies and promising strategies for biomarker discovery and development. Most of the tumor markers are proteins that either numerically increase in response to the alteration of cancer conditions or are produced by cancer cells. However, they are natural compounds ordinarily available in the typical cells to a little extent what are affected by increase of expression due to cancer and its intensity in blood, body fluids or tissues. Tumor markers are substances normally available in body fluids such as serum, urine, blood, and tissues that increase in the desired tissue of cancer patients. Most of tumor markers are proteins that either are produced in response to changes in cancer conditions or are made by the cancer cells. However, most of tumor markers are among the natural compounds of normal cells present in normal conditions in the cell in small amounts and are affected by increase of expression, due to cancer and their levels in the blood, body fluids or tissues

Journals Portal, Shahid Beheshti University of Medical Sciences

Structural Biology: Modeling applications and techniques at a glance

Author: Abbaszadegan Mohammad Reza
Ebrahimi Maryam
Heidari Keshel Saeed
Hesami Tackallou Saeed
Moghaddasian Morteza
Raeisossadati Reza
Rahnema Eiman
Rezaei Tavirani Mostafa
Roozafzoon Reza
Shirzaei Sani Ehsan
Publication venue: Publisher: School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences
Publication date: 28/01/2013
Field of study

As recent advancements in biology shows, the molecular machines specially proteins, RNA and complex molecules play the main role of the so called cell functionality. It means a very big part of the system biology is concerned with the interactions of such molecular components. Drug industries and research institutes are trying hard to better understand the concepts underlying these interactions and are highly dependent on the issues regarding these molecular elements. However the costs for such projects are so high and in many cases these projects will be funded by governments or profit making companies. With this in mind it has to be said that the techniques like stimulation are always a very good candidate to decrease such costs and to provide scientists with a bright future of the project results before undergoing costly experiments. However the costs involved projects that determine an approximation for the problem is not that much high but they are also costly. So it is of utmost importance to invent special techniques for the concept of stimulation that can also decrease the project costs and also predict much accurately. Since the system biology and proteomics as the study of the proteins and their functions are in the center of consideration for the purpose of drug discovery, understanding the cell functionalities and the underlying causes behind diseases; so we need advance software and algorithms that can predict the structure of the molecular components and to provide researchers with the computational tools to analyze such models. In this paper we make review of the importance of molecular modeling, its limitations and applications

Journals Portal, Shahid Beheshti University of Medical Sciences

Recommended from our members

Fair and high performance shared memory resource management

Author: Ebrahimi Eiman
Publication venue
Publication date: 31/01/2012
Field of study

textChip multiprocessors (CMPs) commonly share a large portion of memory system resources among different cores. Since memory requests from different threads executing on different cores significantly interfere with one another in these shared resources, the design of the shared memory subsystem is crucial for achieving high performance and fairness. Inter-thread memory system interference has different implications based on the type of workload running on a CMP. In multi-programmed workloads, different applications can experience significantly different slowdowns. If left uncontrolled, large disparities in slowdowns result in low system performance and make system software's priority-based thread scheduling policies ineffective. In a single multi-threaded application, memory system interference between threads of the same application can slow each thread down significantly. Most importantly, the critical path of execution can also be significantly slowed down, resulting in increased application execution time. This dissertation proposes three mechanisms that address different shortcomings of current shared resource management techniques targeted at multi-programmed workloads, and one mechanism which speeds up a single multi-threaded application by managing main-memory related interference between its different threads. With multi-programmed workloads, the key idea is that both demand- and prefetch-caused inter-application interference should be taken into account in shared resource management techniques across the entire shared memory system. Our evaluations demonstrate that doing so significantly improves both system performance and fairness compared to the state-of-the-art. When executing a single multi-threaded application on a CMP, the key idea is to take into account the inter-dependence of threads in memory scheduling decisions. Our evaluation shows that doing so significantly reduces the execution time of the multi-threaded application compared to using state-of-the-art memory schedulers designed for multi-programmed workloads. This dissertation concludes that the performance and fairness of CMPs can be significantly improved by better management of inter-thread interference in the shared memory resources, both for multi-programmed workloads and multi-threaded applications.Electrical and Computer Engineerin

Texas ScholarWorks

Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems

Author: Eiman Ebrahimi
Onur Mutlu
Yale N. Patt
Publication venue
Publication date: 01/01/2008
Field of study

Linked data structure (LDS) accesses are critical to the performance of many large scale applications. Techniques have been proposed to prefetch such accesses. Unfortunately, many LDS prefetching techniques 1) generate a large number of useless prefetches, thereby degrading performance and bandwidth-efficiency, 2) require significant hardware or storage cost, or 3) when employed together with stream-based prefetchers, cause significant resource contention in the memory system. As a result, existing processors do not employ LDS prefetchers even though they commonly employ stream-based prefetchers. This paper proposes a low-cost hardware/software cooperative technique that enables bandwidth-efficient prefetching of linked data structures. Our solution has two new components: 1) a compiler-guided prefetch filtering mechanism that informs the hardware about which pointer addresses to prefetch, 2) a coordinated prefetcher throttling mechanism that uses run-time feedback to manage the interference between multiple prefetchers (LDS and stream-based) in a hybrid prefetching system. Evaluations show that the proposed solution improves average performance by 22.5 % while decreasing memory bandwidth consumption by 25 % over a baseline system that employs an effective stream prefetcher on a set of memory- and pointer-intensive applications. We compare our proposal to three different LDS/correlation prefetching techniques and find that it provides significantly better performance on both single-core and multi-core systems, while requiring less hardware cost

CiteSeerX

Crossref