29 research outputs found
Beyond the socket: NUMA-aware GPUs
GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.We would like to thank anonymous reviewers and Steve Keckler for their help in improving this paper. The first author is supported by
the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925)Peer ReviewedPostprint (published version
Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
Deploying deep learning (DL) models across multiple compute devices to train
large and complex models continues to grow in importance because of the demand
for faster and more frequent training. Data parallelism (DP) is the most widely
used parallelization strategy, but as the number of devices in data parallel
training grows, so does the communication overhead between devices.
Additionally, a larger aggregate batch size per step leads to statistical
efficiency loss, i.e., a larger number of epochs are required to converge to a
desired accuracy. These factors affect overall training time and beyond a
certain number of devices, the speedup from leveraging DP begins to scale
poorly. In addition to DP, each training step can be accelerated by exploiting
model parallelism (MP). This work explores hybrid parallelization, where each
data parallel worker is comprised of more than one device, across which the
model dataflow graph (DFG) is split using MP. We show that at scale, hybrid
training will be more effective at minimizing end-to-end training time than
exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the
hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%,
and 22% respectively compared to what DP alone can achieve at scale
Tumor markers: a proteomic approach
This article reviews the recently published data on the diagnosis of cancer with proteomics, including the major proteomics technologies and promising strategies for biomarker discovery and development. Most of the tumor markers are proteins that either numerically increase in response to the alteration of cancer conditions or are produced by cancer cells. However, they are natural compounds ordinarily available in the typical cells to a little extent what are affected by increase of expression due to cancer and its intensity in blood, body fluids or tissues. Tumor markers are substances normally available in body fluids such as serum, urine, blood, and tissues that increase in the desired tissue of cancer patients. Most of tumor markers are proteins that either are produced in response to changes in cancer conditions or are made by the cancer cells. However, most of tumor markers are among the natural compounds of normal cells present in normal conditions in the cell in small amounts and are affected by increase of expression, due to cancer and their levels in the blood, body fluids or tissues
Structural Biology: Modeling applications and techniques at a glance
As recent advancements in biology shows, the molecular machines specially proteins, RNA and complex molecules play the main role of the so called cell functionality. It means a very big part of the system biology is concerned with the interactions of such molecular components. Drug industries and research institutes are trying hard to better understand the concepts underlying these interactions and are highly dependent on the issues regarding these molecular elements. However the costs for such projects are so high and in many cases these projects will be funded by governments or profit making companies. With this in mind it has to be said that the techniques like stimulation are always a very good candidate to decrease such costs and to provide scientists with a bright future of the project results before undergoing costly experiments. However the costs involved projects that determine an approximation for the problem is not that much high but they are also costly. So it is of utmost importance to invent special techniques for the concept of stimulation that can also decrease the project costs and also predict much accurately. Since the system biology and proteomics as the study of the proteins and their functions are in the center of consideration for the purpose of drug discovery, understanding the cell functionalities and the underlying causes behind diseases; so we need advance software and algorithms that can predict the structure of the molecular components and to provide researchers with the computational tools to analyze such models. In this paper we make review of the importance of molecular modeling, its limitations and applications
Recommended from our members
Fair and high performance shared memory resource management
textChip multiprocessors (CMPs) commonly share a large portion of memory
system resources among different cores. Since memory requests from
different threads executing on different cores significantly interfere
with one another in these shared resources, the design of the shared
memory subsystem is crucial for achieving high performance and
fairness.
Inter-thread memory system interference has different implications
based on the type of workload running on a CMP. In multi-programmed
workloads, different applications can experience significantly
different slowdowns. If left uncontrolled, large disparities in
slowdowns result in low system performance and make system software's
priority-based thread scheduling policies ineffective. In a single
multi-threaded application, memory system interference between threads
of the same application can slow each thread down significantly. Most
importantly, the critical path of execution can also be
significantly slowed down, resulting in increased application
execution time.
This dissertation proposes three mechanisms that address different
shortcomings of current shared resource management techniques targeted
at multi-programmed workloads, and one mechanism which speeds up a
single multi-threaded application by managing main-memory related
interference between its different threads.
With multi-programmed workloads, the key idea is that both demand- and
prefetch-caused inter-application interference should be taken into
account in shared resource management techniques across the entire
shared memory system. Our evaluations demonstrate that doing so
significantly improves both system performance and fairness compared
to the state-of-the-art. When executing a single multi-threaded
application on a CMP, the key idea is to take into account the
inter-dependence of threads in memory scheduling decisions. Our
evaluation shows that doing so significantly reduces the execution
time of the multi-threaded application compared to using
state-of-the-art memory schedulers designed for multi-programmed
workloads.
This dissertation concludes that the performance and fairness of CMPs
can be significantly improved by better management of inter-thread
interference in the shared memory resources, both for multi-programmed
workloads and multi-threaded applications.Electrical and Computer Engineerin
Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems
Linked data structure (LDS) accesses are critical to the performance of many large scale applications. Techniques have been proposed to prefetch such accesses. Unfortunately, many LDS prefetching techniques 1) generate a large number of useless prefetches, thereby degrading performance and bandwidth-efficiency, 2) require significant hardware or storage cost, or 3) when employed together with stream-based prefetchers, cause significant resource contention in the memory system. As a result, existing processors do not employ LDS prefetchers even though they commonly employ stream-based prefetchers. This paper proposes a low-cost hardware/software cooperative technique that enables bandwidth-efficient prefetching of linked data structures. Our solution has two new components: 1) a compiler-guided prefetch filtering mechanism that informs the hardware about which pointer addresses to prefetch, 2) a coordinated prefetcher throttling mechanism that uses run-time feedback to manage the interference between multiple prefetchers (LDS and stream-based) in a hybrid prefetching system. Evaluations show that the proposed solution improves average performance by 22.5 % while decreasing memory bandwidth consumption by 25 % over a baseline system that employs an effective stream prefetcher on a set of memory- and pointer-intensive applications. We compare our proposal to three different LDS/correlation prefetching techniques and find that it provides significantly better performance on both single-core and multi-core systems, while requiring less hardware cost