Search CORE

961 research outputs found

Recommended from our members

Accurate modeling of core and memory locality for proxy generation targeting emerging applications and architectures

Author: Panda Reena
Publication venue
Publication date: 21/03/2018
Field of study

Designing optimal computer systems for improved performance and energy efficiency requires architects and designers to have a deep understanding of the end-user workloads. However, many end-users (e.g., large corporations, banks, defense organizations, etc.) are apprehensive to share their applications with designers due to the confidential nature of software code and data. In addition, emerging applications pose significant challenges to early design space exploration due to their long-running nature and the highly complex nature of their software stack that cannot be supported on many early performance models. The above challenges can be overcome by using a proxy benchmark. A miniaturized proxy benchmark can be used as a substitute of the original workload to perform early computer performance evaluation. The process of generating a proxy benchmark consists of extracting a set of key statistics to summarize the behavior of end-user applications through profiling and using the collected statistics to synthesize a representative proxy benchmark. Using such proxy benchmarks can help designers to understand the behavior of end-user’s workloads in a reasonable time without the users having to disclose sensitive information about their workloads. Prior proxy benchmarking schemes leverage micro-architecture independent metrics, derived from detailed simulation tools, to generate proxy benchmarks. However, many emerging workloads do not work reliably with many profiling or simulation tools, in which case it becomes impossible to apply prior proxy generation techniques to generate proxy benchmarks for such complex applications. Furthermore, these techniques model instruction pipeline-level locality in great detail, but abstract out memory locality modeling using simple stride-based models. This results in poor cloning accuracy especially for emerging applications, which have larger memory footprints and complex access patterns. A few detailed cache and memory locality modeling techniques have also been proposed in literature. However, these techniques either model limited locality metrics and suffer from poor cloning accuracy or are fairly accurate, but at the expense of significant metadata overhead. Finally, none of the prior proxy benchmarking techniques model both core and memory locality with high accuracy. As a result, they are not useful for studying system-level performance behavior. Keeping the above key limitations and shortcomings of prior work in mind, this dissertation presents several techniques that expand the frontiers of workload proxy benchmarking, thereby enabling computer designers to gain a better and faster understanding of end-user application behavior without compromising the privileged nature of software or data. This dissertation first presents a core-level proxy benchmark generation methodology that leverages performance metrics derived from hardware performance counter measurements to create miniature proxy benchmarks targeting emerging big-data applications. The presented performance counter based characterization and associated extrapolation into generic parameters for proxy generation enables faster analysis (runs almost at native hardware speeds, unlike prior workload cloning proposals) and proxy generation for emerging applications that do not work with simulators or profiling tools. The generated proxy benchmarks are representative of the performance of the real-world big-data applications, including operating system and run-time effects, and yet converge to results quickly without needing any complex software stack support. Next, to improve upon the accuracy and efficiency of prior memory proxy benchmarking techniques, this dissertation presents a novel memory locality modeling technique that leverages localized pattern detection to create miniature memory proxy benchmarks. The presented technique models memory reference locality by decomposing an application’s memory accesses into a set of independent streams (localized by using address region based localization property), tracking fine-grained patterns within the localized streams and, finally, chaining or interleaving accesses from different localized memory streams to create an ordered proxy memory access sequence. This dissertation further extends the workload cloning approach to Graphics Processing Units (GPUs) and presents a novel proxy generation methodology to model the inherent memory access locality of GPU applications, while also accounting for the GPU’s parallel execution model. The generated memory proxy benchmarks help to enable fast and efficient design space exploration of futuristic memory hierarchies. Finally, this dissertation presents a novel technique to integrate accurate core and memory locality models to create system-level proxy benchmarks targeting emerging applications. This is a new capability that can facilitate efficient overall system (core, cache and memory subsystem) design-space exploration. This dissertation further presents a novel methodology that exploits the synthetic benchmark generation framework to create hypothetical workloads with performance behavior that does not currently exist. Such proxies can be generated to cover anticipated code trends and can represent futuristic workloads before the workloads even exist.Electrical and Computer Engineerin

Texas ScholarWorks

Dynamic HW/SW Partitioning: Configuration Scheduling and Design Space Exploration

Author: Kandasamy Santheeban
Publication venue: 'University of Waterloo'
Publication date: 01/01/2007
Field of study

Hardware/software partitioning is a process that occurs frequently in embedded system design. It is the procedure of determining whether a part of a system should be implemented in software or hardware. This dissertation is a study of hardware/software partitioning and the use of scheduling algorithms to improve the performance of dynamically reconfigurable computing devices. Reconfigurable computing devices are devices that are adaptable at the logic level to solve specific problems [Tes05]. One example of a reconfigurable computing device is the field programmable gate array (FPGA). The emergence of dynamically reconfigurable FPGAs made it possible to configure FPGAs at runtime. Most current approaches use a simple on demand configuration scheduling algorithm for the FPGA configurations. The on demand configuration scheduling algorithm reconfigures the FPGA at runtime, whenever a configuration is needed and is found not to be configured. The problem with this approach of dynamic reconfiguration is the reconfiguration time overhead, which is the time it takes to reconfigure the FPGA with a new configuration at runtime. Configuration caches and partial configuration have been proposed as possible solutions to this problem, but these techniques suffer from various limitations. The emergence of dynamically reconfigurable FPGAs also made it possible to perform dynamic hardware/software partitioning (DHSP), which is the procedure of determining at runtime whether a computation should be performed using its software or hardware implementation. The drawback of performing DHSP using configurations that are generated at runtime is that the profiling and the dynamic generation of configurations require profiling tool and synthesis tool access at runtime. This study proposes that configuration scheduling algorithms, which perform DHSP using statically generated configurations, can be developed to combine the advantages and reduce the major disadvantages of current approaches. A case study is used to compare and evaluate the tradeoffs between the currently existing approach for dynamic reconfiguration and the DHSP configuration scheduling algorithm based approach proposed in the study. A simulation model is developed to examine the performance of the various configuration scheduling algorithms. First, the difference in the execution time between the different approaches is analyzed. Afterwards, other important design criteria such as power consumption, energy consumption, area requirements and unit cost are analyzed and estimated. Also, business and marketing considerations such as time to market and development cost are considered. The study illustrates how different types of DHSP configuration scheduling algorithms can be implemented and how their performance can be evaluated using a variety of software applications. It is also shown how to evaluate when which of the approaches would be more advantageous by determining the tradeoffs that exist between them. Also the underlying factors that affect when which design alternative is more advantageous are determined and analyzed. The study shows that configuration scheduling algorithms, which perform DHSP using statically generated configurations, can be developed to combine the advantages and reduce some major disadvantages of current approaches. It is shown that there are situations where DHSP configuration scheduling algorithms can be more advantageous than the other approaches

University of Waterloo's Institutional Repository

Bridging the Gap between Application and Solid-State-Drives

Author: Zhou Jian
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2018
Field of study

Data storage is one of the important and often critical parts of the computing system in terms of performance, cost, reliability, and energy. Numerous new memory technologies, such as NAND flash, phase change memory (PCM), magnetic RAM (STT-RAM) and Memristor, have emerged recently. Many of them have already entered the production system. Traditional storage optimization and caching algorithms are far from optimal because storage I/Os do not show simple locality. To provide optimal storage we need accurate predictions of I/O behavior. However, the workloads are increasingly dynamic and diverse, making the long and short time I/O prediction challenge. Because of the evolution of the storage technologies and the increasing diversity of workloads, the storage software is becoming more and more complex. For example, Flash Translation Layer (FTL) is added for NAND-flash based Solid State Disks (NAND-SSDs). However, it introduces overhead such as address translation delay and garbage collection costs. There are many recent studies aim to address the overhead. Unfortunately, there is no one-size-fits-all solution due to the variety of workloads. Despite rapidly evolving in storage technologies, the increasing heterogeneity and diversity in machines and workloads coupled with the continued data explosion exacerbate the gap between computing and storage speeds. In this dissertation, we improve the data storage performance from both top-down and bottom-up approach. First, we will investigate exposing the storage level parallelism so that applications can avoid I/O contentions and workloads skew when scheduling the jobs. Second, we will study how architecture aware task scheduling can improve the performance of the application when PCM based NVRAM are equipped. Third, we will develop an I/O correlation aware flash translation layer for NAND-flash based Solid State Disks. Fourth, we will build a DRAM-based correlation aware FTL emulator and study the performance in various filesystems

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Optimizing the Performance of Directive-based Programming Model for GPGPUs

Author: Xu Rengan 1987-
Publication venue
Publication date: 10/07/2018
Field of study

Accelerators have been deployed on most major HPC systems. They are considered to improve the performance of many applications. Accelerators such as GPUs have an immense potential in terms of high compute capacity but programming these devices is a challenge. OpenCL, CUDA and other vendor-specific models for accelerator programming definitely offer high performance, but these are low-level models that demand excellent programming skills; moreover, they are time consuming to write and debug. In order to simplify GPU programming, several directive-based programming models have been proposed, including HMPP, PGI accelerator model and OpenACC. OpenACC has now become established as the de facto standard. We evaluate and compare these models involving several scientific applications. To study the implementation challenges and the principles and techniques of directive- based models, we built an open source OpenACC compiler on top of a main stream compiler framework (OpenUH as a branch of Open64). In this dissertation, we present the required techniques to parallelize and optimize the applications ported with OpenACC programming model. We apply both user-level optimizations in the applications and compiler and runtime-driven optimizations. The compiler optimization focuses on the parallelization of reduction operations inside nested parallel loops. To fully utilize all GPU resources, we also extend the OpenACC model to support multiple GPUs in a single node. Our application porting experience also revealed the challenge of choosing good loop schedules. The default loop schedule chosen by the compiler may not produce the best performance, so the user has to manually try different loop schedules to improve the performance. To solve this issue, we developed a locality-aware auto-tuning framework which is based on the proposed memory access cost model to help the compiler choose optimal loop schedules and guide the user to choose appropriate loop schedules.Computer Science, Department o

University of Houston Institutional Repository (UHIR)

Analyzing CUDA workloads using a detailed GPU simulator

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure

Author: Eeckhout Lieven
Kaeli David R.
Wang Lu
Wang Zhiying
Zhao Xia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

GPUs continue to boost the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Sending redundant requests to access the same memory location wastes valuable NoC bandwidth-we find on average 19 percent (and up to 48 percent) of the requests to be redundant. To remove redundant NoC traffic, we propose distributed-block scheduling, intra-cluster coalescing (ICC) and the coalesced cache (CC) to coalesce L1 cache misses within and across SMs in a cluster, respectively. Our evaluation results show that distributed-block scheduling, ICC and CC are complementary and improve both performance and energy consumption. We report an average performance improvement of 15 percent (and up to 67 percent) while at the same time reducing system energy by 6 percent (and up to 19 percent) and improving the energy-delay product (EDP) by 19 percent on average (and up to 53 percent), compared to state-of-the-art distributed CTA scheduling

Ghent University Academic Bibliography

A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE

Author: Wang Dongwei
Publication venue: VCU Scholars Compass
Publication date: 01/01/2016
Field of study

As a throughput-oriented device, Graphics Processing Unit(GPU) has already integrated with cache, which is similar to CPU cores. However, the applications in GPGPU computing exhibit distinct memory access patterns. Normally, the cache, in GPU cores, suffers from threads contention and resources over-utilization, whereas few detailed works excavate the root of this phenomenon. In this work, we adequately analyze the memory accesses from twenty benchmarks based on reuse distance theory and quantify their patterns. Additionally, we discuss the optimization suggestions, and implement a Bypassing Aware(BA) Cache which could intellectually bypass the thrashing-prone candidates. BA cache is a cost efficient cache design with two extra bits in each line, they are flags to make the bypassing decision and find the victim cache line. Experimental results show that BA cache can improve the system performance around 20\% and reduce the cache miss rate around 11\% compared with traditional design

VCU Scholars Compass