429 research outputs found
High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks
Hardware accelerators based on field programmable gate array (FPGA) and system on chip (SoC) devices have gained attention in recent years. One of the main reasons is that these devices contain reconfigurable logic, which makes them feasible for boosting the performance of applications. High-level synthesis (HLS) tools facilitate the creation of FPGA code from a high level of abstraction using different directives to obtain an optimized hardware design based on performance metrics. However, the complexity of the design space depends on different factors such as the number of directives used in the source code, the available resources in the device, and the clock frequency. Design space exploration (DSE) techniques comprise the evaluation of multiple implementations with different combinations of directives to obtain a design with a good compromise between different metrics. This paper presents a survey of models, methodologies, and frameworks proposed for metric estimation, FPGA-based DSE, and power consumption estimation on FPGA/SoC. The main features, limitations, and trade-offs of these approaches are described. We also present the integration of existing models and frameworks in diverse research areas and identify the different challenges to be addressed
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
Performance engineering of data-intensive applications
Data-intensive programs deal with big chunks of data and often contain compute-intensive characteristics. Among various HPC application domains, big data analytics, machine learning and the more recent deep-learning models are well-known data-intensive applications. An efficient design of such applications demands extensive knowledge of the target hardware and software, particularly the memory/cache hierarchy and the data communication among threads/processes. Such a requirement makes code development an arduous task, as inappropriate data structures and algorithm design may result in superfluous runtime, let alone hardware incompatibilities while porting the code to other platforms.
In this dissertation, we introduce a set of tools and methods for the performance engineering of parallel data-intensive programs. We start with performance profiling to gain insights on thread communications and relevant code optimizations. Then, by narrowing down our scope to deep-learning applications, we introduce our tools for enhancing the performance portability and scalability of convolutional neural networks (ConvNet) at inference and training phases.
Our first contribution is a novel performance-profiling method to unveil potential communication bottlenecks caused by data-access patterns and thread interactions. Our findings show that the data shared between a pair of threads should be reused with a reasonably short intervals to preserve data locality, yet existing profilers neglect them and mainly report the communication volume. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. Our experiments show that applying relevant optimizations improves the performance in Rodinia benchmarks by up to 56%.
For the next contribution, we developed a framework for automatic generation of efficient and performance-portable convolution kernels, including Winograd convolutions, for various GPU platforms. We employed a synergy of meta-programming, symbolic execution, and auto-tuning. The results demonstrate efficient kernels generated through an automated optimization pipeline with runtimes close to vendor deep-learning libraries, and the minimum required programming effort confirms the performance portability of our approach. Furthermore, our symbolic execution method exploits repetitive patterns in Winograd convolutions, enabling us to reduce the number of arithmetic operations by up to 62% without compromising the numerical stability.
Lastly, we investigate possible methods to scale the performance of ConvNets in training and inference phases. Our specialized training platform equipped with a novel topology-aware network pruning algorithm enables rapid training, neural architecture search, and network compression. Thus, an AI model training can be easily scaled to a multitude of compute nodes, leading to faster model design with less operating costs. Furthermore, the network compression component scales a ConvNet model down by removing redundant layers, preparing the model for a more pertinent deployment.
Altogether, this work demonstrates the necessity and shows the benefit of performance engineering and parallel programming methods in accelerating emerging data-intensive workloads. With the help of the proposed tools and techniques, we pinpoint data communication bottlenecks and achieve performance portability and scalability in data-intensive applications
NATSA: A Near-Data Processing Accelerator for Time Series Analysis
Time series analysis is a key technique for extracting and predicting events
in domains as diverse as epidemiology, genomics, neuroscience, environmental
sciences, economics, and more. Matrix profile, the state-of-the-art algorithm
to perform time series analysis, computes the most similar subsequence for a
given query subsequence within a sliced time series. Matrix profile has low
arithmetic intensity, but it typically operates on large amounts of time series
data. In current computing systems, this data needs to be moved between the
off-chip memory units and the on-chip computation units for performing matrix
profile. This causes a major performance bottleneck as data movement is
extremely costly in terms of both execution time and energy.
In this work, we present NATSA, the first Near-Data Processing accelerator
for time series analysis. The key idea is to exploit modern 3D-stacked High
Bandwidth Memory (HBM) to enable efficient and fast specialized matrix profile
computation near memory, where time series data resides. NATSA provides three
key benefits: 1) quickly computing the matrix profile for a wide range of
applications by building specialized energy-efficient floating-point arithmetic
processing units close to HBM, 2) improving the energy efficiency and execution
time by reducing the need for data movement over slow and energy-hungry buses
between the computation units and the memory units, and 3) analyzing time
series data at scale by exploiting low-latency, high-bandwidth, and
energy-efficient memory access provided by HBM. Our experimental evaluation
shows that NATSA improves performance by up to 14.2x (9.9x on average) and
reduces energy by up to 27.2x (19.4x on average), over the state-of-the-art
multi-core implementation. NATSA also improves performance by 6.3x and reduces
energy by 10.2x over a general-purpose NDP platform with 64 in-order cores.Comment: To appear in the 38th IEEE International Conference on Computer
Design (ICCD 2020
Simulation methodologies for mobile GPUs
GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU-GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. Making the situation even more dire, existing GPU simulation efforts are concentrated around desktop GPUs, making infrastructure for modelling mobile GPUs virtually non-existent, despite their surging importance in the GPU market. Still, mobile GPU designers are faced with the challenge of evaluating design alternatives involving hundreds of architectural configuration options and micro-architectural improvements under tight time-to-market constraints, to which currently employed design flows involving detailed, but slow simulations are not well suited. In this thesis we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali Bifrost GPU powered device, achieving 100\% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework through a number of case studies exploring modern, mobile GPU applications, and optimize them using functional simulation statistics, unavailable with other approaches or hardware. Furthermore, we develop a trace-based performance model, allowing architects to rapidly model GPU configurations in early design space exploration
Recommended from our members
Finding, Measuring, and Reducing Inefficiencies in Contemporary Computer Systems
Computer systems have become increasingly diverse and specialized in recent years. This complexity supports a wide range of new computing uses and users, but is not without cost: it has become difficult to maintain the efficiency of contemporary general purpose computing systems. Computing inefficiencies, which include nonoptimal runtimes, excessive energy use, and limits to scalability, are a serious problem that can result in an inability to apply computing to solve the world's most important problems. Beyond the complexity and vast diversity of modern computing platforms and applications, a number of factors make improving general purpose efficiency challenging, including the requirement that multiple levels of the computer system stack be examined, that legacy hardware devices and software may stand in the way of achieving efficiency, and the need to balance efficiency with reusability, programmability, security, and other goals.
This dissertation presents five case studies, each demonstrating different ways in which the measurement of emerging systems can provide actionable advice to help keep general purpose computing efficient. The first of the five case studies is Parallel Block Vectors, a new profiling method for understanding parallel programs with a fine-grained, code-centric perspective aids in both future hardware design and in optimizing software to map better to existing hardware. Second is a project that defines a new way of measuring application interference on a datacenter's worth of chip-multiprocessors, leading to improved scheduling where applications can more effectively utilize available hardware resources. Next is a project that uses the GT-Pin tool to define a method for accelerating the simulation of GPGPUs, ultimately allowing for the development of future hardware with fewer inefficiencies. The fourth project is an experimental energy survey that compares and combines the latest energy efficiency solutions at different levels of the stack to properly evaluate the state of the art and to find paths forward for future energy efficiency research. The final project presented is NRG-Loops, a language extension that allows programs to measure and intelligently adapt their own power and energy use
- …