337 research outputs found
Algorithm-Level Optimizations for Scalable Parallel Graph Processing
Efficiently processing large graphs is challenging, since parallel graph algorithms suffer from
poor scalability and performance due to many factors, including heavy communication and load-imbalance.
Furthermore, it is difficult to express graph algorithms, as users need to understand
and effectively utilize the underlying execution of the algorithm on the distributed system. The
performance of graph algorithms depends not only on the characteristics of the system (such as
latency, available RAM, etc.), but also on the characteristics of the input graph (small-world scalefree,
mesh, long-diameter, etc.), and characteristics of the algorithm (sparse computation vs. dense
communication). The best execution strategy, therefore, often heavily depends on the combination
of input graph, system and algorithm.
Fine-grained expression exposes maximum parallelism in the algorithm and allows the user to
concentrate on a single vertex, making it easier to express parallel graph algorithms. However,
this often loses information about the machine, making it difficult to extract performance and
scalability from fine-grained algorithms.
To address these issues, we present a model for expressing parallel graph algorithms using a
fine-grained expression. Our model decouples the algorithm-writer from the underlying details
of the system, graph, and execution and tuning of the algorithm. We also present various graph
paradigms that optimize the execution of graph algorithms for various types of input graphs and
systems. We show our model is general enough to allow graph algorithms to use the various graph
paradigms for the best/fastest execution, and demonstrate good performance and scalability for
various different graphs, algorithms, and systems to 100,000+ cores
Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA
Deep neural network (DNN) has achieved remarkable success in many applications because of its powerful capability for data processing. Their performance in computer vision have matched and in some areas even surpassed human capabilities. Deep neural networks can capture complex nonlinear features; however this ability comes at the cost of high computational and memory requirements. State-of-art networks require billions of arithmetic operations and millions of parameters. The brute-force computing model of DNN often requires extremely large hardware resources, introducing severe concerns on its scalability running on traditional von Neumann architecture. The well-known memory wall, and latency brought by the long-range connectivity and communication of DNN severely constrain the computation efficiency of DNN. The acceleration techniques of DNN, either software or hardware, often suffer from poor hardware execution efficiency of the simplified model (software), or inevitable accuracy degradation and limited supportable algorithms (hardware), respectively. In order to preserve the inference accuracy and make the hardware implementation in a more efficient form, a close investigation to the hardware/software co-design methodologies for DNNs is needed.
The proposed work first presents an FPGA-based implementation framework for Recurrent Neural Network (RNN) acceleration. At architectural level, we improve the parallelism of RNN training scheme and reduce the computing resource requirement for computation efficiency enhancement. The hardware implementation primarily targets at reducing data communication load. Secondly, we propose a data locality-aware sparse matrix and vector multiplication (SpMV) kernel. At software level, we reorganize a large sparse matrix into many modest-sized blocks by adopting hypergraph-based partitioning and clustering. Available hardware constraints have been taken into consideration for the memory allocation and data access regularization. Thirdly, we present a holistic acceleration to sparse convolutional neural network (CNN). During network training, the data locality is regularized to ease the hardware mapping. The distributed architecture enables high computation parallelism and data reuse. The proposed research results in an hardware/software co-design methodology for fast and accurate DNN acceleration, through the innovations in algorithm optimization, hardware implementation, and the interactive design process across these
two domains
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
The rapid growth of demanding applications in domains applying multimedia
processing and machine learning has marked a new era for edge and cloud
computing. These applications involve massive data and compute-intensive tasks,
and thus, typical computing paradigms in embedded systems and data centers are
stressed to meet the worldwide demand for high performance. Concurrently, the
landscape of the semiconductor field in the last 15 years has constituted power
as a first-class design concern. As a result, the community of computing
systems is forced to find alternative design approaches to facilitate
high-performance and/or power-efficient computing. Among the examined
solutions, Approximate Computing has attracted an ever-increasing interest,
with research works applying approximations across the entire traditional
computing stack, i.e., at software, hardware, and architectural levels. Over
the last decade, there is a plethora of approximation techniques in software
(programs, frameworks, compilers, runtimes, languages), hardware (circuits,
accelerators), and architectures (processors, memories). The current article is
Part I of our comprehensive survey on Approximate Computing, and it reviews its
motivation, terminology and principles, as well it classifies and presents the
technical details of the state-of-the-art software and hardware approximation
techniques.Comment: Under Review at ACM Computing Survey
Scalable Graph Analysis and Clustering on Commodity Hardware
The abundance of large-scale datasets both in industry and academia today has
lead to a need for scalable data analysis frameworks and libraries.
This assertion is exceedingly apparent in large-scale graph datasets.
The vast majority of existing frameworks focus on distributing computation
within a cluster, neglecting to fully utilize each individual node,
leading to poor overall performance. This thesis is motivated by the prevalence of
Non-Uniform Memory Access (NUMA) architectures within multicore machines and
the advancements in the performance of external memory devices like SSDs.
This thesis focusses on the development of machine
learning frameworks, libraries, and application development principles to enable
scalable data analysis, with minimal resource consumption. We develop novel
optimizations that leverage fine-grain I/O and NUMA-awareness to advance
the state-of-the-art within the areas of scalable graph analytics and machine
learning.
We focus on minimality, scalability and memory
parallelism when data reside either in (i) memory, (ii) semi-externally, or
(iii) distributed memory. We target two core areas:
(i) graph analytics and (ii) community detection (clustering).
The semi-external memory (SEM) paradigm is an attractive middle ground for
limited resource consumption and near-in-memory performance on a single thick
compute node. In recent years, its adoption has steadily risen in popularity with
framework developers, despite having limited adoption from application developers.
We address key questions surrounding the development of state-of-the-art
applications within an SEM, vertex-centric graph framework. Our target is to
lower the barrier for entry to SEM, vertex-centric application development.
As such, we develop Graphyti, a library of highly optimized applications in
Semi-External Memory (SEM) using the FlashGraph framework. We utilize this
library to identify the core principles that underlie the development of
state-of-the-art vertex-centric graph applications in SEM.
We then address scaling the task of community detection through clustering given
arbitrary hardware budgets. We develop the clusterNOR
extensible clustering framework and library with facilities for optimized
scale-out and scale-up computation. In summation, this thesis develops
key SEM design principles for graph analytics, introduces novel algorithmic
and systems-oriented optimizations for scalable algorithms that utilize a
two-step Majorize-Minimization or Minorize-Maximization (MM) objective function
optimization pattern. The optimizations we develop enable the applications and
libraries provided to attain state-of-the-art performance in varying memory
settings
HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU
The end of Dennard scaling and the slowdown of Moore's law led to a shift in
technology trends toward parallel architectures, particularly in HPC systems.
To continue providing performance benefits, HPC should embrace Approximate
Computing (AC), which trades application quality loss for improved performance.
However, existing AC techniques have not been extensively applied and evaluated
in state-of-the-art hardware architectures such as GPUs, the primary execution
vehicle for HPC applications today.
This paper presents HPAC-Offload, a pragma-based programming model that
extends OpenMP offload applications to support AC techniques, allowing portable
approximations across different GPU architectures. We conduct a comprehensive
performance analysis of HPAC-Offload across GPU-accelerated HPC applications,
revealing that AC techniques can significantly accelerate HPC applications
(1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our
analysis offers deep insights into the performance of GPU-based AC that guide
the future development of AC algorithms and systems for these architectures.Comment: 12 pages, 12 pages. Accepted at SC2
GUNDAM : A toolkit for fast spatial correlation functions in galaxy surveys
We describe the capabilities of a new software package to calculate two-point correlation functions (2PCFs) of large galaxy samples. The code can efficiently estimate 3D/projected/angular 2PCFs with a variety of statistical estimators and bootstrap errors, and is intended to provide a complete framework (including calculation, storage, manipulation, and plotting) to perform this type of spatial analysis with large redshift surveys. GUNDAM implements a very fast skip list/linked list algorithm that efficiently counts galaxy pairs and avoids the computation of unnecessary distances. It is several orders of magnitude faster than a naive pair counter, and matches or even surpass other advanced algorithms. The implementation is also embarrassingly parallel, making full use of multicore processors or large computational clusters when available. The software is designed to be flexible, user friendly and easily extensible, integrating optimized, well-tested packages already available in the astronomy community. Out of the box, it already provides advanced features such as custom weighting schemes, fibre collision corrections and 2D correlations. GUNDAM will ultimately provide an efficient toolkit to analyse the large-scale structure 'buried' in upcoming extremely large data sets generated by future surveys.Fil: Donoso, Emilio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Juan. Instituto de Ciencias Astronómicas, de la Tierra y del Espacio. Universidad Nacional de San Juan. Instituto de Ciencias Astronómicas, de la Tierra y del Espacio; Argentin
- …