337 research outputs found

    Algorithm-Level Optimizations for Scalable Parallel Graph Processing

    Get PDF
    Efficiently processing large graphs is challenging, since parallel graph algorithms suffer from poor scalability and performance due to many factors, including heavy communication and load-imbalance. Furthermore, it is difficult to express graph algorithms, as users need to understand and effectively utilize the underlying execution of the algorithm on the distributed system. The performance of graph algorithms depends not only on the characteristics of the system (such as latency, available RAM, etc.), but also on the characteristics of the input graph (small-world scalefree, mesh, long-diameter, etc.), and characteristics of the algorithm (sparse computation vs. dense communication). The best execution strategy, therefore, often heavily depends on the combination of input graph, system and algorithm. Fine-grained expression exposes maximum parallelism in the algorithm and allows the user to concentrate on a single vertex, making it easier to express parallel graph algorithms. However, this often loses information about the machine, making it difficult to extract performance and scalability from fine-grained algorithms. To address these issues, we present a model for expressing parallel graph algorithms using a fine-grained expression. Our model decouples the algorithm-writer from the underlying details of the system, graph, and execution and tuning of the algorithm. We also present various graph paradigms that optimize the execution of graph algorithms for various types of input graphs and systems. We show our model is general enough to allow graph algorithms to use the various graph paradigms for the best/fastest execution, and demonstrate good performance and scalability for various different graphs, algorithms, and systems to 100,000+ cores

    Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA

    Get PDF
    Deep neural network (DNN) has achieved remarkable success in many applications because of its powerful capability for data processing. Their performance in computer vision have matched and in some areas even surpassed human capabilities. Deep neural networks can capture complex nonlinear features; however this ability comes at the cost of high computational and memory requirements. State-of-art networks require billions of arithmetic operations and millions of parameters. The brute-force computing model of DNN often requires extremely large hardware resources, introducing severe concerns on its scalability running on traditional von Neumann architecture. The well-known memory wall, and latency brought by the long-range connectivity and communication of DNN severely constrain the computation efficiency of DNN. The acceleration techniques of DNN, either software or hardware, often suffer from poor hardware execution efficiency of the simplified model (software), or inevitable accuracy degradation and limited supportable algorithms (hardware), respectively. In order to preserve the inference accuracy and make the hardware implementation in a more efficient form, a close investigation to the hardware/software co-design methodologies for DNNs is needed. The proposed work first presents an FPGA-based implementation framework for Recurrent Neural Network (RNN) acceleration. At architectural level, we improve the parallelism of RNN training scheme and reduce the computing resource requirement for computation efficiency enhancement. The hardware implementation primarily targets at reducing data communication load. Secondly, we propose a data locality-aware sparse matrix and vector multiplication (SpMV) kernel. At software level, we reorganize a large sparse matrix into many modest-sized blocks by adopting hypergraph-based partitioning and clustering. Available hardware constraints have been taken into consideration for the memory allocation and data access regularization. Thirdly, we present a holistic acceleration to sparse convolutional neural network (CNN). During network training, the data locality is regularized to ease the hardware mapping. The distributed architecture enables high computation parallelism and data reuse. The proposed research results in an hardware/software co-design methodology for fast and accurate DNN acceleration, through the innovations in algorithm optimization, hardware implementation, and the interactive design process across these two domains

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    Scalable Graph Analysis and Clustering on Commodity Hardware

    Get PDF
    The abundance of large-scale datasets both in industry and academia today has lead to a need for scalable data analysis frameworks and libraries. This assertion is exceedingly apparent in large-scale graph datasets. The vast majority of existing frameworks focus on distributing computation within a cluster, neglecting to fully utilize each individual node, leading to poor overall performance. This thesis is motivated by the prevalence of Non-Uniform Memory Access (NUMA) architectures within multicore machines and the advancements in the performance of external memory devices like SSDs. This thesis focusses on the development of machine learning frameworks, libraries, and application development principles to enable scalable data analysis, with minimal resource consumption. We develop novel optimizations that leverage fine-grain I/O and NUMA-awareness to advance the state-of-the-art within the areas of scalable graph analytics and machine learning. We focus on minimality, scalability and memory parallelism when data reside either in (i) memory, (ii) semi-externally, or (iii) distributed memory. We target two core areas: (i) graph analytics and (ii) community detection (clustering). The semi-external memory (SEM) paradigm is an attractive middle ground for limited resource consumption and near-in-memory performance on a single thick compute node. In recent years, its adoption has steadily risen in popularity with framework developers, despite having limited adoption from application developers. We address key questions surrounding the development of state-of-the-art applications within an SEM, vertex-centric graph framework. Our target is to lower the barrier for entry to SEM, vertex-centric application development. As such, we develop Graphyti, a library of highly optimized applications in Semi-External Memory (SEM) using the FlashGraph framework. We utilize this library to identify the core principles that underlie the development of state-of-the-art vertex-centric graph applications in SEM. We then address scaling the task of community detection through clustering given arbitrary hardware budgets. We develop the clusterNOR extensible clustering framework and library with facilities for optimized scale-out and scale-up computation. In summation, this thesis develops key SEM design principles for graph analytics, introduces novel algorithmic and systems-oriented optimizations for scalable algorithms that utilize a two-step Majorize-Minimization or Minorize-Maximization (MM) objective function optimization pattern. The optimizations we develop enable the applications and libraries provided to attain state-of-the-art performance in varying memory settings

    HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

    Full text link
    The end of Dennard scaling and the slowdown of Moore's law led to a shift in technology trends toward parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated in state-of-the-art hardware architectures such as GPUs, the primary execution vehicle for HPC applications today. This paper presents HPAC-Offload, a pragma-based programming model that extends OpenMP offload applications to support AC techniques, allowing portable approximations across different GPU architectures. We conduct a comprehensive performance analysis of HPAC-Offload across GPU-accelerated HPC applications, revealing that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our analysis offers deep insights into the performance of GPU-based AC that guide the future development of AC algorithms and systems for these architectures.Comment: 12 pages, 12 pages. Accepted at SC2

    GUNDAM : A toolkit for fast spatial correlation functions in galaxy surveys

    Get PDF
    We describe the capabilities of a new software package to calculate two-point correlation functions (2PCFs) of large galaxy samples. The code can efficiently estimate 3D/projected/angular 2PCFs with a variety of statistical estimators and bootstrap errors, and is intended to provide a complete framework (including calculation, storage, manipulation, and plotting) to perform this type of spatial analysis with large redshift surveys. GUNDAM implements a very fast skip list/linked list algorithm that efficiently counts galaxy pairs and avoids the computation of unnecessary distances. It is several orders of magnitude faster than a naive pair counter, and matches or even surpass other advanced algorithms. The implementation is also embarrassingly parallel, making full use of multicore processors or large computational clusters when available. The software is designed to be flexible, user friendly and easily extensible, integrating optimized, well-tested packages already available in the astronomy community. Out of the box, it already provides advanced features such as custom weighting schemes, fibre collision corrections and 2D correlations. GUNDAM will ultimately provide an efficient toolkit to analyse the large-scale structure 'buried' in upcoming extremely large data sets generated by future surveys.Fil: Donoso, Emilio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Juan. Instituto de Ciencias Astronómicas, de la Tierra y del Espacio. Universidad Nacional de San Juan. Instituto de Ciencias Astronómicas, de la Tierra y del Espacio; Argentin
    corecore