31,319 research outputs found
cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications
Modern processor architectures, in addition to having still more cores, also
require still more consideration to memory-layout in order to run at full
capacity. The usefulness of most languages is deprecating as their
abstractions, structures or objects are hard to map onto modern processor
architectures efficiently.
The work in this paper introduces a new abstract machine framework, cphVB,
that enables vector oriented high-level programming languages to map onto a
broad range of architectures efficiently. The idea is to close the gap between
high-level languages and hardware optimized low-level implementations. By
translating high-level vector operations into an intermediate vector bytecode,
cphVB enables specialized vector engines to efficiently execute the vector
operations.
The primary success parameters are to maintain a complete abstraction from
low-level details and to provide efficient code execution across different,
modern, processors. We evaluate the presented design through a setup that
targets multi-core CPU architectures. We evaluate the performance of the
implementation using Python implementations of well-known algorithms: a jacobi
solver, a kNN search, a shallow water simulation and a synthetic stencil
simulation. All demonstrate good performance
Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs
Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times
Garbage collection auto-tuning for Java MapReduce on Multi-Cores
MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
RELEASE: A High-level Paradigm for Reliable Large-scale Server Software
Erlang is a functional language with a much-emulated model for building reliable distributed systems. This paper outlines the RELEASE project, and describes the progress in the first six months. The project aim is to scale the Erlang’s radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. Currently Erlang has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. We are working at three levels to address these challenges: evolving the Erlang virtual machine so that it can work effectively on large scale multicore systems; evolving the language to Scalable Distributed (SD) Erlang; developing a scalable Erlang infrastructure to integrate multiple, heterogeneous clusters. We are also developing state of the art tools that allow programmers to understand the behaviour of massively parallel SD Erlang programs. We will demonstrate the effectiveness of the RELEASE approach using demonstrators and two large case studies on a Blue Gene
Devito: Towards a generic Finite Difference DSL using Symbolic Python
Domain specific languages (DSL) have been used in a variety of fields to
express complex scientific problems in a concise manner and provide automated
performance optimization for a range of computational architectures. As such
DSLs provide a powerful mechanism to speed up scientific Python computation
that goes beyond traditional vectorization and pre-compilation approaches,
while allowing domain scientists to build applications within the comforts of
the Python software ecosystem. In this paper we present Devito, a new finite
difference DSL that provides optimized stencil computation from high-level
problem specifications based on symbolic Python expressions. We demonstrate
Devito's symbolic API and performance advantages over traditional Python
acceleration methods before highlighting its use in the scientific context of
seismic inversion problems.Comment: pyHPC 2016 conference submissio
- …