1,102 research outputs found
GPUs as Storage System Accelerators
Massively multicore processors, such as Graphics Processing Units (GPUs),
provide, at a comparable price, a one order of magnitude higher peak
performance than traditional CPUs. This drop in the cost of computation, as any
order-of-magnitude drop in the cost per unit of performance for a class of
system components, triggers the opportunity to redesign systems and to explore
new ways to engineer them to recalibrate the cost-to-performance relation. This
project explores the feasibility of harnessing GPUs' computational power to
improve the performance, reliability, or security of distributed storage
systems. In this context, we present the design of a storage system prototype
that uses GPU offloading to accelerate a number of computationally intensive
primitives based on hashing, and introduce techniques to efficiently leverage
the processing power of GPUs. We evaluate the performance of this prototype
under two configurations: as a content addressable storage system that
facilitates online similarity detection between successive versions of the same
file and as a traditional system that uses hashing to preserve data integrity.
Further, we evaluate the impact of offloading to the GPU on competing
applications' performance. Our results show that this technique can bring
tangible performance gains without negatively impacting the performance of
concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201
Parallel Implementations of Cellular Automata for Traffic Models
The Biham-Middleton-Levine (BML) traffic model is a simple two-dimensional,
discrete Cellular Automaton (CA) that has been used to study self-organization
and phase transitions arising in traffic flows. From the computational point of
view, the BML model exhibits the usual features of discrete CA, where the state
of the automaton are updated according to simple rules that depend on the state
of each cell and its neighbors. In this paper we study the impact of various
optimizations for speeding up CA computations by using the BML model as a case
study. In particular, we describe and analyze the impact of several parallel
implementations that rely on CPU features, such as multiple cores or SIMD
instructions, and on GPUs. Experimental evaluation provides quantitative
measures of the payoff of each technique in terms of speedup with respect to a
plain serial implementation. Our findings show that the performance gap between
CPU and GPU implementations of the BML traffic model can be reduced by clever
exploitation of all CPU features
X-MAP A Performance Prediction Tool for Porting Algorithms and Applications to Accelerators
Most modern high-performance computing systems comprise of one or more accelerators with varying architectures in addition to traditional multicore Central Processing Units (CPUs). Examples of these accelerators include Graphic Processing Units (GPU) and Intel’s Many Integrated Cores architecture called Xeon Phi (PHI). These architectures provide massive parallel computation capabilities, which provide substantial performance benefits over traditional CPUs for a variety of scientific applications. We know that all accelerators are not similar because each of them has their own unique architecture. This difference in the underlying architecture plays a crucial role in determining if a given accelerator will provide a significant speedup over its competition. In addition to the architecture itself, one more differentiating factor for these accelerators is the programming language used to program them. For example, Nvidia GPUs can be programmed using Compute Unified Device Architecture (CUDA) and OpenCL while Intel Xeon PHIs can be programmed using OpenMP and OpenCL. The choice of programming language also plays a critical role in the speedup obtained depending on how close the language is to the hardware in addition to the level of optimization. With that said, it is thus very difficult for an application developer to choose the ideal accelerator to achieve the best possible speedup. In light of this, we present an easy to use Graphical User Interface (GUI) Tool called X-MAP which is a performance prediction tool for porting algorithms and applications to architectures which encompasses a Machine Learning based inference model to predict the performance of an applica-tion on a number of well-known accelerators and at the same time predict the best architecture and programming language for the application. We do this by collecting hardware counters from a given application and predicting run time by providing this data as inputs to a Neural Network Regressor based inference model. We predict the architecture and associated programming language by pro
viding the hardware counters as inputs to an inference model based on Random Forest Classification Model. Finally, with a mean absolute prediction error of 8.52 and features such as syntax high-lighting for multiple programming languages, a function-wise breakdown of the entire application to understand bottlenecks and the ability for end users to submit their own prediction models to further improve the system, makes X-MAP a unique tool that has a significant edge over existing performance prediction solutions
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
Hybrid implementation of the fastICA algorithm for high-density EEG using the capabilities of the Intel architecture and CUDA programming
High-density electroencephalographic (EEG) systems are utilized in the study of the human brain and its underlying behaviors. However, working with EEG data requires a well-cleaned signal, which is often achieved through the use of independent component analysis (ICA) methods. The calculation time for these types of algorithms is the longer the more data we have. This article presents a hybrid implementation of the fastICA algorithm that uses parallel programming techniques (libraries and extensions of the Intel processors and CUDA programming), which results in a significant acceleration of execution time on selected architectures
Automated CNC Tool Path Planning and Machining Simulation on Highly Parallel Computing Architectures
This work has created a completely new geometry representation for the CAD/CAM area that was initially designed for highly parallel scalable environment. A methodology was also created for designing highly parallel and scalable algorithms that can use the developed geometry representation. The approach used in this work is to move parallel algorithm design complexity from an algorithm level to a data representation level. As a result the developed methodology allows an easy algorithm design without worrying too much about the underlying hardware. However, the developed algorithms are still highly parallel because the underlying geometry model is highly parallel. For validation purposes, the developed methodology and geometry representation were used for designing CNC machine simulation and tool path planning algorithms. Then these algorithms were implemented and tested on a multi-GPU system. Performance evaluation of developed algorithms has shown great parallelizability and scalability; and that main algorithm properties are required for modern highly parallel environment. It was also proved that GPUs are capable of performing work an order of magnitude faster than traditional central processors. The last part of the work demonstrates how high performance that comes with highly parallel hardware can be used for development of a next level of automated CNC tool path planning systems. As a proof of concept, a fully automated tool path planning system capable of generating valid G-code programs for 5-axis CNC milling machines was developed. For validation purposes, the developed system was used for generating tool paths for some parts and results were used for machining simulation and experimental machining. Experimental results have proved from one side that the developed system works. And from another side, that highly parallel hardware brings computational resources for algorithms that were not even considered before due to computational requirements, but can provide the next level of automation for modern manufacturing systems
- …