Search CORE

1,102 research outputs found

GPUs as Storage System Accelerators

Author: Al-Kiswany Samer
Gharaibeh Abdullah
Ripeanu Matei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/05/2012
Field of study

Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

arXiv.org e-Print Archive

Crossref

Parallel Implementations of Cellular Automata for Traffic Models

Author: L Dagum
O Biham
RM D’Souza
S Maerivoet
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

The Biham-Middleton-Levine (BML) traffic model is a simple two-dimensional, discrete Cellular Automaton (CA) that has been used to study self-organization and phase transitions arising in traffic flows. From the computational point of view, the BML model exhibits the usual features of discrete CA, where the state of the automaton are updated according to simple rules that depend on the state of each cell and its neighbors. In this paper we study the impact of various optimizations for speeding up CA computations by using the BML model as a case study. In particular, we describe and analyze the impact of several parallel implementations that rely on CPU features, such as multiple cores or SIMD instructions, and on GPUs. Experimental evaluation provides quantitative measures of the payoff of each technique in terms of speedup with respect to a plain serial implementation. Our findings show that the performance gap between CPU and GPU implementations of the BML traffic model can be reduced by clever exploitation of all CPU features

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

X-MAP A Performance Prediction Tool for Porting Algorithms and Applications to Accelerators

Author: Shetty Ashrit
Publication venue: Clemson University Libraries
Publication date: 01/08/2017
Field of study

Most modern high-performance computing systems comprise of one or more accelerators with varying architectures in addition to traditional multicore Central Processing Units (CPUs). Examples of these accelerators include Graphic Processing Units (GPU) and Intel’s Many Integrated Cores architecture called Xeon Phi (PHI). These architectures provide massive parallel computation capabilities, which provide substantial performance beneﬁts over traditional CPUs for a variety of scientiﬁc applications. We know that all accelerators are not similar because each of them has their own unique architecture. This diﬀerence in the underlying architecture plays a crucial role in determining if a given accelerator will provide a signiﬁcant speedup over its competition. In addition to the architecture itself, one more diﬀerentiating factor for these accelerators is the programming language used to program them. For example, Nvidia GPUs can be programmed using Compute Uniﬁed Device Architecture (CUDA) and OpenCL while Intel Xeon PHIs can be programmed using OpenMP and OpenCL. The choice of programming language also plays a critical role in the speedup obtained depending on how close the language is to the hardware in addition to the level of optimization. With that said, it is thus very diﬃcult for an application developer to choose the ideal accelerator to achieve the best possible speedup. In light of this, we present an easy to use Graphical User Interface (GUI) Tool called X-MAP which is a performance prediction tool for porting algorithms and applications to architectures which encompasses a Machine Learning based inference model to predict the performance of an applica-tion on a number of well-known accelerators and at the same time predict the best architecture and programming language for the application. We do this by collecting hardware counters from a given application and predicting run time by providing this data as inputs to a Neural Network Regressor based inference model. We predict the architecture and associated programming language by pro viding the hardware counters as inputs to an inference model based on Random Forest Classiﬁcation Model. Finally, with a mean absolute prediction error of 8.52 and features such as syntax high-lighting for multiple programming languages, a function-wise breakdown of the entire application to understand bottlenecks and the ability for end users to submit their own prediction models to further improve the system, makes X-MAP a unique tool that has a signiﬁcant edge over existing performance prediction solutions

Clemson University: TigerPrints

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

Author: Blazewicz Marek
Brandt Steven R.
Ciznicki Milosz
Hinder Ian
Kierzynka Michal
Koppelman David M.
Löffler Frank
Schnetter Erik
Tao Jian
Publication venue: 'IOS Press'
Publication date: 01/01/2013
Field of study

Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin

arXiv.org e-Print Archive

CiteSeerX

Directory of Open Access Journals

Louisiana State University

MPG.PuRe

Hybrid implementation of the fastICA algorithm for high-density EEG using the capabilities of the Intel architecture and CUDA programming

Author: Gajos-Balińska Anna
Stpiczyński Przemysław
Wójcik Grzegorz M.
Publication venue: AGH University of Science and Technology Department of Computer Science
Publication date: 27/12/2023
Field of study

High-density electroencephalographic (EEG) systems are utilized in the study of the human brain and its underlying behaviors. However, working with EEG data requires a well-cleaned signal, which is often achieved through the use of independent component analysis (ICA) methods. The calculation time for these types of algorithms is the longer the more data we have. This article presents a hybrid implementation of the fastICA algorithm that uses parallel programming techniques (libraries and extensions of the Intel processors and CUDA programming), which results in a significant acceleration of execution time on selected architectures

AGH (Akademia Górniczo-Hutnicza) University of Science and Technology: Journals

Computer Science Journal (AGH University of Science and Technology, Krakow)

Automated CNC Tool Path Planning and Machining Simulation on Highly Parallel Computing Architectures

Author: Konobrytskyi Dmytro
Publication venue: Clemson University Libraries
Publication date: 01/05/2013
Field of study

This work has created a completely new geometry representation for the CAD/CAM area that was initially designed for highly parallel scalable environment. A methodology was also created for designing highly parallel and scalable algorithms that can use the developed geometry representation. The approach used in this work is to move parallel algorithm design complexity from an algorithm level to a data representation level. As a result the developed methodology allows an easy algorithm design without worrying too much about the underlying hardware. However, the developed algorithms are still highly parallel because the underlying geometry model is highly parallel. For validation purposes, the developed methodology and geometry representation were used for designing CNC machine simulation and tool path planning algorithms. Then these algorithms were implemented and tested on a multi-GPU system. Performance evaluation of developed algorithms has shown great parallelizability and scalability; and that main algorithm properties are required for modern highly parallel environment. It was also proved that GPUs are capable of performing work an order of magnitude faster than traditional central processors. The last part of the work demonstrates how high performance that comes with highly parallel hardware can be used for development of a next level of automated CNC tool path planning systems. As a proof of concept, a fully automated tool path planning system capable of generating valid G-code programs for 5-axis CNC milling machines was developed. For validation purposes, the developed system was used for generating tool paths for some parts and results were used for machining simulation and experimental machining. Experimental results have proved from one side that the developed system works. And from another side, that highly parallel hardware brings computational resources for algorithms that were not even considered before due to computational requirements, but can provide the next level of automation for modern manufacturing systems

Clemson University: TigerPrints