Search CORE

1,089 research outputs found

Modeling Algorithm Performance on Highly-threaded Many-core Architectures

Author: Ma Lin
Publication venue: Washington University Open Scholarship
Publication date: 15/12/2014
Field of study

The rapid growth of data processing required in various arenas of computation over the past decades necessitates extensive use of parallel computing engines. Among those, highly-threaded many-core machines, such as GPUs have become increasingly popular for accelerating a diverse range of data-intensive applications. They feature a large number of hardware threads with low-overhead context switches to hide the memory access latencies and therefore provide high computational throughput. However, understanding and harnessing such machines places great challenges on algorithm designers and performance tuners due to the complex interaction of threads and hierarchical memory subsystems of these machines. The achieved performance jointly depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy). Contemporary work tries to model the performance of GPUs from various aspects with different emphasis and granularity. However, no model considers all of these factors together at the same time. This dissertation presents an analytical framework that jointly addresses parallelism, latency-hiding, and occupancy for both theoretical and empirical performance analysis of algorithms on highly-threaded many-core machines so that it can guide both algorithm design and performance tuning. In particular, this framework not only helps to explore and reduce the runtime configuration space for tuning kernel execution on GPUs, but also reflects performance bottlenecks and predicts how the runtime will trend as the problem and other parameters scale. The framework consists of a pair of analytical models with one focusing on higher-level asymptotic algorithm performance on GPUs and the other one emphasizing lower-level details about scheduling and runtime configuration. Based on the two models, we have conducted extensive analysis of a large set of algorithms. Two analysis provides interesting results and explains previously unexplained data. In addition, the two models are further bridged and combined as a consistent framework. The framework is able to provide an end-to-end methodology for algorithm design, evaluation, comparison, implementation, and prediction of real runtime on GPUs fairly accurately. To demonstrate the viability of our methods, the models are validated through data from implementations of a variety of classic algorithms, including hashing, Bloom filters, all-pairs shortest path, matrix multiplication, FFT, merge sort, list ranking, string matching via suffix tree/array, etc. We evaluate the models\u27 performance across a wide spectrum of parameters, data values, and machines. The results indicate that the models can be effectively used for algorithm performance analysis and runtime prediction on highly-threaded many-core machines

Washington University St. Louis: Open Scholarship

SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs) (Preliminary Version)

Author: Bader D.A.
Publication venue: UNM Digital Repository
Publication date: 01/11/1998
Field of study

Using Least Variance for Robust Extraction of Systolic Time Intervals

Author: Cziesler Cody R
Publication venue: RIT Scholar Works
Publication date: 21/05/2014
Field of study

Systolic time intervals (STI) are clinically used as non-invasive predictor of cardiovascular disease. However, algorithm accuracy generally suffers across subjects and physiological states, requiring parameter tuning for robust STI extraction. To address this challenge, an automated methodology of processing with varying tuning parameters was explored. In this work, two STIs were examined: the R-wave pulse transit time to the PPG foot at the ear (rPTT) and the left ventricular ejection time (LVET). Historic feature detection algorithms were used with a range of tuning parameters over a 60 second interval, with least variance used to select the optimal parameter for robust extraction. These least variance algorithms were quantitatively compared to historic, single parameter algorithms using a positive predictive value metric. In order to decrease the runtime of the algorithms, the least variance algorithms were written such that they could run on a GPU using CUDA. Overall, the least variance algorithms were able to extract the features better than the historic algorithms, without sacrificing runtime. In addition to providing this robust and reliable STI extraction, the least variance algorithms can be adapted to extract features from any period data stream

RIT Scholar Works

PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on Chip-Multiprocessors

Author: Christian Bienia
Kai Li
Sanjeev Kumar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

The PARSEC benchmark suite was recently released and has been adopted by a significant number of users within a short amount of time. This new collection of workloads is not yet fully under-stood by researchers. In this study we compare the SPLASH-2 and PARSEC benchmark suites with each other to gain insights into differences and similarities between the two program collections. We use standard statistical methods and machine learning to ana-lyze the suites for redundancy and overlap on Chip-Multiprocessors (CMPs). Our analysis shows that PARSEC workloads are funda-mentally different from SPLASH-2 benchmarks. The observed dif-ferences can be explained with two technology trends, the prolifer-ation of CMPs and the accelerating growth of world data

CiteSeerX

Crossref