1,089 research outputs found

    Modeling Algorithm Performance on Highly-threaded Many-core Architectures

    Get PDF
    The rapid growth of data processing required in various arenas of computation over the past decades necessitates extensive use of parallel computing engines. Among those, highly-threaded many-core machines, such as GPUs have become increasingly popular for accelerating a diverse range of data-intensive applications. They feature a large number of hardware threads with low-overhead context switches to hide the memory access latencies and therefore provide high computational throughput. However, understanding and harnessing such machines places great challenges on algorithm designers and performance tuners due to the complex interaction of threads and hierarchical memory subsystems of these machines. The achieved performance jointly depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy). Contemporary work tries to model the performance of GPUs from various aspects with different emphasis and granularity. However, no model considers all of these factors together at the same time. This dissertation presents an analytical framework that jointly addresses parallelism, latency-hiding, and occupancy for both theoretical and empirical performance analysis of algorithms on highly-threaded many-core machines so that it can guide both algorithm design and performance tuning. In particular, this framework not only helps to explore and reduce the runtime configuration space for tuning kernel execution on GPUs, but also reflects performance bottlenecks and predicts how the runtime will trend as the problem and other parameters scale. The framework consists of a pair of analytical models with one focusing on higher-level asymptotic algorithm performance on GPUs and the other one emphasizing lower-level details about scheduling and runtime configuration. Based on the two models, we have conducted extensive analysis of a large set of algorithms. Two analysis provides interesting results and explains previously unexplained data. In addition, the two models are further bridged and combined as a consistent framework. The framework is able to provide an end-to-end methodology for algorithm design, evaluation, comparison, implementation, and prediction of real runtime on GPUs fairly accurately. To demonstrate the viability of our methods, the models are validated through data from implementations of a variety of classic algorithms, including hashing, Bloom filters, all-pairs shortest path, matrix multiplication, FFT, merge sort, list ranking, string matching via suffix tree/array, etc. We evaluate the models\u27 performance across a wide spectrum of parameters, data values, and machines. The results indicate that the models can be effectively used for algorithm performance analysis and runtime prediction on highly-threaded many-core machines

    Using Least Variance for Robust Extraction of Systolic Time Intervals

    Get PDF
    Systolic time intervals (STI) are clinically used as non-invasive predictor of cardiovascular disease. However, algorithm accuracy generally suffers across subjects and physiological states, requiring parameter tuning for robust STI extraction. To address this challenge, an automated methodology of processing with varying tuning parameters was explored. In this work, two STIs were examined: the R-wave pulse transit time to the PPG foot at the ear (rPTT) and the left ventricular ejection time (LVET). Historic feature detection algorithms were used with a range of tuning parameters over a 60 second interval, with least variance used to select the optimal parameter for robust extraction. These least variance algorithms were quantitatively compared to historic, single parameter algorithms using a positive predictive value metric. In order to decrease the runtime of the algorithms, the least variance algorithms were written such that they could run on a GPU using CUDA. Overall, the least variance algorithms were able to extract the features better than the historic algorithms, without sacrificing runtime. In addition to providing this robust and reliable STI extraction, the least variance algorithms can be adapted to extract features from any period data stream

    PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on Chip-Multiprocessors

    Full text link
    The PARSEC benchmark suite was recently released and has been adopted by a significant number of users within a short amount of time. This new collection of workloads is not yet fully under-stood by researchers. In this study we compare the SPLASH-2 and PARSEC benchmark suites with each other to gain insights into differences and similarities between the two program collections. We use standard statistical methods and machine learning to ana-lyze the suites for redundancy and overlap on Chip-Multiprocessors (CMPs). Our analysis shows that PARSEC workloads are funda-mentally different from SPLASH-2 benchmarks. The observed dif-ferences can be explained with two technology trends, the prolifer-ation of CMPs and the accelerating growth of world data
    • …
    corecore