65 research outputs found

    Accelerating BST Methods for Model Reduction with Graphics Processors

    Get PDF
    Model order reduction of dynamical linear time-invariant system appears in many scientific and engineering applications. Numerically reliable SVD-based methods for this task require O(n3) floating-point arithmetic operations, with n being in the range 103 − 105 for many practical applications. In this paper we investigate the use of graphics processors (GPUs) to accelerate model reduction of large-scale linear systems via Balanced Stochastic Truncation, by off-loading the computationally intensive tasks to this device. Experiments on a hybrid platform consisting of state-of-the-art general-purpose multi-core processors and a GPU illustrate the potential of this approach

    A Performance Model For Gpu Architectures: Analysis And Design Of Fundamental Algorithms

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2018

    Study of Fine-Grained, Irregular Parallel Applications on a Many-Core Processor

    Get PDF
    This dissertation demonstrates the possibility of obtaining strong speedups for a variety of parallel applications versus the best serial and parallel implementations on commodity platforms. These results were obtained using the PRAM-inspired Explicit Multi-Threading (XMT) many-core computing platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two. Biconnectivity: For finding the biconnected components of a graph, we demonstrate speedups of 9x to 33x on XMT relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two contemporary NVIDIA GPUs of similar or greater silicon area. Prior studies of parallel biconnectivity algorithms achieved at most a 4x speedup, but we could not find biconnectivity code for GPUs to compare biconnectivity against them. Triconnectivity: We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups on XMT over the only published optimal (linear-time) serial implementation of a triconnected components algorithm running on a modern CPU. To our knowledge, no other parallel implementation of a triconnected components algorithm has been published for any platform. Burrows-Wheeler compression: We present novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet and their empirical evaluation. To validate these theoretical algorithms, we implement them on XMT and show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression. Fast Fourier transform (FFT): Using FFT as an example, we examine the impact that adoption of some enabling technologies, including silicon photonics, would have on the performance of a many-core architecture. The results show that a single-chip many-core processor could potentially outperform a large high-performance computing cluster. Boosted decision trees: This chapter focuses on the hybrid memory architecture of the XMT computer platform, a key part of which is a flexible all-to-all interconnection network that connects processors to shared memory modules. First, to understand some recent advances in GPU memory architecture and how they relate to this hybrid memory architecture, we use microbenchmarks including list ranking. Then, we contrast the scalability of applications with that of routines. In particular, regardless of the scalability needs of full applications, some routines may involve smaller problem sizes, and in particular smaller levels of parallelism, perhaps even serial. To see how a hybrid memory architecture can benefit such applications, we simulate a computer with such an architecture and demonstrate the potential for a speedup of 3.3X over NVIDIA's most powerful GPU to date for XGBoost, an implementation of boosted decision trees, a timely machine learning approach. Boolean satisfiability (SAT): SAT is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers. We show the potential for speedups of up to 382X across a variety of problem instances. We hope that these results will stimulate future research

    Solving Algebraic Riccati Equations on Hybrid CPU-GPU Platforms

    Get PDF
    The solution of Algebraic Riccati Equations is required in many linear optimal and robust control methods such as LQR, LQG, Kalman filter, and in model order reduction techniques like the balanced stochastic truncation method. Numerically reliable algorithms for these applications rely on the sign function method, and require O(8n3) floating-point arithmetic operations, with n in the range of 103 −105 for many practical applications. In this paper we investigate the use of graphics processors (GPUs) to accelerate the solution of Algebraic Riccati Equations by off-loading the computationally intensive kernels to this device. Experiments on a hybrid platform compose by state-of-the-art general-purpose multi-core processors and a GPU illustrate the potential of this approach.Sociedad Argentina de Informática e Investigación Operativ

    Similarity search implementations for multi-core and many-core processors

    Get PDF
    Similarity search in a large collection of stored objects in a metric database has become a most interesting problem. The Spaghettis is an efficient metric data structure to index metric spaces. However, for real applications, when processing large volumes of data, query response time can be high enough. In this case, it is necessary to apply mechanisms in order to significantly reduce the average query response time. In this sense, the parallelization of the metric structures processing is an interesting field of research. Modern multi-core and many-core systems offer a very impressive cost/performance ratio. In this paper two new parallel implementations for range queries on Spaghettis data structures have been carried out: one of them on a many-core processor and the other one on a multi-core processor. Both implementations have been compared in terms of execution time and speedup

    A gpu-based implementation for range queries on spaghettis data structure

    Get PDF
    Similarity search in a large collection of stored objects in a metric database has become a most interesting problem. The Spaghettis is an efficient metric data structure to index metric spaces. However, for real applications processing large volumes of generated data, query response times can be high enough. In these cases, it is necessary to apply mechanisms in order to significantly reduce the average query time. In this sense, the parallelization of metric structures is an interesting field of research. The recent appearance of GPUs for general purpose computing platforms offers powerful parallel processing capabilities. In this paper we propose a GPU-based implementation for Spaghettis metric structure. Firstly, we have adapted Spaghettis structure to GPU-based platform. Afterwards, we have compared both sequential and GPU-based implementation to analyse the performance, showing significant improvements in terms of time reduction, obtaining values of speed-up close to 10. Keywords: Databases ? similarity search ? metric spaces ? algorithms ? data structures ? parallel processing ? GPU ? CUD

    Aceleración de una herramienta para la predicción de energía solar mediante arquitecturas masivamente paralelas

    Get PDF
    En la última década, Uruguay ha comenzado a incorporar fuertemente la energía eólica y solar a su matriz energética. La inclusión de este tipo de fuentes de energía para abastecer la red eléctrica presenta un gran desafío al momento de administrar su uso, principalmente debido a su flujo de carácter fluctuante. Considerando esta situación, y con el objetivo de simplificar el trabajo de despacho de carga (que se encarga de administrar eficientemente los recursos energéticos presentes en la matriz), desde la Facultad de Ingeniería se ha desarrollado una herramienta capaz de predecir la generación de energía solar fotovoltaica en el país para un horizonte de tiempo de 96 horas. Uno de los principales inconvenientes de dicha herramienta es su elevado costo computacional, lo que resulta en tiempos de ejecución restrictivos. Esta tesis aborda el estudio de la herramienta mencionada, haciendo foco especialmente en la componente que más tiempo y recursos requiere, el modelo numérico de circulación general de la atmósfera Weather Research and Forecasting (WRF). En una primera fase de este trabajo se analiza el tiempo de ejecución de dicho modelo, concluyendo que una de las etapas más costosa es el cómputo de la radiación solar, debido, entre otras cosas, a la precisión numérica que se requiere en estos cálculos. A partir de esta situación, en el presente trabajo se propone una nueva arquitectura de software asincrónica que permita desacoplar y calcular de forma paralela la radiación solar con el resto de las propiedades atmosféricas presentes en el WRF, siguiendo un patrón de paralelismo de tipo pipeline. Adicionalmente, se aborda el portado de una porción del cálculo de la radiación a un coprocesador masivamente paralelo, concretamente una GPU (Graphics Processing Unit) y/o un procesador XeonPhi, con el objetivo de disminuir la demanda de cómputo sobre la CPU. La evaluación experimental de esta propuesta en un escenario de doce plantas fotovoltaicas en el territorio uruguayo permite concluir que la arquitectura asincrónica logra disminuir los tiempos de ejecución del modelo original en un 10 % aproximadamente, cuando se consideran equipos multicore con una gran cantidad de núcleos. Adicionalmente, la extensión de esta arquitectura permite incorporar exitosamente la capacidad de cómputo de un coprocesador (GPU o Xeon-Phi), alcanzando mejoras de entre un 25 % a un 30 % en el tiempo total del modelo cuando se combinan ambas estrategias (asincronismo y uso de dispositivos de cómputo secundario).Over the last decade, Uruguay has begun a strong incorporation of eolic and solar energy to its energy matrix. The inclusion of this type of energy sources to supply the power grid poses a significant challenge at the moment of managing its use, mainly because of its variable flux. Considering this situation, and in order to simplify the power dispatch task (which efficiently manages the energy resources in the matrix), the Facultad de Ingeniería has developed a tool capable of predicting the generation of photovoltaic solar energy in the country for a 96-hour time horizon. One of the main drawbacks of said tool is its high computational cost, which results in restrictive runtimes. This thesis addresses the study of the aforementioned tool, focusing especially on the most resource- and time-consuming component, the numerical weather prediction model Weather Research and Forecasting (WRF). In a first stage of this work, the runtime of said model is assessed, concluding that one of the most expensive steps is the solar radiation computation, because of, inter alia, the numerical precision required in these calculations. Starting from this situation, this work proposes a new asynchronous software architecture which enables decoupling computation of solar radiation and its parallel calculation with the remaining atmospheric properties in the WRF, following a pipeline parallel strategy. Additionally, offloading of a portion of the radiation calculation to a co-processor is addressed, specifically a GPU (Graphics Processing Unit) and/or a Xeon-Phi processor, in order to decrease the computation load on the CPU. Experimental assessment of this proposal in a twelve-photovoltaic-facility scenario in Uruguayan land makes it possible to conclude that asynchronous architecture decreases runtimes of the original model by approximately 10 %, when considering multicore equipment with a large amount of cores. Furthermore, the extent of this architecture enables the successful incorporation of the computation ability of a co-processor (GPU o Xeon-Phi), reaching improvements of between 25 % and 30 % in the total execution time of the model when both strategies are combined (asynchronism and use of secondary computation devices)
    corecore