Search CORE

132 research outputs found

Manycore high-performance computing in bioinformatics

Author: Giraud Mathieu
Janot Stéphane
Schmidt Bertil
Varré Jean-Stéphane
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2011
Field of study

Mining the increasing amount of genomic data requires having very efficient tools. Increasing the efficiency can be obtained with better algorithms, but one could also take advantage of the hardware itself to reduce the application runtimes. Since a few years, issues with heat dissipation prevent the processors from having higher frequencies. One of the answers to maintain Moore's Law is parallel processing. Grid environments provide tools for effective implementation of coarse grain parallelization. Recently, another kind of hardware has attracted interest: multicore processors. Graphic processing units (GPUs) are a first step towards massively multicore processors. They allow everyone to have some teraflops of cheap computing power in its personal computer. The CUDA library (released in 2007) and the new standard OpenCL (specified in 2008) make programming of such devices very convenient. OpenCL is likely to gain a wide industrial support and to become a standard of choice for parallel programming. In all cases, the best speedups are obtained when combining precise algorithmic studies with a knowledge of the computing architectures. This is especially true with the memory hierarchy: the algorithms have to find a good balance between using large (and slow) global memories and some fast (but small) local memories. In this chapter, we will show how those manycore devices enable more efficient bioinformatics applications. We will first give some insights into architectures and parallelism. Then we will describe recent implementations specifically designed for manycore architectures, including algorithms on sequence alignment and RNA structure prediction. We will conclude with some thoughts about the dissemination of those algorithms and implementations: are they today available on the bookshelf for everyone

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Gunrock: GPU Graph Analytics

Author: Davidson Andrew
Liu Weitang
Osama Muhammad
Owens John D.
Pan Yuechao
Riffel Andy T.
Wang Leyuan
Wang Yangzihao
Wu Yuduo
Yang Carl
Yuan Chenshan
Publication venue
Publication date: 04/01/2017
Field of study

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

arXiv.org e-Print Archive

eScholarship - University of California

FigShare

Generalized database index structures on massively parallel processor architectures

Author: Beier Felix
Publication venue
Publication date: 01/01/2019
Field of study

Height-balanced search trees are ubiquitous in database management systems as well as in other applications that require efficient access methods in order to identify entries in large data volumes. They can be configured with various strategies for structuring the search space for a given data set and for pruning it when different kinds of search queries are answered. In order to facilitate the development of application-specific tree variants, index frameworks, such as GiST, exist that provide a reusable library of commonly shared tree management functionality. By specializing internal data organization strategies, the framework can be customized to create an index that is efficient for an application's data access characteristics. Because the majority of the framework's code can be reused development and testing efforts are significantly lower, compared to an implementation from scratch. However, none of the existing frameworks supports the execution of index operations on massively parallel processor architectures, such as GPUs. Enabling the use of such processors for generalized index frameworks is the goal of this thesis. By compiling state-of-the-art techniques from a wide range of CPU- and GPU-optimized indexes, a GiST extension is developed that abstracts the physical execution aspect of generic, tree-based search queries. Tree traversals are broken-down into vectorized processing primitives that can be scheduled to one of the available (co-)processors for execution. Further, a CPU-based implementation is provided as well as a new GPU-based algorithm that, unlike prior art in this area, does not require that the index is fully stored inside a GPU's main memory buffer. The applicability of the extended framework is assessed for image rendering engines and, based on microbenchmarks, the parallelized algorithm performance is compared for different CPU and GPU generations. It will be shown that cases exist, where the GPU clearly outperforms the CPU and vice versa. In order to leverage the strengths of each processor type, an adaptive scheduler is presented that can be calibrated to schedule index operations to the best-fitting device in a hybrid system. With the help of a tree traversal simulation different scheduling strategies are evaluated and it will be shown that the adaptive scheduler can be used to make near-optimal decisions.Suchbäume sind allgegenwärtig in Datenbanksystemen und anderen Anwendungen, die eine effiziente Möglichkeit benötigen um in großen Datensätzen nach Einträgen zu suchen, die bestimmte Suchkriterien erfüllen. Sie können mit verschiedenen Strategien konfiguriert werden um den Suchraum zu strukturieren und die für ein Suchergebnis irrelevante Bereiche von der Bearbeitung auszuschließen. Die Entwicklung von anwendungsspezifischen Indexen wird durch Frameworks wie GiST unterstützt. Jedoch unterstützt keines der heute bereits existierenden Frameworks die Verwendung von hochgradig parallelen Prozessorarchitekturen wie GPUs. Solche Prozessoren für generische Index Frameworks nutzbar zu machen, ist Ziel dieser Arbeit. Dazu werden Techniken aus verschiedensten CPU- und GPU-optimierten Indexen analysiert und für die Entwicklung einer GiST-Erweiterung verwendet, welche die für eine Suche in Suchbäumen nötigen Berechnungen abstrahiert. Traversierungsoperationen werden dabei auf vektorisierte Primitive abgebildet, die auf parallelen Prozessoren implementiert werden können. Die Verwendung dieser Erweiterung wird beispielhaft an einem CPU Algorithmus demonstriert. Weiterhin wird ein neuer GPU-basierter Algorithmus vorgestellt, der im Vergleich zu bisherigen Verfahren, ein dynamisches Nachladen der Index Daten in den Hauptspeicher der GPU unterstützt. Die Praktikabilität des erweiterten Frameworks wird am Beispiel von Anwendungen aus der Computergrafik untersucht und die Performanz der verwendeten Algorithmen mit Hilfe eines Benchmarks auf verschiedenen CPU- und GPU-Modellen analysiert. Dabei wird gezeigt, unter welchen Bedingungen die parallele GPU-basierte Ausführung schneller ist als die CPU-basierte Variante - und umgekehrt. Um die Stärken beider Prozessortypen in einem hybriden System ausnutzen zu können, wird ein Scheduler entwickelt, der nach einer Kalibrierungsphase für eine gegebene Operation den geeignetsten Prozessor wählen kann. Mit Hilfe eines Simulators für Baumtraversierungen werden verschiedenste Scheduling Strategien verglichen. Dabei wird gezeigt, dass die Entscheidungen des Schedulers kaum vom Optimum abweichen und, abhängig von der simulierten Last, die erzielbaren Durchsätze für die parallele Ausführung mehrerer Suchoperationen durch hybrides Scheduling um eine Größenordnung und mehr erhöht werden können

Digitale Bibliothek Thüringen

Programming issues for video analysis on Graphics Processing Units

Author: Gómez Luna Juan
Publication venue: Universidad de Córdoba, Servicio de Publicaciones
Publication date: 01/01/2012
Field of study

El procesamiento de vídeo es la parte del procesamiento de señales, donde las señales de entrada y/o de salida son secuencias de vídeo. Cubre una amplia variedad de aplicaciones que son, en general, de cálculo intensivo, debido a su complejidad algorítmica. Por otra parte, muchas de estas aplicaciones exigen un funcionamiento en tiempo real. El cumplimiento de estos requisitos hace necesario el uso de aceleradores hardware como las Unidades de Procesamiento Gráfico (GPU). El procesamiento de propósito general en GPU representa una tendencia exitosa en la computación de alto rendimiento, desde el lanzamiento de la arquitectura y el modelo de programación NVIDIA CUDA. Esta tesis doctoral trata sobre la paralelización eficiente de aplicaciones de procesamiento de vídeo en GPU. Este objetivo se aborda desde dos vertientes: por un lado, la programación adecuada de la GPU para aplicaciones de vídeo; por otro lado, la GPU debe ser considerada como parte de un sistema heterogéneo. Dado que las secuencias de vídeo se componen de fotogramas, que son estructuras de datos regulares, muchos componentes de las aplicaciones de vídeo son inherentemente paralelizables. Sin embargo, otros componentes son irregulares en el sentido de que llevan a cabo cálculos que dependen de la carga de trabajo, sufren contención en la escritura, contienen partes inherentemente secuenciales o desbalanceadas en carga... Esta tesis propone estrategias para hacer frente a estos aspectos, a través de varios casos de estudio. También se describe una aproximación optimizada al cálculo de histogramas basada en un modelo de rendimiento de la memoria. Las secuencias de vídeo son flujos continuos que deben ser transferidos desde el ¿host¿ (CPU) al dispositivo (GPU), y los resultados del dispositivo al ¿host¿. Esta tesis doctoral propone el uso de CUDA streams para implementar el paradigma de ¿stream processing¿ en la GPU, con el fin de controlar la ejecución simultánea de las transferencias de datos y de la computación. También propone modelos de rendimiento que permiten una ejecución óptima

Repositorio Institucional de la Universidad de Córdoba

Fast Monte Carlo Simulations for Quality Assurance in Radiation Therapy

Author: Wang Yuhe
Publication venue: Washington University Open Scholarship
Publication date: 15/12/2017
Field of study

Monte Carlo (MC) simulation is generally considered to be the most accurate method for dose calculation in radiation therapy. However, it suffers from the low simulation efficiency (hours to days) and complex configuration, which impede its applications in clinical studies. The recent rise of MRI-guided radiation platform (e.g. ViewRay’s MRIdian system) brings urgent need of fast MC algorithms because the introduced strong magnetic field may cause big errors to other algorithms. My dissertation focuses on resolving the conflict between accuracy and efficiency of MC simulations through 4 different approaches: (1) GPU parallel computation, (2) Transport mechanism simplification, (3) Variance reduction, (4) DVH constraint. Accordingly, we took several steps to thoroughly study the performance and accuracy influence of these methods. As a result, three Monte Carlo simulation packages named gPENELOPE, gDPMvr and gDVH were developed for subtle balance between performance and accuracy in different application scenarios. For example, the most accurate gPENELOPE is usually used as golden standard for radiation meter model, while the fastest gDVH is usually used for quick in-patient dose calculation, which significantly reduces the calculation time from 5 hours to 1.2 minutes (250 times faster) with only 1% error introduced. In addition, a cross-platform GUI integrating simulation kernels and 3D visualization was developed to make the toolkit more user-friendly. After the fast MC infrastructure was established, we successfully applied it to four radiotherapy scenarios: (1) Validate the vender provided Co60 radiation head model by comparing the dose calculated by gPENELOPE to experiment data; (2) Quantitatively study the effect of magnetic field to dose distribution and proposed a strategy to improve treatment planning efficiency; (3) Evaluate the accuracy of the build-in MC algorithm of MRIdian’s treatment planning system. (4) Perform quick quality assurance (QA) for the “online adaptive radiation therapy” that doesn’t permit enough time to perform experiment QA. Many other time-sensitive applications (e.g. motional dose accumulation) will also benefit a lot from our fast MC infrastructure

Washington University St. Louis: Open Scholarship

High performance FPGA and GPU complex pattern matching over spatio-temporal streams

Author: C Mouza
D Knuth
Ildar Absalyamov
J Cazalas
L Woods
M Erwig
M Hadjieleftheriou
M Sadoghi
Marcos R. Vieira
N Cascarano
Roger Moussalli
Vassilis J. Tsotras
Walid Najjar
Y Zheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/08/2014
Field of study

The wide and increasing availability of collected data in the form of trajectories has led to research advances in behavioral aspects of the monitored subjects (e.g., wild animals, people, and vehicles). Using trajectory data harvested by devices, such as GPS, RFID and mobile devices, complex pattern queries can be posed to select trajectories based on specific events of interest. In this paper, we present a study on FPGA- and GPU-based architectures processing complex patterns on streams of spatio-temporal data. Complex patterns are described as regular expressions over a spatial alphabet that can be implicitly or explicitly anchored to the time domain. More importantly, variables can be used to substantially enhance the flexibility and expressive power of pattern queries. Here we explore the challenges in handling several constructs of the assumed pattern query language, with a study on the trade-offs between expressiveness, scalability and matching accuracy. We show an extensive performance evaluation where FPGA and GPU setups outperform the current state-of-the-art (single-threaded) CPU-based approaches, by over three orders of magnitude for FPGAs (for expressive queries) and up to two orders of magnitude for certain datasets on GPUs (and in some cases slowdown). Unlike software-based approaches, the performance of the proposed FPGA and GPU solutions is only minimally affected by the increased pattern complexity

Crossref

eScholarship - University of California

Ubiquitous supercomputing : design and development of enabling technologies for multi-robot systems rethinking supercomputing

Author: Camargo Forero Leonardo
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

Supercomputing, also known as High Performance Computing (HPC), is almost everywhere (ubiquitous), from the small widget in your phone telling you that today will be a sunny day, up to the next great contribution to the understanding of the origins of the universe.However, there is a field where supercomputing has been only slightly explored - robotics. Other than attempts to optimize complex robotics tasks, the two forces lack an effective alignment and a purposeful long-term contract. With advancements in miniaturization, communications and the appearance of powerful, energy and weight optimized embedded computing boards, a next logical transition corresponds to the creation of clusters of robots, a set of robotic entities that behave similarly as a supercomputer does. Yet, there is key aspect regarding our current understanding of what supercomputing means, or is useful for, that this work aims to redefine. For decades, supercomputing has been solely intended as a computing efficiency mechanism i.e. decreasing the computing time for complex tasks. While such train of thought have led to countless findings, supercomputing is more than that, because in order to provide the capacity of solving most problems quickly, another complete set of features must be provided, a set of features that can also be exploited in contexts such as robotics and that ultimately transform a set of independent entities into a cohesive unit.This thesis aims at rethinking what supercomputing means and to devise strategies to effectively set its inclusion within the robotics realm, contributing therefore to the ubiquity of supercomputing, the first main ideal of this work. With this in mind, a state of the art concerning previous attempts to mix robotics and HPC will be outlined, followed by the proposal of High Performance Robotic Computing (HPRC), a new concept mapping supercomputing to the nuances of multi-robot systems. HPRC can be thought as supercomputing in the edge and while this approach will provide all kind of advantages, in certain applications it might not be enough since interaction with external infrastructures will be required or desired. To facilitate such interaction, this thesis proposes the concept of ubiquitous supercomputing as the union of HPC, HPRC and two more type of entities, computing-less devices (e.g. sensor networks, etc.) and humans.The results of this thesis include the ubiquitous supercomputing ontology and an enabling technology depicted as The ARCHADE. The technology serves as a middleware between a mission and a supercomputing infrastructure and as a framework to facilitate the execution of any type of mission, i.e. precision agriculture, entertainment, inspection and monitoring, etc. Furthermore, the results of the execution of a set of missions are discussed.By integrating supercomputing and robotics, a second ideal is targeted, ubiquitous robotics, i.e. the use of robots in all kind of applications. Correspondingly, a review of existing ubiquitous robotics frameworks is presented and based upon its conclusions, The ARCHADE's design and development have followed the guidelines for current and future solutions. Furthermore, The ARCHADE is based on a rethought supercomputing where performance is not the only feature to be provided by ubiquitous supercomputing systems. However, performance indicators will be discussed, along with those related to other supercomputing features.Supercomputing has been an excellent ally for scientific exploration and not so long ago for commercial activities, leading to all kind of improvements in our lives, in our society and in our future. With the results of this thesis, the joining of two fields, two forces previously disconnected because of their philosophical approaches and their divergent backgrounds, holds enormous potential to open up our imagination for all kind of new applications and for a world where robotics and supercomputing are everywhere.La supercomputación, también conocida como Computación de Alto Rendimiento (HPC por sus siglas en inglés) puede encontrarse en casi cualquier lugar (ubicua), desde el widget en tu teléfono diciéndote que hoy será un día soleado, hasta la siguiente gran contribución al entendimiento de los orígenes del universo. Sin embargo, hay un campo en el que ha sido poco explorada - la robótica. Más allá de intentos de optimizar tareas robóticas complejas, las dos fuerzas carecen de un contrato a largo plazo. Dado los avances en miniaturización, comunicaciones y la aparición de potentes computadores embebidos, optimizados en peso y energía, la siguiente transición corresponde a la creación de un cluster de robots, un conjunto de robots que se comportan de manera similar a un supercomputador. No obstante, hay un aspecto clave, con respecto a la comprensión de la supercomputación, que esta tesis pretende redefinir. Durante décadas, la supercomputación ha sido entendida como un mecanismo de eficiencia computacional, es decir para reducir el tiempo de computación de ciertos problemas extremadamente complejos. Si bien este enfoque ha conducido a innumerables hallazgos, la supercomputación es más que eso, porque para proporcionar la capacidad de resolver todo tipo de problemas rápidamente, se debe proporcionar otro conjunto de características que también pueden ser explotadas en la robótica y que transforman un conjunto de robots en una unidad cohesiva. Esta tesis pretende repensar lo que significa la supercomputación y diseñar estrategias para establecer su inclusión dentro del mundo de la robótica, contribuyendo así a su ubicuidad, el principal ideal de este trabajo. Con esto en mente, se presentará un estado del arte relacionado con intentos anteriores de mezclar robótica y HPC, seguido de la propuesta de Computación Robótica de Alto Rendimiento (HPRC, por sus siglas en inglés), un nuevo concepto, que mapea la supercomputación a los matices específicos de los sistemas multi-robot. HPRC puede pensarse como supercomputación en el borde y si bien este enfoque proporcionará todo tipo de ventajas, ciertas aplicaciones requerirán una interacción con infraestructuras externas. Para facilitar dicha interacción, esta tesis propone el concepto de supercomputación ubicua como la unión de HPC, HPRC y dos tipos más de entidades, dispositivos sin computación embebida y seres humanos. Los resultados de esta tesis incluyen la ontología de la supercomputación ubicua y una tecnología llamada The ARCHADE. La tecnología actúa como middleware entre una misión y una infraestructura de supercomputación y como framework para facilitar la ejecución de cualquier tipo de misión, por ejemplo, agricultura de precisión, inspección y monitoreo, etc. Al integrar la supercomputación y la robótica, se busca un segundo ideal, robótica ubicua, es decir el uso de robots en todo tipo de aplicaciones. Correspondientemente, una revisión de frameworks existentes relacionados serán discutidos. El diseño y desarrollo de The ARCHADE ha seguido las pautas y sugerencias encontradas en dicha revisión. Además, The ARCHADE se basa en una supercomputación repensada donde la eficiencia computacional no es la única característica proporcionada a sistemas basados en la tecnología. Sin embargo, se analizarán indicadores de eficiencia computacional, junto con otros indicadores relacionados con otras características de la supercomputación. La supercomputación ha sido un excelente aliado para la exploración científica, conduciendo a todo tipo de mejoras en nuestras vidas, nuestra sociedad y nuestro futuro. Con los resultados de esta tesis, la unión de dos campos, dos fuerzas previamente desconectadas debido a sus enfoques filosóficos y sus antecedentes divergentes, tiene un enorme potencial para abrir nuestra imaginación hacia todo tipo de aplicaciones nuevas y para un mundo donde la robótica y la supercomputación estén en todos lado

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa