103 research outputs found

    Geometry-Oblivious FMM for Compressing Dense SPD Matrices

    Full text link
    We present GOFMM (geometry-oblivious FMM), a novel method that creates a hierarchical low-rank approximation, "compression," of an arbitrary dense symmetric positive definite (SPD) matrix. For many applications, GOFMM enables an approximate matrix-vector multiplication in NlogNN \log N or even NN time, where NN is the matrix size. Compression requires NlogNN \log N storage and work. In general, our scheme belongs to the family of hierarchical matrix approximation methods. In particular, it generalizes the fast multipole method (FMM) to a purely algebraic setting by only requiring the ability to sample matrix entries. Neither geometric information (i.e., point coordinates) nor knowledge of how the matrix entries have been generated is required, thus the term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme for hierarchical matrix computations that reduces synchronization barriers. We present results on the Intel Knights Landing and Haswell architectures, and on the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1

    MaSiF: Machine learning guided auto-tuning of parallel skeletons

    Get PDF

    Replicable parallel branch and bound search

    Get PDF
    Combinatorial branch and bound searches are a common technique for solving global optimisation and decision problems. Their performance often depends on good search order heuristics, refined over decades of algorithms research. Parallel search necessarily deviates from the sequential search order, sometimes dramatically and unpredictably, e.g. by distributing work at random. This can disrupt effective search order heuristics and lead to unexpected and highly variable parallel performance. The variability makes it hard to reason about the parallel performance of combinatorial searches. This paper presents a generic parallel branch and bound skeleton, implemented in Haskell, with replicable parallel performance. The skeleton aims to preserve the search order heuristic by distributing work in an ordered fashion, closely following the sequential search order. We demonstrate the generality of the approach by applying the skeleton to 40 instances of three combinatorial problems: Maximum Clique, 0/1 Knapsack and Travelling Salesperson. The overheads of our Haskell skeleton are reasonable: giving slowdown factors of between 1.9 and 6.2 compared with a class-leading, dedicated, and highly optimised C++ Maximum Clique solver. We demonstrate scaling up to 200 cores of a Beowulf cluster, achieving speedups of 100x for several Maximum Clique instances. We demonstrate low variance of parallel performance across all instances of the three combinatorial problems and at all scales up to 200 cores, with median Relative Standard Deviation (RSD) below 2%. Parallel solvers that do not follow the sequential search order exhibit far higher variance, with median RSD exceeding 85% for Knapsack

    Parallel source code transformation techniques using design patterns

    Get PDF
    Mención Internacional en el título de doctorIn recent years, the traditional approaches for improving performance, such as increasing the clock frequency, has come to a dead-end. To tackle this issue, parallel architectures, such as multi-/many-core processors, have been envisioned to increase the performance by providing greater processing capabilities. However, programming efficiently for this architectures demands big efforts in order to transform sequential applications into parallel and to optimize such applications. Compared to sequential programming, designing and implementing parallel applications for operating on modern hardware poses a number of new challenges to developers such as data races, deadlocks, load imbalance, etc. To pave the way, parallel design patterns provide a way to encapsulate algorithmic aspects, allowing users to implement robust, readable and portable solutions with such high-level abstractions. Basically, these patterns instantiate parallelism while hiding away the complexity of concurrency mechanisms, such as thread management, synchronizations or data sharing. Nonetheless, frameworks following this philosophy does not share the same interface and users require understanding different libraries, and their capabilities, not only to decide which fits best for their purposes but also to properly leverage them. Furthermore, in order to parallelize these applications, it is necessary to analyze the sequential code in order to detect the regions of code that can be parallelized that is a time consuming and complex task. Additionally, different libraries targeted to specific devices provide some algorithms implementations that are already parallel and highly-tuned. In these situations, it is also necessary to analyze and determine which routine implementation is the most suitable for a given problem. To tackle these issues, this thesis aims at simplifying and minimizing the necessary efforts to transform sequential applications into parallel. This way, resulting codes will improve their performance by fully exploiting the available resources while the development efforts will be considerably reduced. Basically, in this thesis, we contribute with the following. First, we propose a technique to detect potential parallel patterns in sequential code. Second, we provide a novel generic C++ interface for parallel patterns which acts as a switch among existing frameworks. Third, we implement a framework that is able to transform sequential code into parallel using the proposed pattern discovery technique and pattern interface. Finally, we propose mechanisms that are able to select the most suitable device and routine implementation to solve a given problem based on previous performance information. The evaluation demonstrates that using the proposed techniques can minimize the refactoring and optimization time while improving the performance of the resulting applications with respect to the original code.En los últimos años, las técnicas tradicionales para mejorar el rendimiento, como es el caso del incremento de la frecuencia de reloj, han llegado a sus límites. Con el fin de seguir mejorando el rendimiento, se han desarrollado las arquitecturas paralelas, las cuales proporcionan un incremento del rendimiento al estar provistas de mayores capacidades de procesamiento. Sin embargo, programar de forma eficiente para estas arquitecturas requieren de grandes esfuerzos por parte de los desarrolladores. Comparado con la programación secuencial, diseñar e implementar aplicaciones paralelas enfocadas a trabajar en estas arquitecturas presentan una gran cantidad de dificultades como son las condiciones de carrera, los deadlocks o el incorrecto balanceo de la carga. En este sentido, los patrones paralelos son una forma de encapsular aspectos algorítmicos de las aplicaciones permitiendo el desarrollo de soluciones robustas, portables y legibles gracias a las abstracciones de alto nivel. En general, estos patrones son capaces de proporcionar el paralelismo a la vez que ocultan las complejidades derivadas de los mecanismos de control de concurrencia necesarios como el manejo de los hilos, las sincronizaciones o la compartición de datos. No obstante, los diferentes frameworks que siguen esta filosofía no comparten una única interfaz lo que conlleva que los usuarios deban conocer múltiples bibliotecas y sus capacidades, con el fin de decidir cuál de ellos es mejor para una situación concreta y como usarlos de forma eficiente. Además, con el fin de paralelizar aplicaciones existentes, es necesario analizar e identificar las regiones del código que pueden ser paralelizadas, lo cual es una tarea ardua y compleja. Además, algunos algoritmos ya se encuentran implementados en paralelo y optimizados para arquitecturas concretas en diversas bibliotecas. Esto da lugar a que sea necesario analizar y determinar que implementación concreta es la más adecuada para solucionar un problema dado. Para paliar estas situaciones, está tesis busca simplificar y minimizar el esfuerzo necesario para transformar aplicaciones secuenciales en paralelas. De esta forma, los códigos resultantes serán capaces de explotar los recursos disponibles a la vez que se reduce considerablemente el esfuerzo de desarrollo necesario. En general, esta tesis contribuye con lo siguiente. En primer lugar, se propone una técnica de detección de patrones paralelos en códigos secuenciales. En segundo lugar, se presenta una interfaz genérica de patrones paralelos para C++ que permite seleccionar la implementación de dichos patrones proporcionada por frameworks ya existentes. En tercer lugar, se introduce un framework de transformación de código secuencial a paralelo que hace uso de las técnicas de detección de patrones y la interfaz presentadas. Finalmente, se proponen mecanismos capaces de seleccionar la implementación más adecuada para solucionar un problema concreto basándose en el rendimiento obtenido en ejecuciones previas. Gracias a la evaluación realizada se ha podido demostrar que uso de las técnicas presentadas pueden minimizar el tiempo necesario para transformar y optimizar el código a la vez que mejora el rendimiento de las aplicaciones transformadas.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: David Expósito Singh.- Secretario: Rafael Asenjo Plaza.- Vocal: Marco Aldinucc

    Architecture aware parallel programming in Glasgow parallel Haskell (GPH)

    Get PDF
    General purpose computing architectures are evolving quickly to become manycore and hierarchical: i.e. a core can communicate more quickly locally than globally. To be effective on such architectures, programming models must be aware of the communications hierarchy. This thesis investigates a programming model that aims to share the responsibility of task placement, load balance, thread creation, and synchronisation between the application developer and the runtime system. The main contribution of this thesis is the development of four new architectureaware constructs for Glasgow parallel Haskell that exploit information about task size and aim to reduce communication for small tasks, preserve data locality, or to distribute large units of work. We define a semantics for the constructs that specifies the sets of PEs that each construct identifies, and we check four properties of the semantics using QuickCheck. We report a preliminary investigation of architecture aware programming models that abstract over the new constructs. In particular, we propose architecture aware evaluation strategies and skeletons. We investigate three common paradigms, such as data parallelism, divide-and-conquer and nested parallelism, on hierarchical architectures with up to 224 cores. The results show that the architecture-aware programming model consistently delivers better speedup and scalability than existing constructs, together with a dramatic reduction in the execution time variability. We present a comparison of functional multicore technologies and it reports some of the first ever multicore results for the Feedback Directed Implicit Parallelism (FDIP) and the semi-explicit parallelism (GpH and Eden) languages. The comparison reflects the growing maturity of the field by systematically evaluating four parallel Haskell implementations on a common multicore architecture. The comparison contrasts the programming effort each language requires with the parallel performance delivered. We investigate the minimum thread granularity required to achieve satisfactory performance for three implementations parallel functional language on a multicore platform. The results show that GHC-GUM requires a larger thread granularity than Eden and GHC-SMP. The thread granularity rises as the number of cores rises

    ASKIT: Approximate Skeletonization Kernel-Independent Treecode in High Dimensions

    Full text link
    We present a fast algorithm for kernel summation problems in high-dimensions. These problems appear in computational physics, numerical approximation, non-parametric statistics, and machine learning. In our context, the sums depend on a kernel function that is a pair potential defined on a dataset of points in a high-dimensional Euclidean space. A direct evaluation of the sum scales quadratically with the number of points. Fast kernel summation methods can reduce this cost to linear complexity, but the constants involved do not scale well with the dimensionality of the dataset. The main algorithmic components of fast kernel summation algorithms are the separation of the kernel sum between near and far field (which is the basis for pruning) and the efficient and accurate approximation of the far field. We introduce novel methods for pruning and approximating the far field. Our far field approximation requires only kernel evaluations and does not use analytic expansions. Pruning is not done using bounding boxes but rather combinatorially using a sparsified nearest-neighbor graph of the input. The time complexity of our algorithm depends linearly on the ambient dimension. The error in the algorithm depends on the low-rank approximability of the far field, which in turn depends on the kernel function and on the intrinsic dimensionality of the distribution of the points. The error of the far field approximation does not depend on the ambient dimension. We present the new algorithm along with experimental results that demonstrate its performance. We report results for Gaussian kernel sums for 100 million points in 64 dimensions, for one million points in 1000 dimensions, and for problems in which the Gaussian kernel has a variable bandwidth. To the best of our knowledge, all of these experiments are impossible or prohibitively expensive with existing fast kernel summation methods.Comment: 22 pages, 6 figure

    On Designing Multicore-aware Simulators for Biological Systems

    Full text link
    The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It often is an enlightening technique, which may however result in being computational expensive. We discuss the main opportunities to speed it up on multi-core platforms, which pose new challenges for parallelisation techniques. These opportunities are developed in two general families of solutions involving both the single simulation and a bulk of independent simulations (either replicas of derived from parameter sweep). Proposed solutions are tested on the parallelisation of the CWC simulator (Calculus of Wrapped Compartments) that is carried out according to proposed solutions by way of the FastFlow programming framework making possible fast development and efficient execution on multi-cores.Comment: 19 pages + cover pag
    corecore