104 research outputs found

    Tiled Algorithms for Matrix Computations on Multicore Architectures

    Full text link
    The current computer architecture has moved towards the multi/many-core structure. However, the algorithms in the current sequential dense numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multi/many-core architectures. A new family of algorithms, the tile algorithms, has recently been introduced to circumvent this problem. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. The goal of this thesis is to study tiled algorithms in a multi/many-core setting and to provide new algorithms which exploit the current architecture to improve performance relative to current state-of-the-art libraries while maintaining the stability and robustness of these libraries.Comment: PhD Thesis, 2012 http://math.ucdenver.ed

    Integrated Development and Parallelization of Automated Dicentric Chromosome Identification Software to Expedite Biodosimetry Analysis

    Get PDF
    Manual cytogenetic biodosimetry lacks the ability to handle mass casualty events. We present an automated dicentric chromosome identification (ADCI) software utilizing parallel computing technology. A parallelization strategy combining data and task parallelism, as well as optimization of I/O operations, has been designed, implemented, and incorporated in ADCI. Experiments on an eight-core desktop show that our algorithm can expedite the process of ADCI by at least four folds. Experiments on Symmetric Computing, SHARCNET, Blue Gene/Q multi-processor computers demonstrate the capability of parallelized ADCI to process thousands of samples for cytogenetic biodosimetry in a few hours. This increase in speed underscores the effectiveness of parallelization in accelerating ADCI. Our software will be an important tool to handle the magnitude of mass casualty ionizing radiation events by expediting accurate detection of dicentric chromosomes

    Dynamic Task Execution on Shared and Distributed Memory Architectures

    Get PDF
    Multicore architectures with high core counts have come to dominate the world of high performance computing, from shared memory machines to the largest distributed memory clusters. The multicore route to increased performance has a simpler design and better power efficiency than the traditional approach of increasing processor frequencies. But, standard programming techniques are not well adapted to this change in computer architecture design. In this work, we study the use of dynamic runtime environments executing data driven applications as a solution to programming multicore architectures. The goals of our runtime environments are productivity, scalability and performance. We demonstrate productivity by defining a simple programming interface to express code. Our runtime environments are experimentally shown to be scalable and give competitive performance on large multicore and distributed memory machines. This work is driven by linear algebra algorithms, where state-of-the-art libraries (e.g., LAPACK and ScaLAPACK) using a fork-join or block-synchronous execution style do not use the available resources in the most efficient manner. Research work in linear algebra has reformulated these algorithms as tasks acting on tiles of data, with data dependency relationships between the tasks. This results in a task-based DAG for the reformulated algorithms, which can be executed via asynchronous data-driven execution paths analogous to dataflow execution. We study an API and runtime environment for shared memory architectures that efficiently executes serially presented tile based algorithms. This runtime is used to enable linear algebra applications and is shown to deliver performance competitive with state-of- the-art commercial and research libraries. We develop a runtime environment for distributed memory multicore architectures extended from our shared memory implementation. The runtime takes serially presented algorithms designed for the shared memory environment, and schedules and executes them on distributed memory architectures in a scalable and high performance manner. We design a distributed data coherency protocol and a distributed task scheduling mechanism which avoid global coordination. Experimental results with linear algebra applications show the scalability and performance of our runtime environment

    Effective data parallel computing on multicore processors

    Get PDF
    The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this dissertation is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this dissertation include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform nai¨ve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores

    Doctor of Philosophy

    Get PDF
    dissertationDataflow pipeline models are widely used in visualization systems. Despite recent advancements in parallel architecture, most systems still support only a single CPU or a small collection of CPUs such as a SMP workstation. Even for systems that are specifically tuned towards parallel visualization, their execution models only provide support for data-parallelism while ignoring taskparallelism and pipeline-parallelism. With the recent popularization of machines equipped with multicore CPUs and multi-GPU units, these visualization systems are undoubtedly falling further behind in reaching maximum efficiency. On the other hand, there exist several libraries that can schedule program executions on multiple CPUs and/or multiple GPUs. However, due to differences in executing a task graph and a pipeline along with their APIs being considerably low-level, it still remains a challenge to integrate these run-time libraries into current visualization systems. Thus, there is a need for a redesigned dataflow architecture to fully support and exploit the power of highly parallel machines in large-scale visualization. The new design must be able to schedule executions on heterogeneous platforms while at the same time supporting arbitrarily large datasets through the use of streaming data structures. The primary goal of this dissertation work is to develop a parallel dataflow architecture for streaming large-scale visualizations. The framework includes supports for platforms ranging from multicore processors to clusters consisting of thousands CPUs and GPUs. We achieve this in our system by introducing the notion of Virtual Processing Elements and Task-Oriented Modules along with a highly customizable scheduler that controls the assignment of tasks to elements dynamically. This creates an intuitive way to maintain multiple CPU/GPU kernels yet still provide coherency and synchronization across module executions. We have implemented these techniques into HyperFlow which is made of an API with all basic dataflow constructs described in the dissertation, and a distributed run-time library that can be used to deploy those pipelines on multicore, multi-GPU and cluster-based platforms

    Contributions à l'optimisation de programmes et à la synthèse de circuits haut-niveau

    Get PDF
    Since the end of Dennard scaling, power efficiency is the limiting factor for large-scale computing. Hardware accelerators such as reconfigurable circuits (FPGA, CGRA) or Graphics Processing Units (GPUs) were introduced to improve the performance under a limited energy budget, resulting into complex heterogeneous platforms. This document presents a synthetic description of my research activities over the last decade on compilers for high-performance computing and high-level synthesis of circuits (HLS) for FPGA accelerators. Specifically, my contributions covers both theoretical and practical aspects of automatic parallelization and HLS in a general theoretical framework called the polyhedral model.A first chapter describes our contributions to loop tiling, a key program transformation for automatic parallelization which splits the computation atomic blocks called tiles.We rephrase loop tiling in the polyhedral model to enable any polyhedral tile shape whose size depends on a single parameter (monoparametric tiling), and we present a tiling transformation for programs with reductions – accumulations w.r.t. an associative/commutative operator. Our results open the way for semantic program transformations ; program transformations which does not preserve the computation but still lead to an equivalent program.A second chapter describes our contributions to algorithm recognition. A compiler optimization will never replace a good algorithm, hence the idea to recognize algorithm instances in a program and to substitute them by a call to a performance library. In our PhD thesis, we have addressed the recognition of templates – functionswith first-order variables – into programs and its application to program optimization. We propose a complementary algorithm recognition framework which leverages our monoparametric tiling and our reduction tiling transformations. This automates semantic tiling, a new semantic program transformation which increases the grain of operators (scalar → matrix).A third chapter presents our contributions to the synthesis of communications with an off-chip memory in the context of high-level circuit synthesis (HLS). We propose an execution model based on loop tiling, a pipelined architecture and a source-level compilation algorithm which, connected to the C2H HLS tool from Altera, ends up to a FPGA configuration with minimized data transfers. Our compilation algorithm is optimal – the data are loaded as late as possible and stored as soon as possible with a maximal reuse.A fourth chapter presents our contributions to design a unified polyhedral compilation model for high-level circuit synthesis.We present the Data-aware Process Networks (DPN), a dataflow intermediate representation which leverages the ideas developed in chapter 3 to explicit the data transfers with an off-chip memory. We propose an algorithm to compile a DPN from a sequential program, and we present our contribution to the synthesis of DPN to a circuit. In particular, we present our algorithms to compile the control, the channels and the synchronizations of a DPN. These results are used in the production compiler of the Xtremlogic start-up.Depuis la fin du Dennard scaling, l’efficacité énergétique est le facteur limitant pour le calcul haute performance. Les accélérateurs matériels comme les circuits reconfigurables (FPGA, CGRA) ou les accélérateurs graphiques (GPUs) ont été introduits pour améliorer les performances sous un budget énergétique limité, menant à des plateformes hétérogènes complexes.Mes travaux de recherche portent sur les compilateurs et la synthèse de circuits haut-niveau (High-Level Synthesis, HLS) pour le calcul haute-performance. Specifiquement, mes contributions couvrent les aspects théoriques etpratiques de la parallélisation automatique et la HLS dans le cadre général du modèle polyédrique.Un premier chapitre décrit mes contributions au tuilage de boucles, une transformation fondamentale pour la parallélisation automatique, qui découpe le calcul en sous-calculs atomiques appelés tuiles. Nous reformulons le tuilage de boucles dans le modèle polyédrique pour permettre n’importe tuile polytopique dont la taille dépend d’un facteur homothétique (tuilage monoparamétrique), et nous décrivons une transformation de tuilage pour des programmes avec des réductions – une accumulation selon un opérateur associative et commutatif. Nos résultats ouvrent la voie à des transformations de programme sémantiques ; qui ne préservent pas le calcul, mais produisent un programme équivalent.Un second chapitre décrit mes contributions à la reconnaissance d’algorithmes. Une optimisation de compilateur ne remplacera jamais un bon algorithme, d’où l’idée de reconnaître les instances d’un algorithme dans un programme et de les substituer par un appel vers une bibliothèque hauteperformance, chaque fois que c’est possible et utile.Dans notre thèse, nous avons traité la reconnaissance de templates – des fonctions avec des variables d’ordre 1 – dans un programme et son application à l’optimisation de programes. Nous proposons une approche complémentaire qui s’appuie sur notre tuilage monoparamétrique complété par une transformation pour tuiler les réductions. Ceci automatise le tuilage sémantique, une nouvelle transformation sémantique qui augmente le grain des opérateurs (scalaire → matrice).Un troisième chapitre présente mes contributions à la synthèse des communications avec une mémoire off-chip dans le contexte de la synthèse de circuits haut-niveau. Nous proposons un modèle d’exécution basé sur le tuilage de boucles, une architecture pipelinée et un algorithme de compilation source-à-source qui, connecté à l’outil de HLS C2H d’Altera, produit une configuration de circuit FPGA qui réalise un volume minimal de transferts de données. Notre algorithme est optimal – les données sont chargées le plus tard possible et stockées le plus tôt possible, avec une réutilisation maximale et sans redondances.Enfin, un quatrième chapitre présente mes contributions pour construire un modèle de compilation polyédrique unifié pour la synthèse de circuits haut-niveau.Nous présentons les réseaux de processus DPN (Data-aware Process Networks), une représentation intermédiaire dataflow qui s’appuie sur les idées développées au chapitre 3 pour expliciter les transferts de données entre le circuit et la mémoire off-chip. Nous proposons une suite d’algorithmes pour compiler un DPN à partir d’un programme séquentiel et nous présentons nos contributions à la synthèse d’un DPN en circuit. En particulier, nous présentons nos algorithmes pour compiler le contrôle, les canaux et les synchronisations d’un DPN. Ces résultats sont utilisés dans le compilateur de production de la start-up XtremLogic

    Running stream-like programs on heterogeneous multi-core systems

    Get PDF
    All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole. Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program's structure and data layout can be safely and reliably modified by high-level compiler transformations. One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion. This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream compiler passes---specifically software pipelining and buffer allocation---and it models the compiler's ability to fuse kernels. The latter is important because the compiler may not be able to fuse arbitrary collections of kernels. This thesis also introduces a static queue sizing algorithm. This algorithm is important when memory is distributed, especially when local stores are small. The algorithm takes account of latencies and variations in computation time, and is constrained by the sizes of the local memories. The second part of this thesis is concerned with dynamic scheduling of stream programs. First, it investigates the performance of known online, non-preemptive, non-clairvoyant dynamic schedulers. Second, it proposes two dynamic schedulers for stream programs. The first is specifically for one-dimensional stream programs. The second is more general: it does not need to be told the stream graph, but it has slightly larger overhead. This thesis also introduces some support tools related to stream programming. StarssCheck is a debugging tool, based on Valgrind, for the StarSs task-parallel programming language. It generates a warning whenever the program's behaviour contradicts a pragma annotation. Such behaviour could otherwise lead to exceptions or race conditions. StreamIt to OmpSs is a tool to convert a streaming program in the StreamIt language into a dynamically scheduled task based program using StarSs.Totes les empreses de semiconductors produeixen actualment multi-cores. Mòbils,PCs, portàtils, i dispositius mòbils d’Internet necessitaran programari quefaci servir eficientment aquests cores. Escriure programari paral·lel d’altrendiment és difícil, laboriós i propens a errors, incrementant tant el tempsde llançament al mercat com el cost. El programari té una vida més llarga queel maquinari; típicament pren més temps desenvolupar nou programi que noumaquinari, i el programari ja existent pot perdurar molt temps, durant el qualel nombre de cores dels sistemes incrementarà. La productivitat dedesenvolupament i manteniment millorarà si el paral·lelisme i els detallstècnics són gestionats per la màquina, mentre el programador raona sobre elconjunt de l’aplicació.El programari paral·lel hauria de ser escrit en llenguatges específics deldomini. Aquests llenguatges extrauen paral·lelisme implícit, el qual és ocultatper un llenguatge seqüencial com C. Quan l’assignació de memòria i lesestructures de control són gestionades pel compilador, l’estructura iorganització de dades del programi poden ser modificades de manera segura ifiable per les transformacions d’alt nivell del compilador.Un dels dominis de l’aplicació importants és el que consta dels programes destream; aquest programes són estructurats com a nuclis independents queinteractuen només a través de canals d’un sol sentit, anomenats streams. Laprogramació de streams no és aplicable a tots els programes, però sorgeix deforma natural en la codificació i descodificació d’àudio i vídeo, gràfics 3D, iprocessament de senyals digitals. Aquesta representació permet transformacionsd’alt nivell, fins i tot descomposició i fusió de nucli.Aquesta tesi desenvolupa noves tècniques de compilació i sistemes en tempsd’execució per a programació de streams. La primera part d’aquesta tesi esfocalitza amb un compilador de streams de planificació estàtica. Presenta unnou algorisme de partició estàtica, que determina quins nuclis han de serfusionats, per tal d’equilibrar la càrrega en els processadors i en lesinterconnexions. Un bon algorisme de particionat és fonamental per tal de queel compilador produeixi codi eficient. L’algorisme també té en compte elspassos de compilació subseqüents---específicament software pipelining il’arranjament de buffers---i modela la capacitat del compilador per fusionarnuclis. Aquesta tesi també presenta un algorisme estàtic de redimensionament de cues.Aquest algorisme és important quan la memòria és distribuïda, especialment quanles memòries locals són petites. L’algorisme té en compte latències ivariacions en els temps de càlcul, i considera el límit imposat per la mida deles memòries locals.La segona part d’aquesta tesi es centralitza en la planificació dinàmica deprogrames de streams. En primer lloc, investiga el rendiment dels planificadorsdinàmics online, non-preemptive i non-clairvoyant. En segon lloc, proposa dosplanificadors dinàmics per programes de stream. El primer és específicament pera programes de streams unidimensionals. El segon és més general: no necessitael graf de streams, però els overheads són una mica més grans.Aquesta tesi també presenta un conjunt d’eines de suport relacionades amb laprogramació de streams. StarssCheck és una eina de depuració, que és basa enValgrind, per StarSs, un llenguatge de programació paral·lela basat en tasques.Aquesta eina genera un avís cada vegada que el comportament del programa estàen contradicció amb una anotació pragma. Aquest comportament d’una altra manerapodria causar excepcions o situacions de competició. StreamIt to OmpSs és unaeina per convertir un programa de streams codificat en el llenguatge StreamIt aun programa de tasques en StarSs planificat de forma dinàmica.Postprint (published version

    A Model-based Design Framework for Application-specific Heterogeneous Systems

    Get PDF
    The increasing heterogeneity of computing systems enables higher performance and power efficiency. However, these improvements come at the cost of increasing the overall complexity of designing such systems. These complexities include constructing implementations for various types of processors, setting up and configuring communication protocols, and efficiently scheduling the computational work. The process for developing such systems is iterative and time consuming, with no well-defined performance goal. Current performance estimation approaches use source code implementations that require experienced developers and time to produce. We present a framework to aid in the design of heterogeneous systems and the performance tuning of applications. Our framework supports system construction: integrating custom hardware accelerators with existing cores into processors, integrating processors into cohesive systems, and mapping computations to processors to achieve overall application performance and efficient hardware usage. It also facilitates effective design space exploration using processor models (for both existing and future processors) that do not require source code implementations to estimate performance. We evaluate our framework using a variety of applications and implement them in systems ranging from low power embedded systems-on-chip (SoC) to high performance systems consisting of commercial-off-the-shelf (COTS) components. We show how the design process is improved, reducing the number of design iterations and unnecessary source code development ultimately leading to higher performing efficient systems

    Accelerating advanced preconditioning methods on hybrid architectures

    Get PDF
    Un gran número de problemas, en diversas áreas de la ciencia y la ingeniería, involucran la solución de sistemas dispersos de ecuaciones lineales de gran escala. En muchos de estos escenarios, son además un cuello de botella desde el punto de vista computacional, y por esa razón, su implementación eficiente ha motivado una cantidad enorme de trabajos científicos. Por muchos años, los métodos directos basados en el proceso de la Eliminación Gaussiana han sido la herramienta de referencia para resolver dichos sistemas, pero la dimensión de los problemas abordados actualmente impone serios desafíos a la mayoría de estos algoritmos, considerando sus requerimientos de memoria, su tiempo de cómputo y la complejidad de su implementación. Propulsados por los avances en las técnicas de precondicionado, los métodos iterativos se han vuelto más confiables, y por lo tanto emergen como alternativas a los métodos directos, ofreciendo soluciones de alta calidad a un menor costo computacional. Sin embargo, estos avances muchas veces son relativos a un problema específico, o dotan a los precondicionadores de una complejidad tal, que su aplicación en diversos problemas se vuelve poco práctica en términos de tiempo de ejecución y consumo de memoria. Como respuesta a esta situación, es común la utilización de estrategias de Computación de Alto Desempeño, ya que el desarrollo sostenido de las plataformas de hardware permite la ejecución simultánea de cada vez más operaciones. Un claro ejemplo de esta evolución son las plataformas compuestas por procesadores multi-núcleo y aceleradoras de hardware como las Unidades de Procesamiento Gráfico (GPU). Particularmente, las GPU se han convertido en poderosos procesadores paralelos, capaces de integrar miles de núcleos a precios y consumo energético razonables.Por estas razones, las GPU son ahora una plataforma de hardware de gran importancia para la ciencia y la ingeniería, y su uso eficiente es crucial para alcanzar un buen desempeño en la mayoría de las aplicaciones. Esta tesis se centra en el uso de GPUs para acelerar la solución de sistemas dispersos de ecuaciones lineales usando métodos iterativos precondicionados con técnicas modernas. En particular, se trabaja sobre ILUPACK, que ofrece implementaciones de los métodos iterativos más importantes, y presenta un interesante y moderno precondicionador de tipo ILU multinivel. En este trabajo, se desarrollan versiones del precondicionador y de los métodos incluidos en el paquete, capaces de explotar el paralelismo de datos mediante el uso de GPUs sin afectar las propiedades numéricas del precondicionador. Además, se habilita y analiza el uso de las GPU en versiones paralelas existentes, basadas en paralelismo de tareas para plataformas de memoria compartida y distribuida. Los resultados obtenidos muestran una sensible mejora en el tiempo de ejecución de los métodos abordados, así como la posibilidad de resolver problemas de gran escala de forma eficiente
    • …
    corecore