85 research outputs found

    Enabling GPU Support for the COMPSs-Mobile Framework

    Get PDF
    Using the GPUs embedded in mobile devices allows for increasing the performance of the applications running on them while reducing the energy consumption of their execution. This article presents a task-based solution for adaptative, collaborative heterogeneous computing on mobile cloud environments. To implement our proposal, we extend the COMPSs-Mobile framework – an implementation of the COMPSs programming model for building mobile applications that offload part of the computation to the Cloud – to support offloading computation to GPUs through OpenCL. To evaluate our solution, we subject the prototype to three benchmark applications representing different application patterns.This work is partially supported by the Joint-Laboratory on Extreme Scale Computing (JLESC), by the European Union through the Horizon 2020 research and innovation programme under contract 687584 (TANGO Project), by the Spanish Goverment (TIN2015-65316-P, BES-2013-067167, EEBB-2016-11272, SEV-2011-00067) and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    Algorithm Libraries for Multi-Core Processors

    Get PDF
    By providing parallelized versions of established algorithm libraries, we ease the exploitation of the multiple cores on modern processors for the programmer. The Multi-Core STL provides basic algorithms for internal memory, while the parallelized STXXL enables multi-core acceleration for algorithms on large data sets stored on disk. Some parallelized geometric algorithms are introduced into CGAL. Further, we design and implement sorting algorithms for huge data in distributed external memory

    Efficient Execution of Sequential Instructions Streams by Physical Machines

    Get PDF
    Any computational model which relies on a physical system is likely to be subject to the fact that information density and speed have intrinsic, ultimate limits. The RAM model, and in particular the underlying assumption that memory accesses can be carried out in time independent from memory size itself, is not physically implementable. This work has developed in the field of limiting technology machines, in which it is somewhat provocatively assumed that technology has achieved the physical limits. The ultimate goal for this is to tackle the problem of the intrinsic latencies of physical systems by encouraging scalable organizations for processors and memories. An algorithmic study is presented, which depicts the implementation of high concurrency programs for SP and SPE, sequential machine models able to compute direct-flow programs in optimal time. Then, a novel pieplined, hierarchical memory organization is presented, with optimal latency and bandwidth for a physical system. In order to both take full advantage of the memory capabilities and exploit the available instruction level parallelism of the code to be executed, a novel processor model is developed. Particular care is put in devising an efficient information flow within the processor itself. Both designs are extremely scalable, as they are based on fixed capacity and fixed size nodes, which are connected as a multidimensional array. Performance analysis on the resulting machine design has led to the discovery that latencies internal to the processor can be the dominating source of complexity in instruction flow execution, which adds to the effects of processor-memory interaction. A characterization of instruction flows is then developed, which is based on the topology induced by instruction dependences

    Programming and parallelising applications for distributed infrastructures

    Get PDF
    The last decade has witnessed unprecedented changes in parallel and distributed infrastructures. Due to the diminished gains in processor performance from increasing clock frequency, manufacturers have moved from uniprocessor architectures to multicores; as a result, clusters of computers have incorporated such new CPU designs. Furthermore, the ever-growing need of scienti c applications for computing and storage capabilities has motivated the appearance of grids: geographically-distributed, multi-domain infrastructures based on sharing of resources to accomplish large and complex tasks. More recently, clouds have emerged by combining virtualisation technologies, service-orientation and business models to deliver IT resources on demand over the Internet. The size and complexity of these new infrastructures poses a challenge for programmers to exploit them. On the one hand, some of the di culties are inherent to concurrent and distributed programming themselves, e.g. dealing with thread creation and synchronisation, messaging, data partitioning and transfer, etc. On the other hand, other issues are related to the singularities of each scenario, like the heterogeneity of Grid middleware and resources or the risk of vendor lock-in when writing an application for a particular Cloud provider. In the face of such a challenge, programming productivity - understood as a tradeo between programmability and performance - has become crucial for software developers. There is a strong need for high-productivity programming models and languages, which should provide simple means for writing parallel and distributed applications that can run on current infrastructures without sacri cing performance. In that sense, this thesis contributes with Java StarSs, a programming model and runtime system for developing and parallelising Java applications on distributed infrastructures. The model has two key features: first, the user programs in a fully-sequential standard-Java fashion - no parallel construct, API call or pragma must be included in the application code; second, it is completely infrastructure-unaware, i.e. programs do not contain any details about deployment or resource management, so that the same application can run in di erent infrastructures with no changes. The only requirement for the user is to select the application tasks, which are the model's unit of parallelism. Tasks can be either regular Java methods or web service operations, and they can handle any data type supported by the Java language, namely les, objects, arrays and primitives. For the sake of simplicity of the model, Java StarSs shifts the burden of parallelisation from the programmer to the runtime system. The runtime is responsible from modifying the original application to make it create asynchronous tasks and synchronise data accesses from the main program. Moreover, the implicit inter-task concurrency is automatically found as the application executes, thanks to a data dependency detection mechanism that integrates all the Java data types. This thesis provides a fairly comprehensive evaluation of Java StarSs on three di erent distributed scenarios: Grid, Cluster and Cloud. For each of them, a runtime system was designed and implemented to exploit their particular characteristics as well as to address their issues, while keeping the infrastructure unawareness of the programming model. The evaluation compares Java StarSs against state-of-the-art solutions, both in terms of programmability and performance, and demonstrates how the model can bring remarkable productivity to programmers of parallel distributed applications

    Runtime-adaptive generalized task parallelism

    Get PDF
    Multi core systems are ubiquitous nowadays and their number is ever increasing. And while, limited by physical constraints, the computational power of the individual cores has been stagnating or even declining for years, a solution to effectively utilize the computational power that comes with the additional cores is yet to be found. Existing approaches to automatic parallelization are often highly specialized to exploit the parallelism of specific program patterns, and thus to parallelize a small subset of programs only. In addition, frequently used invasive runtime systems prohibit the combination of different approaches, which impedes the practicality of automatic parallelization. In the following thesis, we show that specializing to narrowly defined program patterns is not necessary to efficiently parallelize applications coming from different domains. We develop a generalizing approach to parallelization, which, driven by an underlying mathematical optimization problem, is able to make qualified parallelization decisions taking into account the involved runtime overhead. In combination with a specializing, adaptive runtime system the approach is able to match and even exceed the performance results achieved by specialized approaches.Mehrkernsysteme sind heutzutage allgegenwärtig und finden täglich weitere Verbreitung. Und während, limitiert durch die Grenzen des physikalisch Machbaren, die Rechenkraft der einzelnen Kerne bereits seit Jahren stagniert oder gar sinkt, existiert bis heute keine zufriedenstellende Lösung zur effektiven Ausnutzung der gebotenen Rechenkraft, die mit der steigenden Anzahl an Kernen einhergeht. Existierende Ansätze der automatischen Parallelisierung sind häufig hoch spezialisiert auf die Ausnutzung bestimmter Programm-Muster, und somit auf die Parallelisierung weniger Programmteile. Hinzu kommt, dass häufig verwendete invasive Laufzeitsysteme die Kombination mehrerer Parallelisierungs-Ansätze verhindern, was der Praxistauglichkeit und Reichweite automatischer Ansätze im Wege steht. In der Ihnen vorliegenden Arbeit zeigen wir, dass die Spezialisierung auf eng definierte Programmuster nicht notwendig ist, um Parallelität in Programmen verschiedener Domänen effizient auszunutzen. Wir entwickeln einen generalisierenden Ansatz der Parallelisierung, der, getrieben von einem mathematischen Optimierungsproblem, in der Lage ist, fundierte Parallelisierungsentscheidungen unter Berücksichtigung relevanter Kosten zu treffen. In Kombination mit einem spezialisierenden und adaptiven Laufzeitsystem ist der entwickelte Ansatz in der Lage, mit den Ergebnissen spezialisierter Ansätze mitzuhalten, oder diese gar zu übertreffen.Part of the work presented in this thesis was performed in the context of the SoftwareCluster project EMERGENT (http://www.software-cluster.org). It was funded by the German Federal Ministry of Education and Research (BMBF) under grant no. “01IC10S01”. Later work has been supported, also by the German Federal Ministry of Education and Research (BMBF), through funding for the Center for IT-Security, Privacy and Accountability (CISPA) under grant no. “16KIS0344”

    Development of a performance analysis environment for parallel pattern-based applications

    Get PDF
    One of the challenges that the scientific community is facing nowadays refers to the data parallel treatment. Every day, we produce more and more overwhelming amounts of data, in such a way that there comes a point at which all those data volumes grow exponentially and can not be treated as we would desire. The importance of this treatment and data processing is mainly due to the need that arises in the scientific area for discovering new advances in science, or simply for discovering new algorithms capable of solving experiments each time more complex, whose resolution years ago was an unfeasible task due to the available resources. As a consequence, some changes appear in the internal architecture of these computers in order to increase their computing capacity and thus, be able to cope with the need for massive data processing. Thus, the scientific community has implemented different pattern-based parallel programming frameworks in order to compute experiments in a faster and efficient way. Unfortunately, the use of these programming paradigms is not a simple task, since it requires expertise and programming skills. This is further complicated when developers are not aware of the program internal behaviour, which leads to unexpected results in certain parts of the code. Inevitably, some need arises to develop a series of tools with the aim of helping this community to analyze the performance and results of their experiments. Hence, this bachelor thesis presents the development of a performance analysis environment based on parallel applications as a solution to that problem. Specifically, this environment is composed of two techniques commonly used, profiling and tracing, which have been added to the GrPPI framework. In this way, users can obtain a general assessment of their applications performance and thus, act according with the results obtained.Uno de los retos actuales al que la comunidad científica está haciendo frente hace referencia al tratamiento en paralelo de datos. Diariamente producimos cada vez más cantidades abrumadoras de datos, de tal manera que llega un punto en el que todos esos volúmenes de datos crecen desenfrenadamente y no pueden ser tratados como se desearía. La importancia de este tratamiento y procesamiento de datos se debe, principalmente, a la necesidad que surge en el ámbito científico por descubrir nuevos avances en la ciencia, o, simplemente, por descubrir nuevos algoritmos capaces de resolver experimentos cada vez más complejos cuya resolución años atrás era una tarea inviable debido a los recursos disponibles. Como consecuencia, surgen cambios en la arquitectura interna de estos ordenadores con el fin de aumentar su capacidad de cómputo y así poder hacer frente a dicha necesidad de tratamiento masivo de datos. Así, la comunidad científica ha implementado distintos frameworks de programación paralela basados en patrones paralelos con el fin de computar esos experimentos de una manera más rápida y eficiente. Desafortunadamente, la utilización de estos paradigmas de programación no es una tarea sencilla, ya que requiere experiencia y destrezas en programación. Esto se complica aún más cuando los desarrolladores no son conscientes del comportamiento interno del programa, lo cual conlleva a obtener resultados inesperados en ciertas partes del código. Inevitablemente, surge la necesidad de desarrollar una serie de herramientas con el objetivo de ayudar a esta comunidad a analizar el rendimiento y resultados de sus experimentos. Así pues, este trabajo fin de carrera presenta el desarrollo de un entorno de análisis de rendimiento de aplicaciones paralelas como solución al ese problema. Concretamente, este entorno está compuesto de dos técnicas comunmente utilizadas, profiling y tracing, las cuales han sido añadidas al framework de programación paralela GrPPI. Así, los usuarios podrán recibir una valoración general sobre el rendimiento de sus aplicaciones y actuar conforme a los resultados obtenidos.Ingeniería Informática (Plan 2011

    Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

    Get PDF
    Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits. Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case. Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting, we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use. This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios. The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms. Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming

    Locality Awareness for Task Parallel Computation

    Get PDF
    The task parallel programming model allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the run time system. Efficient scheduling of tasks on multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In this dissertation, we study the performance impact of these issues and other performance factors that limit parallel speedup in task parallel program executions and propose new scheduling strategies to improve performance. Our performance model characterizes lost efficiency in terms of overhead time, idle time, and work time inflation due to increased data access costs. We introduce a hierarchical run time scheduler that combines the benefits of work stealing and parallel depth-first schedulers. Matching the scheduler design to the memory hierarchy of multicore NUMA systems limits costly remote data accesses while maintaining load balance and exploiting constructive data sharing among threads that share a cache. We also propose a locality- based scheduling framework based on locality domains and comprising an API for programmers to specify application locality and a scheduler that honors those specifications. Implementations of the hierarchical and locality-based schedulers in our OpenMP run time system exhibit performance improvements on several task parallel benchmark applications over existing scheduling strategies and production OpenMP run time systems.Doctor of Philosoph

    A component-based framework for certification of components in a cloud of HPC services

    Get PDF
    HPC Shelfis a proposal of a cloud computing platform to provide component-oriented services for High Performance Computing (HPC) applications. This paper presents a Verification-as-a-Service (VaaS) framework for component certification onHPC Shelf. Certification is aimed at providing higher confidence that components of parallel computing systems ofHPC Shelfbehave as expected according to one or more requirements expressed in their contracts. To this end, new abstractions are introduced, starting with certifier components. They are designed to inspect other components and verify them for different types of functional, non-functional and behavioral requirements. The certification framework is naturally based on parallel computing techniques to speed up verification tasks.NORTE-01-0145- FEDER-000037
    corecore