11 research outputs found

    Система параллельных вычислений Templet: спецификация, реализация, применение

    Get PDF
    В статье описывается реализованный прототип системы параллельных вычислений Templet для языка программирования С++. В системе используется новая версия акторной модели выполнения. Дизайн данной акторной модели позволяет определить поведение разрабатываемой параллельной программы математически точно на основе формул темпоральной логики. Это свойство важно, так как даёт разработчику свободу реализации акторных вычислений на произвольной платформе. Представленный вариант акторной модели может быть легко реализован на различном аппаратном обеспечении и на различных языках программирования для вычислений на системах с разделяемой памятью. В статье даётся определение модели акторов Templet с использованием логики TLA, обсуждается дизайн системы программирования и описывается несколько примеров практического использования представленной системы параллельных вычислений.Работа выполнена при государственной поддержке Министерства образования и науки РФ в рамках реализации мероприятий Программы повышения конкурентоспособности Самарского национального исследовательского университета имени академика С.П. Королева среди ведущих мировых научно-образовательных центров на 2013-2020 годы. Работа частично поддержана грантом РФФИ № 15-08-05934 А

    A semi-automatic parallelization tool for Java based on fork-join synchronization patterns

    Get PDF
    Because of the increasing availability of multi-core machines, clusters, Grids, and combinations of these environments, there is now plenty of computational power available for executing compute intensive applications. However, because of the overwhelming and rapid advances in distributed and parallel hardware and environments, today?s programmers are not fully prepared to exploit distribution and parallelism. In this sense, the Java language has helped in handling the heterogeneity of such environments, but there is a lack of facilities and tools to easily distributing and parallelizing applications. One solution to mitigate this problem and make some progress towards producing general tools seems to be the synthesis of semi-automatic parallelism and Parallelism as a Concern (PaaC), which allows parallelizing applications along with as little modifications on sequential codes as possible. In this paper, we discuss a new approach that aims at overcoming the drawbacks of current Java-based parallel and distributed development tools, which precisely exploit these new conceptsFil: Hirsch, Matias. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina;Fil: Zunino, Alejandro. Consejo Nacional de Invest.cientif.y Tecnicas. Ctro Cientifico Tecnologico Conicet - Tandil. Instituto Superior de Ingenieria del Software;Fil: Mateos Diaz, Cristian Maximiliano. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software

    Efficient and Reasonable Object-Oriented Concurrency

    Full text link
    Making threaded programs safe and easy to reason about is one of the chief difficulties in modern programming. This work provides an efficient execution model for SCOOP, a concurrency approach that provides not only data race freedom but also pre/postcondition reasoning guarantees between threads. The extensions we propose influence both the underlying semantics to increase the amount of concurrent execution that is possible, exclude certain classes of deadlocks, and enable greater performance. These extensions are used as the basis an efficient runtime and optimization pass that improve performance 15x over a baseline implementation. This new implementation of SCOOP is also 2x faster than other well-known safe concurrent languages. The measurements are based on both coordination-intensive and data-manipulation-intensive benchmarks designed to offer a mixture of workloads.Comment: Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE '15). ACM, 201

    A semi-automatic parallelization tool for Java based on fork-join synchronization patterns

    Get PDF
    Because of the increasing availability of multi-core machines, clusters, Grids, and combinations of these environments, there is now plenty of computational power available for executing compute intensive applications. However, because of the overwhelming and rapid advances in distributed and parallel hardware and environments, today’s programmers are not fully prepared to exploit distribution and parallelism. In this sense, the Java language has helped in handling the heterogeneity of such environments, but there is a lack of facilities and tools to easily distributing and parallelizing applications. One solution to mitigate this problem and make some progress towards producing general tools seems to be the synthesis of semi-automatic parallelism and Parallelism as a Concern (PaaC), which allows parallelizing applications along with as little modifications on sequential codes as possible. In this paper, we discuss a new approach that aims at overcoming the drawbacks of current Java-based parallel and distributed development tools, which precisely exploit these new concepts.Sociedad Argentina de Informática e Investigación Operativ

    Parallel Optimisations of Perceived Quality Simulations

    Get PDF
    Processor architectures have changed significantly, with fast single core processors replaced by a diverse range of multicore processors. New architectures require code to be executed in parallel to realize these performance gains. This is straightforward for small applications, where analysis and refactoring is simple or existing tools can parallelise automatically, however the organic growth and large complicated data structures common in mature industrial applications can make parallelisation more difficult. One such application is studied, a mature Windows C++ application used for the visualisation of Perceived Quality (PQ). PQ simulations enable the visualisation of how manufacturing variations affect the look of the final product. The application is commonly used, however suffers from performance issues. Previous parallelisation attempts have failed. The issues associated with parallelising a mature industrial application are investigated. A methodology to investigate, analyse and evaluate the methods and tools available is produced. The shortfalls of these methods and tools are identified, and the methods used to overcome them explained. The parallel version of the software is evaluated for performance. Case studies centring on the significant use cases of the application help to understand the impact on the user. Automated compilers provided no parallelism, while the manual parallelisation using OpenMP required significant refactoring. A number of data dependency issues resulted in some serialised code. Performance scaled with the number of physical cores when applied to certain problems, however the unresolved bottlenecks resulted in mixed results for users. Use in verification did benefit, however those in early design stages did not. Without tools to aid analysis of complex data structures, parallelism could remain out of reach for industrial applications. Methods used here successfully, such as serialisation, and code isolation and serialisation, could be used effectively by such tools

    Exploiting Hardware Abstraction for Parallel Programming Framework: Platform and Multitasking

    Get PDF
    With the help of the parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are required to have excellent hardware programming skills and unique optimization techniques to explore the potential of FPGA resources fully. Intermediate frameworks above hardware circuits are proposed to improve either performance or productivity by leveraging parallel programming models beyond the multi-core era. In this work, we propose the PolyPC (Polymorphic Parallel Computing) framework, which targets enhancing productivity without losing performance. It helps designers develop parallelized applications and implement them on FPGAs. The PolyPC framework implements a custom hardware platform, on which programs written in an OpenCL-like programming model can launch. Additionally, the PolyPC framework extends vendor-provided tools to provide a complete development environment including intermediate software framework, and automatic system builders. Designers\u27 programs can be either synthesized as hardware processing elements (PEs) or compiled to executable files running on software PEs. Benefiting from nontrivial features of re-loadable PEs, and independent group-level schedulers, the multitasking is enabled for both software and hardware PEs to improve the efficiency of utilizing hardware resources. The PolyPC framework is evaluated regarding performance, area efficiency, and multitasking. The results show a maximum 66 times speedup over a dual-core ARM processor and 1043 times speedup over a high-performance MicroBlaze with 125 times of area efficiency. It delivers a significant improvement in response time to high-priority tasks with the priority-aware scheduling. Overheads of multitasking are evaluated to analyze trade-offs. With the help of the design flow, the OpenCL application programs are converted into executables through the front-end source-to-source transformation and back-end synthesis/compilation to run on PEs, and the framework is generated from users\u27 specifications

    Hardware design of task superscalar architecture

    Get PDF
    Exploiting concurrency to achieve greater performance is a difficult and important challenge for current high performance systems. Although the theory is plain, the complexity of traditional parallel programming models in most cases impedes the programmer to harvest performance. Several partitioning granularities have been proposed to better exploit concurrency at task granularity. In this sense, different dynamic software task management systems, such as task-based dataflow programming models, benefit dataflow principles to improve task-level parallelism and overcome the limitations of static task management systems. These models implicitly schedule computation and data and use tasks instead of instructions as a basic work unit, thereby relieving the programmer of explicitly managing parallelism. While these programming models share conceptual similarities with the well-known Out-of-Order superscalar pipelines (e.g., dynamic data dependency analysis and dataflow scheduling), they rely on software-based dependency analysis, which is inherently slow, and limits their scalability when there is fine-grained task granularity and a large amount of tasks. The aforementioned problem increases with the number of available cores. In order to keep all the cores busy and accelerate the overall application performance, it becomes necessary to partition it into more and smaller tasks. The task scheduling (i.e., creation and management of the execution of tasks) in software introduces overheads, and so becomes increasingly inefficient with the number of cores. In contrast, a hardware scheduling solution can achieve greater speed-ups as a hardware task scheduler requires fewer cycles than the software version to dispatch a task. The Task Superscalar is a hybrid dataflow/von-Neumann architecture that exploits the task level parallelism of the program. The Task Superscalar combines the effectiveness of Out-of-Order processors together with the task abstraction, and thereby provides an unified management layer for CMPs which effectively employs processors as functional units. The Task Superscalar has been implemented in software with limited parallelism and high memory consumption due to the nature of the software implementation. In this thesis, a Hardware Task Superscalar architecture is designed to be integrated in a future High Performance Computer with the ability to exploit fine-grained task parallelism. The main contributions of this thesis are: (1) a design of the operational flow of Task Superscalar architecture adapted and improved for hardware implementation, (2) a HDL prototype for latency exploration, (3) a full cycle-accurate simulator of the Hardware Task Superscalar (based on the previously obtained latencies), (4) full design space exploration of the Task Superscalar component configuration (number and size) for systems with different number of processing elements (cores), (5) comparison with a software implementation of a real task-based programming model runtime using real benchmarks, and (6) hardware resource usage exploration of the selected configurations.Explotar la concurrencia para conseguir un mejor rendimiento es un reto importante y difícil para los sistemas de alto rendimiento. Aunque la teoría es sencilla, en muchos casos la complejidad de los modelos de programación paralela tradicionales impide al programador obtener un buen rendimiento. Se han propuesto diferentes granularidades de particionamiento de tareas para explotar mejor la concurrencia implícita en las aplicaciones. En este sentido, diferentes sistemas software de manejo dinámico de tareas utilizan los principios de ejecución "dataflow" para mejorar el paralelismo a nivel de tarea y superar el rendimiento de los sistemas de planificación estáticos. Estos modelos planfican la ejecución dinámicamente y utilizan tareas, en lugar de instrucciones, como unidad básica de trabajo. De esta forma descargan al programador de tener que realizar la sincronización de las tareas explícitamente en su programa. Aunque estos modelos de programación comparten muchas similitudes con los bien conocidos procesadores fuera de orden (como el análisis dinámico de dependencias y la ejecución en "dataflow"), dependen de un análisis dinámico software de las dependencias. Dicho análisis es inherentemente lento y limita la escalabilidad cuando hay un gran número de tareas pequeñas. Los problemas antes mencionados se incrementan exponencialmente con el número de núcleos disponibles. Para conseguir mantener todos los núcleos ocupados y conseguir acelerar el rendimiento global de la aplicación se hace necesario particionarla en muchas tareas pequeñas. La gestión de dichas tareas (es decir, su creación y distribución entre los núcleos) en software introduce sobrecostes, y por tanto resulta ineficiente conforme aumenta el número de núcleos. En contraposición, un sistema hardware de planificación de tareas puede conseguir mejores rendimientos ya que requiere una menor latencia en la gestión de las tareas. El Task Superscalar (TSS) es una arquitectura híbrida dataflow/von-Neumann que explota el paralelismo a nivel de tareas de los programas. El TSS combina la efectividad de los procesadores fuera de orden con la abstracción de tarea, y por tanto provee una capa unificada de gestión para los CMPs que gestiona los núcleos como unidades funcionales. Previo al trabajo de esta tesis el Task Superscalar se había implementado en software con un paralelismo limitado y mucho consumo de memoria debido a las limitaciones inherentes de una implementación software. En esta tesis se diseñado una implementación hardware de la arquitectura Task Superscalar con capacidad para manejar muchas tareas de pequeño tamaño que es integrable en un futuro computador de altas prestaciones. Así pues, las contribuciones principales de esta tesis son: (1) el diseño de un flujo operacional de la arquitectura Task Superscalar adaptado y mejorado para su implementación hardware; (2) un prototipo HDL de dicho flujo para la exploración de las latencias asociadas a la implementación hardware; (3) un simulador ciclo a ciclo del diseño hardware basado en los resultados obtenidos en la implementación hardware; (4) una exploración completa del espacio de diseño de los componentes hardware (número y cantidad de módulos, tamaños de las memorias, etc.) para diferentes tamaños de computadores (es decir, para diferentes cantidades de nucleos); (5) una comparación con la implementación software actual del mismo modelo de programación utilizando aplicaciones reales y; (6) una exploración de la utilización de recursos hardware de las diferentes configuraciones seleccionadas
    corecore