45 research outputs found

    Efficient implementation of resource-constrained cyber-physical systems using multi-core parallelism

    Get PDF
    The quest for more performance of applications and systems became more challenging in the recent years. Especially in the cyber-physical and mobile domain, the performance requirements increased significantly. Applications, previously found in the high-performance domain, emerge in the area of resource-constrained domain. Modern heterogeneous high-performance MPSoCs provide a solid foundation to satisfy the high demand. Such systems combine general processors with specialized accelerators ranging from GPUs to machine learning chips. On the other side of the performance spectrum, the demand for small energy efficient systems exposed by modern IoT applications increased vastly. Developing efficient software for such resource-constrained multi-core systems is an error-prone, time-consuming and challenging task. This thesis provides with PA4RES a holistic semiautomatic approach to parallelize and implement applications for such platforms efficiently. Our solution supports the developer to find good trade-offs to tackle the requirements exposed by modern applications and systems. With PICO, we propose a comprehensive approach to express parallelism in sequential applications. PICO detects data dependencies and implements required synchronization automatically. Using a genetic algorithm, PICO optimizes the data synchronization. The evolutionary algorithm considers channel capacity, memory mapping, channel merging and flexibility offered by the channel implementation with respect to execution time, energy consumption and memory footprint. PICO's communication optimization phase was able to generate a speedup almost 2 or an energy improvement of 30% for certain benchmarks. The PAMONO sensor approach enables a fast detection of biological viruses using optical methods. With a sophisticated virus detection software, a real-time virus detection running on stationary computers was achieved. Within this thesis, we were able to derive a soft real-time capable virus detection running on a high-performance embedded system, commonly found in today's smart phones. This was accomplished with smart DSE algorithm which optimizes for execution time, energy consumption and detection quality. Compared to a baseline implementation, our solution achieved a speedup of 4.1 and 87\% energy savings and satisfied the soft real-time requirements. Accepting a degradation of the detection quality, which still is usable in medical context, led to a speedup of 11.1. This work provides the fundamentals for a truly mobile real-time virus detection solution. The growing demand for processing power can no longer satisfied following well-known approaches like higher frequencies. These so-called performance walls expose a serious challenge for the growing performance demand. Approximate computing is a promising approach to overcome or at least shift the performance walls by accepting a degradation in the output quality to gain improvements in other objectives. Especially for a safe integration of approximation into existing application or during the development of new approximation techniques, a method to assess the impact on the output quality is essential. With QCAPES, we provide a multi-metric assessment framework to analyze the impact of approximation. Furthermore, QCAPES provides useful insights on the impact of approximation on execution time and energy consumption. With ApproxPICO we propose an extension to PICO to consider approximate computing during the parallelization of sequential applications

    Electronic System-Level Synthesis Methodologies

    Full text link

    Automatic pipelining and vectorization of scientific code for FPGAs

    Get PDF
    There is a large body of legacy scientific code in use today that could benefit from execution on accelerator devices like GPUs and FPGAs. Manual translation of such legacy code into device-specific parallel code requires significant manual effort and is a major obstacle to wider FPGA adoption. We are developing an automated optimizing compiler TyTra to overcome this obstacle. The TyTra flow aims to compile legacy Fortran code automatically for FPGA-based acceleration, while applying suitable optimizations. We present the flow with a focus on two key optimizations, automatic pipelining and vectorization. Our compiler frontend extracts patterns from legacy Fortran code that can be pipelined and vectorized. The backend first creates fine and coarse-grained pipelines and then automatically vectorizes both the memory access and the datapath based on a cost model, generating an OpenCL-HDL hybrid working solution for FPGA targets on the Amazon cloud. Our results show up to 4.2× performance improvement over baseline OpenCL code

    Automatically Parallelizing Embedded Legacy Software on Soft-Core SoCs

    Get PDF
    Nowadays, embedded systems are utilized in many areas and become omnipresent, making people's lives more comfortable. Embedded systems have to handle more and more functionality in many products. To maintain the often required low energy consumption, multi-core systems provide high performance at moderate energy consumption. The development started with dual-core processors and has today reached many-core designs with dozens and hundreds of processor cores. However, existing applications can barely leverage the potential of that many cores. Legacy applications are usually written sequentially and thus typically use only one processor core. Thus, these applications do not benefit from the advantages provided by modern many-core systems. Rewriting those applications to use multiple cores requires new skills from developers and it is also time-consuming and highly error prone. Dozens of languages, APIs and compilers have already been presented in the past decades to aid the user with parallelizing applications. Fully automatic parallelizing compilers are seen as the holy grail, since the user effort is kept minimal. However, automatic parallelizers often cannot extract parallelism as good as user aided approaches. Most of these parallelization tools are designed for desktop and high-performance systems and are thus not tuned or applicable for low performance embedded systems. To improve this situation, this work presents an automatic parallelizer for embedded systems, which is able to mostly deliver better quality than user aided approaches and if not allows easy manual fine-tuning. Parallelization tools extract concurrently executable tasks from an application. These tasks can then be executed on different processor cores. Parallelization tools and automatic parallelizers in particular often struggle to efficiently map the extracted parallelism to an existing multi-core processor. This work uses soft-core processors on FPGAs, which makes it possible to realize custom multi-core designs in hardware, within a few minutes. This allows to adapt the multi-core processor to the characteristics of the extracted parallelism. Especially, core-interconnects for communication can be optimized to fit the communication pattern of the parallel application. Embedded applications are often structured as follows: receive input data, (multiple) data processing steps, data output. The multiple processing steps are often realized as consecutive loosely coupled transformations. These steps naturally already model the structure of a processing pipeline. It is the goal of this work to extract this kind of pipeline-parallelism from an application and map it to multiple cores to increase the overall throughput of the system. Multiple cores forming a chain with direct communication channels ideally fit this pattern. The previously described, so called pipeline-parallelism is a barely addressed concept in most parallelization tools. Also, current multi-core designs often do not support the hardware flexibility provided by soft-cores, targeted in this approach. The main contribution of this work is an automatic parallelizer which is able to map different processing steps from the source-code of a sequential application to different cores in a multi-core pipeline. Users only specify the required processing speed after parallelization. The developed tool tries to extract a matching parallelized software design along with a custom multi-core design out of sequential embedded legacy applications. The automatically created multi-core system already contains used peripherals extracted from the source-code and is ready to be used. The presented parallelizer implements multi-objective optimization to generate a minimal hardware design, just fulfilling the user defined requirement. To the best of my knowledge, the possibility to generate such a multi-core pipeline defined by the demands of the parallelized software has never been presented before. The approach is implemented for two soft-core processors and evaluation shows for both targets high speedups of 12x and higher at a reasonable hardware overhead. Compared to other automatic parallelizers, which mainly focus on speedups through latency reduction, significantly higher speedups can be achieved depending on the given application structure

    Navigating the Landscape for Real-time Localisation and Mapping for Robotics, Virtual and Augmented Reality

    Get PDF
    Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.Comment: Proceedings of the IEEE 201

    MPSoCBench : um framework para avaliação de ferramentas e metodologias para sistemas multiprocessados em chip

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Recentes metodologias e ferramentas de projetos de sistemas multiprocessados em chip (MPSoC) aumentam a produtividade por meio da utilização de plataformas baseadas em simuladores, antes de definir os últimos detalhes da arquitetura. No entanto, a simulação só é eficiente quando utiliza ferramentas de modelagem que suportem a descrição do comportamento do sistema em um elevado nível de abstração. A escassez de plataformas virtuais de MPSoCs que integrem hardware e software escaláveis nos motivou a desenvolver o MPSoCBench, que consiste de um conjunto escalável de MPSoCs incluindo quatro modelos de processadores (PowerPC, MIPS, SPARC e ARM), organizado em plataformas com 1, 2, 4, 8, 16, 32 e 64 núcleos, cross-compiladores, IPs, interconexões, 17 aplicações paralelas e estimativa de consumo de energia para os principais componentes (processadores, roteadores, memória principal e caches). Uma importante demanda em projetos MPSoC é atender às restrições de consumo de energia o mais cedo possível. Considerando que o desempenho do processador está diretamente relacionado ao consumo, há um crescente interesse em explorar o trade-off entre consumo de energia e desempenho, tendo em conta o domínio da aplicação alvo. Técnicas de escalabilidade dinâmica de freqüência e voltagem fundamentam-se em gerenciar o nível de tensão e frequência da CPU, permitindo que o sistema alcance apenas o desempenho suficiente para processar a carga de trabalho, reduzindo, consequentemente, o consumo de energia. Para explorar a eficiência energética e desempenho, foram adicionados recursos ao MPSoCBench, visando explorar escalabilidade dinâmica de voltaegem e frequência (DVFS) e foram validados três mecanismos com base na estimativa dinâmica de energia e taxa de uso de CPUAbstract: Recent design methodologies and tools aim at enhancing the design productivity by providing a software development platform before the definition of the final Multiprocessor System on Chip (MPSoC) architecture details. However, simulation can only be efficiently performed when using a modeling and simulation engine that supports system behavior description at a high abstraction level. The lack of MPSoC virtual platform prototyping integrating both scalable hardware and software in order to create and evaluate new methodologies and tools motivated us to develop the MPSoCBench, a scalable set of MPSoCs including four different ISAs (PowerPC, MIPS, SPARC, and ARM) organized in platforms with 1, 2, 4, 8, 16, 32, and 64 cores, cross-compilers, IPs, interconnections, 17 parallel version of software from well-known benchmarks, and power consumption estimation for main components (processors, routers, memory, and caches). An important demand in MPSoC designs is the addressing of energy consumption constraints as early as possible. Whereas processor performance comes with a high power cost, there is an increasing interest in exploring the trade-off between power and performance, taking into account the target application domain. Dynamic Voltage and Frequency Scaling techniques adaptively scale the voltage and frequency levels of the CPU allowing it to reach just enough performance to process the system workload while meeting throughput constraints, and thereby, reducing the energy consumption. To explore this wide design space for energy efficiency and performance, both for hardware and software components, we provided MPSoCBench features to explore dynamic voltage and frequency scalability (DVFS) and evaluated three mechanisms based on energy estimation and CPU usage rateDoutoradoCiência da ComputaçãoDoutora em Ciência da Computaçã

    3rd EGEE User Forum

    Get PDF
    We have organized this book in a sequence of chapters, each chapter associated with an application or technical theme introduced by an overview of the contents, and a summary of the main conclusions coming from the Forum for the chapter topic. The first chapter gathers all the plenary session keynote addresses, and following this there is a sequence of chapters covering the application flavoured sessions. These are followed by chapters with the flavour of Computer Science and Grid Technology. The final chapter covers the important number of practical demonstrations and posters exhibited at the Forum. Much of the work presented has a direct link to specific areas of Science, and so we have created a Science Index, presented below. In addition, at the end of this book, we provide a complete list of the institutes and countries involved in the User Forum

    Review of System Design Frameworks

    Get PDF
    In the last decade, the enormous development of the semiconductor industry with ever-increasing complexities of digital embedded systems and strong market competition with fast time-to-market and low design cost demands have imposed serious difficulty to a conventional design method. Therefore, there emerges a new design flow named model-based system design, which is based on high-level abstraction models, heavy design automation, and extensive component reuse to increase productivity and satisfy the market pressure. This thesis presents reviews of ten high level academic system design frameworks and tools that have been proposed and implemented recently to support the model based design flow, namely System-on-Chip Environment (SCE), Embedded System Environment (ESE), Metropolis, Daedalus, SystemCoDesigner (SCD), xPilot, GAUT, No-Instruction-Set Computer (NISC), Formal System Design (ForSyDe), and Ptolemy II. These tools are then compared to each other in various aspects comprising objective, technique, implementation and capability. Following that, three design flow frameworks, including ESE, Daedalus, and SystemCoDesigner, are experimented for their real usage, performance and practicality. The frameworks and tools implementing the model-based design flow all show promising results. Modelling tools (ForSyDe, and Ptolemy II) can sufficiently capture a wide range of complicated modern systems, while high-level synthesis tools (xPilot, GAUT, and NISC) produce better design qualities in terms of area, power, and cost in comparison to traditional works. Study cases of design flow frameworks (SCE, ESE, Metropolis, Daedalus, and SCD) show the model-based method significantly reduces developing time as well as facilitates the system design process. However, most of these tools and frameworks are being incomplete, and still under the experimental stage. There still be a lot of works needed until the method can be put into practice
    corecore