75,011 research outputs found

    Reducing Memory Requirements of Stream Programs by Graph Transformations

    Get PDF
    International audienceStream languages explicitly describe fork-join parallelism and pipelines, offering a powerful programming model for many-core Multi-Processor Systems on Chip (MPSoC). In an embedded resource-constrained system, adapting stream programs to fit memory requirements is particularly important. In this paper we present a new approach to re- duce the memory footprint required to run stream programs on MPSoC. Through an exploration of equivalent program variants, the method selects parallel code minimizing mem- ory consumption. For large program instances, a heuristic accelerating the exploration phase is proposed and evalu- ated. We demonstrate the interest of our method on a panel of ten significant benchmarks. Using a multi-core modulo scheduling technique, our approach lowers considerably the minimal amount of memory required to run seven of these benchmarks while preserving throughput

    Formal Specification and Runtime Verification of Parallel Systems using Interval Temporal Logic (ITL)

    Get PDF
    Runtime Verification (RV) is the discipline that allows monitoring systems at runtime in order to check the satisfaction or violation of a given correctness property. Parallel systems are more complicated than sequential systems. Therefore, systems that run in parallel need a parallel runtime verification framework to monitor their behaviour and guarantee correctness properties. Parallel systems have correctness properties different from correctness properties of sequential systems. For instance, as a correctness property of parallel systems, absence of deadlock has to be guaranteed and mutual exclusion mechanism has to be applied in case a resource is shared between more than one system and the parallelism form is true concurrency. Therefore, sequential runtime verification framework can not handle systems that run in parallel due to the singularity issue of this kind of framework as they are built to handle a single system at a time, whereas for parallel systems a framework has to handle many systems at a time. AnaTempura is a runtime verification tool which can handle single systems at a time. To solve this problem, I evolved AnaTempura to be able to handle parallel systems. In this thesis, I propose a Parallel Runtime Verification Framework (PRVF) that can handle systems which use architectures of parallelism in their design such as multi-core processor architecture. The proposed model can check system behaviour at runtime in order to either guarantee satisfaction or detect violations of correctness properties. My technique is based on Interval Temporal Logic (ITL) and its executable subset Tempura to verify properties at runtime using the AnaTempura tool. I use, as a demonstration, the case study of private L2 cache memory of multi-core processor architecture. My objectives are to i) design MSI protocol compliant with cache memory coherence and ii) fulfil main memory consistency model at runtime. I achieve this via a formal Tempura specification of the cache controller which is then verified at runtime against my objectives for memory consistency and cache coherence using AnaTempura. The presented specifications allow to extend it allow to extend it to not only capture correctness but also monitor the performance of a cache memory controller. The case study is then evaluated via integrating AnaTempura with MATLAB in order to check correctness properties such as memory consistency and cache coherence

    FASTCUDA: Open Source FPGA Accelerator & Hardware-Software Codesign Toolset for CUDA Kernels

    Get PDF
    Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    When parallel speedups hit the memory wall

    Get PDF
    After Amdahl's trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes new parallel speedup models that account for variations of the average data-access delay to describe the limiting effect of the memory wall on parallel speedups. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, besides being black-box modeling, our experiments show that conventional machine-learning modeling needs about one order of magnitude more measurements to reach the same level of accuracy achieved in our modeling.Comment: 24 page

    Programming MPSoC platforms: Road works ahead

    Get PDF
    This paper summarizes a special session on multicore/multi-processor system-on-chip (MPSoC) programming challenges. The current trend towards MPSoC platforms in most computing domains does not only mean a radical change in computer architecture. Even more important from a SW developer´s viewpoint, at the same time the classical sequential von Neumann programming model needs to be overcome. Efficient utilization of the MPSoC HW resources demands for radically new models and corresponding SW development tools, capable of exploiting the available parallelism and guaranteeing bug-free parallel SW. While several standards are established in the high-performance computing domain (e.g. OpenMP), it is clear that more innovations are required for successful\ud deployment of heterogeneous embedded MPSoC. On the other hand, at least for coming years, the freedom for disruptive programming technologies is limited by the huge amount of certified sequential code that demands for a more pragmatic, gradual tool and code replacement strategy
    corecore