11 research outputs found

    Insights on Memory Controller Scaling for Multicore Embedded Systems

    Get PDF
    In recent years, the growth of the number of cores as well as the frequency of cores along different processor generations has proportionally increased bandwidth needs simultaneously in both CPU and GPU systems. In order to address the communication latency between CPU and GPU memories in recent implementation of heterogeneous mobile embedded systems with hard or firm real-time requirements, sharing the same address space adds significant levels of contention. In addition, when heterogeneous cores are simultaneously present in a single system, memory parallelism is significantly restricted by a small amount of memory controllers (MCs). As a strategy to approach these significant levels of memory pressure, it is proposed in this paper evaluations of the impact of scaling MCs up to four to eight units - limited by motherboard size for embedded purposes. Our findings show that performance is enhanced by a factor of 4× when employing only CPU cores, 4.6× when only GPU cores and finally, 2× when both CPU and GPU cores are simultaneously considered

    Design-Space Exploration of Stream Programs through Semantic-Preserving Transformations

    Get PDF
    Stream languages explicitly describe fork-join parallelism and pipelines, offering a powerful programming model for many-core Multi-Processor Systems on Chip (MPSoC). In an embedded resource-constrained system, adapting stream programs to fit memory requirements is particularly important. In this paper we present a design-space exploration technique to reduce the minimal memory required when running stream programs on MPSoC; this allows to target memory constrained systems and in some cases obtain better performance. Using a set of semantically preserving transformations, we explore a large number of equivalent program variants; we select the variant that minimizes a buffer evaluation metric. To cope efficiently with large program instances we propose and evaluate an heuristic for this method. We demonstrate the interest of our method on a panel of ten significant benchmarks. As an illustration, we measure the minimal memory required using a multi-core modulo scheduling. Our approach lowers considerably the minimal memory required for seven of the ten benchmarks

    High Performance Stencil Code Generation with LIFT

    Get PDF
    Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes

    Integracija tokovnog modela za učinkovito izvođenje na višejezgrenim računalnim arhitekturama

    Get PDF
    Streaming has emerged as an important model in present–day applications, ranging from multimedia to scientific computing. Moreover, the emergence of new multicore architectures has resulted with new challenges in efficient utilization of available computational resources. Streaming model offers the portability and scalability of performance with the increasing number of cores. In this paper we propose a tool which enables the implementation of the compute–intensive stream processing kernels as portable modules in general–purpose applications. Resulting modules can be efficiently reused with high degree of scalability in regard to increasing number of processing cores.Tokovni računalni model predstavlja zanimljivo područje istraživanja s ciljem ubrzanja kako multimedijskih, tako i znanstvenih aplikacija. Isto tako, pojava višejezgrenih računalnih arhitektura rezultirala je povećanjem zanimanja za istraživanje metoda i modela koji bi omogućili učinkovito iskorištavanje postojećih paralelnih resursa. Tokovni model omogućuje istovremeno visok stupanj apstrakcije, prenosivost i skalabinost aplikacija s obzirom na povećanje računskih jezgri. U ovom je članku predložen pristup koji omogućuje implementaciju računski zahtjevnih dijelova aplikacija u tokovnom modelu te njihovu integraciju u vidu prenosivih modula. Na taj način ostvareno je ubrzanje cjelokupnih aplikacija pri izvođenju na višejezgrenim procesorima

    Architecture--Performance Interrelationship Analysis In Single/Multiple Cpu/Gpu Computing Systems: Application To Composite Process Flow Modeling

    Get PDF
    Current developments in computing have shown the advantage of using one or more Graphic Processing Units (GPU) to boost the performance of many computationally intensive applications but there are still limits to these GPU-enhanced systems. The major factors that contribute to the limitations of GPU(s) for High Performance Computing (HPC) can be categorized as hardware and software oriented in nature. Understanding how these factors affect performance is essential to develop efficient and robust applications codes that employ one or more GPU devices as powerful co-processors for HPC computational modeling. The present work analyzes and understands the intrinsic interrelationship of both hardware and software categories on computational performance for single and multiple GPU-enhanced systems using a computationally intensive application that is representative of a large portion of challenges confronting modern HPC. The representative application uses unstructured finite element computations for transient composite resin infusion process flow modeling as the computational core, characteristics and results of which reflect many other HPC applications via the sparse matrix system used for the solution of linear system of equations. This work describes these various software and hardware factors and how they interact to affect performance of computationally intensive applications enabling more efficient development and porting of High Performance Computing applications that includes current, legacy, and future large scale computational modeling applications in various engineering and scientific disciplines

    Scheduling and Tuning Kernels for High-performance on Heterogeneous Processor Systems

    Get PDF
    Accelerated parallel computing techniques using devices such as GPUs and Xeon Phis (along with CPUs) have proposed promising solutions of extending the cutting edge of high-performance computer systems. A significant performance improvement can be achieved when suitable workloads are handled by the accelerator. Traditional CPUs can handle those workloads not well suited for accelerators. Combination of multiple types of processors in a single computer system is referred to as a heterogeneous system. This dissertation addresses tuning and scheduling issues in heterogeneous systems. The first section presents work on tuning scientific workloads on three different types of processors: multi-core CPU, Xeon Phi massively parallel processor, and NVIDIA GPU; common tuning methods and platform-specific tuning techniques are presented. Then, analysis is done to demonstrate the performance characteristics of the heterogeneous system on different input data. This section of the dissertation is part of the GeauxDock project, which prototyped a few state-of-art bioinformatics algorithms, and delivered a fast molecular docking program. The second section of this work studies the performance model of the GeauxDock computing kernel. Specifically, the work presents an extraction of features from the input data set and the target systems, and then uses various regression models to calculate the perspective computation time. This helps understand why a certain processor is faster for certain sets of tasks. It also provides the essential information for scheduling on heterogeneous systems. In addition, this dissertation investigates a high-level task scheduling framework for heterogeneous processor systems in which, the pros and cons of using different heterogeneous processors can complement each other. Thus a higher performance can be achieve on heterogeneous computing systems. A new scheduling algorithm with four innovations is presented: Ranked Opportunistic Balancing (ROB), Multi-subject Ranking (MR), Multi-subject Relative Ranking (MRR), and Automatic Small Tasks Rearranging (ASTR). The new algorithm consistently outperforms previously proposed algorithms with better scheduling results, lower computational complexity, and more consistent results over a range of performance prediction errors. Finally, this work extends the heterogeneous task scheduling algorithm to handle power capping feature. It demonstrates that a power-aware scheduler significantly improves the power efficiencies and saves the energy consumption. This suggests that, in addition to performance benefits, heterogeneous systems may have certain advantages on overall power efficiency

    A hybrid static/dynamic approach to scheduling stream programs

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 95-97).Streaming languages such as Streamlt are often utilized to write stream programs that execute on multicore processors. Stream programs consist of actors that operate on streams of data. To execute on multiple cores, actors are scheduled for parallel execution while satisfying data dependencies between actors. In StreamIt, the compiler analyzes data dependencies between actors at compile-time and generates a static schedule that determines where and when actors are executed on the available cores. Statically scheduling actors onto cores results in no scheduling overhead at runtime and allows for sophisticated compile-time scheduling optimizations. Unfortunately, static scheduling has a number of severe limitations. The generated static schedule is inflexible and cannot be adapted to run-time conditions, such as cores that are unexpectedly unavailable. Static scheduling may also incorrectly load-balance cores due to inaccurate static work estimates. This thesis contributes a hybrid static/dynamic scheduling approach that attempts to address the limitations of static scheduling. Dynamic load-balancing is utilized to adjust the static schedule to run-time conditions and to correct load imbalances that might exist after static scheduling. Dynamic load-balancing is designed to add very little run-time overhead.by Ceryen Tan.M.Eng

    Software Pipelined Execution of Stream Programs on GPUs

    No full text
    Abstract—The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), which support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem — both scheduling and assignment of filters to processors — as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling exploits both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, yielding speedups between 1.87X and 36.83X over a single threaded CPU