2,833 research outputs found

    Design of multimedia processor based on metric computation

    Get PDF
    Media-processing applications, such as signal processing, 2D and 3D graphics rendering, and image compression, are the dominant workloads in many embedded systems today. The real-time constraints of those media applications have taxing demands on today's processor performances with low cost, low power and reduced design delay. To satisfy those challenges, a fast and efficient strategy consists in upgrading a low cost general purpose processor core. This approach is based on the personalization of a general RISC processor core according the target multimedia application requirements. Thus, if the extra cost is justified, the general purpose processor GPP core can be enforced with instruction level coprocessors, coarse grain dedicated hardware, ad hoc memories or new GPP cores. In this way the final design solution is tailored to the application requirements. The proposed approach is based on three main steps: the first one is the analysis of the targeted application using efficient metrics. The second step is the selection of the appropriate architecture template according to the first step results and recommendations. The third step is the architecture generation. This approach is experimented using various image and video algorithms showing its feasibility

    Grain-size optimization and scheduling for distributed memory architectures

    Get PDF
    The problem of scheduling parallel programs for execution on distributed memory parallel architectures has become the subject of intense research in recent, years. Because of the high inter-processor communication overhead in existing parallel machines, a crucial step in scheduling is task clustering, the process of coalescing heavily communicating fine grain tasks into coarser ones in order to reduce the communication overhead so that the overall execution time is minimized. The thesis of this research is that the task of exposing the parallelism in a given application should be left to the algorithm designer. On the other hand, the task of limiting the parallelism in a chosen parallel algorithm is best handled by the compiler or operating system for the target parallel machine. Toward this end, we have developed CASS (for Clustering And Scheduling System), a. task management system that provides facilities for automatic granularity optimization and task scheduling of parallel programs on distributed memory parallel architectures. In CASS, a task graph generated by a profiler is used by the clustering module to find the best granularity al which to execute the program so that the overall execution time is minimized. The scheduling module maps the clusters onto a. fixed number of processors and determines the order of execution of tasks in each processor. The output of scheduling module is then used by a code generator to generate machine instructions. CASS employs two efficient heuristic algorithms for clustering static task graphs: CASS-I for clustering with task duplication, and CASS-II for clustering without task duplication. It is shown that the clustering algorithms used by CASS outperform the best known algorithms reported in the literature. For the scheduling module in CASS, a heuristic algorithm based on load balancing is used to merge clusters such that the number of clusters matches the number of available physical processors. We also investigate task clustering algorithms for dynamic task graphs and show that it is inherently more difficult than the static case

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Leakage-Aware Multiprocessor Scheduling

    Full text link
    corecore