955 research outputs found

    Synthesis of application specific processor architectures for ultra-low energy consumption

    No full text
    In this paper we suggest that further energy savings can be achieved by a new approach to synthesis of embedded processor cores, where the architecture is tailored to the algorithms that the core executes. In the context of embedded processor synthesis, both single-core and many-core, the types of algorithms and demands on the execution efficiency are usually known at the chip design time. This knowledge can be utilised at the design stage to synthesise architectures optimised for energy consumption. Firstly, we present an overview of both traditional energy saving techniques and new developments in architectural approaches to energy-efficient processing. Secondly, we propose a picoMIPS architecture that serves as an architectural template for energy-efficient synthesis. As a case study, we show how the picoMIPS architecture can be tailored to an energy efficient execution of the DCT algorithm

    High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

    Get PDF
    This article presents two high-efficient parallel realizations of the context-based adaptive variable length coding (CAVLC) based on heterogeneous multicore processors. By optimizing the architecture of the CAVLC encoder, three kinds of dependences are eliminated or weaken, including the context-based data dependence, the memory accessing dependence and the control dependence. The CAVLC pipeline is divided into three stages: two scans, coding, and lag packing, and be implemented on two typical heterogeneous multicore architectures. One is a block-based SIMD parallel CAVLC encoder on multicore stream processor STORM. The other is a component-oriented SIMT parallel encoder on massively parallel architecture GPU. Both of them exploited rich data-level parallelism. Experiments results show that compared with the CPU version, more than 70 times of speedup can be obtained for STORM and over 50 times for GPU. The implementation of encoder on STORM can make a real-time processing for 1080p @30fps and GPU-based version can satisfy the requirements for 720p real-time encoding. The throughput of the presented CAVLC encoders is more than 10 times higher than that of published software encoders on DSP and multicore platforms

    SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

    Full text link
    Modern SoCs integrate multiple CPU cores and Hardware Accelerators (HWAs) that share the same main memory system, causing interference among memory requests from different agents. The result of this interference, if not controlled well, is missed deadlines for HWAs and low CPU performance. State-of-the-art mechanisms designed for CPU-GPU systems strive to meet a target frame rate for GPUs by prioritizing the GPU close to the time when it has to complete a frame. We observe two major problems when such an approach is adapted to a heterogeneous CPU-HWA system. First, HWAs miss deadlines because they are prioritized only close to their deadlines. Second, such an approach does not consider the diverse memory access characteristics of different applications running on CPUs and HWAs, leading to low performance for latency-sensitive CPU applications and deadline misses for some HWAs, including GPUs. In this paper, we propose a Simple Quality of service Aware memory Scheduler for Heterogeneous systems (SQUASH), that overcomes these problems using three key ideas, with the goal of meeting deadlines of HWAs while providing high CPU performance. First, SQUASH prioritizes a HWA when it is not on track to meet its deadline any time during a deadline period. Second, SQUASH prioritizes HWAs over memory-intensive CPU applications based on the observation that the performance of memory-intensive applications is not sensitive to memory latency. Third, SQUASH treats short-deadline HWAs differently as they are more likely to miss their deadlines and schedules their requests based on worst-case memory access time estimates. Extensive evaluations across a wide variety of different workloads and systems show that SQUASH achieves significantly better CPU performance than the best previous scheduler while always meeting the deadlines for all HWAs, including GPUs, thereby largely improving frame rates

    Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

    Full text link
    The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM

    Exploring Processor and Memory Architectures for Multimedia

    Get PDF
    Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

    Multilayered Heterogeneous Parallelism Applied to Atmospheric Constituent Transport Simulation

    Get PDF
    Heterogeneous multicore chipsets with many levels of parallelism are becoming increasingly common in high-performance computing systems. Effective use of parallelism in these new chipsets constitutes the challenge facing a new generation of large scale scientific computing applications. This study examines methods for improving the performance of two-dimensional and three-dimensional atmospheric constituent transport simulation on the Cell Broadband Engine Architecture (CBEA). A function offloading approach is used in a 2D transport module, and a vector stream processing approach is used in a 3D transport module. Two methods for transferring incontiguous data between main memory and accelerator local storage are compared. By leveraging the heterogeneous parallelism of the CBEA, the 3D transport module achieves performance comparable to two nodes of an IBM BlueGene/P, or eight Intel Xeon cores, on a single PowerXCell 8i chip. Module performance on two CBEA systems, an IBM BlueGene/P, and an eight-core shared-memory Intel Xeon workstation are given

    A Survey of Prediction and Classification Techniques in Multicore Processor Systems

    Get PDF
    In multicore processor systems, being able to accurately predict the future provides new optimization opportunities, which otherwise could not be exploited. For example, an oracle able to predict a certain application\u27s behavior running on a smart phone could direct the power manager to switch to appropriate dynamic voltage and frequency scaling modes that would guarantee minimum levels of desired performance while saving energy consumption and thereby prolonging battery life. Using predictions enables systems to become proactive rather than continue to operate in a reactive manner. This prediction-based proactive approach has become increasingly popular in the design and optimization of integrated circuits and of multicore processor systems. Prediction transforms from simple forecasting to sophisticated machine learning based prediction and classification that learns from existing data, employs data mining, and predicts future behavior. This can be exploited by novel optimization techniques that can span across all layers of the computing stack. In this survey paper, we present a discussion of the most popular techniques on prediction and classification in the general context of computing systems with emphasis on multicore processors. The paper is far from comprehensive, but, it will help the reader interested in employing prediction in optimization of multicore processor systems
    corecore