917 research outputs found

    Asymmetric Cache Coherency: Policy Modifications to Improve Multicore Performance

    No full text
    International audienceAsymmetric coherency is a new optimisation method for coherency policies to support non-uniform work- loads in multicore processors. Asymmetric coherency assists in load balancing a workload and this is applica- ble to SoC multicores where the applications are not evenly spread among the processors and customization of the coherency is possible. Asymmetric coherency is a policy change, and consequently our designs re- quire little or no additional hardware over an existing system. We explore two different types of asymmetric coherency policies. Our bus based asymmetric coherency policy, generated a 60% coherency cost reduction (reduction of latencies due to coherency messages) for non-shared data. Our directory based asymmetric co- herency policy, showed up to a 5.8% execution time improvement and up to a 22% improvement in average memory latency for the parallel benchmarks Sha, using a statically allocated asymmetry. Dynamically allo- cated asymmetry was found to generate further improvements in access latency, increasing the effectiveness of asymmetric coherency by up to 73.8% when compared to the static asymmetric solution

    Power, Performance, and Energy Management of Heterogeneous Architectures

    Get PDF
    abstract: Many core modern multiprocessor systems-on-chip offers tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. Applications running on these architectures are becoming increasingly complex. As the basic building blocks, which make up the application, change during runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is a daunting task due to a large number of workloads and configurations. Therefore, there is a strong need to evaluate the metrics of interest as a function of the supported configurations. This thesis focuses on two different types of modern multiprocessor systems-on-chip (SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture. For mobile heterogeneous systems, this thesis presents a novel methodology that can accurately instrument different types of applications with specific performance monitoring calls. These calls provide a rich set of performance statistics at a basic block level while the application runs on the target platform. The target architecture used for this work (Odroid XU3) is capable of running at 4940 different frequency and core combinations. With the help of instrumented application vast amount of characterization data is collected that provides details about performance, power and CPU state at every instrumented basic block across 19 different types of applications. The vast amount of data collected has enabled two runtime schemes. The first work provides a methodology to find optimal configurations in heterogeneous architecture using classifiers and demonstrates an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively. The second work using same data shows a novel imitation learning framework for dynamically controlling the type, number, and the frequencies of active cores to achieve an average of 109% PPW improvement compared to the default governors. This work also presents how to accurately profile tile based Intel Xeon Phi architecture while training different types of neural networks using open image dataset on deep learning framework. The data collected allows deep exploratory analysis. It also showcases how different hardware parameters affect performance of Xeon Phi.Dissertation/ThesisMasters Thesis Engineering 201

    Power aware data and memory management for dynamic applications

    Get PDF
    In recent years, the semiconductor industry has turned its focus towards heterogeneous multiprocessor platforms. They are an economically viable solution for coping with the growing setup and manufacturing cost of silicon systems. Furthermore, their inherent flexibility perfectly supports the emerging market of interactive, mobile data and content services. The platform’s performance and energy depend largely on how well the data-dominated services are mapped on the memory subsystem. A crucial aspect thereby is how efficient data is transferred between the different memory layers. Several compilation techniques have been developed to optimally use the available bandwidth. Unfortunately, they do not take the interaction between multiple threads into account and do not deal with the dynamic behaviour of these novel applications. The main limitations of current techniques are outlined and an approach for dealing with them is introduced

    ReSP: A Nonintrusive Transaction-Level Reflective MPSoC Simulation Platform for Design Space Exploration

    Full text link

    Achieving a better balance between productivity and performance on FPGAs through Heterogeneous Extensible Multiprocessor Systems

    Get PDF
    Field Programmable Gate Arrays (FPGAs) were first introduced circa 1980, and they held the promise of delivering performance levels associated with customized circuits, but with productivity levels more closely associated with software development. Achieving both performance and productivity objectives has been a long standing challenge problem for the reconfigurable computing community and remains unsolved today. On one hand, Vendor supplied design flows have tended towards achieving the high levels of performance through gate level customization, but at the cost of very low productivity. On the other hand, FPGA densities are following Moore\u27s law and and can now support complete multiprocessor system architectures. Thus FPGAs can be turned into an architecture with programmable processors which brings productivity but sacrifices the peak performance advantages of custom circuits. In this thesis we explore how the two use cases can be combined to achieve the best from both. The flexibility of the FPGAs to host a heterogeneous multiprocessor system with different types of programmable processors and custom accelerators allows the software developers to design a platform that matches the unique performance needs of their application. However, currently no automated approaches are publicly available to create such heterogeneous architectures as well as the software support for these platforms. Creating base architectures, configuring multiple tool chains, and repetitive engineering design efforts can and should be automated. This thesis introduces Heterogeneous Extensible Multiprocessor System (HEMPS) template approach which allows an FPGA to be programmed with productivity levels close to those associated with parallel processing, and with performance levels close to those associated with customized circuits. The work in this thesis introduces an ArchGen script to automate the generation of HEMPS systems as well as a library of portable and self tuning polymorphic functions. These tools will abstract away the HW/SW co-design details and provide a transparent programming language to capture different levels of parallelisms, without sacrificing productivity or portability

    Techniques to Improve Energy Efficiency on Heterogeneous Multiprocessors under Timing and Quality Constraints

    Get PDF
    Traditionally, applications are executed without the notion of a computational deadline and often use all available system resources, which leads to higher\ua0energy consumption. User specification of Quality of Service (QoS) constraints,\ua0in terms of completion time and solution quality, opens up for allocation of\ua0just enough resources to an application to finish just in time and thereby save\ua0energy. Modern heterogeneous multiprocessor (HMP) platforms provide a\ua0set of configurable resources, including a frequency range of dynamic voltage\ua0frequency scaling (DVFS), one among a set processor types, and one or a\ua0plurality of processors of each type. They can be configured at run-time to\ua0open up new opportunities for resource management.This thesis presents techniques to reduce energy consumption under QoS\ua0constraints by allocating resources at run-time on heterogeneous multiprocessor platforms targeting sequential and parallel iterative and task-parallel\ua0applications. The proposed techniques rely on a progress-tracking framework\ua0that monitors and predicts how much time is left until the application finishes.\ua0Furthermore, the proposed framework enables the prediction of computation\ua0demand and performance requirements for future iterations or tasks.\ua0The first contribution of this thesis is a resource management technique,\ua0called SLOOP, targeting single-threaded applications. SLOOP allocates resources, i.e., processor type and DVFS, for each iteration to meet deadlines\ua0while using the prediction of computational demand and execution time.The second contribution of this thesis is a resource-management scheme, called SaC, for multi-threaded applications executing on HMPs, where resources\ua0also include the number of processors besides DVFS and processor type. SaC\ua0first chooses the most energy-efficient configuration that meets the deadline.\ua0The proposed technique collects execution-time slack over subsequent iterations\ua0to select a configuration that can save energy.The third contribution of this thesis is a resource manager, called Task-RM, for task-parallel applications executing on HMPs under QoS constraints. Task-RM exploits the variance in task execution times and imbalance between\ua0sibling tasks to allocate just enough resources in terms of DVFS and processor type. It uses an innovative off-line analysis to avoid redoing scheduling analysis\ua0at run-time.Finally, the fourth contribution is a scheme, called Approx-RM, that can exploit accuracy-energy trade-offs in approximate iterative applications. Approx-RM allocates an appropriate amount of resources while guaranteeing timing\ua0and solution quality specifications. Approx-RM first predicts the iteration count required to meet the quality target and then allocates enough resources\ua0on an HMP in terms of DVFS, processor type, and processor count to save\ua0energy while meeting a performance target

    A Task-Graph Execution Manager for Reconfigurable Multi-tasking Systems

    Get PDF
    Reconfigurable hardware can be used to build multi tasking systems that dynamically adapt themselves to the requirements of the running applications. This is especially useful in embedded systems, since the available resources are very limited and the reconfigurable hardware can be reused for different applications. In these systems computations are frequently represented as task graphs that are executed taking into account their internal dependencies and the task schedule. The management of the task graph execution is critical for the system performance. In this regard, we have developed two dif erent versions, a software module and a hardware architecture, of a generic task-graph execution manager for reconfigurable multi-tasking systems. The second version reduces the run-time management overheads by almost two orders of magnitude. Hence it is especially suitable for systems with exigent timing constraints. Both versions include specific support to optimize the reconfiguration process

    Power-efficient data management for dynamic applications

    Get PDF
    In recent years, the semiconductor industry has turned its focus towards heterogeneous multi-processor platforms. They are an economically viable solution for coping with the growing setup and manufacturing cost of silicon systems. Furthermore, their inherent flexibility also perfectly supports the emerging market of interactive, mobile data and content services. The platform's performance and energy depend largely on how well the data-dominated services are mapped on the memory subsystem. A crucial aspect thereby is how efficient data is transferred between the different memory layers. Several compilation techniques have been developed to optimally use the available bandwidth. Unfortunately, they do not take the interaction between multiple threads running on the different processors into account, only locally optimize the bandwidth nor deal with the dynamic behavior of these applications. The contributions of this chapter are to outline the main limitations of current techniques and to introduce an approach for dealing with the dynamic multi-threaded of our application domain

    On the design of multimedia architectures : proceedings of a one-day workshop, Eindhoven, December 18, 2003

    Get PDF
    • …
    corecore