32 research outputs found

    Performance and Memory Space Optimizations for Embedded Systems

    Get PDF
    Embedded systems have three common principles: real-time performance, low power consumption, and low price (limited hardware). Embedded computers use chip multiprocessors (CMPs) to meet these expectations. However, one of the major problems is lack of efficient software support for CMPs; in particular, automated code parallelizers are needed. The aim of this study is to explore various ways to increase performance, as well as reducing resource usage and energy consumption for embedded systems. We use code restructuring, loop scheduling, data transformation, code and data placement, and scratch-pad memory (SPM) management as our tools in different embedded system scenarios. The majority of our work is focused on loop scheduling. Main contributions of our work are: We propose a memory saving strategy that exploits the value locality in array data by storing arrays in a compressed form. Based on the compressed forms of the input arrays, our approach automatically determines the compressed forms of the output arrays and also automatically restructures the code. We propose and evaluate a compiler-directed code scheduling scheme, which considers both parallelism and data locality. It analyzes the code using a locality parallelism graph representation, and assigns the nodes of this graph to processors.We also introduce an Integer Linear Programming based formulation of the scheduling problem. We propose a compiler-based SPM conscious loop scheduling strategy for array/loop based embedded applications. The method is to distribute loop iterations across parallel processors in an SPM-conscious manner. The compiler identifies potential SPM hits and misses, and distributes loop iterations such that the processors have close execution times. We present an SPM management technique using Markov chain based data access. We propose a compiler directed integrated code and data placement scheme for 2-D mesh based CMP architectures. Using a Code-Data Affinity Graph (CDAG) to represent the relationship between loop iterations and array data, it assigns the sets of loop iterations to processing cores and sets of data blocks to on-chip memories. We present a memory bank aware dynamic loop scheduling scheme for array intensive applications.The goal is to minimize the number of memory banks needed for executing the group of loop iterations

    A scratch-pad memory aware dynamic loop scheduling algorithm

    Get PDF
    Executing array based applications on a chip multiprocessor requires effective loop parallelization techniques. One of the critical issues that need to be tackled by an optimizing compiler in this context is loop scheduling, which distributes the iterations of a loop to be executed in parallel across the available processors. Most of the existing work in this area targets cache based execution platforms. In comparison, this paper proposes the first dynamic loop scheduler, to our knowledge, that targets scratch-pad memory (SPM) based chip multiprocessors, and presents an experimental evaluation of it. The main idea behind our approach is to identify the set of loop iterations that access the SPM and those that do not. This information is exploited at runtime to balance the loads of the processors involved in executing the loop nest at hand. Therefore, the proposed dynamic scheduler takes advantage of the SPM in performing the loop iteration-to-processor mapping. Our experimental evaluation with eight array/loop intensive applications reveals that the proposed scheduler is very effective in practice and brings between 13.7% and 41.7% performance savings over a static loop scheduling scheme, which is also tested in our experiments. © 2008 IEEE

    Optimisation des mémoires dans le flot de conception des systèmes multiprocesseurs sur puces pour des applications de type multimédia

    Get PDF
    RÉSUMÉ Les systèmes multiprocesseurs sur puce (MPSoC) constituent l'un des principaux moteurs de la révolution industrielle des semi-conducteurs. Les MPSoCs jouissent d’une popularité grandissante dans le domaine des systèmes embarqués. Leur grande capacité de parallélisation à un très haut niveau d'intégration, en font de bons candidats pour les systèmes et les applications telles que les applications multimédia. La consommation d’énergie, la capacité de calcul et l’espace de conception sont les éléments dont dépendent les performances de ce type d’applications. La mémoire est le facteur clé permettant d’améliorer de façon substantielle leurs performances. Avec l’arrivée des applications multimédias embarquées dans l’industrie, le problème des gains de performances est vital. La masse de données traitées par ces applications requiert une grande capacité de calcul et de mémoire. Dernièrement, de nouveaux modèles de programmation ont fait leur apparition. Ces modèles offrent une programmation de plus haut niveau pour répondre aux besoins croissants des MPSoCs, d’où la nécessité de nouvelles approches d'optimisation et de placement pour les systèmes embarqués et leurs modèles de programmation. La conception niveau système des architectures MPSoCs pour les applications de type multimédia constitue un véritable défi technique. L’objectif général de cette thèse est de relever ce défi en trouvant des solutions. Plus spécifiquement, cette thèse se propose d’introduire le concept d’optimisation mémoire dans le flot de conception niveau système et d’observer leur impact sur différents modèles de programmation utilisés lors de la conception de MPSoCs. Il s’agit, autrement dit, de réaliser l’unification du domaine de la compilation avec celui de la conception niveau système pour une meilleure conception globale. La contribution de cette thèse est de proposer de nouvelles approches pour les techniques d'optimisation mémoire pour la conception MPSoCs avec différents modèles de programmation. Nos travaux de recherche concernent l'intégration des techniques d’optimisation mémoire dans le flot de conception de MPSoCs pour différents types de modèle de programmation. Ces travaux ont été exécutés en collaboration avec STMicroelectronics.----------ABSTRACT Multiprocessor systems-on-chip (MPSoC) are defined as one of the main drivers of the industrial semiconductors revolution. MPSoCs are gaining popularity in the field of embedded systems. Pursuant to their great ability to parallelize at a very high integration level, they are good candidates for systems and applications such as multimedia. Memory is becoming a key player for significant improvements in these applications (i.e. power, performance and area). With the emergence of more embedded multimedia applications in the industry, this issue becomes increasingly vital. The large amount of data manipulated by these applications requires high-capacity calculation and memory. Lately, new programming models have been introduced. These programming models offer a higher programming level to answer the increasing needs of MPSoCs. This leads to the need of new optimization and mapping approaches suitable for embedded systems and their programming models. The overall objective of this research is to find solutions to the challenges of system level design of applications such as multimedia. This entails the development of new approaches and new optimization techniques. The specific objective of this research is to introduce the concept of memory optimization in the system level conception flow and study its impact on different programming models used for MPSoCs’ design. In other words, it is the unification of the compilation and system level design domains. The contribution of this research is to propose new approaches for memory optimization techniques for MPSoCs’ design in different programming models. This thesis relates to the integration of memory optimization to varying programming model types in the MPSoCs conception flow. Our research was done in collaboration with STMicroelectronics

    System Synthesis for Embedded Multiprocessors

    Get PDF
    Modern embedded systems must increasingly accommodate dynamically changing operating environments, high computational requirements, and tight time-to-market windows. Such trends and the ever-increasing design complexity of embedded systems have challenged designers to raise the level of abstraction and replace traditional ad-hoc approaches with more efficient synthesis techniques. Additionally, since embedded multiprocessor systems are typically designed as final implementations for dedicated functions, modifications to embedded system implementations are rare, and this allows embedded system designers to spend significantly larger amounts of time to optimize the architecture and the employed software. This dissertation presents several system-level synthesis algorithms that employ time-intensive optimization techniques that allow the designer to explore a significantly larger part of the design space. It looks at critical issues that are at the core of the synthesis process --- selecting the architecture, partitioning the functionality over the components of the architecture, and scheduling activities such that design constraints and optimization objectives are satisfied. More specifically for the scheduling step, a new solution to the two-step multiprocessor scheduling problem is proposed. For the first step of clustering a highly efficient genetic algorithm is proposed. Several techniques for the second step of merging are proposed and finally a complete two-step effective solution is presented. Also, a randomization technique is applied to existing deterministic techniques to extend these techniques so that they can utilize arbitrary increases in available optimization time. This novel framework for extending deterministic algorithms in our context allows for accurate and fair comparison of our techniques against the state of the art. To further generalize the proposed clustering-based scheduling approach, a complementary two-step multiprocessor scheduling approach for heterogeneous multiprocessor systems is presented. This work is amongst the first works that formally studies the application of clustering to heterogeneous system scheduling. Several techniques are proposed and compared and conclusive results are presented. A modular system-level synthesis framework is then proposed. It synthesizes multi-mode, multi-task embedded systems under a number of hard constraints; optimizes a comprehensive set of objectives; and provides a set of alternative trade-off points in a given multi-objective design evaluation space. An extension of the framework is proposed to better address DVS, memory optimization, and efficient mappings onto dynamically reconfigurable hardware. An integrated framework for energy-driven scheduling onto embedded multiprocessor systems is proposed. It employs a solution representation that encodes both task assignment and ordering into a single chromosome and hence significantly reduces the search space and problem complexity. It is shown that a task assignment and scheduling that result in better performance do not necessarily save power, and hence, integrating task scheduling and voltage scheduling is crucial for fully exploiting the energy-saving potential of an embedded multiprocessor implementation

    Exploring Processor and Memory Architectures for Multimedia

    Get PDF
    Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

    Application-Specific Memory Subsystems

    Get PDF
    The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. These cache hierarchies are designed to be general-purpose in that they strive to provide the best possible performance across a wide range of applications. However, such a memory subsystem does not necessarily provide the best possible performance for a particular application. Although general-purpose memory subsystems are desirable when the work-load is unknown and the memory subsystem must remain fixed, when this is not the case a custom memory subsystem may be beneficial. For example, in an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) designed to run a particular application, a custom memory subsystem optimized for that application would be desirable. In addition, when there are tunable parameters in the memory subsystem, it may make sense to change these parameters depending on the application being run. Such a situation arises today with FPGAs and, to a lesser extent, GPUs, and it is plausible that general-purpose computers will begin to support greater flexibility in the memory subsystem in the future. In this dissertation, we first show that it is possible to create application-specific memory subsystems that provide much better performance than a general-purpose memory subsystem. In addition, we show a way to discover such memory subsystems automatically using a superoptimization technique on memory address traces gathered from applications. This allows one to generate a custom memory subsystem with little effort. We next show that our memory subsystem superoptimization technique can be used to optimize for objectives other than performance. As an example, we show that it is possible to reduce the number of writes to the main memory, which can be useful for main memories with limited write durability, such as flash or Phase-Change Memory (PCM). Finally, we show how to superoptimize memory subsystems for streaming applications, which are a class of parallel applications. In particular, we show that, through the use of ScalaPipe, we can author and deploy streaming applications targeting FPGAs with superoptimized memory subsystems. ScalaPipe is a domain-specific language (DSL) embedded in the Scala programming language for generating streaming applications that can be implemented on CPUs and FPGAs. Using the ScalaPipe implementation, we are able to demonstrate actual performance improvements using the superoptimized memory subsystem with applications implemented in hardware

    Hierarchical Memory Size Estimation for Loop Transformation and Data Memory Platform Optimization

    Get PDF
    In today’s embedded systems, the memory hierarchy is rapidly becoming a major bottleneck in terms of power, performance and area, due to the very large amount of (memory related) data need to be transferred and stored (temporarily). This is especially the case for portable multi-media applications systems. These applications are characterized by deep loop nests and multi-dimensional arrays at the high level. Due to the dramatically increasing size and complexity of system-on-a-chip (SoC) designs and stringent time-to-market requirement, the methodology and tools for chip design must be raised to the system level. Early analysis tools are particularly critical in enabling SoC designers to take full advantage of the many architectural options available. For memory optimization, the early high level techniques aim either to design an optimal memory platform for a given application or to optimize the application code in order to take advantage of the memory platform features, or even both. Loop transformation is such an important high level optimization technique. It modifies the execution order of loops and statements without changing the application functionality. Existing loop transformation algorithms are all performed based either on reduction of data access lifetime and on improvement in data locality and regularity to steer selection of loop transformations. These are, however, very abstract cost functions which do not represent the exact memory size requirement of the arrays and how the data will be mapped onto the memory platform later on. Existing algorithms all result in one final loop transformation solution. As different loop transformations may result in optimal utilization for different memory platform instances, ad-hoc decisions at this stage without estimating their impact on the actual hierarchy utilization can lead to a final sub-optimal solution. An evaluation of later design stages’ effort is hence required. On the other hand, there usually exist a huge number of loop transformation possibilities, the estimation is required to be performed repeatedly and its computation time of the estimation technique also becomes critical to make it useful during the loop transformation search space exploration. This dissertation proposes a memory footprint estimation methodology. An intra-array memory footprint estimation is performed first followed by an interarray estimation. In order to achieve a fast estimate to make it useful repeatedly during the early high level search space exploration, several techniques have been introduced. A fast intra-array memory footprint estimation is performed at the iteration domain based on the maximal lifetime of data accesses, which is defined by the maximal dependency vector. Two approaches, an ILP formulation and vertexes approach, have been introduced for achieving a fast maximal dependency vector calculation. The fast inter-array estimation has been achieved based on several Hanoi tower based approaches. A hierarchical memory size estimation methodology has also been proposed in this dissertation. It estimates the influence of any given sequence of loop transformation instances on the mapping of application data onto a hierarchical memory platform. As the exact memory platform instantiation is often not yet defined at this high level design stage, a platform independent estimation is introduced with a Pareto curve output for each loop transformation instance. It can steer the designer or an automatic steering tool to select all the interesting loop transformation instances that might later lead to low power data mapping for any of the many possible memory hierarchy instances. This is useful when the memory platform is not defined yet, or for a given memory hierarchy instance. It also allows to find the most appropriate low power memory hierarchy instance by performing an early power estimation of different memory hierarchy instances. Initially the source code is used as input for estimation, resulting in an initial approach. However, performing the estimation repeatedly from the source code is too slow for the large loop transformation search space exploration. An incremental approach, based on local updating of the previous result, is thus introduced to handle sequences of different loop transformations. Several advanced techniques have also been used on these two approaches in order to perform a fast estimation, such as bounding box geometrical model based data reuse analysis, platform independent memory hierarchy layer assignment estimation, fast intra- and inter-array memory footprint estimation. The feasibility and usefulness of the methodologies are substantiated using several representative real-life application demonstrators. It shows for instance that the fast memory footprint estimation can be two order of magnitude faster than compared techniques while still achieving fairly accurate estimation result. For hierarchical memory size estimation methodology, the initial approach is two order of magnitude faster than the compared technique and the incremental approach is another two order of magnitude faster than the initial approach, which can just take a few milliseconds. The fast computation time of the incremental approach make it feasible to be used repeatedly during the loop transformation exploration over a very large number of possibilities. Furthermore, prototype CAD tools has been developed that includes mast parts of the methodologies

    Providing QoS with Reduced Energy Consumption via Real-Time Voltage Scaling on Embedded Systems

    Get PDF
    Low energy consumption has emerged as one of the most important design objectives for many modern embedded systems, particularly the battery-operated PDAs. For some soft real-time applications such as multimedia applications, occasional deadline misses can be tolerated. How to leverage this feature to save more energy while still meeting the user required quality of service (QoS) is the research topic this thesis focuses on. We have proposed a new probabilistic design methodology, a set of energy reduction techniques for single and multiple processor systems by using dynamic voltage scaling (DVS), the practical solutions to voltage set-up problem for multiple voltage DVS system, and a new QoS metric. Most present design space exploration techniques, which are based on application's worst case execution time, often lead to over-designing systems. We have proposed the probabilistic design methodology for soft real-time embedded systems by using detailed execution time information in order to reduce the system resources while delivering the user required QoS probabilistically. One important phase in the probabilistic design methodology is the offline/online resource management. As an example, we have proposed a set of energy reduction techniques by employing DVS techniques to exploit the slacks arising from the tolerance to deadline misses for single and multiple processor systems while meeting the user required completion ratio statistically. Multiple-voltage DVS system is predicted as the future low-power system by International Technology Roadmap for Semiconductors (ITRS). In order to find the best way to employ DVS, we have formulated the voltage set-up problem and provided its practical solutions that seek the most energy efficient voltage setting for the design of multiple-voltage DVS systems. We have also presented a case study in designing energy-efficient dual voltage soft real-time system with (m, k)-firm deadline guarantee. Although completion ratio is widely used as a QoS metric, it can only be applied to the applications with independent tasks. We have proposed a new QoS metric that differentiates firm and soft deadlines and considers the task dependency as well. Based on this new metric, we have developed a set of online scheduling algorithms that enhance quality of presentation (QoP) significantly, particularly for overloaded systems

    Performance, Power Modeling and Optimization for High-Performance Computing Systems

    Get PDF
    University of Minnesota Ph.D. dissertation.October 2016. Major: Electrical/Computer Engineering. Advisor: John Sartori. 1 computer file (PDF); xi, 154 pages.Heterogeneity abounds in modern high-performance computing systems. Applications are heterogeneous, containing time-varying unbalanced utilization for different resources, and system architectures have become heterogeneous in order to achieve higher levels of performance and energy efficiency. The most powerful, and also the most energy-efficient high-performance computing systems today consist of many-core CPUs and GPGPUs with a variety of specialize on-chip and off-chip memories. These heterogeneous systems provide a huge amount of computing resources, but it is becoming increasingly challenging to use them effectively and efficiently to maximize their potential. This becomes an even more pressing challenge as energy efficiency becomes the primary barrier to achieving higher levels of performance. This thesis addresses the challenges of performance modeling and optimization in heterogeneous high-performance computing systems. Effective system optimization requires understanding of how performance and power change in response to optimizations. Therefore, we begin by summarizing the impact of modern architectural advances on performance and power modeling for chip multiprocessors (CMPs). We present two models that estimate the performance and power in such systems. The first model, CAMP, is a fast and accurate cache-aware performance model that estimates the performance degradation due to cache contention of processes running on cache-sharing cores. We then propose a system-level power model for a multi-programmed CMP environment that accounts for cache contention. We explain how to integrate the two models to enable power-aware process assignment. Then, we propose an off-chip memory access-aware runtime DVFS control technique that minimizes energy consumption subject to a constraint on application execution time. The second part of the dissertation focuses on improving performance for GPGPUs. After a thorough analysis on CPI breakdown, we lay out all the key factors that govern GPU throughput. In order to improve overall performance for GPGPUs, we propose two approaches that address the key factors, without introducing extra congestion and degradation to the system. We first propose a new two-level priority scheduling policy to improve overall performance by optimizing effective degree of parallelism. Then, we propose ICMT, a full, detailed solution for intra-core multitasking for GPGPUs, including architectural support and a contention-aware workload scheduling algorithm that improves all the key factors in a balanced fashion. Furthermore, we propose a new contention-aware analytical performance model that provides fine-grained workload scheduling decisions for intra-core multitasking, including detailed resource allocation from co-scheduled workloads
    corecore