78 research outputs found

    Performance modeling of embedded applications with zero architectural knowledge

    Get PDF
    Performance estimation is a key step in the development of an embedded system. Normally, the performance evaluation is performed using a simulator or a performance mathematical model of the target architecture. However, both these approaches are usually based on the knowledge of the architectural details of the target. In this paper we present a methodology for automatically building an analytical model to estimate the performance of an application on a generic processor without requiring any information about the processor architecture but the one provided by the GNU GCC Intermediate Representation. The proposed methodology exploits the linear regression technique based on an application analysis performed on the Register Transfer Level internal representation of the GNU GCC compiler. The benefits of working with this type of model and with this intermediate representation are three: we take into account most of the compiler optimizations, we implicitly consider some architectural characteristics of the target processor and we can easily estimate the performance of portions of the specification. We validate our approach by evaluating with cross-validation technique the accuracy and the generality of the performance models built for the ARM926EJ-S and the LEON3 processor

    Hardware-software codesign in a high-level synthesis environment

    Get PDF
    Interfacing hardware-oriented high-level synthesis to software development is a computationally hard problem for which no general solution exists. Under special conditions, the hardware-software codesign (system-level synthesis) problem may be analyzed with traditional tools and efficient heuristics. This dissertation introduces a new alternative to the currently used heuristic methods. The new approach combines the results of top-down hardware development with existing basic hardware units (bottom-up libraries) and compiler generation tools. The optimization goal is to maximize operating frequency or minimize cost with reasonable tradeoffs in other properties. The dissertation research provides a unified approach to hardware-software codesign. The improvements over previously existing design methodologies are presented in the frame-work of an academic CAD environment (PIPE). This CAD environment implements a sufficient subset of functions of commercial microelectronics CAD packages. The results may be generalized for other general-purpose algorithms or environments. Reference benchmarks are used to validate the new approach. Most of the well-known benchmarks are based on discrete-time numerical simulations, digital filtering applications, and cryptography (an emerging field in benchmarking). As there is a need for high-performance applications, an additional requirement for this dissertation is to investigate pipelined hardware-software systems\u27 performance and design methods. The results demonstrate that the quality of existing heuristics does not change in the enhanced, hardware-software environment

    Toward Lean Hardware/Software System Development: An Evaluation of Selected Complex Electronic System Development Methodologies

    Get PDF
    The development of electronic hardware and software has become a major component of major DoD systems. This report surveys a wide set of new electronic hardware/software development methods and develops a system to evaluate them, particularly for cross system integration.Lean Aerospace Initiativ

    System-Level Power Estimation Methodology for MPSoC based Platforms

    Get PDF
    Avec l'essor des nouvelles technologies d'intégration sur silicium submicroniques, la consommation de puissance dans les systèmes sur puce multiprocesseur (MPSoC) est devenue un facteur primordial au niveau du flot de conception. La prise en considération de ce facteur clé dès les premières phases de conception, joue un rôle primordial puisqu'elle permet d'augmenter la fiabilité des composants et de réduire le temps d'arrivée sur le marché du produit final.Shifting the design entry point up to the system-level is the most important countermeasure adopted to manage the increasing complexity of Multiprocessor System on Chip (MPSoC). The reason is that decisions taken at this level, early in the design cycle, have the greatest impact on the final design in terms of power and energy efficiency. However, taking decisions at this level is very difficult, since the design space is extremely wide and it has so far been mostly a manual activity. Efficient system-level power estimation tools are therefore necessary to enable proper Design Space Exploration (DSE) based on power/energy and timing.VALENCIENNES-Bib. électronique (596069901) / SudocSudocFranceF

    Meeting U.S. defense needs in the information age : an evaluation of selected comlex electronic system development methodologies

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 1995.Includes bibliographical references (p. 159-167).by Alexander C. Hou.M.S

    SCALABLE TECHNIQUES FOR SCHEDULING AND MAPPING DSP APPLICATIONS ONTO EMBEDDED MULTIPROCESSOR PLATFORMS

    Get PDF
    A variety of multiprocessor architectures has proliferated even for off-the-shelf computing platforms. To make use of these platforms, traditional implementation frameworks focus on implementing Digital Signal Processing (DSP) applications using special platform features to achieve high performance. However, due to the fast evolution of the underlying architectures, solution redevelopment is error prone and re-usability of existing solutions and libraries is limited. In this thesis, we facilitate an efficient migration of DSP systems to multiprocessor platforms while systematically leveraging previous investment in optimized library kernels using dataflow design frameworks. We make these library elements, which are typically tailored to specialized architectures, more amenable to extensive analysis and optimization using an efficient and systematic process. In this thesis we provide techniques to allow such migration through four basic contributions: 1. We propose and develop a framework to explore efficient utilization of Single Instruction Multiple Data (SIMD) cores and accelerators available in heterogeneous multiprocessor platforms consisting of General Purpose Processors (GPPs) and Graphics Processing Units (GPUs). We also propose new scheduling techniques by applying extensive block processing in conjunction with appropriate task mapping and task ordering methods that match efficiently with the underlying architecture. The approach gives the developer the ability to prototype a GPU-accelerated application and explore its design space efficiently and effectively. 2. We introduce the concept of Partial Expansion Graphs (PEGs) as an implementation model and associated class of scheduling strategies. PEGs are designed to help realize DSP systems in terms of forms and granularities of parallelism that are well matched to the given applications and targeted platforms. PEGs also facilitate derivation of both static and dynamic scheduling techniques, depending on the amount of variability in task execution times and other operating conditions. We show how to implement efficient PEG-based scheduling methods using real time operating systems, and to re-use pre-optimized libraries of DSP components within such implementations. 3. We develop new algorithms for scheduling and mapping systems implemented using PEGs. Collectively, these algorithms operate in three steps. First, the amount of data parallelism in the application graph is tuned systematically over many iterations to profit from the available cores in the target platform. Then a mapping algorithm that uses graph analysis is developed to distribute data and task parallel instances over different cores while trying to balance the load of all processing units to make use of pipeline parallelism. Finally, we use a novel technique for performance evaluation by implementing the scheduler and a customizable solution on the programmable platform. This allows accurate fitness functions to be measured and used to drive runtime adaptation of schedules. 4. In addition to providing scheduling techniques for the mentioned applications and platforms, we also show how to integrate the resulting solution in the underlying environment. This is achieved by leveraging existing libraries and applying the GPP-GPU scheduling framework to augment a popular existing Software Defined Radio (SDR) development environment -- GNU Radio -- with a dataflow foundation and a stand-alone GPU-accelerated library. We also show how to realize the PEG model on real time operating system libraries, such as the Texas Instruments DSP/BIOS. A code generator that accepts a manual system designer solution as well as automatically configured solutions is provided to complete the design flow starting from application model to running system

    Extensions to cloudsim simulator based on a first order parasitic coupling overhead model to study the CPU scalation in clouds systems

    Get PDF
    Parallel CPU performance scaling is dependent of many factors such applications behavior, cache available, efficiency of cache use. Computer simulators are used to predict or benchmark the performance of multi-core processors; they are categorized as cycle accurate or high-level models. The cycle accurate models are based on queue theory to step-by-step conduct the simulation to perform the operations needed to consume a model that represents the application size. The high-level models are used to collect, usually via additional hardware, not accessible information that represents the behavior of the running system. In those two cases, they are complex because a very long sequence of queues (cycle accurate) controls the process, making the simulation very time consuming. And in the case of high-level, sometimes it needs a complete different hardware from the actual system to read the needed states. In addition, sometimes programmable logic systems are used to implement the cache controls, queues and communication protocols what enlarges the running time, to accommodate the clock restrictions of the additional hardware. Based on a simple model proposed by Kandalintsev and Lo Cigno, the Behavioral First Order Performance Model (BFO), all the complexity of the internals of the architecture is studied as parasitic interference of one CPU/core fighting for shared resources. They introduced a coupling factor, which incorporates the essence of the observed performance behavior in a system. In this work, we used the BFO model to implement a fast cycle-accurate simulator capable of handling even large systems. To do it the CloudSim simulator was extended to behave as a cycle-accurate simulator. Basically we implemented CPU and application models that can handle the complexity of the different types of scenarios (distinct CPU behavior when running different kinds of applications). Using off line measurements we trained the model to extract the coupling factor needed to feed the simulator. We tested the pairwise behavior to make sure that the implementation could reproduce the measured tests. Our simulations consider up to 14 cores (in case of Xeon E5-2880v4) and to improve the results quality we introduced a correction factor to minimize the (seemly exponential) error observed, with good results.O aumento de performance em CPUs paralelas é dependente de muitos fatores como comportamento das aplicações, cache disponível, eficiência do uso do cache. Simuladores em computadores são usados para prever ou comparar a performance de processadores multi-core; eles são categorizados como modelos ciclo-precisos ou de alto nível. Os modelos cliclo-precisos são baseados na teoria de filas para passo-a-passo conduzir a simulação a executar as operações necessárias para consumir um modelo que representa o tamanho da aplicação. Os modelos de alto nível são usados para coletar, usualmente através de hardware adicional, informações não acessíveis que representam o comportamento do sistema em operação. Nos dois casos, eles são complexos porque uma longa sequência de filas (ciclo-preciso) controlam o processo, fazendo com que a simulação sejam muito demoradas. No caso de alto nível, muitas vezes é necessário um hardware completamente diferente do sistema real a fim de ler os estados necessários. Além disso, às vezes sistemas com lógica programável são usados para implementar o controle de cache, filas e protocolos de comunicação o que aumenta o tempo de execução, por causa as restrições de clock desse hardware adicional. Baseado em um modelo simples proposto por Kandalintsev e Lo Cigno, o Modelo BFO (First Order Performance), toda a complexidade da arquitetura interna é estudada como interferência parasítica de uma CPU/core lutando por recursos comuns. Eles introduziram um fator de acoplamento, que incorpora a essencia do comportamento de performance observado em um sistema. Neste trabalho, usamos o modelo BFO para implementar um simulador ciclo-preciso rápido capaz de manusear grandes sistemas. Para tanto o simulador CloudSim foi extendido para se comportar como um simulador ciclo-preciso. Basicamente implementamos modelos de CPU e modelos de aplicação que poder dar conta da complexidade de diferentes tipos de cenários (comportamento de CPU distintos quando rodando diferentes tipos de aplicação). Usando medidas off line nós treinamos o modelo para extrair os fatores de acoplamento necessários para alimentar o simulador. Testamos o comportamento em pares para estar seguros de que a implementação poderia reproduzir os testes medidos. Nossas simulações consideram até 14 cores (no caso de m Xeon E5-2880v4) e para melhorar a qualidade dos resultados introduzimos um fator de correção para minimizar o erro observado (que parece ser exponencial), com bons resultados

    A Survey on FPGA-Based Heterogeneous Clusters Architectures

    Get PDF
    In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. To this date, the prevalent approach to supercomputing is dominated by CPUs and GPUs. Given their fixed architectures with generic instruction sets, they have been favored with lots of tools and mature workflows which led to mass adoption and further growth. However, reconfigurable hardware such as FPGAs has repeatedly proven that it offers substantial advantages over this supercomputing approach concerning performance and power consumption. In this survey, we review the most relevant works that advanced the field of heterogeneous supercomputing using FPGAs focusing on their architectural characteristics. Each work was divided into three main parts: network, hardware, and software tools. All implementations face challenges that involve all three parts. These dependencies result in compromises that designers must take into account. The advantages and limitations of each approach are discussed and compared in detail. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines
    corecore