99 research outputs found

    Hardware-software codesign in a high-level synthesis environment

    Get PDF
    Interfacing hardware-oriented high-level synthesis to software development is a computationally hard problem for which no general solution exists. Under special conditions, the hardware-software codesign (system-level synthesis) problem may be analyzed with traditional tools and efficient heuristics. This dissertation introduces a new alternative to the currently used heuristic methods. The new approach combines the results of top-down hardware development with existing basic hardware units (bottom-up libraries) and compiler generation tools. The optimization goal is to maximize operating frequency or minimize cost with reasonable tradeoffs in other properties. The dissertation research provides a unified approach to hardware-software codesign. The improvements over previously existing design methodologies are presented in the frame-work of an academic CAD environment (PIPE). This CAD environment implements a sufficient subset of functions of commercial microelectronics CAD packages. The results may be generalized for other general-purpose algorithms or environments. Reference benchmarks are used to validate the new approach. Most of the well-known benchmarks are based on discrete-time numerical simulations, digital filtering applications, and cryptography (an emerging field in benchmarking). As there is a need for high-performance applications, an additional requirement for this dissertation is to investigate pipelined hardware-software systems\u27 performance and design methods. The results demonstrate that the quality of existing heuristics does not change in the enhanced, hardware-software environment

    Mr.Wolf: An Energy-Precision Scalable Parallel Ultra Low Power SoC for IoT Edge Processing

    Get PDF
    This paper presents Mr. Wolf, a parallel ultra-low power (PULP) system on chip (SoC) featuring a hierarchical architecture with a small (12 kgates) microcontroller (MCU) class RISC-V core augmented with an autonomous IO subsystem for efficient data transfer from a wide set of peripherals. The small core can offload compute-intensive kernels to an eight-core floating-point capable of processing engine available on demand. The proposed SoC, implemented in a 40-nm LP CMOS technology, features a 108-mu W fully retentive memory (512 kB). The IO subsystem is capable of transferring up to 1.6 Gbit/s from external devices to the memory in less than 2.5 mW. The eight-core compute cluster achieves a peak performance of 850 million of 32-bit integer multiply and accumulate per second (MMAC/s) and 500 million of 32-bit floating-point multiply and accumulate per second (MFMAC/s) -1 GFlop/s-with an energy efficiency up to 15 MMAC/s/mW and 9 MFMAC/s/mW. These building blocks are supported by aggressive on-chip power conversion and management, enabling energy-proportional heterogeneous computing for always-on IoT end nodes improving performance by several orders of magnitude with respect to traditional single-core MCUs within a power envelope of 153 mW. We demonstrated the capabilities of the proposed SoC on a wide set of near-sensor processing kernels showing that Mr. Wolf can deliver performance up to 16.4 GOp/s with energy efficiency up to 274 MOp/s/mW on real-life applications, paving the way for always-on data analytics on high-bandwidth sensors at the edge of the Internet of Things

    Enabling VLSI processing blocks for MIMO-OFDM Communications

    Get PDF
    Multi-input multi-output (MIMO) systems combined with orthogonal frequency-division multiplexing (OFDM) gained a wide popularity in wireless applications due to the potential of providing increased channel capacity and robustness against multipath fading channels. However these advantages come at the cost of a very high processing complexity and the efficient implementation of MIMO-OFDM receivers is today a major research topic. In this paper, efficient architectures are proposed for the hardware implementation of the main building blocks of a MIMO-OFDM receiver. A sphere decoder architecture flexible to different modulation without any loss in BER performance is presented while the proposed matrix factorization implementation allows to achieve the highest throughput specified in the IEEE 802.11n standard. Finally a novel sphere decoder approach is presented, which allows for the realization of new golden space time trellis coded modulation (GST-TCM) scheme. Implementation cost and offered throughput are provided for the proposed architectures synthesized on a 0.13  CMOS standard cell technology or on advanced FPGA devices

    System Level Performance Evaluation of Distributed Embedded Systems

    Get PDF
    In order to evaluate the feasibility of the distributed embedded systems in different application domains at an early phase, the System Level Performance Evaluation (SLPE) must provide reliable estimates of the nonfunctional properties of the system such as end-to-end delays and packet losses rate. The values of these non-functional properties depend not only on the application layer of the OSI model but also on the technologies residing at the MAC, transport and Physical layers. Therefore, the system level performance evaluation methodology must provide functionally accurate models of the protocols and technologies operating at these layers. After conducting a state of the art survey, it was found that the existing approaches for SLPE are either specialized for a particular domain of systems or apply a particular model of computation (MOC) for modeling the communication and synchronization between the different components of a distributed application. Therefore, these approaches abstract the functionalities of the data-link, Transport and MAC layers by the highly abstract message passing methods employed by the different models of computation. On the other hand, network simulators such as OMNeT++, ns-2 and Opnet do not provide the models for platform components of devices such as processors and memories and totally abstract the application processing by delays obtained via traffic generators. Therefore the system designer is not able to determine the potential impact of an application in terms of utilization of the platform used by the device. Hence, for a system level performance evaluation approach to estimate both the platform utilization and the non-functional properties which are a consequence of the lower layers of OSI models (such as end-to-end delays), it must provide the tools for automatic workload extraction of application workload models at various levels of refinement and functionally correct models of lower layers of OSI model (Transport MAC and Physical layers). Since ABSOLUT is not restricted to a particular domain and also does not depend on any MOC, therefore it was selected for the extension to a system level performance evaluation approach for distributed embedded systems. The models of data-link and Transport layer protocols and automatic workload generation of system calls was not available in ABSOLUT performance evaluation methodology. The, thesis describes the design and modelling of these OSI model layers and automatic workload generation tool for system calls. The tools and models integrated to ABSOLUT methodology were used in a number of case studies. The accuracy of the protocols was compared to network simulators and real systems. The results were 88% accurate for user space code of the application layer and provide an improvement of over 50% as compared to manual models for external libraries and system calls. The ABSOLUT physical layer models were found to be 99.8% accurate when compared to analytical models. The MAC and transport layer models were found to be 70-80% accurate when compared with the same scenarios simulated by ns-2 and OMNeT++ simulators. The bit error rates, frame error probability and packet loss rates show close correlation with the analytical methods .i.e., over 99%, 92% and 80% respectively. Therefore the results of ABSOLUT framework for application layer outperform the results of performance evaluation approaches which employ virtual systems and at the same time provide as accurate estimates of the end-to-end delays and packet loss rate as network simulators. The results of the network simulators also vary in absolute values but they follow the same trend. Therefore, the extensions made to ABSOLUT allow the system designer to identify the potential bottlenecks in the system at different OSI model layers and evaluate the non-functional properties with a high level of accuracy. Also, if the system designer wants to focus entirely on the application layer, different models of computations can be easily instantiated on top of extended ABSOLUT framework to achieve higher simulation speeds as described in the thesis

    Dynamically reconfigurable bio-inspired hardware

    Get PDF
    During the last several years, reconfigurable computing devices have experienced an impressive development in their resource availability, speed, and configurability. Currently, commercial FPGAs offer the possibility of self-reconfiguring by partially modifying their configuration bitstream, providing high architectural flexibility, while guaranteeing high performance. These configurability features have received special interest from computer architects: one can find several reconfigurable coprocessor architectures for cryptographic algorithms, image processing, automotive applications, and different general purpose functions. On the other hand we have bio-inspired hardware, a large research field taking inspiration from living beings in order to design hardware systems, which includes diverse topics: evolvable hardware, neural hardware, cellular automata, and fuzzy hardware, among others. Living beings are well known for their high adaptability to environmental changes, featuring very flexible adaptations at several levels. Bio-inspired hardware systems require such flexibility to be provided by the hardware platform on which the system is implemented. In general, bio-inspired hardware has been implemented on both custom and commercial hardware platforms. These custom platforms are specifically designed for supporting bio-inspired hardware systems, typically featuring special cellular architectures and enhanced reconfigurability capabilities; an example is their partial and dynamic reconfigurability. These aspects are very well appreciated for providing the performance and the high architectural flexibility required by bio-inspired systems. However, the availability and the very high costs of such custom devices make them only accessible to a very few research groups. Even though some commercial FPGAs provide enhanced reconfigurability features such as partial and dynamic reconfiguration, their utilization is still in its early stages and they are not well supported by FPGA vendors, thus making their use difficult to include in existing bio-inspired systems. In this thesis, I present a set of architectures, techniques, and methodologies for benefiting from the configurability advantages of current commercial FPGAs in the design of bio-inspired hardware systems. Among the presented architectures there are neural networks, spiking neuron models, fuzzy systems, cellular automata and random boolean networks. For these architectures, I propose several adaptation techniques for parametric and topological adaptation, such as hebbian learning, evolutionary and co-evolutionary algorithms, and particle swarm optimization. Finally, as case study I consider the implementation of bio-inspired hardware systems in two platforms: YaMoR (Yet another Modular Robot) and ROPES (Reconfigurable Object for Pervasive Systems); the development of both platforms having been co-supervised in the framework of this thesis

    WTEC Panel Report on International Assessment of Research and Development in Simulation-Based Engineering and Science

    Full text link

    Contribution à l’ordonnancement dynamique, tolérant aux fautes, de tâches pour les systèmes embarqués temps-réel multiprocesseurs

    Get PDF
    The thesis is concerned with online mapping and scheduling of tasks on multiprocessor embedded systems in order to improve the reliability subject to various constraints regarding e.g. time, or energy. To evaluate system performances, the number of rejected tasks, algorithm complexity and resilience assessed by injecting faults are analysed. The research was applied to: (i) the primary/backup approach technique, which is a fault tolerant one based on two task copies, and (ii) the scheduling algorithms for small satellites called CubeSats. The chief objective for the primary/backup approach is to analyse processor allocation strategies, devise novel enhancing scheduling methods and to choose one, which significantly reduces the algorithm run-time without worsening the system performances. Regarding CubeSats, the proposed idea is to gather all processors built into satellites on one board and design scheduling algorithms to make CubeSats more robust as to the faults. Two real CubeSat scenarios are analysed and it is found that it is useless to consider systems with more than six processors and that the presented algorithms perform well in a harsh environment and with energy constraints.La thèse se focalise sur le placement et l’ordonnancement dynamique des tâches sur les systèmes embarqués multiprocesseurs pour améliorer leur fiabilité tout en tenant compte des contraintes telles que le temps réel ou l’énergie. Afin d’évaluer les performances du système, le nombre de tâches rejetées, la complexité de l’algorithme et la résilience estimée en injectant des fautes sont principalement analysés. La recherche est appliquée (i) à l’approche de « primary/backup » qui est une technique de tolérance aux fautes basée sur deux copies d’une tâche et (ii) aux algorithmes de placement pour les petits satellites appelés CubeSats. Quant à l’approche de « primary/backup », l’objectif principal est d’étudier les stratégies d’allocation des processeurs, de proposer de nouvelles méthodes d’amélioration pour l’ordonnancement et d’en choisir une qui diminue considérablement la durée de l’exécution de l’algorithme sans dégrader les performances du système. En ce qui concerne les CubeSats, l’idée est de regrouper tous les processeurs à bord et de concevoir des algorithmes d’ordonnancement afin de rendre les CubeSats plus robustes. Les scénarios provenant de deux CubeSats réels sont étudiés et les résultats montrent qu’il est inutile de considérer les systèmes ayant plus de six processeurs et que les algorithmes proposés fonctionnent bien même avec des capacités énergétiques limitées et dans un environnement hostile
    • …
    corecore