6,130 research outputs found

    The Octopus switch

    Get PDF
    This chapter1 discusses the interconnection architecture of the Mobile Digital Companion. The approach to build a low-power handheld multimedia computer presented here is to have autonomous, reconfigurable modules such as network, video and audio devices, interconnected by a switch rather than by a bus, and to offload as much as work as possible from the CPU to programmable modules placed in the data streams. Thus, communication between components is not broadcast over a bus but delivered exactly where it is needed, work is carried out where the data passes through, bypassing the memory. The amount of buffering is minimised, and if it is required at all, it is placed right on the data path, where it is needed. A reconfigurable internal communication network switch called Octopus exploits locality of reference and eliminates wasteful data copies. The switch is implemented as a simplified ATM switch and provides Quality of Service guarantees and enough bandwidth for multimedia applications. We have built a testbed of the architecture, of which we will present performance and energy consumption characteristics

    EC-CENTRIC: An Energy- and Context-Centric Perspective on IoT Systems and Protocol Design

    Get PDF
    The radio transceiver of an IoT device is often where most of the energy is consumed. For this reason, most research so far has focused on low power circuit and energy efficient physical layer designs, with the goal of reducing the average energy per information bit required for communication. While these efforts are valuable per se, their actual effectiveness can be partially neutralized by ill-designed network, processing and resource management solutions, which can become a primary factor of performance degradation, in terms of throughput, responsiveness and energy efficiency. The objective of this paper is to describe an energy-centric and context-aware optimization framework that accounts for the energy impact of the fundamental functionalities of an IoT system and that proceeds along three main technical thrusts: 1) balancing signal-dependent processing techniques (compression and feature extraction) and communication tasks; 2) jointly designing channel access and routing protocols to maximize the network lifetime; 3) providing self-adaptability to different operating conditions through the adoption of suitable learning architectures and of flexible/reconfigurable algorithms and protocols. After discussing this framework, we present some preliminary results that validate the effectiveness of our proposed line of action, and show how the use of adaptive signal processing and channel access techniques allows an IoT network to dynamically tune lifetime for signal distortion, according to the requirements dictated by the application

    Design techniques for low-power systems

    Get PDF
    Portable products are being used increasingly. Because these systems are battery powered, reducing power consumption is vital. In this report we give the properties of low-power design and techniques to exploit them on the architecture of the system. We focus on: minimizing capacitance, avoiding unnecessary and wasteful activity, and reducing voltage and frequency. We review energy reduction techniques in the architecture and design of a hand-held computer and the wireless communication system including error control, system decomposition, communication and MAC protocols, and low-power short range networks

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programaciĂłn basados en tareas permiten a los programadores expresar las aplicaciones como una colecciĂłn de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecuciĂłn ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogĂ©neas. En estos modelos de programaciĂłn los runtimes garantizan la ejecuciĂłn de las tareas en el orden correcto mediante el uso de grĂĄficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la informaciĂłn que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecuciĂłn de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo mĂĄs nĂșcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas mĂĄs tareas mĂĄs rĂĄpido. En general los mayores sobrecostes introducidos por los runtimes son: la administraciĂłn de los grafos de dependencias que relacionan las tareas y la gestiĂłn de las tareas en sistemas heterogĂ©neos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificaciĂłn de tareas heterogĂ©neas. La funciĂłn principal de dicha arquitectura es acelerar las funciones crĂ­ticas de los runtimes para modelos de programaciĂłn basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programaciĂłn para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios nĂșcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. EstĂĄ conectado a un procesador ARM Cortex A9 de 2 nĂșcleos, e integrado con el SO. Con 24 nĂșcleos simulados y realizando el anĂĄlisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, ampliĂł Picos incorporando el soporte para la gestiĂłn de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha caracterĂ­stica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo estĂĄ completamente integrado en el sistema, no solo en hardware, sino tambiĂ©n con el modelo de programaciĂłn paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogĂ©neas. El planificador de tareas heterogĂ©neas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecuciĂłn de hardware adecuada que tenga el tiempo de finalizaciĂłn estimado mĂĄs corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro nĂșcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleraciĂłn de hasta 16.2x, y ahorra hasta el 90% de la energĂ­a con respecto al software.Postprint (published version

    Hardware runtime management for task-based programming models

    Get PDF
    Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programaciĂłn basados en tareas permiten a los programadores expresar las aplicaciones como una colecciĂłn de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecuciĂłn ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogĂ©neas. En estos modelos de programaciĂłn los runtimes garantizan la ejecuciĂłn de las tareas en el orden correcto mediante el uso de grĂĄficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la informaciĂłn que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecuciĂłn de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo mĂĄs nĂșcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas mĂĄs tareas mĂĄs rĂĄpido. En general los mayores sobrecostes introducidos por los runtimes son: la administraciĂłn de los grafos de dependencias que relacionan las tareas y la gestiĂłn de las tareas en sistemas heterogĂ©neos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificaciĂłn de tareas heterogĂ©neas. La funciĂłn principal de dicha arquitectura es acelerar las funciones crĂ­ticas de los runtimes para modelos de programaciĂłn basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programaciĂłn para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios nĂșcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. EstĂĄ conectado a un procesador ARM Cortex A9 de 2 nĂșcleos, e integrado con el SO. Con 24 nĂșcleos simulados y realizando el anĂĄlisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, ampliĂł Picos incorporando el soporte para la gestiĂłn de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha caracterĂ­stica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo estĂĄ completamente integrado en el sistema, no solo en hardware, sino tambiĂ©n con el modelo de programaciĂłn paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogĂ©neas. El planificador de tareas heterogĂ©neas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecuciĂłn de hardware adecuada que tenga el tiempo de finalizaciĂłn estimado mĂĄs corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro nĂșcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleraciĂłn de hasta 16.2x, y ahorra hasta el 90% de la energĂ­a con respecto al software

    Efficient and Reliable Task Scheduling, Network Reprogramming, and Data Storage for Wireless Sensor Networks

    Get PDF
    Wireless sensor networks (WSNs) typically consist of a large number of resource-constrained nodes. The limited computational resources afforded by these nodes present unique development challenges. In this dissertation, we consider three such challenges. The first challenge focuses on minimizing energy usage in WSNs through intelligent duty cycling. Limited energy resources dictate the design of many embedded applications, causing such systems to be composed of small, modular tasks, scheduled periodically. In this model, each embedded device wakes, executes a task-set, and returns to sleep. These systems spend most of their time in a state of deep sleep to minimize power consumption. We refer to these systems as almost-always-sleeping (AAS) systems. We describe a series of task schedulers for AAS systems designed to maximize sleep time. We consider four scheduler designs, model their performance, and present detailed performance analysis results under varying load conditions. The second challenge focuses on a fast and reliable network reprogramming solution for WSNs based on incremental code updates. We first present VSPIN, a framework for developing incremental code update mechanisms to support efficient reprogramming of WSNs. VSPIN provides a modular testing platform on the host system to plug-in and evaluate various incremental code update algorithms. The framework supports Avrdude, among the most popular Linux-based programming tools for AVR microcontrollers. Using VSPIN, we next present an incremental code update strategy to efficiently reprogram wireless sensor nodes. We adapt a linear space and quadratic time algorithm (Hirschberg\u27s Algorithm) for computing maximal common subsequences to build an edit map specifying an edit sequence required to transform the code running in a sensor network to a new code image. We then present a heuristic-based optimization strategy for efficient edit script encoding to reduce the edit map size. Finally, we present experimental results exploring the reduction in data size that it enables. The approach achieves reductions of 99.987% for simple changes, and between 86.95% and 94.58% for more complex changes, compared to full image transmissions - leading to significantly lower energy costs for wireless sensor network reprogramming. The third challenge focuses on enabling fast and reliable data storage in wireless sensor systems. A file storage system that is fast, lightweight, and reliable across device failures is important to safeguard the data that these devices record. A fast and efficient file system enables sensed data to be sampled and stored quickly and batched for later transmission. A reliable file system allows seamless operation without disruptions due to hardware, software, or other unforeseen failures. While flash technology provides persistent storage by itself, it has limitations that prevent it from being used in mission-critical deployment scenarios. Hybrid memory models which utilize newer non-volatile memory technologies, such as ferroelectric RAM (FRAM), can mitigate the physical disadvantages of flash. In this vein, we present the design and implementation of LoggerFS, a fast, lightweight, and reliable file system for wireless sensor networks, which uses a hybrid memory design consisting of RAM, FRAM, and flash. LoggerFS is engineered to provide fast data storage, have a small memory footprint, and provide data reliability across system failures. LoggerFS adapts a log-structured file system approach, augmented with data persistence and reliability guarantees. A caching mechanism allows for flash wear-leveling and fast data buffering. We present a performance evaluation of LoggerFS using a prototypical in-situ sensing platform and demonstrate between 50% and 800% improvements for various workloads using the FRAM write-back cache over the implementation without the cache

    The NASA SBIR product catalog

    Get PDF
    The purpose of this catalog is to assist small business firms in making the community aware of products emerging from their efforts in the Small Business Innovation Research (SBIR) program. It contains descriptions of some products that have advanced into Phase 3 and others that are identified as prospective products. Both lists of products in this catalog are based on information supplied by NASA SBIR contractors in responding to an invitation to be represented in this document. Generally, all products suggested by the small firms were included in order to meet the goals of information exchange for SBIR results. Of the 444 SBIR contractors NASA queried, 137 provided information on 219 products. The catalog presents the product information in the technology areas listed in the table of contents. Within each area, the products are listed in alphabetical order by product name and are given identifying numbers. Also included is an alphabetical listing of the companies that have products described. This listing cross-references the product list and provides information on the business activity of each firm. In addition, there are three indexes: one a list of firms by states, one that lists the products according to NASA Centers that managed the SBIR projects, and one that lists the products by the relevant Technical Topics utilized in NASA's annual program solicitation under which each SBIR project was selected

    Advances in parallel programming for electronic design automation

    Get PDF
    The continued miniaturization of the technology node increases not only the chip capacity but also the circuit design complexity. How does one efficiently design a chip with millions or billions transistors? This has become a challenging problem in the integrated circuit (IC) design industry, especially for the developers of electronic design automation (EDA) tools. To boost the performance of EDA tools, one promising direction is via parallel computing. In this dissertation, we explore different parallel computing approaches, from CPU to GPU to distributed computing, for EDA applications. Nowadays multi-core processors are prevalent from mobile devices to laptops to desktop, and it is natural for software developers to utilize the available cores to maximize the performance of their applications. Therefore, in this dissertation we first focus on multi-threaded programming. We begin by reviewing a C++ parallel programming library called Cpp-Taskflow. Cpp-Taskflow is designed to facilitate programming parallel applications, and has been successfully applied to an EDA timing analysis tool. We will demonstrate Cpp-Taskflow’s programming model and interface, software architecture and execution flow. Then, we improve Cpp-Taskflow in several aspects. First, we enhance Cpp-Taskflow’s usability through restructuring the software architecture. Second, we introduce task graph composition to support composability and modularity, which makes it easier for users to construct large and complex parallel patterns. Third, we add a new task type in Cpp-Taskflow to let users control the graph execution flow. This feature empowers the graph model with the ability to describe complex control flow. Aside from the above enhancements, we have designed a new scheduler to adaptively manage the threads based on available parallelism. The new scheduler uses a simple and effective strategy which can not only prevent resource from being underutilized, but also mitigate resource over-subscription. We have evaluated the new scheduler on both micro-benchmarks and a very-large-scale integration (VLSI) application, and the results show that the new scheduler can achieve good performance and is very energy-efficient. Next we study the applicability of heterogeneous computing, specifically the graphics processing unit (GPU), to EDA. We demonstrate how to use GPU to accelerate VLSI placement, and we show that GPU can bring substantial performance gain to VLSI placement. Finally, as the design size keeps increasing, a more scalable solution will be distributed computing. We introduce a distributed power grid analysis framework built on top of DtCraft. This framework allows users to flexibly partition the design and automatically deploy the computations across several machines. In addition, we propose a job scheduler that can efficiently utilize cluster resource to improve the framework’s performance

    Low Power system Design techniques for mobile computers

    Get PDF
    Portable products are being used increasingly. Because these systems are battery powered, reducing power consumption is vital. In this report we give the properties of low power design and techniques to exploit them on the architecture of the system. We focus on: min imizing capacitance, avoiding unnecessary and wasteful activity, and reducing voltage and frequency. We review energy reduction techniques in the architecture and design of a hand-held computer and the wireless communication system, including error control, sys tem decomposition, communication and MAC protocols, and low power short range net works
    • 

    corecore