18 research outputs found

    An Approach Based on Particle Swarm Optimization for Inspection of Spacecraft Hulls by a Swarm of Miniaturized Robots

    Get PDF
    The remoteness and hazards that are inherent to the operating environments of space infrastructures promote their need for automated robotic inspection. In particular, micrometeoroid and orbital debris impact and structural fatigue are common sources of damage to spacecraft hulls. Vibration sensing has been used to detect structural damage in spacecraft hulls as well as in structural health monitoring practices in industry by deploying static sensors. In this paper, we propose using a swarm of miniaturized vibration-sensing mobile robots realizing a network of mobile sensors. We present a distributed inspection algorithm based on the bio-inspired particle swarm optimization and evolutionary algorithm niching techniques to deliver the task of enumeration and localization of an a priori unknown number of vibration sources on a simplified 2.5D spacecraft surface. Our algorithm is deployed on a swarm of simulated cm-scale wheeled robots. These are guided in their inspection task by sensing vibrations arising from failure points on the surface which are detected by on-board accelerometers. We study three performance metrics: (1) proximity of the localized sources to the ground truth locations, (2) time to localize each source, and (3) time to finish the inspection task given a 75% inspection coverage threshold. We find that our swarm is able to successfully localize the present so

    Embedded System Design

    Get PDF
    A unique feature of this open access textbook is to provide a comprehensive introduction to the fundamental knowledge in embedded systems, with applications in cyber-physical systems and the Internet of things. It starts with an introduction to the field and a survey of specification models and languages for embedded and cyber-physical systems. It provides a brief overview of hardware devices used for such systems and presents the essentials of system software for embedded systems, including real-time operating systems. The author also discusses evaluation and validation techniques for embedded systems and provides an overview of techniques for mapping applications to execution platforms, including multi-core platforms. Embedded systems have to operate under tight constraints and, hence, the book also contains a selected set of optimization techniques, including software optimization techniques. The book closes with a brief survey on testing. This fourth edition has been updated and revised to reflect new trends and technologies, such as the importance of cyber-physical systems (CPS) and the Internet of things (IoT), the evolution of single-core processors to multi-core processors, and the increased importance of energy efficiency and thermal issues

    Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures

    Get PDF
    Real-time systems are ubiquitous in our everyday life, e.g., in safety-critical domains such as automotive, avionics or robotics. The correctness of a real-time system does not only depend on the correctness of its calculations, but also on the non-functional requirement of adhering to deadlines. Failing to meet a deadline may lead to severe malfunctions, therefore worst-case execution times (WCET) need to be guaranteed. Despite significant scientific advances, however, timing analysis of WCET guarantees lags years behind current high-performance microarchitectures with out-of-order scheduling pipelines, several hardware threads and multiple (shared) cache layers. To satisfy the increasing performance demands of real-time systems, analyzable performance features are required. In order to escape the scarcity of timing-analyzable performance features, the main contribution of this thesis is the introduction of runtime reconfiguration of hardware accelerators onto a field-programmable gate array (FPGA) as a novel means to achieve performance that is amenable to WCET guarantees. Instead of designing an architecture for a specific application domain, this approach preserves the flexibility of the system. First, this thesis contributes novel co-scheduling approaches to distribute work among CPU and GPU in an extensive analysis of how (average-case) performance is achieved on fused CPU-GPU architectures, a main trend in current high-performance microarchitectures that combines a CPU and a GPU on a single chip. Being able to employ such architectures in real-time systems would be highly desirable, because they provide high performance within a limited area and power budget. As a result of this analysis, however, a cache coherency bottleneck is uncovered in recent fused CPU-GPU architectures that share the last level cache between CPU and GPU. This insight (i) complicates performance predictions and (ii) adds a shared last level cache between CPU and GPU to the growing list of microarchitectural features that benefit average-case performance, but render the analysis of WCET guarantees on high-performance architectures virtually infeasible. Thus, further motivating the need for novel microarchitectural features that provide predictable performance and are amenable to timing analysis. Towards this end, a runtime reconfiguration controller called ``Command-based Reconfiguration Queue\u27\u27 (CoRQ) is presented that provides guaranteed latencies for its operations, especially for the reconfiguration delay, i.e., the time it takes to reconfigure a hardware accelerator onto a reconfigurable fabric (e.g., FPGA). CoRQ enables the design of timing-analyzable runtime-reconfigurable architectures that support WCET guarantees. Based on the --now feasible-- guaranteed reconfiguration delay of accelerators, a WCET analysis is introduced that enables tasks to reconfigure application-specific custom instructions (CIs) at runtime. CIs are executed by a processor pipeline and invoke execution of one or more accelerators. Different measures to deal with reconfiguration delays are compared for their impact on accelerated WCET guarantees and overestimation. The timing anomaly of runtime reconfiguration is identified and safely bounded: a case where executing iterations of a computational kernel faster than in WCET during reconfiguration of CIs can prolong the total execution time of a task. Once tasks that perform runtime reconfiguration of CIs can be analyzed for WCET guarantees, the question of which CIs to configure on a constrained reconfigurable area to optimize the WCET is raised. The question is addressed for systems where multiple CIs with different implementations each (allowing to trade-off latency and area requirements) can be selected. This is generally the case, e.g., when employing high-level synthesis. This so-called WCET-optimizing instruction set selection problem is modeled based on the Implicit Path Enumeration Technique (IPET), which is the path analysis technique state-of-the-art timing analyzers rely on. To our knowledge, this is the first approach that enables WCET optimization with support for making use of global program flow information (and information about reconfiguration delay). An optimal algorithm (similar to Branch and Bound) and a fast greedy heuristic algorithm (that achieves the optimal solution in most cases) are presented. Finally, an approach is presented that, for the first time, combines optimized static WCET guarantees and runtime optimization of the average-case execution (maintaining WCET guarantees) using runtime reconfiguration of hardware accelerators by leveraging runtime slack (the amount of time that program parts are executed faster than in WCET). It comprises an analysis of runtime slack bounds that enable safe reconfiguration for average-case performance under WCET guarantees and presents a mechanism to monitor runtime slack using a simple performance counter that is commonly available in many microprocessors. Ultimately, this thesis shows that runtime reconfiguration of accelerators is a key feature to achieve predictable performance

    Automatically Parallelizing Embedded Legacy Software on Soft-Core SoCs

    Get PDF
    Nowadays, embedded systems are utilized in many areas and become omnipresent, making people's lives more comfortable. Embedded systems have to handle more and more functionality in many products. To maintain the often required low energy consumption, multi-core systems provide high performance at moderate energy consumption. The development started with dual-core processors and has today reached many-core designs with dozens and hundreds of processor cores. However, existing applications can barely leverage the potential of that many cores. Legacy applications are usually written sequentially and thus typically use only one processor core. Thus, these applications do not benefit from the advantages provided by modern many-core systems. Rewriting those applications to use multiple cores requires new skills from developers and it is also time-consuming and highly error prone. Dozens of languages, APIs and compilers have already been presented in the past decades to aid the user with parallelizing applications. Fully automatic parallelizing compilers are seen as the holy grail, since the user effort is kept minimal. However, automatic parallelizers often cannot extract parallelism as good as user aided approaches. Most of these parallelization tools are designed for desktop and high-performance systems and are thus not tuned or applicable for low performance embedded systems. To improve this situation, this work presents an automatic parallelizer for embedded systems, which is able to mostly deliver better quality than user aided approaches and if not allows easy manual fine-tuning. Parallelization tools extract concurrently executable tasks from an application. These tasks can then be executed on different processor cores. Parallelization tools and automatic parallelizers in particular often struggle to efficiently map the extracted parallelism to an existing multi-core processor. This work uses soft-core processors on FPGAs, which makes it possible to realize custom multi-core designs in hardware, within a few minutes. This allows to adapt the multi-core processor to the characteristics of the extracted parallelism. Especially, core-interconnects for communication can be optimized to fit the communication pattern of the parallel application. Embedded applications are often structured as follows: receive input data, (multiple) data processing steps, data output. The multiple processing steps are often realized as consecutive loosely coupled transformations. These steps naturally already model the structure of a processing pipeline. It is the goal of this work to extract this kind of pipeline-parallelism from an application and map it to multiple cores to increase the overall throughput of the system. Multiple cores forming a chain with direct communication channels ideally fit this pattern. The previously described, so called pipeline-parallelism is a barely addressed concept in most parallelization tools. Also, current multi-core designs often do not support the hardware flexibility provided by soft-cores, targeted in this approach. The main contribution of this work is an automatic parallelizer which is able to map different processing steps from the source-code of a sequential application to different cores in a multi-core pipeline. Users only specify the required processing speed after parallelization. The developed tool tries to extract a matching parallelized software design along with a custom multi-core design out of sequential embedded legacy applications. The automatically created multi-core system already contains used peripherals extracted from the source-code and is ready to be used. The presented parallelizer implements multi-objective optimization to generate a minimal hardware design, just fulfilling the user defined requirement. To the best of my knowledge, the possibility to generate such a multi-core pipeline defined by the demands of the parallelized software has never been presented before. The approach is implemented for two soft-core processors and evaluation shows for both targets high speedups of 12x and higher at a reasonable hardware overhead. Compared to other automatic parallelizers, which mainly focus on speedups through latency reduction, significantly higher speedups can be achieved depending on the given application structure

    Adaptive Integrated Circuit Design for Variation Resilience and Security

    Get PDF
    The past few decades witness the burgeoning development of integrated circuit in terms of process technology scaling. Along with the tremendous benefits coming from the scaling, challenges are also presented in various stages. During the design time, the complexity of developing a circuit with millions to billions of smaller size transistors is extended after the variations are taken into account. The difficulty of analyzing these nondeterministic properties makes the allocation scheme of redundant resource hardly work in a cost-efficient way. Besides fabrication variations, analog circuits are suffered from severe performance degradations owing to their physical attributes which are vulnerable to aging effects. As such, the post-silicon calibration approach gains increasing attentions to compensate the performance mismatch. For the user-end applications, additional system failures result from the pirated and counterfeited devices provided by the untrusted semiconductor supply chain. Again analog circuits show their weakness to this threat due to the shortage of piracy avoidance techniques. In this dissertation, we propose three adaptive integrated circuit designs to overcome these challenges respectively. The first one investigates the variability-aware gate implementation with the consideration of the overhead control of adaptivity assignment. This design improves the variation resilience typically for digital circuits while optimizing the power consumption and timing yield. The second design is implemented as a self-validation system for the calibration of diverse analog circuits. The system is completely integrated on chip to enhance the convenience without external assistance. In the last design, a classic analog component is further studied to establish the configurable locking mechanism for analog circuits. The use of Satisfiability Modulo Theories addresses the difficulty of searching the unique unlocking pattern of non-Boolean variables

    XXIII Congreso Argentino de Ciencias de la ComputaciĂłn - CACIC 2017 : Libro de actas

    Get PDF
    Trabajos presentados en el XXIII Congreso Argentino de Ciencias de la ComputaciĂłn (CACIC), celebrado en la ciudad de La Plata los dĂ­as 9 al 13 de octubre de 2017, organizado por la Red de Universidades con Carreras en InformĂĄtica (RedUNCI) y la Facultad de InformĂĄtica de la Universidad Nacional de La Plata (UNLP).Red de Universidades con Carreras en InformĂĄtica (RedUNCI

    Un framework pour l'exécution efficace d'applications sur GPU et CPU+GPU

    Get PDF
    Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter.Les verrous technologiques rencontrĂ©s par les fabricants de semi-conducteurs au dĂ©but des annĂ©es deux-mille ont abrogĂ© la flambĂ©e des performances des unitĂ©s de calculs sĂ©quentielles. La tendance actuelle est Ă  la multiplication du nombre de cƓurs de processeur par socket et Ă  l'utilisation progressive des cartes GPU pour des calculs hautement parallĂšles. La complexitĂ© des architectures rĂ©centes rend difficile l'estimation statique des performances d'un programme. Nous dĂ©crivons une mĂ©thode fiable et prĂ©cise de prĂ©diction du temps d'exĂ©cution de nids de boucles parallĂšles sur GPU basĂ©e sur trois Ă©tapes : la gĂ©nĂ©ration de code, le profilage offline et la prĂ©diction online. En outre, nous prĂ©sentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un systĂšme pour la performance. La premiĂšre consiste en l'utilisation conjointe des CPUs et GPUs pour l'exĂ©cution d'un code. Afin de prĂ©server les performances il est nĂ©cessaire de considĂ©rer la rĂ©partition de charge, notamment en prĂ©disant les temps d'exĂ©cution. Le runtime utilise les rĂ©sultats du profilage et un ordonnanceur calcule des temps d'exĂ©cution et ajuste la charge distribuĂ©e aux processeurs. La seconde technique prĂ©sentĂ©e met le CPU et le GPU en compĂ©tition : des instances du code cible sont exĂ©cutĂ©es simultanĂ©ment sur CPU et GPU. Le vainqueur de la compĂ©tition notifie sa complĂ©tion Ă  l'autre instance, impliquant son arrĂȘt

    The fast multipole method at exascale

    Get PDF
    This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D
    corecore