46 research outputs found

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems

    Real-time analysis of MPI programs for NoC-based many-cores using time division multiplexing

    Get PDF
    Worst-case execution time (WCET) analysis is crucial for designing hard real-time systems. While the WCET of tasks in a single core system can be upper bounded in isolation, the tasks in a many-core system are subject to shared memory interferences which impose high overestimation of the WCET bounds. However, many-core-based massively parallel applications will enter the area of real-time systems in the years ahead. Explicit message-passing and a clear separation of computation and communication facilitates WCET analysis for those programs. A standard programming model for message-based communication is the message passing interface (MPI). It provides an application independent interface for different standard communication operations (e.g. broadcast, gather, ...). Thereby, it uses efficient communication patterns with deterministic behaviour. In applying these known structures, we target to provide a WCET analysis for communication that is reusable for different applications if the communication is executed on the same underlying platform. Hence, the analysis must be performed once per hardware platform and can be reused afterwards with only adapting several parameters such as the number of nodes participating in that communication. Typically, the processing elements of many-core platforms are connected via a Network-on-Chip (NoC) and apply techniques such as time-division multiplexing (TDM) to provide guaranteed services for the network. Hence, the hardware and the applied technique for guaranteed service needs to facilitate this reusability of the analysis as well. In this work we review different general-purpose TDM schedules that enable a WCET approximation independent of the placement of tasks on processing elements of a many-core which uses a NoC with torus topology. Furthermore, we provide two new schedules that show a similar performance as the state-of-the-art schedules but additionally serve situations where the presented state-of-the-art schedules perform poorly. Based on these schedules a procedure for the WCET analysis of the communication patterns used in MPI is proposed. Finally, we show how to apply the results of the analysis to calculate the WCET upper bound for a complete MPI program. Detailed insights in the performance of the applied TDM schedules are provided by comparing the schedules to each other in terms of timing. Additionally, we discuss the exhibited timing of the general-purpose schedules compared to a state-of-the-art application specific TDM schedule to put in relation both types of schedules. We apply the proposed procedure to several standard types of communication provided in MPI and compare different patterns that are used to implement a specific communication. Our evaluation investigates the communications’ building blocks of the timing bounds and shows the tremendous impact of choosing the appropriate communication pattern. Finally, a case study demonstrates the application of the presented procedure to a complete MPI program. With the method proposed in this work it is possible to perform a reusable WCET timing analysis for the communication in a NoC that is independent of the placement of tasks on the chip. Moreover, as the applied schedules are not optimized for a specific application but can be used for all applications in the same way, there are only marginal changes in the timing of the communication when the software is adapted or updated. Thus, there is no need to perform the timing analysis from scratch in such cases

    Design and Programming Methods for Reconfigurable Multi-Core Architectures using a Network-on-Chip-Centric Approach

    Get PDF
    A current trend in the semiconductor industry is the use of Multi-Processor Systems-on-Chip (MPSoCs) for a wide variety of applications such as image processing, automotive, multimedia, and robotic systems. Most applications gain performance advantages by executing parallel tasks on multiple processors due to the inherent parallelism. Moreover, heterogeneous structures provide high performance/energy efficiency, since application-specific processing elements (PEs) can be exploited. The increasing number of heterogeneous PEs leads to challenging communication requirements. To overcome this challenge, Networks-on-Chip (NoCs) have emerged as scalable on-chip interconnect. Nevertheless, NoCs have to deal with many design parameters such as virtual channels, routing algorithms and buffering techniques to fulfill the system requirements. This thesis highly contributes to the state-of-the-art of FPGA-based MPSoCs and NoCs. In the following, the three major contributions are introduced. As a first major contribution, a novel router concept is presented that efficiently utilizes communication times by performing sequences of arithmetic operations on the data that is transferred. The internal input buffers of the routers are exchanged with processing units that are capable of executing operations. Two different architectures of such processing units are presented. The first architecture provides multiply and accumulate operations which are often used in signal processing applications. The second architecture introduced as Application-Specific Instruction Set Routers (ASIRs) contains a processing unit capable of executing any operation and hence, it is not limited to multiply and accumulate operations. An internal processing core located in ASIRs can be developed in C/C++ using high-level synthesis. The second major contribution comprises application and performance explorations of the novel router concept. Models that approximate the achievable speedup and the end-to-end latency of ASIRs are derived and discussed to show the benefits in terms of performance. Furthermore, two applications using an ASIR-based MPSoC are implemented and evaluated on a Xilinx Zynq SoC. The first application is an image processing algorithm consisting of a Sobel filter, an RGB-to-Grayscale conversion, and a threshold operation. The second application is a system that helps visually impaired people by navigating them through unknown indoor environments. A Light Detection and Ranging (LIDAR) sensor scans the environment, while Inertial Measurement Units (IMUs) measure the orientation of the user to generate an audio signal that makes the distance as well as the orientation of obstacles audible. This application consists of multiple parallel tasks that are mapped to an ASIR-based MPSoC. Both applications show the performance advantages of ASIRs compared to a conventional NoC-based MPSoC. Furthermore, dynamic partial reconfiguration in terms of relocation and security aspects are investigated. The third major contribution refers to development and programming methodologies of NoC-based MPSoCs. A software-defined approach is presented that combines the design and programming of heterogeneous MPSoCs. In addition, a Kahn-Process-Network (KPN) –based model is designed to describe parallel applications for MPSoCs using ASIRs. The KPN-based model is extended to support not only the mapping of tasks to NoC-based MPSoCs but also the mapping to ASIR-based MPSoCs. A static mapping methodology is presented that assigns tasks to ASIRs and processors for a given KPN-model. The impact of external hardware components such as sensors, actuators and accelerators connected to the processors is also discussed which makes the approach of high interest for embedded systems

    High Performance Embedded Computing

    Get PDF
    Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc. High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds. Technical topics discussed in the book include: Parallel embedded platforms Programming models Mapping and scheduling of parallel computations Timing and schedulability analysis Runtimes and operating systemsThe work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things

    Vector extensions in COTS processors to increase guaranteed performance in real-time systems

    Get PDF
    The need for increased application performance in high-integrity systems like those in avionics is on the rise as software continues to implement more complex functionalities. The prevalent computing solution for future high-integrity embedded products are multi-processors systems-on-chip (MPSoC) processors. MPSoCs include CPU multicores that enable improving performance via thread-level parallelism. MPSoCs also include generic accelerators (GPUs) and application-specific accelerators. However, the data processing approach (DPA) required to exploit each of these underlying parallel hardware blocks carries several open challenges to enable the safe deployment in high-integrity domains. The main challenges include the qualification of its associated runtime system and the difficulties in analyzing programs deploying the DPA with out-of-the-box timing analysis and code coverage tools. In this work, we perform a thorough analysis of vector extensions (VExt) in current COTS processors for high-integrity systems. We show that VExt prevent many of the challenges arising with parallel programming models and GPUs. Unlike other DPAs, VExt require no runtime support, prevent by design race conditions that might arise with parallel programming models, and have minimum impact on the software ecosystem enabling the use of existing code coverage and timing analysis tools. We develop vectorized versions of neural network kernels and show that the NVIDIA Xavier VExt provide a reasonable increase in guaranteed application performance of up to 2.7x. Our analysis contends that VExt are the DPA approach with arguably the fastest path for adoption in high-integrity systems.This work has received funding from the the European Research Council (ERC) grant agreement No. 772773 (SuPerCom) and the Spanish Ministry of Science and Innovation (AEI/10.13039/501100011033) under grants PID2019-107255GB-C21 and IJC2020-045931-I.Peer ReviewedPostprint (author's final draft

    Embedded System Design

    Get PDF
    A unique feature of this open access textbook is to provide a comprehensive introduction to the fundamental knowledge in embedded systems, with applications in cyber-physical systems and the Internet of things. It starts with an introduction to the field and a survey of specification models and languages for embedded and cyber-physical systems. It provides a brief overview of hardware devices used for such systems and presents the essentials of system software for embedded systems, including real-time operating systems. The author also discusses evaluation and validation techniques for embedded systems and provides an overview of techniques for mapping applications to execution platforms, including multi-core platforms. Embedded systems have to operate under tight constraints and, hence, the book also contains a selected set of optimization techniques, including software optimization techniques. The book closes with a brief survey on testing. This fourth edition has been updated and revised to reflect new trends and technologies, such as the importance of cyber-physical systems (CPS) and the Internet of things (IoT), the evolution of single-core processors to multi-core processors, and the increased importance of energy efficiency and thermal issues

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Distributed and Lightweight Meta-heuristic Optimization method for Complex Problems

    Get PDF
    The world is becoming more prominent and more complex every day. The resources are limited and efficiently use them is one of the most requirement. Finding an Efficient and optimal solution in complex problems needs to practical methods. During the last decades, several optimization approaches have been presented that they can apply to different optimization problems, and they can achieve different performance on various problems. Different parameters can have a significant effect on the results, such as the type of search spaces. Between the main categories of optimization methods (deterministic and stochastic methods), stochastic optimization methods work more efficient on big complex problems than deterministic methods. But in highly complex problems, stochastic optimization methods also have some issues, such as execution time, convergence to local optimum, incompatible with distributed systems, and dependence on the type of search spaces. Therefore this thesis presents a distributed and lightweight metaheuristic optimization method (MICGA) for complex problems focusing on four main tracks. 1) The primary goal is to improve the execution time by MICGA. 2) The proposed method increases the stability and reliability of the results by using the multi-population strategy in the second track. 3) MICGA is compatible with distributed systems. 4) Finally, MICGA is applied to the different type of optimization problems with other kinds of search spaces (continuous, discrete and order based optimization problems). MICGA has been compared with other efficient optimization approaches. The results show the proposed work has been achieved enough improvement on the main issues of the stochastic methods that are mentioned before.Maailmasta on päivä päivältä tulossa yhä monimutkaisempi. Resurssit ovat rajalliset, ja siksi niiden tehokas käyttö on erittäin tärkeää. Tehokkaan ja optimaalisen ratkaisun löytäminen monimutkaisiin ongelmiin vaatii tehokkaita käytännön menetelmiä. Viime vuosikymmenien aikana on ehdotettu useita optimointimenetelmiä, joilla jokaisella on vahvuutensa ja heikkoutensa suorituskyvyn ja tarkkuuden suhteen erityyppisten ongelmien ratkaisemisessa. Parametreilla, kuten hakuavaruuden tyypillä, voi olla merkittävä vaikutus tuloksiin. Optimointimenetelmien pääryhmistä (deterministiset ja stokastiset menetelmät) stokastinen optimointi toimii suurissa monimutkaisissa ongelmissa tehokkaammin kuin deterministinen optimointi. Erittäin monimutkaisissa ongelmissa stokastisilla optimointimenetelmillä on kuitenkin myös joitain ongelmia, kuten korkeat suoritusajat, päätyminen paikallisiin optimipisteisiin, yhteensopimattomuus hajautetun toteutuksen kanssa ja riippuvuus hakuavaruuden tyypistä. Tämä opinnäytetyö esittelee hajautetun ja kevyen metaheuristisen optimointimenetelmän (MICGA) monimutkaisille ongelmille keskittyen neljään päätavoitteeseen: 1) Ensisijaisena tavoitteena on pienentää suoritusaikaa MICGA:n avulla. 2) Lisäksi ehdotettu menetelmä lisää tulosten vakautta ja luotettavuutta käyttämällä monipopulaatiostrategiaa. 3) MICGA tukee hajautettua toteutusta. 4) Lopuksi MICGA-menetelmää sovelletaan erilaisiin optimointiongelmiin, jotka edustavat erityyppisiä hakuavaruuksia (jatkuvat, diskreetit ja järjestykseen perustuvat optimointiongelmat). Työssä MICGA-menetelmää verrataan muihin tehokkaisiin optimointimenetelmiin. Tulokset osoittavat, että ehdotetulla menetelmällä saavutetaan selkeitä parannuksia yllä mainittuihin stokastisten menetelmien pääongelmiin liittyen

    Analytical cost metrics: days of future past

    Get PDF
    2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree

    High-Performance and Time-Predictable Embedded Computing

    Get PDF
    Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc. High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds. Technical topics discussed in the book include: Parallel embedded platforms Programming models Mapping and scheduling of parallel computations Timing and schedulability analysis Runtimes and operating systems The work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things.info:eu-repo/semantics/publishedVersio
    corecore