34 research outputs found

    Design and Programming Methods for Reconfigurable Multi-Core Architectures using a Network-on-Chip-Centric Approach

    Get PDF
    A current trend in the semiconductor industry is the use of Multi-Processor Systems-on-Chip (MPSoCs) for a wide variety of applications such as image processing, automotive, multimedia, and robotic systems. Most applications gain performance advantages by executing parallel tasks on multiple processors due to the inherent parallelism. Moreover, heterogeneous structures provide high performance/energy efficiency, since application-specific processing elements (PEs) can be exploited. The increasing number of heterogeneous PEs leads to challenging communication requirements. To overcome this challenge, Networks-on-Chip (NoCs) have emerged as scalable on-chip interconnect. Nevertheless, NoCs have to deal with many design parameters such as virtual channels, routing algorithms and buffering techniques to fulfill the system requirements. This thesis highly contributes to the state-of-the-art of FPGA-based MPSoCs and NoCs. In the following, the three major contributions are introduced. As a first major contribution, a novel router concept is presented that efficiently utilizes communication times by performing sequences of arithmetic operations on the data that is transferred. The internal input buffers of the routers are exchanged with processing units that are capable of executing operations. Two different architectures of such processing units are presented. The first architecture provides multiply and accumulate operations which are often used in signal processing applications. The second architecture introduced as Application-Specific Instruction Set Routers (ASIRs) contains a processing unit capable of executing any operation and hence, it is not limited to multiply and accumulate operations. An internal processing core located in ASIRs can be developed in C/C++ using high-level synthesis. The second major contribution comprises application and performance explorations of the novel router concept. Models that approximate the achievable speedup and the end-to-end latency of ASIRs are derived and discussed to show the benefits in terms of performance. Furthermore, two applications using an ASIR-based MPSoC are implemented and evaluated on a Xilinx Zynq SoC. The first application is an image processing algorithm consisting of a Sobel filter, an RGB-to-Grayscale conversion, and a threshold operation. The second application is a system that helps visually impaired people by navigating them through unknown indoor environments. A Light Detection and Ranging (LIDAR) sensor scans the environment, while Inertial Measurement Units (IMUs) measure the orientation of the user to generate an audio signal that makes the distance as well as the orientation of obstacles audible. This application consists of multiple parallel tasks that are mapped to an ASIR-based MPSoC. Both applications show the performance advantages of ASIRs compared to a conventional NoC-based MPSoC. Furthermore, dynamic partial reconfiguration in terms of relocation and security aspects are investigated. The third major contribution refers to development and programming methodologies of NoC-based MPSoCs. A software-defined approach is presented that combines the design and programming of heterogeneous MPSoCs. In addition, a Kahn-Process-Network (KPN) –based model is designed to describe parallel applications for MPSoCs using ASIRs. The KPN-based model is extended to support not only the mapping of tasks to NoC-based MPSoCs but also the mapping to ASIR-based MPSoCs. A static mapping methodology is presented that assigns tasks to ASIRs and processors for a given KPN-model. The impact of external hardware components such as sensors, actuators and accelerators connected to the processors is also discussed which makes the approach of high interest for embedded systems

    Performance Aspects of Synthesizable Computing Systems

    Get PDF

    DSM64: A Distributed Shared Memory System in User-Space

    Get PDF
    This paper presents DSM64: a lazy release consistent software distributed shared memory (SDSM) system built entirely in user-space. The DSM64 system is capable of executing threaded applications implemented with pthreads on a cluster of networked machines without any modifications to the target application. The DSM64 system features a centralized memory manager [1] built atop Hoard [2, 3]: a fast, scalable, and memory-efficient allocator for shared-memory multiprocessors. In my presentation, I present a SDSM system written in C++ for Linux operating systems. I discuss a straight-forward approach to implement SDSM systems in a Linux environment using system-provided tools and concepts avail- able entirely in user-space. I show that the SDSM system presented in this paper is capable of resolving page faults over a local area network in as little as 2 milliseconds. In my analysis, I present the following. I compare the performance characteristics of a matrix multiplication benchmark using various memory coherency models. I demonstrate that matrix multiplication benchmark using a LRC model performs orders of magnitude quicker than the same application using a stricter coherency model. I show the effect of coherency model on memory access patterns and memory contention. I compare the effects of different locking strategies on execution speed and memory access patterns. Lastly, I provide a comparison of the DSM64 system to a non-networked version using a system-provided allocator

    HW/SW Codesign for the Xilinx Zynq Platform

    Get PDF
    Tato práce se zabývá možnostmi pro HW/SW codesign na platformě Xilinx Zynq. Na základě studia rozhraní mezi částmi Processing System (ARM Cortex-A9 MPCore) a Programmable Logic (FPGA) je navržen abstraktní a univerzální přístup k vývoji aplikací, které jsou akcelerovány v programovatelném hardwaru na tomto čipu a běží nad operačním systémem Linux. V praktické části je pro tyto účely navržen framework určený pro Zynq, ale také pro jiné obdobné platformy. Žádný takový framework není v současné době k dispozici.This work describes a novel approach of HW/SW codesign on the Xilinx Zynq and similar platforms. It deals with interconnections between the Processing System (ARM Cortex-A9 MPCore) and the Programmable Logic (FPGA) to find an abstract and universal way to develop applications that are partially offloaded into the programmable hardware and that run in the Linux operating system. For that purpose a framework for HW/SW codesign on the Zynq and similar platforms is designed. No such framework is currently available.

    Terrier: an embedded operating system using advanced types for safety

    Get PDF
    Operating systems software is fundamental to modern computer systems: all other applications are dependent upon the correct and timely provision of basic system services. At the same time, advances in programming languages and type theory have lead to the creation of functional programming languages with type systems that are designed to combine theorem proving with practical systems programming. The Terrier operating system project focuses on low-level systems programming in the context of a multi-core, real-time, embedded system, while taking advantage of a dependently typed programming language named ATS to improve reliability. Terrier is a new point in the design space for an operating system, one that leans heavily on an associated programming language, ATS, to provide safety that has traditionally been in the scope of hardware protection and kernel privilege. Terrier tries to have far fewer abstractions between program and hardware. The purpose of Terrier is to put programs as much in contact with the real hardware, real memory, and real timing constraints as possible, while still retaining the ability to multiplex programs and provide for a reasonable level of safety through static analysis

    A new approach to reversible computing with applications to speculative parallel simulation

    Get PDF
    In this thesis, we propose an innovative approach to reversible computing that shifts the focus from the operations to the memory outcome of a generic program. This choice allows us to overcome some typical challenges of "plain" reversible computing. Our methodology is to instrument a generic application with the help of an instrumentation tool, namely Hijacker, which we have redesigned and developed for the purpose. Through compile-time instrumentation, we enhance the program's code to keep track of the memory trace it produces until the end. Regardless of the complexity behind the generation of each computational step of the program, we can build inverse machine instructions just by inspecting the instruction that is attempting to write some value to memory. Therefore from this information, we craft an ad-hoc instruction that conveys this old value and the knowledge of where to replace it. This instruction will become part of a more comprehensive structure, namely the reverse window. Through this structure, we have sufficient information to cancel all the updates done by the generic program during its execution. In this writing, we will discuss the structure of the reverse window, as the building block for the whole reversing framework we designed and finally realized. Albeit we settle our solution in the specific context of the parallel discrete event simulation (PDES) adopting the Time Warp synchronization protocol, this framework paves the way for further general-purpose development and employment. We also present two additional innovative contributions coming from our innovative reversibility approach, both of them still embrace traditional state saving-based rollback strategy. The first contribution aims to harness the advantages of both the possible approaches. We implement the rollback operation combining state saving together with our reversible support through a mathematical model. This model enables the system to choose in autonomicity the best rollback strategy, by the mutable runtime dynamics of programs. The second contribution explores an orthogonal direction, still related to reversible computing aspects. In particular, we will address the problem of reversing shared libraries. Indeed, leading from their nature, shared objects are visible to the whole system and so does every possible external modification of their code. As a consequence, it is not possible to instrument them without affecting other unaware applications. We propose a different method to deal with the instrumentation of shared objects. All our innovative proposals have been assessed using the last generation of the open source ROOT-Sim PDES platform, where we integrated our solutions. ROOT-Sim is a C-based package implementing a general purpose simulation environment based on the Time Warp synchronization protocol

    MURAC: A unified machine model for heterogeneous computers

    Get PDF
    Includes bibliographical referencesHeterogeneous computing enables the performance and energy advantages of multiple distinct processing architectures to be efficiently exploited within a single machine. These systems are capable of delivering large performance increases by matching the applications to architectures that are most suited to them. The Multiple Runtime-reconfigurable Architecture Computer (MURAC) model has been proposed to tackle the problems commonly found in the design and usage of these machines. This model presents a system-level approach that creates a clear separation of concerns between the system implementer and the application developer. The three key concepts that make up the MURAC model are a unified machine model, a unified instruction stream and a unified memory space. A simple programming model built upon these abstractions provides a consistent interface for interacting with the underlying machine to the user application. This programming model simplifies application partitioning between hardware and software and allows the easy integration of different execution models within the single control ow of a mixed-architecture application. The theoretical and practical trade-offs of the proposed model have been explored through the design of several systems. An instruction-accurate system simulator has been developed that supports the simulated execution of mixed-architecture applications. An embedded System-on-Chip implementation has been used to measure the overhead in hardware resources required to support the model, which was found to be minimal. An implementation of the model within an operating system on a tightly-coupled reconfigurable processor platform has been created. This implementation is used to extend the software scheduler to allow for the full support of mixed-architecture applications in a multitasking environment. Different scheduling strategies have been tested using this scheduler for mixed-architecture applications. The design and implementation of these systems has shown that a unified abstraction model for heterogeneous computers provides important usability benefits to system and application designers. These benefits are achieved through a consistent view of the multiple different architectures to the operating system and user applications. This allows them to focus on achieving their performance and efficiency goals by gaining the benefits of different execution models during runtime without the complex implementation details of the system-level synchronisation and coordination

    Scratchpad Memory Management For Multicore Real-Time Embedded Systems

    Get PDF
    Multicore systems will continue to spread in the domain of real-time embedded systems due to the increasing need for high-performance applications. This research discusses some of the challenges associated with employing multicore systems for safety-critical real-time applications. Mainly, this work is concerned with providing: 1) efficient inter-core timing isolation for independent tasks, and 2) predictable task communication for communicating tasks. Principally, we introduce a new task execution model, based on the 3-phase execution model, that exploits the Direct Memory Access (DMA) controllers available in modern embedded platforms along with ScratchPad Memories (SPMs) to enforce strong timing isolation between tasks. The DMA and the SPMs are explicitly managed to pre-load tasks from main memory into the local (private) scratchpad memories. Tasks are then executed from the local SPMs without accessing main memory. This model allows CPU execution to be overlapped with DMA loading/unloading operations from and to main memory. We show that by co-scheduling task execution on CPUs and using DMA to access memory and I/O, we can efficiently hide access latency to physical resources. In turn, this leads to significant improvements in system schedulability, compared to both the case of unregulated contention for access to physical resources and to previous cache and SPM management techniques for real-time systems. The presented SPM-centric scheduling algorithms and analyses cover single-core, partitioned, and global real-time systems. The proposed scheme is also extended to support large tasks that do not fit entirely into the local SPM. Moreover, the schedulability analysis considers the case of recovering from transient soft errors (bit flips caused by a single event upset) in several levels of memories, that cannot be automatically corrected in hardware by the ECC unit. The proposed SPM-centric scheduling is integrated at the OS level; thus it is transparent to applications. The proposed scheme is implemented and evaluated on an FPGA platform and a Commercial-Off-The-Shelf (COTS) platform. In regards to real-time task communication, two types of communication are considered. 1) Asynchronous inter-task communication, between either sequential tasks (single-threaded) or parallel tasks (multi-threaded). 2) Intra-task communication, where parallel threads of the same application exchange data. A new task scheduling model for parallel tasks (Bundled Scheduling) is proposed to facilitate intra-task communication and reduce synchronization overheads. We show that the proposed bundled scheduling model can be applied to several parallel programming models, such as fork-join and DAG-based applications, leading to improved system schedulability. Finally, intra-task communication is governed by a predictable inter-core communication platform. Specifically, we propose HopliteRT, a lean and predictable Network-on-Chip that connects the private SPMs

    Simulation Modelling of Distributed-Shared Memory Multiprocessors

    Get PDF
    Institute for Computing Systems ArchitectureDistributed shared memory (DSM) systems have been recognised as a compelling platform for parallel computing due to the programming advantages and scalability. DSM systems allow applications to access data in a logically shared address space by abstracting away the distinction of physical memory location. As the location of data is transparent, the sources of overhead caused by accessing the distant memories are difficult to analyse. This memory locality problem has been identified as crucial to DSM performance. Many researchers have investigated the problem using simulation as a tool for conducting experiments resulting in the progressive evolution of DSM systems. Nevertheless, both the diversity of architectural configurations and the rapid advance of DSM implementations impose constraints on simulation model designs in two issues: the limitation of the simulation framework on model extensibility and the lack of verification applicability during a simulation run causing the delay in verification process. This thesis studies simulation modelling techniques for memory locality analysis of various DSM systems implemented on top of a cluster of symmetric multiprocessors. The thesis presents a simulation technique to promote model extensibility and proposes a technique for verification applicability, called a Specification-based Parameter Model Interaction (SPMI). The proposed techniques have been implemented in a new interpretation-driven simulation called DSiMCLUSTER on top of a discrete event simulation (DES) engine known as HASE. Experiments have been conducted to determine which factors are most influential on the degree of locality and to determine the possibility to maximise the stability of performance. DSiMCLUSTER has been validated against a SunFire 15K server and has achieved similarity of cache miss results, an average of +-6% with the worst case less than 15% of difference. These results confirm that the techniques used in developing the DSiMCLUSTER can contribute ways to achieve both (a) a highly extensible simulation framework to keep up with the ongoing innovation of the DSM architecture, and (b) the verification applicability resulting in an efficient framework for memory analysis experiments on DSM architecture
    corecore