375 research outputs found

    Polymorphic computing abstraction for heterogeneous architectures

    Get PDF
    Integration of multiple computing paradigms onto system on chip (SoC) has pushed the boundaries of design space exploration for hardware architectures and computing system software stack. The heterogeneity of computing styles in SoC has created a new class of architectures referred to as Heterogeneous Architectures. Novel applications developed to exploit the different computing styles are user centric for embedded SoC. Software and hardware designers are faced with several challenges to harness the full potential of heterogeneous architectures. Applications have to execute on more than one compute style to increase overall SoC resource utilization. The implication of such an abstraction is that application threads need to be polymorphic. Operating system layer is thus faced with the problem of scheduling polymorphic threads. Resource allocation is also an important problem to be dealt by the OS. Morphism evolution of application threads is constrained by the availability of heterogeneous computing resources. Traditional design optimization goals such as computational power and lower energy per computation are inadequate to satisfy user centric application resource needs. Resource allocation decisions at application layer need to permeate to the architectural layer to avoid conflicting demands which may affect energy-delay characteristics of application threads. We propose Polymorphic computing abstraction as a unified computing model for heterogeneous architectures to address the above issues. Simulation environment for polymorphic applications is developed and evaluated under various scheduling strategies to determine the effectiveness of polymorphism abstraction on resource allocation. User satisfaction model is also developed to complement polymorphism and used for optimization of resource utilization at application and network layer of embedded systems

    A time-predictable many-core processor design for critical real-time embedded systems

    Get PDF
    Critical Real-Time Embedded Systems (CRTES) are in charge of controlling fundamental parts of embedded system, e.g. energy harvesting solar panels in satellites, steering and breaking in cars, or flight management systems in airplanes. To do so, CRTES require strong evidence of correct functional and timing behavior. The former guarantees that the system operates correctly in response of its inputs; the latter ensures that its operations are performed within a predefined time budget. CRTES aim at increasing the number and complexity of functions. Examples include the incorporation of \smarter" Advanced Driver Assistance System (ADAS) functionality in modern cars or advanced collision avoidance systems in Unmanned Aerial Vehicles (UAVs). All these new features, implemented in software, lead to an exponential growth in both performance requirements and software development complexity. Furthermore, there is a strong need to integrate multiple functions into the same computing platform to reduce the number of processing units, mass and space requirements, etc. Overall, there is a clear need to increase the computing power of current CRTES in order to support new sophisticated and complex functionality, and integrate multiple systems into a single platform. The use of multi- and many-core processor architectures is increasingly seen in the CRTES industry as the solution to cope with the performance demand and cost constraints of future CRTES. Many-cores supply higher performance by exploiting the parallelism of applications while providing a better performance per watt as cores are maintained simpler with respect to complex single-core processors. Moreover, the parallelization capabilities allow scheduling multiple functions into the same processor, maximizing the hardware utilization. However, the use of multi- and many-cores in CRTES also brings a number of challenges related to provide evidence about the correct operation of the system, especially in the timing domain. Hence, despite the advantages of many-cores and the fact that they are nowadays a reality in the embedded domain (e.g. Kalray MPPA, Freescale NXP P4080, TI Keystone II), their use in CRTES still requires finding efficient ways of providing reliable evidence about the correct operation of the system. This thesis investigates the use of many-core processors in CRTES as a means to satisfy performance demands of future complex applications while providing the necessary timing guarantees. To do so, this thesis contributes to advance the state-of-the-art towards the exploitation of parallel capabilities of many-cores in CRTES contributing in two different computing domains. From the hardware domain, this thesis proposes new many-core designs that enable deriving reliable and tight timing guarantees. From the software domain, we present efficient scheduling and timing analysis techniques to exploit the parallelization capabilities of many-core architectures and to derive tight and trustworthy Worst-Case Execution Time (WCET) estimates of CRTES.Los sistemas críticos empotrados de tiempo real (en ingles Critical Real-Time Embedded Systems, CRTES) se encargan de controlar partes fundamentales de los sistemas integrados, e.g. obtención de la energía de los paneles solares en satélites, la dirección y frenado en automóviles, o el control de vuelo en aviones. Para hacerlo, CRTES requieren fuerte evidencias del correcto comportamiento funcional y temporal. El primero garantiza que el sistema funciona correctamente en respuesta de sus entradas; el último asegura que sus operaciones se realizan dentro de unos limites temporales establecidos previamente. El objetivo de los CRTES es aumentar el número y la complejidad de las funciones. Algunos ejemplos incluyen los sistemas inteligentes de asistencia a la conducción en automóviles modernos o los sistemas avanzados de prevención de colisiones en vehiculos aereos no tripulados. Todas estas nuevas características, implementadas en software,conducen a un crecimiento exponencial tanto en los requerimientos de rendimiento como en la complejidad de desarrollo de software. Además, existe una gran necesidad de integrar múltiples funciones en una sóla plataforma para así reducir el número de unidades de procesamiento, cumplir con requisitos de peso y espacio, etc. En general, hay una clara necesidad de aumentar la potencia de cómputo de los actuales CRTES para soportar nueva funcionalidades sofisticadas y complejas e integrar múltiples sistemas en una sola plataforma. El uso de arquitecturas multi- y many-core se ve cada vez más en la industria CRTES como la solución para hacer frente a la demanda de mayor rendimiento y las limitaciones de costes de los futuros CRTES. Las arquitecturas many-core proporcionan un mayor rendimiento explotando el paralelismo de aplicaciones al tiempo que proporciona un mejor rendimiento por vatio ya que los cores se mantienen más simples con respecto a complejos procesadores de un solo core. Además, las capacidades de paralelización permiten programar múltiples funciones en el mismo procesador, maximizando la utilización del hardware. Sin embargo, el uso de multi- y many-core en CRTES también acarrea ciertos desafíos relacionados con la aportación de evidencias sobre el correcto funcionamiento del sistema, especialmente en el ámbito temporal. Por eso, a pesar de las ventajas de los procesadores many-core y del hecho de que éstos son una realidad en los sitemas integrados (por ejemplo Kalray MPPA, Freescale NXP P4080, TI Keystone II), su uso en CRTES aún precisa de la búsqueda de métodos eficientes para proveer evidencias fiables sobre el correcto funcionamiento del sistema. Esta tesis ahonda en el uso de procesadores many-core en CRTES como un medio para satisfacer los requisitos de rendimiento de aplicaciones complejas mientras proveen las garantías de tiempo necesarias. Para ello, esta tesis contribuye en el avance del estado del arte hacia la explotación de many-cores en CRTES en dos ámbitos de la computación. En el ámbito del hardware, esta tesis propone nuevos diseños many-core que posibilitan garantías de tiempo fiables y precisas. En el ámbito del software, la tesis presenta técnicas eficientes para la planificación de tareas y el análisis de tiempo para aprovechar las capacidades de paralelización en arquitecturas many-core, y también para derivar estimaciones de peor tiempo de ejecución (Worst-Case Execution Time, WCET) fiables y precisas

    Run-time management for future MPSoC platforms

    Get PDF
    In recent years, we are witnessing the dawning of the Multi-Processor Systemon- Chip (MPSoC) era. In essence, this era is triggered by the need to handle more complex applications, while reducing overall cost of embedded (handheld) devices. This cost will mainly be determined by the cost of the hardware platform and the cost of designing applications for that platform. The cost of a hardware platform will partly depend on its production volume. In turn, this means that ??exible, (easily) programmable multi-purpose platforms will exhibit a lower cost. A multi-purpose platform not only requires ??exibility, but should also combine a high performance with a low power consumption. To this end, MPSoC devices integrate computer architectural properties of various computing domains. Just like large-scale parallel and distributed systems, they contain multiple heterogeneous processing elements interconnected by a scalable, network-like structure. This helps in achieving scalable high performance. As in most mobile or portable embedded systems, there is a need for low-power operation and real-time behavior. The cost of designing applications is equally important. Indeed, the actual value of future MPSoC devices is not contained within the embedded multiprocessor IC, but in their capability to provide the user of the device with an amount of services or experiences. So from an application viewpoint, MPSoCs are designed to ef??ciently process multimedia content in applications like video players, video conferencing, 3D gaming, augmented reality, etc. Such applications typically require a lot of processing power and a signi??cant amount of memory. To keep up with ever evolving user needs and with new application standards appearing at a fast pace, MPSoC platforms need to be be easily programmable. Application scalability, i.e. the ability to use just enough platform resources according to the user requirements and with respect to the device capabilities is also an important factor. Hence scalability, ??exibility, real-time behavior, a high performance, a low power consumption and, ??nally, programmability are key components in realizing the success of MPSoC platforms. The run-time manager is logically located between the application layer en the platform layer. It has a crucial role in realizing these MPSoC requirements. As it abstracts the platform hardware, it improves platform programmability. By deciding on resource assignment at run-time and based on the performance requirements of the user, the needs of the application and the capabilities of the platform, it contributes to ??exibility, scalability and to low power operation. As it has an arbiter function between different applications, it enables real-time behavior. This thesis details the key components of such an MPSoC run-time manager and provides a proof-of-concept implementation. These key components include application quality management algorithms linked to MPSoC resource management mechanisms and policies, adapted to the provided MPSoC platform services. First, we describe the role, the responsibilities and the boundary conditions of an MPSoC run-time manager in a generic way. This includes a de??nition of the multiprocessor run-time management design space, a description of the run-time manager design trade-offs and a brief discussion on how these trade-offs affect the key MPSoC requirements. This design space de??nition and the trade-offs are illustrated based on ongoing research and on existing commercial and academic multiprocessor run-time management solutions. Consequently, we introduce a fast and ef??cient resource allocation heuristic that considers FPGA fabric properties such as fragmentation. In addition, this thesis introduces a novel task assignment algorithm for handling soft IP cores denoted as hierarchical con??guration. Hierarchical con??guration managed by the run-time manager enables easier application design and increases the run-time spatial mapping freedom. In turn, this improves the performance of the resource assignment algorithm. Furthermore, we introduce run-time task migration components. We detail a new run-time task migration policy closely coupled to the run-time resource assignment algorithm. In addition to detailing a design-environment supported mechanism that enables moving tasks between an ISP and ??ne-grained recon??gurable hardware, we also propose two novel task migration mechanisms tailored to the Network-on-Chip environment. Finally, we propose a novel mechanism for task migration initiation, based on reusing debug registers in modern embedded microprocessors. We propose a reactive on-chip communication management mechanism. We show that by exploiting an injection rate control mechanism it is possible to provide a communication management system capable of providing a soft (reactive) QoS in a NoC. We introduce a novel, platform independent run-time algorithm to perform quality management, i.e. to select an application quality operating point at run-time based on the user requirements and the available platform resources, as reported by the resource manager. This contribution also proposes a novel way to manage the interaction between the quality manager and the resource manager. In order to have a the realistic, reproducible and ??exible run-time manager testbench with respect to applications with multiple quality levels and implementation tradev offs, we have created an input data generation tool denoted Pareto Surfaces For Free (PSFF). The the PSFF tool is, to the best of our knowledge, the ??rst tool that generates multiple realistic application operating points either based on pro??ling information of a real-life application or based on a designer-controlled random generator. Finally, we provide a proof-of-concept demonstrator that combines these concepts and shows how these mechanisms and policies can operate for real-life situations. In addition, we show that the proposed solutions can be integrated into existing platform operating systems

    매니코어 NoC 아키텍처에 대한 고속 사이클-근사 시뮬레이션 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 하순회.Simulation is a software technique that uses the current available architecture to prototype a future architecture. In computer architecture research, simulation techniques are one of the most important skills. Simulation techniques enable us to obtain important performance indicators of new architectures and to perform the design space exploration using these metrics. Furthermore, the simulator enables rapid software development and optimization on the architecture that does not exist. Despite various known problems, such as slow speed or coverage issue, the reliance on simulation technology in computer architecture research continues to increase. As the density of transistor increases and the performance improvement of the single core hits the ceiling, the newly constructed architectures usually consist of multi/many cores with the network-on-chip, which enables scalable communications. In addition, the implementation of the application itself has also been complicated to effectively utilize these parallel architectures. Thus, simulators for parallel architectures and parallel applications have become extremely complex, and existing sequential simulators no longer simulate these systems at a realistic time. While many of parallel simulation techniques are being developed to solve these problems, they suffer from poor simulation performance or accuracy. In this thesis, we propose and evaluate a novel many-core simulation technique that can obtain the best simulation performance at the cost of minimum simulation error. The proposed parallel many-core simulator is divided into three parts: 1) core simulator, 2) network-on-chip simulator, and 3) simulation backplane. Each core is executed by a core simulator, which communicates with the external simulation backplane via the Interprocess Communication (IPC). Each core simulation is performed individually in a separate host processor. The simulation backplane arranges messages from each core into chronological order, passes them to destination modules, and simulates hardware components other than cores. If the simulation backplane generates a request requiring NoC communication, this request is forwarded to the network simulator and is simulated at the most accurate accuracy level. In this thesis, we proposed a novel core simulation model, which combined analytical and sampled simulations. The core simulator presents 11.36 to 44.31 MIPS performance, while the simulation error is approximately 8 percent. The standalone core simulator is released as an open-source. We confirmed that NoC simulation has a great effect on the reliability of outputs generated from many-core simulation. First, existing flit-level NoC simulators were analyzed at source-code level. Based on the observations, various implementations were evaluated and various software optimizations was applied to improve the network simulation performance. The proposed NoC simulator presents more than 100KCycles/s performance unless the packet injection rate exceeds 0.00625, which is two times faster than state-of-the-arts NoC simulator at least. The speed of the simulation backplane depends greatly on the IPC overhead and SystemC scheduling overhead. To reduce the IPC overhead, the trace-driven co-simulation technique is used, faster IPC is introduced, and the segmented L1 data cache is embedded in a core simulator. In addition, to reduce SystemC scheduling overhead, it is important to reduce the number of modules that are simultaneously awakened. To this end, slave modules are redesigned to be activated only based on an event. A new scheduler parallelization technique is also studied. Although the newly developed SystemC parallel scheduler showed good performance under limited conditions, we also confirmed that no performance improvement was found in the TLM level many-core simulator developed in this thesis. While the proposed many-core simulator uses the conservative synchronization technique which is free from causality errors and performs an accurate flit-level NoC simulation, the simulation performance is still acceptable, thanks to parallelism and optimizations. Additionally, the simulator is highly scalable to add other modules because the simulation backplane is developed to be compatible with SystemC TLM 2.0 standard. Although extensive experiments on accuracy are not conducted, it will be complemented when a detailed specification of the target architecture is given. This dissertation can be a reference to the development of a many-core simulator, which will be more essential in the future.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 4 1.3 Dissertation Organization 5 Chapter 2 Background and Existing Research 6 2.1 Terminologies 6 2.1.1 Simulation Host / Simulation Target 6 2.1.2 Simulated Time / Simulation Time 2.1.3 User-level Simulation / Full-system Simulation 7 2.1.4 Execution-driven Simulation / Trace-driven Simulation 7 2.2 State-of-the-arts Many-core Simulators 8 2.2.1 Gem5 8 2.2.2 Marss 9 2.2.3 Sniper 9 2.2.4 Zsim 9 2.2.5 Manifold 10 2.2.6 Hornet 10 2.2.7 Summary 11 2.3 Host and Target Architecture 12 Chapter 3 Core Simulation 14 3.1 Overview 14 3.2 Related Works 16 3.2.1 Timing Models 16 3.2.2 Analytical Model: Interval Simulation 19 3.3 Sampling Mechanism 23 3.3.1 Sampling Configuration 24 3.3.2 Parameter Extraction 24 3.4 Trace Analyzer 27 3.4.1 Dependency Analysis 29 3.4.2 Life Cycle of An Instruction 31 3.5 Experimental Results 32 3.5.1 Time-accuracy Trade-off 34 3.5.2 Simulation Accuracy 37 3.5.3 Simulation Performance 41 3.6 Discussion 42 Chapter 4 NoC Simulation 45 4.1 Network-on-chip 45 4.2 Motivation 46 4.3 Related Works 48 4.3.1 Noxim 49 4.3.2 Booksim2 50 4.3.3 Garnet 51 4.4 Proposed Approach 51 4.4.1 Implementations 51 4.4.2 Optimizations 54 4.5 Experimental Results 56 4.5.1 Impact of Implementations and Optimizations 56 4.5.2 Comparison with Other State-Of-The-Arts 58 4.5.3 Performance Evaluation For Various Configurations 59 4.5.4 Full-System Simulation Accuracy Impact 59 4.5.5 Accuracy 61 4.6 Discussion 61 Chapter 5 Simulation Backplane 63 5.1 Overview 63 5.2 Background 65 5.2.1 SystemC 65 5.2.2 OSCI Transaction Level Modeling Standard 2.0 66 5.2.3 Synchronization Techniques 67 5.3 SystemC Models for the Target Architecture 69 5.4 Reducing the Cost of Interprocess Communications 71 5.4.1 Trace-driven Co-simulation 71 5.4.2 Better Interprocess Communication 73 5.4.3 Virtually embedding modules to core simulator 74 5.5 Reducing SystemC Scheduling Overhead 76 5.5.1 Event-based Slave Module Activation 76 5.5.2 SystemC Scheduler Parallelization 78 5.6 Evaluation 79 5.6.1 Scalability Test 79 5.6.2 Simulation Performance 79 5.6.3 Simulation Accuracy 80 Chapter 6 Simulation Backplane Parallelization 81 6.1 Background: OSCI SystemC Scheduler 81 6.2 Related Work: SystemC Parallelization Techniques 82 6.2.1 Fully-synchronous Approach 82 6.2.2 Parallel Distributed Event Scheduling (PDES) Approach 82 6.2.3 Out-of-order Execution with Dependency Analysis 83 6.2.4 Dynamic Offloading Approach 84 6.3 Proposed Technique 84 6.3.1 Basic Synchronization 85 6.3.2 Relaxed Synchronization 86 6.3.3 Modeling Restrictions 88 6.4 Experimental Results 89 6.4.1 Performance 90 6.4.2 Accuracy 92 6.5 Discussion and Limitation 93 Chapter 7 Conclusion 95 Bibliography 97 요약 107Docto

    Design, implementation and experimental evaluation of a network-slicing aware mobile protocol stack

    Get PDF
    Mención Internacional en el título de doctorWith the arrival of new generation mobile networks, we currently observe a paradigm shift, where monolithic network functions running on dedicated hardware are now implemented as software pieces that can be virtualized on general purpose hardware platforms. This paradigm shift stands on the softwarization of network functions and the adoption of virtualization techniques. Network Function Virtualization (NFV) comprises softwarization of network elements and virtualization of these components. It brings multiple advantages: (i) Flexibility, allowing an easy management of the virtual network functions (VNFs) (deploy, start, stop or update); (ii) efficiency, resources can be adequately consumed due to the increased flexibility of the network infrastructure; and (iii) reduced costs, due to the ability of sharing hardware resources. To this end, multiple challenges must be addressed to effectively leverage of all these benefits. Network Function Virtualization envisioned the concept of virtual network, resulting in a key enabler of 5G networks flexibility, Network Slicing. This new paradigm represents a new way to operate mobile networks where the underlying infrastructure is "sliced" into logically separated networks that can be customized to the specific needs of the tenant. This approach also enables the ability of instantiate VNFs at different locations of the infrastructure, choosing their optimal placement based on parameters such as the requirements of the service traversing the slice or the available resources. This decision process is called orchestration and involves all the VNFs withing the same network slice. The orchestrator is the entity in charge of managing network slices. Hands-on experiments on network slicing are essential to understand its benefits and limits, and to validate the design and deployment choices. While some network slicing prototypes have been built for Radio Access Networks (RANs), leveraging on the wide availability of radio hardware and open-source software, there is no currently open-source suite for end-to-end network slicing available to the research community. Similarly, orchestration mechanisms must be evaluated as well to properly validate theoretical solutions addressing diverse aspects such as resource assignment or service composition. This thesis contributes on the study of the mobile networks evolution regarding its softwarization and cloudification. We identify software patterns for network function virtualization, including the definition of a novel mobile architecture that squeezes the virtualization architecture by splitting functionality in atomic functions. Then, we effectively design, implement and evaluate of an open-source network slicing implementation. Our results show a per-slice customization without paying the price in terms of performance, also providing a slicing implementation to the research community. Moreover, we propose a framework to flexibly re-orchestrate a virtualized network, allowing on-the-fly re-orchestration without disrupting ongoing services. This framework can greatly improve performance under changing conditions. We evaluate the resulting performance in a realistic network slicing setup, showing the feasibility and advantages of flexible re-orchestration. Lastly and following the required re-design of network functions envisioned during the study of the evolution of mobile networks, we present a novel pipeline architecture specifically engineered for 4G/5G Physical Layers virtualized over clouds. The proposed design follows two objectives, resiliency upon unpredictable computing and parallelization to increase efficiency in multi-core clouds. To this end, we employ techniques such as tight deadline control, jitter-absorbing buffers, predictive Hybrid Automatic Repeat Request, and congestion control. Our experimental results show that our cloud-native approach attains > 95% of the theoretical spectrum efficiency in hostile environments where stateof- the-art architectures collapse.This work has been supported by IMDEA Networks InstitutePrograma de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: Francisco Valera Pintor.- Secretario: Vincenzo Sciancalepore.- Vocal: Xenofon Fouka

    Navigating the Landscape for Real-time Localisation and Mapping for Robotics, Virtual and Augmented Reality

    Get PDF
    Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.Comment: Proceedings of the IEEE 201

    Quarc: an architecture for efficient on-chip communication

    Get PDF
    The exponential downscaling of the feature size has enforced a paradigm shift from computation-based design to communication-based design in system on chip development. Buses, the traditional communication architecture in systems on chip, are incapable of addressing the increasing bandwidth requirements of future large systems. Networks on chip have emerged as an interconnection architecture offering unique solutions to the technological and design issues related to communication in future systems on chip. The transition from buses as a shared medium to networks on chip as a segmented medium has given rise to new challenges in system on chip realm. By leveraging the shared nature of the communication medium, buses have been highly efficient in delivering multicast communication. The segmented nature of networks, however, inhibits the multicast messages to be delivered as efficiently by networks on chip. Relying on extensive research on multicast communication in parallel computers, several network on chip architectures have offered mechanisms to perform the operation, while conforming to resource constraints of the network on chip paradigm. Multicast communication in majority of these networks on chip is implemented by establishing a connection between source and all multicast destinations before the message transmission commences. Establishing the connections incurs an overhead and, therefore, is not desirable; in particular in latency sensitive services such as cache coherence. To address high performance multicast communication, this research presents Quarc, a novel network on chip architecture. The Quarc architecture targets an area-efficient, low power, high performance implementation. The thesis covers a detailed representation of the building blocks of the architecture, including topology, router and network interface. The cost and performance comparison of the Quarc architecture against other network on chip architectures reveals that the Quarc architecture is a highly efficient architecture. Moreover, the thesis introduces novel performance models of complex traffic patterns, including multicast and quality of service-aware communication
    corecore