54 research outputs found

    Improving time predictability of shared hardware resources in real-time multicore systems : emphasis on the space domain

    Get PDF
    Critical Real-Time Embedded Systems (CRTES) follow a verification and validation process on the timing and functional correctness. This process includes the timing analysis that provides Worst-Case Execution Time (WCET) estimates to provide evidence that the execution time of the system, or parts of it, remain within the deadlines. A key design principle for CRTES is the incremental qualification, whereby each software component can be subject to verification and validation independently of any other component, with obvious benefits for cost. At timing level, this requires time composability, such that the timing behavior of a function is not affected by other functions. CRTES are experiencing an unprecedented growth with rising performance demands that have motivated the use of multicore architectures. Multicores can provide the performance required and bring the potential of integrating several software functions onto the same hardware. However, multicore contention in the access to shared hardware resources creates a dependence of the execution time of a task with the rest of the tasks running simultaneously. This dependence threatens time predictability and jeopardizes time composability. In this thesis we analyze and propose hardware solutions to be applied on current multicore designs for CRTES to improve time predictability and time composability, focusing on the on-chip bus and the memory controller. At hardware level, we propose new bus and memory controller designs that control and mitigate contention between different cores and allow to have time composability by design, also in the context of mixed-criticality systems. At analysis level, we propose contention prediction models that factor the impact of contenders and don¿t need modifications to the hardware. We also propose a set of Performance Monitoring Counters (PMC) that provide evidence about the contention. We give an special emphasis on the Space domain focusing on the Cobham Gaisler NGMP multicore processor, which is currently assessed by the European Space Agency for its future missions.Los Sistemas Críticos Empotrados de Tiempo Real (CRTES) siguen un proceso de verificación y validación para su correctitud funcional y temporal. Este proceso incluye el análisis temporal que proporciona estimaciones de el peor caso del tiempo de ejecución (WCET) para dar evidencia de que el tiempo de ejecución del sistema, o partes de él, permanecen dentro de los límites temporales. Un principio de diseño clave para los CRTES es la cualificación incremental, por la que cada componente de software puede ser verificado y validado independientemente del resto de componentes, con beneficios obvios para el coste. A nivel temporal, esto requiere composabilidad temporal, por la que el comportamiento temporal de una función no se ve afectado por otras funciones. CRTES están experimentando un crecimiento sin precedentes con crecientes demandas de rendimiento que han motivado el uso the arquitecturas multi-núcleo (multicore). Los procesadores multi-núcleo pueden proporcionar el rendimiento requerido y tienen el potencial de integrar varias funcionalidades software en el mismo hardware. A pesar de ello, la interferencia entre los diferentes núcleos que aparece en los recursos compartidos de os procesadores multi núcleo crea una dependencia del tiempo de ejecución de una tarea con el resto de tareas ejecutándose simultáneamente en el procesador. Esta dependencia amenaza la predictabilidad temporal y compromete la composabilidad temporal. En esta tésis analizamos y proponemos soluciones hardware para ser aplicadas en los diseños multi núcleo actuales para CRTES que mejoran la predictabilidad y composabilidad temporal, centrándose en el bus y el controlador de memoria internos al chip. A nivel de hardware, proponemos nuevos diseños de buses y controladores de memoria que controlan y mitigan la interferencia entre los diferentes núcleos y permiten tener composabilidad temporal por diseño, también en el contexto de sistemas de criticalidad mixta. A nivel de análisis, proponemos modelos de predicción de la interferencia que factorizan el impacto de los núcleos y no necesitan modificaciones hardware. También proponemos un conjunto de Contadores de Control del Rendimiento (PMC) que proporcionoan evidencia de la interferencia. En esta tésis, damós especial importancia al dominio espacial, centrándonos en el procesador mutli núcleo Cobham Gaisler NGMP, que está siendo actualmente evaluado por la Agencia Espacial Europea para sus futuras misiones

    Seismic Wave Propagation Simulations on Low-power and Performance-centric Manycores

    Get PDF
    International audienceThe large processing requirements of seismic wave propagation simulations make High Performance Computing (HPC) architectures a natural choice for their execution. However, to keep both the current pace of performance improvements and the power consumption under a strict power budget, HPC systems must be more energy e than ever. As a response to this need, energy-e and low-power processors began to make their way into the market. In this paper we employ a novel low-power processor, the MPPA-256 manycore, to perform seismic wave propagation simulations. It has 256 cores connected by a NoC, no cache-coherence and only a limited amount of on-chip memory. We describe how its particular architectural characteristics influenced our solution for an energy-e implementation. As a counterpoint to the low-power MPPA-256 architecture, we employ Xeon Phi, a performance-centric manycore. Although both processors share some architectural similarities, the challenges to implement an e seismic wave propagation kernel on these platforms are very di↵erent. In this work we compare the performance and energy e of our implementations for these processors to proven and optimized solutions for other hardware platforms such as general-purpose processors and a GPU. Our experimental results show that MPPA-256 has the best energy e consuming at least 77 % less energy than the other evaluated platforms, whereas the performance of our solution for the Xeon Phi is on par with a state-of-the-art solution for GPUs

    Paving the way towards high-level parallel pattern interfaces for data stream processing

    Get PDF
    The emergence of the Internet of Things (IoT) data stream applications has posed a number of new challenges to existing infrastructures, processing engines, and programming models. In this sense, high-level interfaces, encapsulating algorithmic aspects in pattern-based constructions, have considerably reduced the development and parallelization efforts of this type of applications. An example of parallel pattern interface is GrPPI, a C++ generic high-level library that acts as a layer between developers and existing parallel programming frameworks, such as C++ threads, OpenMP and Intel TBB. In this paper, we complement the basic patterns supported by GrPPI with the new stream operators Split-Join and Window, and the advanced parallel patterns Stream-Pool, Windowed-Farm and Stream-Iterator for the aforementioned back ends. Thanks to these new stream operators, complex compositions among streaming patterns can be expressed. On the other hand, the collection of advanced patterns allows users to tackle some domain-specific applications, ranging from the evolutionary to the real-time computing areas, where compositions of basic patterns are not capable of fully mimicking the algorithmic behavior of their original sequential codes. The experimental evaluation of the new advanced patterns and the stream operators on a set of domain-specific use-cases, using different back ends and pattern-specific parameters, reports considerable performance gains with respect to the sequential versions. Additionally, we demonstrate the benefits of the GrPPI pattern interface from the usability, flexibility and readability points of view.This work was partially supported by the EU project ICT 644235 “RePhrase: REfactoring Parallel Heterogeneous Resource-Aware Applications” and the project TIN2013-41350-P “Scalable Data Management Techniques for High-End Computing Systems” from the Ministerio de Economía y Competitividad, Spai

    Compiler techniques for scalable performance of stream programs on multicore architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 211-222).Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup).by Michael I. Gordon.Ph.D

    Towards Scalable Synchronization on Multi-Cores

    Get PDF
    The shift of commodity hardware from single- to multi-core processors in the early 2000s compelled software developers to take advantage of the available parallelism of multi-cores. Unfortunately, only few---so-called embarrassingly parallel---applications can leverage this available parallelism in a straightforward manner. The remaining---non-embarrassingly parallel---applications require that their processes coordinate their possibly interleaved executions to ensure overall correctness---they require synchronization. Synchronization is achieved by constraining or even prohibiting parallel execution. Thus, per Amdahl's law, synchronization limits software scalability. In this dissertation, we explore how to minimize the effects of synchronization on software scalability. We show that scalability of synchronization is mainly a property of the underlying hardware. This means that synchronization directly hampers the cross-platform performance portability of concurrent software. Nevertheless, we can achieve portability without sacrificing performance, by creating design patterns and abstractions, which implicitly leverage hardware details without exposing them to software developers. We first perform an exhaustive analysis of the performance behavior of synchronization on several modern platforms. This analysis clearly shows that the performance and scalability of synchronization are highly dependent on the characteristics of the underlying platform. We then focus on lock-based synchronization and analyze the energy/performance trade-offs of various waiting techniques. We show that the performance and the energy efficiency of locks go hand in hand on modern x86 multi-cores. This correlation is again due to the characteristics of the hardware that does not provide practical tools for reducing the power consumption of locks without sacrificing throughput. We then propose two approaches for developing portable and scalable concurrent software, hence hiding the limitations that the underlying multi-cores impose. First, we introduce OPTIK, a new practical design pattern for designing and implementing fast and scalable concurrent data structures. We illustrate the power of our OPTIK pattern by devising five new algorithms and by optimizing four state-of-the-art algorithms for linked lists, skip lists, hash tables, and queues. Second, we introduce MCTOP, a multi-core topology abstraction which includes low-level information, such as memory bandwidths. MCTOP enables developers to accurately and portably define high-level optimization policies. We illustrate several such policies through four examples, including automated backoff schemes for locks, and illustrate the performance and portability of these policies on five platforms

    MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators

    Get PDF
    Parallel programming is gaining ground in various domains due to the tremendous computational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configuration is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level programming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e.g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time

    Network simulation for professional audio networks

    Get PDF
    Audio Engineers are required to design and deploy large multi-channel sound systems which meet a set of requirements and use networking technologies such as Firewire and Ethernet AVB. Bandwidth utilisation and parameter groupings are among the factors which need to be considered in these designs. An implementation of an extensible, generic simulation framework would allow audio engineers to easily compare protocols and networking technologies and get near real time responses with regards to bandwidth utilisation. Our hypothesis is that an application-level capability can be developed which uses a network simulation framework to enable this process and enhances the audio engineer’s experience of designing and configuring a network. This thesis presents a new, extensible simulation framework which can be utilised to simulate professional audio networks. This framework is utilised to develop an application - AudioNetSim - based on the requirements of an audio engineer. The thesis describes the AudioNetSim models and implementations for Ethernet AVB, Firewire and the AES- 64 control protocol. AudioNetSim enables bandwidth usage determination for any network configuration and connection scenario and is used to compare Firewire and Ethernet AVB bandwidth utilisation. It also applies graph theory to the circular join problem and provides a solution to detect circular joins

    Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method

    Get PDF
    International audienceWith the advent of complex modern architectures, the low-level paradigms long considered sufficient to build High Performance Computing (HPC) numerical codes have met their limits. Achieving efficiency, ensuring portability, while preserving programming tractability on such hardware prompted the HPC community to design new, higher level paradigms while relying on runtime systems to maintain performance. However, the common weakness of these projects is to deeply tie applications to specific expert-only runtime system APIs. The OpenMP specification, which aims at providing common parallel programming means for shared-memory platforms, appears as a good candidate to address this issue thanks to the latest task-based constructs introduced in its revision 4.0. The goal of this paper is to assess the effectiveness and limits of this support for designing a high-performance numerical library, ScalFMM, implementing the fast multipole method (FMM) that we have deeply redesigned with respect to the most advanced features provided by OpenMP 4. We show that OpenMP 4 allows for significant performance improvements over previous OpenMP revisions on recent multicore processors and that extensions to the 4.0 standard allow for strongly improving the performance, bridging the gap with the very high performance that was so far reserved to expert-only runtime system APIs
    corecore