565 research outputs found

    Termination detection for fine-grained message-passing architectures

    Get PDF
    Barrier primitives provided by standard parallel programming APIs are the primary means by which applications implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not interact with message-passing in any useful way. In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive, efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both globally-synchronous and asynchronous parallel applications. To evaluate the new primitive, we implement it in a prototype large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.Funded by EPSRC grant EP/N031768/1 (POETS project

    Operating System Kernels on Multi-core Architectures

    Get PDF
    Operating System (OS) kernels have been under research and development for decades, mainly assuming single processor and distributed hardware systems. With the recent rise of multi-core chips that may incorporate a network on chip (NoC), new challenges have appeared that were not considered before. Given that a complete multi-core system that works on a single system on chip (SoC) is now the normal case, different cores on a single SoC may share other physical resources and data. This new sharing scheme on a SoC affects crucial aspects of an overall system like correctness, performance, predictability, scalability and security. Both hardware and OSs to flexibly cooperate in order to provide solutions for such challenges. SoC mimics the internet somehow now, with different cores acting as computer nodes, and the network medium is given in an advanced digital fabrics like buses or NoCs, that are a current research area. However, OSs are still assuming some (hardware) features like single physical memory and memory sharing for inter-process communication, page-based protection, cache operations, even when evolving from uniprocessor to multi-core processors. Such features not only may degrade performance and other system aspects, but also some of them make no sense for a multi-core SoC, and introduce some barriers and limitations. While new OS research is considering different kernel designs to cope up with multi-core systems, they are still limited by the current commercial hardware architectures. The objective of this thesis is to assess different kernel designs and implementations on multi-core hardware architectures. Part of the contributions of the thesis is porting RTEMS (RTOS) and seL4 microkernel to Epiphany and RISC-V hardware architectures respectively, trading-off the design and implementation decisions. This hands-on experience gave a better understanding of the real-world challenges regarding kernel designs and implementations

    An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

    Get PDF
    Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler

    Programming models for many-core architectures: a co-design approach

    Get PDF
    Common many-core processors contain tens of cores and distributed memory. Compared to a multicore system, which only has a few tightly coupled cores sharing a single bus and memory, several complex problems arise. Notably, many cores require many parallel tasks to fully utilize the cores, and communication happens in a distributed and decentralized way. Therefore, programming such a processor requires the application to exhibit concurrency. In contrast to a single-core application, a concurrent application has to deal with memory state changes with an observable (non-deterministic) intermediate state. The complexity introduced by these problems makes programming a many-core system with a single-core-based programming approach notoriously hard.\ud \ud The central concept of this thesis is that abstractions, which are related to (many-core) programming, are structured in a single platform model. A platform is a layered view of the hardware, a memory model, a concurrency model, a model of computation, and compile-time and run-time tooling. Then, a programming model is a specific view on this platform, which is used by a programmer. In this view, some details can be hidden from the programmer's perspective, some details cannot. For example, an operating system presents an infinite number of parallel virtual execution units to the application whilst it hides details regarding scheduling. On the other hand, a programmer usually has balance workload among threads by hand.\ud \ud This thesis presents modifications to different abstraction layers of a many-core architecture, in order to make the system as a whole more efficient, and to reduce the programming complexity. These modifications influence other abstractions in the platform, and especially the programming model. Therefore, this thesis applies co-design on all models. Notably, co-design of the memory model, concurrency model, and model of computation is required for a scalable implementation of lambda-calculus. Moreover, only the combination of requirements of the many-core hardware from one side and the concurrency model from the other leads to a memory model abstraction. Hence, this thesis shows that to cope with the current trends in many-core architectures from a programming perspective, it is essential and feasible to inspect and adapt all abstractions collectively

    Preserving and promoting the historical cultural heritage of rural communities in West Bengal and Bangladesh: Report on a workshop held in Kolkata Saturday 29 April 2017

    Get PDF
    This report is the product of a workshop held in Kolkata, West Bengal, in April 2017, in which academic experts met to discuss the range of issues relating to the promotion and preservation of historical cultural heritage in the cross-border region that is West Bengal and Bangladesh. The objective of the discussions was to provide the evidence-base for future calls under the Global Challenges Research Fund (GCRF). The workshop also enabled the strengthening of existing relationships with the University of Glasgow’s partners in India, and the foundation of new relationships with others in West Bengal and Bangladesh, in order to establish a consortium to underpin a larger GCRF project proposal

    Productive Programming Systems for Heterogeneous Supercomputers

    Get PDF
    The majority of today's scientific and data analytics workloads are still run on relatively energy inefficient, heavyweight, general-purpose processing cores, often referred to in the literature as latency-oriented architectures. The flexibility of these architectures and the programmer aids included (e.g. large and deep cache hierarchies, branch prediction logic, pre-fetch logic) makes them flexible enough to run a wide range of applications fast. However, we have started to see growth in the use of lightweight, simpler, energy-efficient, and functionally constrained cores. These architectures are commonly referred to as throughput-oriented. Within each shared memory node, the computational backbone of future throughput-oriented HPC machines will consist of large pools of lightweight cores. The first wave of throughput-oriented computing came in the mid 2000's with the use of GPUs for general-purpose and scientific computing. Today we are entering the second wave of throughput-oriented computing, with the introduction of NVIDIA Pascal GPUs, Intel Knights Landing Xeon Phi processors, the Epiphany Co-Processor, the Sunway MPP, and other throughput-oriented architectures that enable pre-exascale computing. However, while the majority of the FLOPS in designs for future HPC systems come from throughput-oriented architectures, they are still commonly paired with latency-oriented cores which handle management functions and lightweight/un-parallelizable computational kernels. Hence, most future HPC machines will be heterogeneous in their processing cores. However, the heterogeneity of future machines will not be limited to the processing elements. Indeed, heterogeneity will also exist in the storage, networking, memory, and software stacks of future supercomputers. As a result, it will be necessary to combine many different programming models and libraries in a single application. How to do so in a programmable and well-performing manner is an open research question. This thesis addresses this question using two approaches. First, we explore using managed runtimes on HPC platforms. As a result of their high-level programming models, these managed runtimes have a long history of supporting data analytics workloads on commodity hardware, but often come with overheads which make them less common in the HPC domain. Managed runtimes are also not supported natively on throughput-oriented architectures. Second, we explore the use of a modular programming model and work-stealing runtime to compose the programming and scheduling of multiple third-party HPC libraries. This approach leverages existing investment in HPC libraries, unifies the scheduling of work on a platform, and is designed to quickly support new programming model and runtime extensions. In support of these two approaches, this thesis also makes novel contributions in tooling for future supercomputers. We demonstrate the value of checkpoints as a software development tool on current and future HPC machines, and present novel techniques in performance prediction across heterogeneous cores

    Piattaforme multicore e integrazione tri-dimensionale: analisi architetturale e ottimizzazione

    Get PDF
    Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.I sistemi integrati moderni sono architetture many-core, in cui spesso lo spazio di memoria è condiviso fra i processori. Per ridurre i consumi, molte di queste architetture sostituiscono le cache dati con memorie scratchpad gestite in software, per massimizzarne la località alle CPU e aumentare le performance. Questo significa che i dati devono essere spostati manualmente da parte del programmatore. Inoltre, tradurre in perfomance l’enorme parallelismo potenziale delle piattaforme many-core non è semplice. Per supportare la programmazione, diversi programming model sono stati proposti, e siccome lavorano ad un alto livello di astrazione, sfruttano delle librerie di runtime che forniscono servizi di base quali sincronizzazione, allocazione della memoria, threading. Queste librerie hanno un costo, che nei sistemi integrati è troppo elevato e ostacola il raggiungimento delle piene performance. Questa tesi analizza come un programming model ad alto livello di astrazione – OpenMP – possa essere efficientemente supportato, se il suo stack software viene adattato per sfruttare al meglio la piattaforma sottostante. In una prima parte, studio diversi meccanismi di sincronizzazione e comunicazione fra thread paralleli, portati sulle piattaforme many-core. In seguito, li utilizzo per scrivere un runtime di supporto a OpenMP che sia il più possibile efficente e “leggero” e che supporti paradigmi di parallelismo multi-livello e irregolare, spesso presenti nelle applicazioni moderne. Una seconda parte della tesi esplora le architetture eterogenee, ossia con acceleratori hardware. Queste architetture soffrono di problematiche sia i) per il processo di design della piattaforma, che ii) di scalabilità della piattaforma stessa (aumento del numero degli acceleratori e dei processori), che iii) di programmabilità. La tesi propone delle soluzioni a tutti e tre i problemi. Il linguaggio di programmazione usato è OpenMP, sia per la sua grande espressività a livello semantico, sia perché è lo standard de-facto per programmare sistemi a memoria condivisa

    RA-LPEL: A Resource-Aware Light-Weight Parallel Execution Layer for Reactive Stream Processing Networks on The SCC Many-core Tiled Architecture

    Get PDF
    In computing the available computing power has continuously fallen short of the demanded computing performance. As a consequence, performance improvement has been the main focus of processor design. However, due to the phenomenon called “Power Wall” it has become infeasible to build faster processors by just increasing the processor’s clock speed. One of the resulting trends in hardware design is to integrate several simple and power-efficient cores on the same chip. This design shift poses challenges of its own. In the past, with increasing clock frequency the programs became automatically faster as well without modifications. This is no longer true with many-core architectures. To achieve maximum performance the programs have to run concurrently on more than one core, which forces the general computing paradigm to become increasingly parallel to leverage maximum processing power. In this thesis, we will focus on the Reactive Stream Program (RSP). In stream processing, the system consists of computing nodes, which are connected via communication streams. These streams simplify the concurrency management on modern many-core architectures due to their implicit synchronisation. RSP is a stream processing system that implements the reactive system. The RSPs work in tandem with their environment and the load imposed by the environment may vary over time. This provides a unique opportunity to increase performance per watt. In this thesis the research contribution focuses on the design of the execution layer to run RSPs on tiled many-core architectures, using the Intel’s Single-chip Cloud Computer (SCC) processor as a concrete experimentation platform. Further, we have developed a Dynamic Voltage and Frequency Scaling (DVFS) technique for RSP deployed on many-core architectures. In contrast to many other approaches, our DVFS technique does not require the capability of controlling the power settings of individual computing elements, thus making it applicable for modern many-core architectures, with which power can be changed only for power islands. The experimental results confirm that the proposed DVFS technique can effectively improve the energy efficiency, i.e. increase the performance per watt, for RSPs
    • …
    corecore