990 research outputs found

    Exploiting heterogeneity in Chip-Multiprocessor Design

    Get PDF
    In the past decade, semiconductor manufacturers are persistent in building faster and smaller transistors in order to boost the processor performance as projected by Moore’s Law. Recently, as we enter the deep submicron regime, continuing the same processor development pace becomes an increasingly difficult issue due to constraints on power, temperature, and the scalability of transistors. To overcome these challenges, researchers propose several innovations at both architecture and device levels that are able to partially solve the problems. These diversities in processor architecture and manufacturing materials provide solutions to continuing Moore’s Law by effectively exploiting the heterogeneity, however, they also introduce a set of unprecedented challenges that have been rarely addressed in prior works. In this dissertation, we present a series of in-depth studies to comprehensively investigate the design and optimization of future multi-core and many-core platforms through exploiting heteroge-neities. First, we explore a large design space of heterogeneous chip multiprocessors by exploiting the architectural- and device-level heterogeneities, aiming to identify the optimal design patterns leading to attractive energy- and cost-efficiencies in the pre-silicon stage. After this high-level study, we pay specific attention to the architectural asymmetry, aiming at developing a heterogeneity-aware task scheduler to optimize the energy-efficiency on a given single-ISA heterogeneous multi-processor. An advanced statistical tool is employed to facilitate the algorithm development. In the third study, we shift our concentration to the device-level heterogeneity and propose to effectively leverage the advantages provided by different materials to solve the increasingly important reliability issue for future processors

    An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

    Get PDF
    Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration

    A Hierarchical Dataflow-Driven Heterogeneous Architecture for Wireless Baseband Processing

    Full text link
    Wireless baseband processing (WBP) is a key element of wireless communications, with a series of signal processing modules to improve data throughput and counter channel fading. Conventional hardware solutions, such as digital signal processors (DSPs) and more recently, graphic processing units (GPUs), provide various degrees of parallelism, yet they both fail to take into account the cyclical and consecutive character of WBP. Furthermore, the large amount of data in WBPs cannot be processed quickly in symmetric multiprocessors (SMPs) due to the unpredictability of memory latency. To address this issue, we propose a hierarchical dataflow-driven architecture to accelerate WBP. A pack-and-ship approach is presented under a non-uniform memory access (NUMA) architecture to allow the subordinate tiles to operate in a bundled access and execute manner. We also propose a multi-level dataflow model and the related scheduling scheme to manage and allocate the heterogeneous hardware resources. Experiment results demonstrate that our prototype achieves 2×2\times and 2.3×2.3\times speedup in terms of normalized throughput and single-tile clock cycles compared with GPU and DSP counterparts in several critical WBP benchmarks. Additionally, a link-level throughput of 288288 Mbps can be achieved with a 4545-core configuration.Comment: 7 pages, 7 figures, conferenc

    Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multicore Processors

    Get PDF
    Funding: This work is supported in part by the China Postdoctoral Science Foundation (Grant No. 2020TQ0169), the ShuiMu Tsinghua Scholar fellowship (2019SM131), National Key R&D Program of China (2020AAA0105200), National Natural Science Foundation of China (U20A20226), Beijing Natural Science Foundation (4202031), Beijing Academy of Artificial Intelligence BAAI), the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1). This work is also supported by the Royal Academy of Engineering under the Research Fellowship scheme.Asymmetric multicore processors (AMP) offer multiple types of cores under the same programming interface. Extracting the full potential of AMPs requires intelligent scheduling decisions, matching each thread with the right kind of core, the core that will maximize performance or minimize wasted energy for this thread. Existing OS schedulers are not up to this task. While they may handle certain aspects of asymmetry in the system, none can handle all runtime factors affecting AMPs for the general case of multi-threaded multi-programmed workloads. We address this problem by introducing COLAB, a general purpose asymmetry-aware scheduler targeting multi-threaded multi-programmed workloads. It estimates the performance and power of each thread on each type of core and identifies communication patterns and bottleneck threads. With this information, the scheduler makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor’s time. We evaluate our approach using both the GEM5 simulator on four distinct big.LITTLE configurations and a development board with ARM Cortex-A73/A53 processors and mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average,together with an average 5% energy saving depending on the hardware setup.PostprintPeer reviewe

    Energy Aware Runtime Systems for Elastic Stream Processing Platforms

    Get PDF
    Following an invariant growth in the required computational performance of processors, the multicore revolution started around 20 years ago. This revolution was mainly an answer to power dissipation constraints restricting the increase of clock frequency in single-core processors. The multicore revolution not only brought in the challenge of parallel programming, i.e. being able to develop software exploiting the entire capabilities of manycore architectures, but also the challenge of programming heterogeneous platforms. The question of “on which processing element to map a specific computational unit?”, is well known in the embedded community. With the introduction of general-purpose graphics processing units (GPGPUs), digital signal processors (DSPs) along with many-core processors on different system-on-chip platforms, heterogeneous parallel platforms are nowadays widespread over several domains, from consumer devices to media processing platforms for telecom operators. Finding mapping together with a suitable hardware architecture is a process called design-space exploration. This process is very challenging in heterogeneous many-core architectures, which promise to offer benefits in terms of energy efficiency. The main problem is the exponential explosion of space exploration. With the recent trend of increasing levels of heterogeneity in the chip, selecting the parameters to take into account when mapping software to hardware is still an open research topic in the embedded area. For example, the current Linux scheduler has poor performance when mapping tasks to computing elements available in hardware. The only metric considered is CPU workload, which as was shown in recent work does not match true performance demands from the applications. Doing so may produce an incorrect allocation of resources, resulting in a waste of energy. The origin of this research work comes from the observation that these approaches do not provide full support for the dynamic behavior of stream processing applications, especially if these behaviors are established only at runtime. This research will contribute to the general goal of developing energy-efficient solutions to design streaming applications on heterogeneous and parallel hardware platforms. Streaming applications are nowadays widely spread in the software domain. Their distinctive characiteristic is the retrieving of multiple streams of data and the need to process them in real time. The proposed work will develop new approaches to address the challenging problem of efficient runtime coordination of dynamic applications, focusing on energy and performance management.Efter en oföränderlig tillväxt i prestandakrav hos processorer, började den flerkärniga processor-revolutionen för ungefär 20 år sedan. Denna revolution skedde till största del som en lösning till begränsningar i energieffekten allt eftersom klockfrekvensen kontinuerligt höjdes i en-kärniga processorer. Den flerkärniga processor-revolutionen medförde inte enbart utmaningen gällande parallellprogrammering, m.a.o. förmågan att utveckla mjukvara som använder sig av alla delelement i de flerkärniga processorerna, men också utmaningen med programmering av heterogena plattformar. Frågeställningen ”på vilken processorelement skall en viss beräkning utföras?” är väl känt inom ramen för inbyggda datorsystem. Efter introduktionen av grafikprocessorer för allmänna beräkningar (GPGPU), signalprocesserings-processorer (DSP) samt flerkärniga processorer på olika system-on-chip plattformar, är heterogena parallella plattformar idag omfattande inom många domäner, från konsumtionsartiklar till mediaprocesseringsplattformar för telekommunikationsoperatörer. Processen att placera beräkningarna på en passande hårdvaruplattform kallas för utforskning av en designrymd (design-space exploration). Denna process är mycket utmanande för heterogena flerkärniga arkitekturer, och kan medföra fördelar när det gäller energieffektivitet. Det största problemet är att de olika valmöjligheterna i designrymden kan växa exponentiellt. Enligt den nuvarande trenden som förespår ökad heterogeniska aspekter i processorerna är utmaningen att hitta den mest passande placeringen av beräkningarna på hårdvaran ännu en forskningsfråga inom ramen för inbyggda datorsystem. Till exempel, den nuvarande schemaläggaren i Linux operativsystemet är inkapabel att hitta en effektiv placering av beräkningarna på den underliggande hårdvaran. Det enda mätsättet som används är processorns belastning vilket, som visats i tidigare forskning, inte motsvarar den verkliga prestandan i applikationen. Användning av detta mätsätt vid resursallokering resulterar i slöseri med energi. Denna forskning härstammar från observationerna att dessa tillvägagångssätt inte stöder det dynamiska beteendet hos ström-processeringsapplikationer (stream processing applications), speciellt om beteendena bara etableras vid körtid. Denna forskning kontribuerar till det allmänna målet att utveckla energieffektiva lösningar för ström-applikationer (streaming applications) på heterogena flerkärniga hårdvaruplattformar. Ström-applikationer är numera mycket vanliga i mjukvarudomän. Deras distinkta karaktär är inläsning av flertalet dataströmmar, och behov av att processera dem i realtid. Arbetet i denna forskning understöder utvecklingen av nya sätt för att lösa det utmanade problemet att effektivt koordinera dynamiska applikationer i realtid och fokus på energi- och prestandahantering

    A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

    Get PDF
    Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration

    Heterogeneity-aware scheduling and data partitioning for system performance acceleration

    Get PDF
    Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity. Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity. This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster. Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and Computer Science PhD funding from University of St Andrews; by UK EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1)." -- Acknowledgement

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

    Multi-tasking scheduling for heterogeneous systems

    Get PDF
    Heterogeneous platforms play an increasingly important role in modern computer systems. They combine high performance with low power consumption. From mobiles to supercomputers, we see an increasing number of computer systems that are heterogeneous. The most well-known heterogeneous system, CPU+GPU platforms have been widely used in recent years. As they become more mainstream, serving multiple tasks from multiple users is an emerging challenge. A good scheduler can greatly improve performance. However, indiscriminately allocating tasks based on availability leads to poor performance. As modern GPUs have a large number of hardware resources, most tasks cannot efficiently utilize all of them. Concurrent task execution on GPU is a promising solution, however, indiscriminately running tasks in parallel causes a slowdown. This thesis focuses on scheduling OpenCL kernels. A runtime framework is developed to determine where to schedule OpenCL kernels. It predicts the best-fit device by using a machine learning-based classifier, then schedules the kernels accordingly to either CPU or GPU. To improve GPU utilization, a kernel merging approach is proposed. Kernels are merged if their predicted co-execution can provide better performance than sequential execution. A machine learning based classifier is developed to find the best kernel pairs for co-execution on GPU. Finally, a runtime framework is developed to schedule kernels separately on either CPU or GPU, and run kernels in pairs if their co-execution can improve performance. The approaches developed in this thesis significantly improve system performance and outperform all existing techniques
    corecore