    Query processing on low-energy many-core processors

    Aside from performance, energy efficiency is an increasing challenge in database systems. To tackle both aspects in an integrated fashion, we pursue a hardware/software co-design approach. To fulfill the energy requirement from the hardware perspective, we utilize a low-energy processor design offering the possibility to us to place hundreds to millions of chips on a single board without any thermal restrictions. Furthermore, we address the performance requirement by the development of several database-specific instruction set extensions to customize each core, whereas each core does not have all extensions. Therefore, our hardware foundation is a low-energy processor consisting of a high number of heterogeneous cores. In this paper, we introduce our hardware setup on a system level and present several challenges for query processing. Based on these challenges, we describe two implementation concepts and a comparison between these concepts. Finally, we conclude the paper with some lessons learned and an outlook on our upcoming research directions

    Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques

    During the last few decades an unprecedented technological growth has been at the center of the embedded systems design paramount, with Moore’s Law being the leading factor of this trend. Today in fact an ever increasing number of cores can be integrated on the same die, marking the transition from state-of-the-art multi-core chips to the new many-core design paradigm. Despite the extraordinarily high computing power, the complexity of many-core chips opens the door to several challenges. As a result of the increased silicon density of modern Systems-on-a-Chip (SoC), the design space exploration needed to find the best design has exploded and hardware designers are in fact facing the problem of a huge design space. Virtual Platforms have always been used to enable hardware-software co-design, but today they are facing with the huge complexity of both hardware and software systems. In this thesis two different research works on Virtual Platforms are presented: the first one is intended for the hardware developer, to easily allow complex cycle accurate simulations of many-core SoCs. The second work exploits the parallel computing power of off-the-shelf General Purpose Graphics Processing Units (GPGPUs), with the goal of an increased simulation speed. The term Virtualization can be used in the context of many-core systems not only to refer to the aforementioned hardware emulation tools (Virtual Platforms), but also for two other main purposes: 1) to help the programmer to achieve the maximum possible performance of an application, by hiding the complexity of the underlying hardware. 2) to efficiently exploit the high parallel hardware of many-core chips in environments with multiple active Virtual Machines. This thesis is focused on virtualization techniques with the goal to mitigate, and overtake when possible, some of the challenges introduced by the many-core design paradigm

    Visual Analysis Algorithms for Embedded Systems

    Visual search systems are very popular applications, but on-line versions in 3G wireless environments suffer from network constraint like unstable or limited bandwidth that entail latency in query delivery, significantly degenerating the user’s experience. An alternative is to exploit the ability of the newest mobile devices to perform heterogeneous activities, like not only creating but also processing images. Visual feature extraction and compression can be performed on on-board Graphical Processing Units (GPUs), making smartphones capable of detecting a generic object (matching) in an exact way or of performing a classification activity. The latest trends in visual search have resulted in dedicated efforts in MPEG standardization, namely the MPEG CDVS (Compact Descriptor for Visual Search) standard. CDVS is an ISO/IEC standard used to extract a compressed descriptor. As regards to classification, in recent years neural networks have acquired an impressive importance and have been applied to several domains. This thesis focuses on the use of Deep Neural networks to classify images by means of Deep learning. Implementing visual search algorithms and deep learning-based classification on embedded environments is not a mere code-porting activity. Recent embedded devices are equipped with a powerful but limited number of resources, like development boards such as GPGPUs. GPU architectures fit particularly well, because they allow to execute more operations in parallel, following the SIMD (Single Instruction Multiple Data) paradigm. Nonetheless, it is necessary to make good design choices for the best use of available hardware and memory. For visual search, following the MPEG CDVS standard, the contribution of this thesis is an efficient feature computation phase, a parallel CDVS detector, completely implemented on embedded devices supporting the OpenCL framework. Algorithmic choices and implementation details to target the intrinsic characteristics of the selected embedded platforms are presented and discussed. Experimental results on several GPUs show that the GPU-based solution is up to 7× faster than the CPU-based one. This speed-up opens new visual search scenarios exploiting entire real-time on-board computations with no data transfer. As regards to the use of Deep convolutional neural networks for off-line image classification, their computational and memory requirements are huge, and this is an issue on embedded devices. Most of the complexity derives from the convolutional layers and in particular from the matrix multiplications they entail. The contribution of this thesis is a self-contained implementation to image classification providing common layers used in neural networks. The approach relies on a heterogeneous CPU-GPU scheme for performing convolutions in the transform domain. Experimental results show that the heterogeneous scheme described in this thesis boasts a 50× speedup over the CPU-only reference and outperforms a GPU-based reference by 2×, while slashing the power consumption by nearly 30%

    이종 클러스터를 위한 OpenCL 프레임워크

    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 이재진.OpenCL은 이종 컴퓨팅 시스템의 다양한 계산 장치를 위한 통합 프로그래밍 모델이다. OpenCL은 다양한 이기종의 계산 장치에 대한 공통된 하드웨어 추상화 레이어를 프로그래머에게 제공한다. 프로그래머가 이 추상화 레이어를 타깃으로 OpenCL 어플리케이션을 작성하면, 이 어플리케이션은 OpenCL을 지원하는 모든 하드웨어에서 실행 가능하다. 하지만 현재 OpenCL은 단일 운영체제 시스템을 위한 프로그래밍 모델로 한정된다. 프로그래머가 명시적으로 MPI와 같은 통신 라이브러리를 사용하지 않으면 OpenCL 어플리케이션은 복수개의 노드로 이루어진 클러스터에서 동작하지 않는다. 요즘 들어 여러 개의 멀티코어 CPU와 가속기를 갖춘 이종 클러스터는 그 사용자 기반을 넓혀가고 있다. 이에 해당 이종 클러스터를 타깃으로 프로그래밍 하기 위해서는 프로그래머는 MPI-OpenCL 같이 여러 프로그래밍 모델을 혼합하여 어플리케이션을 작성해야 한다. 이는 어플리케이션을 복잡하게 만들어 유지보수가 어렵게 되며 이식성이 낮아진다. 본 논문에서는 이종 클러스터를 위한 OpenCL 프레임워크, SnuCL을 제안한다. 본 논문은 OpenCL 모델이 이종 클러스터 프로그래밍 환경에 적합하다는 것을 보인다. 이와 동시에 SnuCL이 고성능과 쉬운 프로그래밍을 동시에 달성할 수 있음을 보인다. SnuCL은 타깃 이종 클러스터에 대해서 단일 운영체제가 돌아가는 하나의 시스템 이미지를 사용자에게 제공한다. OpenCL 어플리케이션은 클러스터의 모든 계산 노드에 존재하는 모든 계산 장치가 호스트 노드에 있다는 환상을 갖게 된다. 따라서 프로그래머는 MPI 라이브러리와 같은 커뮤니케이션 API를 사용하지 않고 OpenCL 만을 사용하여 이종 클러스터를 타깃으로 어플리케이션을 작성할 수 있게 된다. SnuCL의 도움으로 OpenCL 어플리케이션은 단일 노드에서 이종 디바이스간 이식성을 가질 뿐만 아니라 이종 클러스터 환경에서도 디바이스간 이식성을 가질 수 있게 된다. 본 논문에서는 총 열한 개의 OpenCL 벤치마크 어플리케이션의 실험을 통하여 SnuCL의 성능을 보인다.OpenCL is a unified programming model for different types of computational units in a single heterogeneous computing system. OpenCL provides a common hardware abstraction layer across different computational units. Programmers can write OpenCL applications once and run them on any OpenCL-compliant hardware. However, one of the limitations of current OpenCL is that it is restricted to a programming model on a single operating system image. It does not work for a cluster of multiple nodes unless the programmer explicitly uses communication libraries, such as MPI. A heterogeneous cluster contains multiple general-purpose multicore CPUs and multiple accelerators to solve bigger problems within an acceptable time frame. As such clusters widen their user base, application developers for the clusters are being forced to turn to an unattractive mix of programming models, such as MPI-OpenCL. This makes the application more complex, hard to maintain, and less portable. In this thesis, we propose SnuCL, an OpenCL framework for heterogeneous clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. SnuCL provides a system image running a single operating system instance for heterogeneous clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.Abstract I. Introduction I.1 Heterogeneous Computing I.2 Motivation I.3 Related Work I.4 Contributions I.5 Organization of this Thesis II. The OpenCL Architecture II.1 Platform Model II.2 Execution Model II.3 Memory Model II.4 OpenCL Applications III. The SnuCL Framework III.1 The SnuCL Runtime III.1.1 Mapping Components III.1.2 Organization of the SnuCL Runtime III.1.3 Processing Kernel-execution Commands III.1.4 Processing Synchronization Commands III.2 Memory Management III.2.1 The OpenCL Memory Model III.2.2 Space Allocation to Buffers III.2.3 Minimizing Memory Copying Overhead III.2.4 Processing Memory Commands III.2.5 Consistency Management III.2.6 Ease of Programming III.3 Extensions to OpenCL III.4 Code Transformations III.4.1 Detecting Buffers Written by a Kernel III.4.2 Emulating PEs for CPU Devices III.4.3 Distributing the Kernel Code IV. Distributed Execution Model for SnuCL IV.1 Two Problems in SnuCL IV.2 Remote Device Virtualization IV.2.1 Exclusive Execution on the Host IV.3 OpenCL Framework Integration IV.3.1 OpenCL Installable Client Driver (ICD) IV.3.2 Event Synchronization IV.3.3 Memory Sharing V. Experimental Results V.1 SnuCL Evaluation V.1.1 Methodology V.1.2 Results V.2 SnuCL-D Evaluation V.2.1 Methodology V.2.2 Results VI. Conclusions and Future Directions VI.1 Conclusions VI.2 Future Directions Bibliography Korean AbstractDocto

    Effizientes Programmiermodell für OpenMP auf einem Cluster-basierten Many-Core-System

    Da die Komplexität „System-on-Chip“ (SoC) auch weiterhin zunimmt, wird man die Herausforderungen aufgrund der Konvergenz der Software- und Hardwareentwicklung nicht ignorieren können. Dies gilt auch für den Umgang mit dem hierarchischen Design, in dem die Prozessorkerne in Clustern oder sogenannten „Tiles“ angeordnet werden, um mittels eines schnellen lokalen Speicherzugriffs eine geringe Latenz und eine hohe Bandbreite der lokalen Kommunikation zu gewährleisten. Aus der Sicht eines Programmierers ist es wünschenswert, sich diese Eigenheiten der Hardware zunutze zu machen und sie bei der Ausgestaltung der abstrakten Parallel-Programmierung gewissenhaft und zielführend zu berücksichtigen. Diese Dissertation überwindet viele Engpässe in Bezug auf die Skalierbarkeit Cluster-basierter Many-Core-Systeme und führt das Programmiermodell OpenMP zur Vereinfachung der Anwendungsentwicklung ein. OpenMP abstrahiert von der Sichtweise des Programmierers – und es werden Richtlinien eingeführt, mit denen Schleifen in Programmsequenzen eingeteilt werden, als Basis für die parallele Programmierung. In dieser Arbeit wird das OpenMP-Modell bespielhaft in einem konkreten Cluster-basierten Many-Core-System umgesetzt; dem Intel Single-Chip Cloud Computer (SCC). Es wird eine schlanke und hoch-optimierte Laufzeitschicht für die Ausführung von OpenMP sowie ein Speichermodell vorgestellt. Auf Basis dieser Laufzeitschicht wird der parallele Code automatisch von einem nativen Backend-Compiler (GCC 4.6) erzeugt, der mit der Laufzeitbibliothek verknüpft ist. Im Rahmen der Arbeit wird auf einen effizienten Designansatz für die OpenMP-Programmierung eingegangen, wobei der Intel SCC als Beispiel für Cluster-basierte Systeme zum Einsatz kommt. In nicht-Cache-kohärenten Systemen dient die SCC OpenMP Laufzeitbibliothek primär dazu, die folgenden Herausforderungen zu bewältigen: 1. Die Ausführung von unmodifizierten, bestehenden OpenMP Programmen auf solchen Systemen. 2. Die Portierung des OpenMP-Speichermodells auf den SCC. 3. Die Synchronisation der parallelen Threads, auf die ein beträchtlicher Anteil der Ausführungszeit einer Anwendung entfällt. Eine Reihe weiterer Beispiele, basierend auf verschiedenen gebräuchlichen Kernen und realen Anwendungen, untermauert die Tauglichkeit von OpenMP – und eine Reihe von Experimenten zeigt, wie dieses Modell zu einer deutlichen Beschleunigung (bis zu 48-fach) in verschiedenen parallelen Anwendungen führt.As the complexity of systems-on-chip (SoCs) continues to increase, it is no longer possible to ignore the challenges caused by the convergence of software and hardware development. This involves attempts to deal with the hierarchical design – in which several cores are grouped in clusters or tiles – to ensure low-latency, high-bandwidth local communication by relying on fast local memories. From a programmer’s perspec- tive, it is desirable to make use of these peculiarities of the hardware, which must be clearly and carefully taken into account when designing the support for high-level parallel programming models. This dissertation overcomes many scalability bottlenecks in cluster-based many-core systems and introduces the OpenMP programming model as a means of simplifying application development. OpenMP represents an abstraction of the programmer’s view by providing abundant directives that decompose loops in sequential programs and lead to parallel programs. In this work, the full OpenMP model is implemented on a specific instance of a cluster-based many-core system: the Intel Single-chip Cloud Computer (SCC). In this thesis, a lightweight and highly optimized runtime layer for OpenMP execution and memory model by generating the parallel code that is automatically compiled by native back-end compiler (GCC 4.6) that linked with the runtime library. In this dissertation, I will address an efficient design approach of the OpenMP pro- gramming model for the Intel SCC as an example for cluster-based systems. The SCC OpenMP runtime library is designed to cope with three main challenges in a non-cache coherent system: 1. Executing unmodified legacy OpenMP programs on such system. 2. Landing OpenMP memory model on the SCC. 3. Synchronization in the work of parallel threads accounts for a sizeable fraction of an application’s execution time. Furthermore, the effectiveness of OpenMP is demonstrated on a set of widely used kernels and real-world applications. An extensive set of experiments shows how this model achieves significant parallel speedups up to 48x in several applications

    Investigation into scalable energy and performance models for many-core systems

    PhD ThesisIt is likely that many-core processor systems will continue to penetrate emerging embedded and high-performance applications. Scalable energy and performance models are two critical aspects that provide insights into the conflicting trade-offs between them with growing hardware and software complexity. Traditional performance models, such as Amdahl’s Law, Gustafson’s and Sun-Ni’s, have helped the research community and industry to better understand the system performance bounds with given processing resources, which is otherwise known as speedup. However, these models and their existing extensions have limited applicability for energy and/or performance-driven system optimization in practical systems. For instance, these are typically based on software characteristics, assuming ideal and homogeneous hardware platforms or limited forms of processor heterogeneity. In addition, the measurement of speedup and parallelization factors of an application running on a specific hardware platform require instrumenting the original software codes. Indeed, practical speedup and parallelizability models of application workloads running on modern heterogeneous hardware are critical for energy and performance models, as they can be used to inform design and control decisions with an aim to improve system throughput and energy efficiency. This thesis addresses the limitations by firstly developing novel and scalable speedup and energy consumption models based on a more general representation of heterogeneity, referred to as the normal form heterogeneity. A method is developed whereby standard performance counters found in modern many-core platforms can be used to derive speedup, and therefore the parallelizability of the software, without instrumenting applications. This extends the usability of the new models to scenarios where the parallelizability of software is unknown, leading to potentially Run-Time Management (RTM) speedup and/or energy efficiency optimization. The models and optimization methods presented in this thesis are validated through extensive experimentation, by running a number of different applications in wide-ranging concurrency scenarios on a number of different homogeneous and heterogeneous Multi/Many Core Processor (M/MCP) systems. These include homogeneous and heterogeneous architectures and viii range from existing off-the-shelf platforms to potential future system extensions. The practical use of these models and methods is demonstrated through real examples such as studying the effectiveness of the system load balancer. The models and methodologies proposed in this thesis provide guidance to a new opportunities for improving the energy efficiency of M/MCP systemsHigher Committee of Education Development (HCED) in Ira

    Efficient and portable multi-tasking for heterogeneous systems

    Modern computing systems comprise heterogeneous designs which combine multiple and diverse architectures on a single system. These designs provide potentials for high performance under reduced power requirements but require advanced resource management and workload scheduling across the available processors. Programmability frameworks, such as OpenCL and CUDA, enable resource management and workload scheduling on heterogeneous systems. These frameworks fully assign the control of resource allocation and scheduling to the application. This design sufficiently serves the needs of dedicated application systems but introduces significant challenges for multi-tasking environments where multiple users and applications compete for access to system resources. This thesis considers these challenges and presents three major contributions that enable efficient multi-tasking on heterogeneous systems. The presented contributions are compatible with existing systems, remain portable across vendors and do not require application changes or recompilation. The first contribution of this thesis is an optimization technique that reduces host-device communication overhead for OpenCL applications. It does this without modification or recompilation of the application source code and is portable across platforms. This work enables efficiency and performance improvements for diverse application workloads found on multi-tasking systems. The second contribution is the design and implementation of a secure, user-space virtualization layer that integrates the accelerator resources of a system with the standard multi-tasking and user-space virtualization facilities of the commodity Linux OS. It enables fine-grained sharing of mixed-vendor accelerator resources and targets heterogeneous systems found in data center nodes and requires no modification to the OS, OpenCL or application. Lastly, the third contribution is a technique and software infrastructure that enable resource sharing control on accelerators, while supporting software managed scheduling on accelerators. The infrastructure remains transparent to existing systems and applications and requires no modifications or recompilation. In enforces fair accelerator sharing which is required for multi-tasking purposes

    Resource Allocation for Software Pipelines in Many-core Systems

    Many-core systems integrate a growing number of cores on a single chip and are expected to integrate hundreds and even thousands of cores soon. Despite their massive processing power, it is crucial to employ their resources efficiently to benefit from parallel processing. This dissertation tackles a major challenge, resource allocation, for complex, memory-intensive applications. The proposed methods allow to significantly improve the performance over the state of the art in many scenarios

    Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

    Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four

    Reducing adaptive optics latency using many-core processors

    Atmospheric turbulence reduces the achievable resolution of ground based optical telescopes. Adaptive optics systems attempt to mitigate the impact of this turbulence and are required to update their corrections quickly and deterministically (i.e. in realtime). The technological challenges faced by the future extremely large telescopes (ELTs) and their associated instruments are considerable. A simple extrapolation of current systems to the ELT scale is not sufficient. My thesis work consisted in the identification and examination of new many-core technologies for accelerating the adaptive optics real-time control loop. I investigated the Mellanox TILE-Gx36 and the Intel Xeon Phi (5110p). The TILE-Gx36 with 4x10 GbE ports and 36 processing cores is a good candidate for fast computation of the wavefront sensor images. The Intel Xeon Phi with 60 processing cores and high memory bandwidth is particularly well suited for the acceleration of the wavefront reconstruction. Through extensive testing I have shown that the TILE-Gx can provide the performance required for the wavefront processing units of the ELT first light instruments. The Intel Xeon Phi (Knights Corner) while providing good overall performance does not have the required determinism. We believe that the next generation of Xeon Phi (Knights Landing) will provide the necessary determinism and increased performance. In this thesis, we show that by using currently available novel many-core processors it is possible to reach the performance required for ELT instruments