96 research outputs found

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Hierarchical and Adaptive Filter and Refinement Algorithms for Geometric Intersection Computations on GPU

    Get PDF
    Geometric intersection algorithms are fundamental in spatial analysis in Geographic Information System (GIS). This dissertation explores high performance computing solution for geometric intersection on a huge amount of spatial data using Graphics Processing Unit (GPU). We have developed a hierarchical filter and refinement system for parallel geometric intersection operations involving large polygons and polylines by extending the classical filter and refine algorithm using efficient filters that leverage GPU computing. The inputs are two layers of large polygonal datasets and the computations are spatial intersection on pairs of cross-layer polygons. These intersections are the compute-intensive spatial data analytic kernels in spatial join and map overlay operations in spatial databases and GIS. Efficient filters, such as PolySketch, PolySketch++ and Point-in-polygon filters have been developed to reduce refinement workload on GPUs. We also showed the application of such filters in speeding-up line segment intersections and point-in-polygon tests. Programming models like CUDA and OpenACC have been used to implement the different versions of the Hierarchical Filter and Refine (HiFiRe) system. Experimental results show good performance of our filter and refinement algorithms. Compared to standard R-tree filter, on average, our filter technique can still discard 76% of polygon pairs which do not have segment intersection points. PolySketch filter reduces on average 99.77% of the workload of finding line segment intersections. Compared to existing Common Minimum Bounding Rectangle (CMBR) filter that is applied on each cross-layer candidate pair, the workload after using PolySketch-based CMBR filter is on average 98% smaller. The execution time of our HiFiRe system on two shapefiles, namely USA Water Bodies (contains 464K polygons) and USA Block Group Boundaries (contains 220K polygons), is about 3.38 seconds using NVidia Titan V GPU

    Modelli e strumenti di programmazione parallela per piattaforme many-core

    Get PDF
    The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms. The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms. First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism. Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems. Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently. All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms

    Efficient Synchronization for GPGPU

    Get PDF
    High-performance General Purpose Graphics processing units (GPGPUs) have exposed bottlenecks in synchronizations of threads and cores. The massively parallel computing cores and complex hierarchies of threads present new challenges for synchronizations at different granularities. Performance of GPU is hindered by inefficient global and local synchronizations. I propose hardware-software cooperative frameworks for efficient synchronization of GPGPU to address the following issues. To provide efficient global synchronization (Gsync), an API with direct hardware support is proposed. The GPU cores are synchronized by an on-chip Gsync controller. Partial context switch is employed to guarantee deadlock-free execution. The proposed Gsync avoids expensive API calls and alleviates data thrashing. Prioritized warp scheduling is used to increase the overlap of context switch with kernel execution. To efficiently exploit the inherent parallelism of producer-consumer problems, a flexible wait-signal scheme is proposed at thread-block level. I propose dedicated APIs to express fine-grained static and dynamic dependencies with hardware support. The proposed scheme can accelerate wavefront, graph and machine learning applications. The architectural design of on-chip wait-signal controller eliminates busy wait loop and long-latency memory operations. I also propose thread block dispatch scheduling to address the problem of load imbalance and large context switch overhead. To reduce stall due to synchronizations, a synchronization-aware warp scheduling is proposed to coordinate multiple warp schedulers upon synchronization events. Both performance and hardware utilization are improved by resolving the barrier sooner

    NFV Platforms: Taxonomy, Design Choices and Future Challenges

    Get PDF
    Due to the intrinsically inefficient service provisioning in traditional networks, Network Function Virtualization (NFV) keeps gaining attention from both industry and academia. By replacing the purpose-built, expensive, proprietary network equipment with software network functions consolidated on commodity hardware, NFV envisions a shift towards a more agile and open service provisioning paradigm. During the last few years, a large number of NFV platforms have been implemented in production environments that typically face critical challenges, including the development, deployment, and management of Virtual Network Functions (VNFs). Nonetheless, just like any complex system, such platforms commonly consist of abounding software and hardware components and usually incorporate disparate design choices based on distinct motivations or use cases. This broad collection of convoluted alternatives makes it extremely arduous for network operators to make proper choices. Although numerous efforts have been devoted to investigating different aspects of NFV, none of them specifically focused on NFV platforms or attempted to explore their design space. In this paper, we present a comprehensive survey on the NFV platform design. Our study solely targets existing NFV platform implementations. We begin with a top-down architectural view of the standard reference NFV platform and present our taxonomy of existing NFV platforms based on what features they provide in terms of a typical network function life cycle. Then we thoroughly explore the design space and elaborate on the implementation choices each platform opts for. We also envision future challenges for NFV platform design in the incoming 5G era. We believe that our study gives a detailed guideline for network operators or service providers to choose the most appropriate NFV platform based on their respective requirements. Our work also provides guidelines for implementing new NFV platforms

    Doctor of Philosophy

    Get PDF
    dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

    Distributed Shared Memory based Live VM Migration

    Get PDF
    Cloud computing is the new trend in computing services and IT industry, this computing paradigm has numerous benefits to utilize IT infrastructure resources and reduce services cost. The key feature of cloud computing depends on mobility and scalability of the computing resources, by managing virtual machines. The virtualization decouples the software from the hardware and manages the software and hardware resources in an easy way without interruption of services. Live virtual machine migration is an essential tool for dynamic resource management in current data centers. Live virtual machine is defined as the process of moving a running virtual machine or application between different physical machines without disconnecting the client or application. Many techniques have been developed to achieve this goal based on several metrics (total migration time, downtime, size of data sent and application performance) that are used to measure the performance of live migration. These metrics measure the quality of the VM services that clients care about, because the main goal of clients is keeping the applications performance with minimum service interruption. The pre-copy live VM migration is done in four phases: preparation, iterative migration, stop and copy, and resume and commitment. During the preparation phase, the source and destination physical servers are selected, the resources in destination physical server are reserved, and the critical VM is selected to be migrated. The cloud manager responsibility is to make all of these decisions. VM state migration takes place and memory state is transferred to the target node during iterative migration phase. Meanwhile, the migrated VM continues to execute and dirties its memory. In the stop and copy phase, VM virtual CPU is stopped and then the processor and network states are transferred to the destination host. Service downtime results from stopping VM execution and moving the VM CPU and network states. Finally in the resume and commitment phase, the migrated VM is resumed running in the destination physical host, the remaining memory pages are pulled by destination machine from the source machine. The source machine resources are released and eliminated. In this thesis, pre-copy live VM migration using Distributed Shared Memory (DSM) computing model is proposed. The setup is built using two identical computation nodes to construct all the proposed environment services architecture namely the virtualization infrastructure (Xenserver6.2 hypervisor), the shared storage server (the network file system), and the DSM and High Performance Computing (HPC) cluster. The custom DSM framework is based on a low latency memory update named Grappa. Moreover, HPC cluster is used to parallelize the work load by using CPUs computation nodes. HPC cluster employs OPENMPI and MPI libraries to support parallelization and auto-parallelization. The DSM allows the cluster CPUs to access the same memory space pages resulting in less memory data updates, which reduces the amount of data transferred through the network. The thesis proposed model achieves a good enhancement of the live VM migration metrics. Downtime is reduced by 50 % in the idle workload of Windows VM and 66.6% in case of Ubuntu Linux idle workload. In general, the proposed model not only reduces the downtime and the total amount of data sent, but also does not degrade other metrics like the total migration time and the applications performance
    corecore