29 research outputs found

    TurboMGNN : improving concurrent GNN training tasks on GPU with fine-grained kernel fusion

    Get PDF
    Graph Neural Networks (GNN) have evolved as powerful models for graph representation learning. Many works have been proposed to support GNN training efficiently on GPU. However, these works only focus on a single GNN training task such as operator optimization, task scheduling, and programming model. Concurrent GNN training, which is needed in the applications such as neural network structure search, has not been explored yet. This work aims to improve the training efficiency of the concurrent GNN training tasks on GPU by developing fine-grained methods to fuse the kernels from different tasks. Specifically, we propose a fine-grained Sparse Matrix Multiplication (SpMM) based kernel fusion method to eliminate redundant accesses to graph data. In order to increase the fusion opportunity and reduce the synchronization cost, we further propose a novel technique to enable the fusion of the kernels in forward and backward propagation. Finally, in order to reduce the resource contention caused by the increased number of concurrent, heterogeneous GNN training tasks, we propose an adaptive strategy to group the tasks and match their operators according to resource contention. We have conducted extensive experiments, including kernel- and model-level benchmarks. The results show that the proposed methods can achieve up to 2.6X performance speedup

    Mammoth : gearing Hadoop towards memory-intensive MapReduce applications

    Get PDF
    The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. In this paper, we have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, we design a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. We have also developed a multi-threaded execution engine, which is based on Hadoop but runs in a single JVM on a node. In the execution engine, we have implemented the algorithm of memory scheduling to realize global memory management, based on which we further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. We have conducted extensive experiments with comparison against the native Hadoop platform. The results show that the Mammoth system can reduce the job execution time by more than 40% in typical cases, without requiring any modifications of the Hadoop programs. When a system is short of memory, Mammoth can improve the performance by up to 5.19 times, as observed for I/O intensive applications, such as PageRank. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the Mammoth system can have a promising potential and impact

    Dynamic Processor Resource Configuration in Virtualized Environments

    No full text
    Abstract-Virtualization can provide significant benefits in data centers, such as dynamic resource configuration, live virtual machine migration. Services are deployed in virtual machines (VMs) and resource utilization can be greatly improved. In this paper, we present VScheduler, a system that dynamically adjusts processor resource configuration of virtual machines, including the amount of virtual resource and a new mapping of virtual machines and physical nodes. VScheduler implements a two-level resource configuration scheme -local resource configuration (LRC) for an individual virtual machine and global resource configuration (GRC) for a whole cluster or data center. GRC especially takes variation tendency of workloads into account when remapping virtual machines to physical nodes. We implement our techniques in Xen and conduct a detailed evaluation using RUBiS and dbench. The experimental results show that VScheduler not only satisfies resource demands of services, but also reduces the number of virtual machines migration, which can provide a stable VM distribution on physical nodes in data centers

    A Real-Time Scheduling Framework Based on Multi-core Dynamic Partitioning in Virtualized Environment

    No full text
    Part 2: Parallel and Multi-Core TechnologiesInternational audienceWith the prevalence of virtualization and cloud computing, many real-time applications are running in virtualized cloud environments. However, their performance cannot be guaranteed because current hypervisors’ CPU schedulers aim to share CPU resources fairly and improve system throughput. They do not consider real-time constraints of these applications, which result in frequent deadline misses. In this paper, we present a real-time scheduling framework in virtualized environment. In the framework, we propose a mechanism called multi-core dynamic partitioning to divide physical CPUs (PCPUs) into two pools dynamically according to the scheduling parameters of real-time virtual machines (RT-VMs). We apply different schedulers to these pools to schedule RT-VMs and non-RT-VMs respectively. Besides, we design a global earliest deadline first (vGEDF) scheduler to schedule RT-VMs. We implement a prototype in the Xen hypervisor and conduct experiments to verify its effectiveness

    Page Classifier and Placer: A Scheme of Managing Hybrid Caches

    No full text
    Part 1: Systems, Networks and ArchitecturesInternational audienceHybrid cache architecture (HCA), which uses two or more cache hierarchy designs in a processor, may outperform traditional cache architectures because no single memory technology can deliver the optimal power, performance and density at the same time. The general HCA scheme has also been proposed to manage cache regions that have different usage patterns. However previous HCA management schemes control data placement at cache set level and are oblivious to software’s different power and performance characteristics in different hardware cache regions. This hardware-only approach may lead to performance loss and may fail to guarantee quality of service. We propose a new HCA approach that enables OS to be aware of underlying hybrid cache architecture and to control data placement, at OS page level, onto difference cache regions. Our approach employs a light-weighted hardware profiler to monitor cache behaviors at OS page level and to capture the hot pages. With this knowledge, OS will be able to dynamically select different cache placement policies to optimize placement of data to achieve higher performance, lower power consumption and better quality of service. Our simulation experiments demonstrate that the proposed hybrid HCA achieves 7.8% performance improvement on a dual-core system compared to a traditional SRAM-only cache architecture and at the same time reduces area cost

    Underthrusting and duplexing beneath the northern Tibetan Plateau and the evolution of the Himalayan-Tibetan orogen

    No full text
    The Cenozoic Qilian Shan thrust belt is the northern margin of the Tibetan Plateau, which developed in part due to progressive India-Asia convergence during Himalayan-Tibetan orogeny. Available geologic observations suggest that this thrust belt started deforming shortly after initial India-Asia collision at 60-55 Ma, and thus its kinematic development is intrinsically related to the construction and evolution of the Tibetan Plateau. Here, we present new field observations from a geologic traverse across the Qilian Shan to elucidate the style of deformation across the active thrust belt. In particular, we infer protracted out-of-sequence deformation here that is consistent with this thrust system remaining a stationary northern boundary to the Tibetan Plateau since the early Cenozoic. We present a lithosphere-scale model for this region that highlights the following: (1) coupled distributed crustal shortening and underthrusting of the North China craton beneath Tibet, which explains the spatial and temporal distribution of observed crustal shortening and thickness, (2) this underthrusting exploited the south-dipping early Paleozoic Qilian suture paleo-subduction melange channel, and (3) development of a lower-crustal duplex at the lithospheric underthrusting ramp. This last inference can explain the relatively high elevation, low relief, and thickened crust of the central Qilian Shan, as well as the comparative aseismicity of the region, which experiences fewer earthquakes due to less upper-crustal faulting. Both the northern and southern margins of the Himalayan-Tibetan orogen appear to have developed similarly, with continental underthrusting and crustal-scale imbrication and duplexing, despite vastly different climatic and plate-velocity boundary conditions, which suggests that the orogen-scale architecture of the thrust belt is controlled by neither of these forcing mechanisms. Instead, strength anisotropies of the crust probably control the kinematics and style of deformation, including the development of northern Tibet, where thrust systems are concentrated along pre-Cenozoic suture zones

    The Photosynthetic Characteristics of Different Purple Peppers

    No full text
    The yield of pepper with purple leaves (PF) is low, while the pepper with green leaves (GM) is not resistant to strong light and high temperature. In this study, we analyzed the photosynthesis characteristics and genetic stability of their hybrid progenies using PF(CS3) and GM(SJ11-3) as controls. Based on the decreased purple color and increased green color, the hybrid pepper was divided into five groups: Z1, Z2, Z3, Z4 and Z5. Results showed that as the purple color increased, the anthocyanin content in leaves increased. Simultaneously, we found that PF exhibited higher resistance to strong light and high temperature. Thus, the purple hybrid progenies with higher photosynthetic rate were recommended, as they showed higher yield and better resistance to strong light and high temperature
    corecore