804 research outputs found

    Porting Decision Tree Algorithms to Multicore using FastFlow

    Full text link
    The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.Comment: 18 pages + cove

    Power Bounded Computing on Current & Emerging HPC Systems

    Get PDF
    Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets. We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency. Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems

    Learning-based run-time power and energy management of multi/many-core systems: current and future trends

    Get PDF
    Multi/Many-core systems are prevalent in several application domains targeting different scales of computing such as embedded and cloud computing. These systems are able to fulfil the everincreasing performance requirements by exploiting their parallel processing capabilities. However, effective power/energy management is required during system operations due to several reasons such as to increase the operational time of battery operated systems, reduce the energy cost of datacenters, and improve thermal efficiency and reliability. This article provides an extensive survey of learning-based run-time power/energy management approaches. The survey includes a taxonomy of the learning-based approaches. These approaches perform design-time and/or run-time power/energy management by employing some learning principles such as reinforcement learning. The survey also highlights the trends followed by the learning-based run-time power management approaches, their upcoming trends and open research challenges

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    Exploiting heterogeneity in Chip-Multiprocessor Design

    Get PDF
    In the past decade, semiconductor manufacturers are persistent in building faster and smaller transistors in order to boost the processor performance as projected by Moore’s Law. Recently, as we enter the deep submicron regime, continuing the same processor development pace becomes an increasingly difficult issue due to constraints on power, temperature, and the scalability of transistors. To overcome these challenges, researchers propose several innovations at both architecture and device levels that are able to partially solve the problems. These diversities in processor architecture and manufacturing materials provide solutions to continuing Moore’s Law by effectively exploiting the heterogeneity, however, they also introduce a set of unprecedented challenges that have been rarely addressed in prior works. In this dissertation, we present a series of in-depth studies to comprehensively investigate the design and optimization of future multi-core and many-core platforms through exploiting heteroge-neities. First, we explore a large design space of heterogeneous chip multiprocessors by exploiting the architectural- and device-level heterogeneities, aiming to identify the optimal design patterns leading to attractive energy- and cost-efficiencies in the pre-silicon stage. After this high-level study, we pay specific attention to the architectural asymmetry, aiming at developing a heterogeneity-aware task scheduler to optimize the energy-efficiency on a given single-ISA heterogeneous multi-processor. An advanced statistical tool is employed to facilitate the algorithm development. In the third study, we shift our concentration to the device-level heterogeneity and propose to effectively leverage the advantages provided by different materials to solve the increasingly important reliability issue for future processors

    Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

    Get PDF
    Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations
    • …
    corecore