208 research outputs found

    Understanding the behavior of Pthread applications on non-uniform cache architectures

    Get PDF
    PosterWhy is it important? As number of cores in a processor scale up, caches would become banked Keeps individual look-up time small. Allows parallel accesses by different cores. Present shared programming model assumes a flat memory. Unaware application can have sub-optimal performance Conclusion Programming model needs to change For any heterogeneous memory hierarchy. Architecture, OS, compiler and application developer should work together Significant performance gains can be achieved. ? Without increasing system complexity. As complexity of memory hierarchy grows, optimizations like these will be critical

    Accelerating Pattern Matching in Neuromorphic Text Recognition System Using Intel Xeon Phi Coprocessor

    Get PDF
    Neuromorphic computing systems refer to the computing architecture inspired by the working mechanism of human brains. The rapidly reducing cost and increasing performance of state-of-the-art computing hardware allows large-scale implementation of machine intelligence models with neuromorphic architectures and opens the opportunity for new applications. One such computing hardware is Intel Xeon Phi coprocessor, which delivers over a TeraFLOP of computing power with 61 integrated processing cores. How to efficiently harness such computing power to achieve real time decision and cognition is one of the key design considerations. This work presents an optimized implementation of Brain-State-in-a-Box (BSB) neural network model on the Xeon Phi coprocessor for pattern matching in the context of intelligent text recognition of noisy document images. From a scalability standpoint on a High Performance Computing (HPC) platform we show that efficient workload partitioning and resource management can double the performance of this many-core architecture for neuromorphic applications

    Hybrid Caching for Chip Multiprocessors Using Compiler-Based Data Classification

    Get PDF
    The high performance delivered by modern computer system keeps scaling with an increasingnumber of processors connected using distributed network on-chip. As a result, memory accesslatency, largely dominated by remote data cache access and inter-processor communication, is becoming a critical performance bottleneck. To release this problem, it is necessary to localize data access as much as possible while keep efficient on-chip cache memory utilization. Achieving this however, is application dependent and needs a keen insight into the memory access characteristics of the applications. This thesis demonstrates how using fairly simple thus inexpensive compiler analysis memory accesses can be classified into private data access and shared data access. In addition, we introduce a third classification named probably private access and demonstrate the impact of this category compared to traditional private and shared memory classification. The memory access classification information from the compiler analysis is then provided to the runtime system through a modified memory allocator and page table to facilitate a hybrid private-shared caching technique. The hybrid cache mechanism is aware of different data access classification and adopts appropriate placement and search policies accordingly to improve performance. Our analysis demonstrates that many applications have a significant amount of both private and shared data and that compiler analysis can identify the private data effectively for many applications. Experimentsresults show that the implemented hybrid caching scheme achieves 4.03% performance improvement over state of the art NUCA-base caching

    Towards a high performance parallel library to compute fluid flexible structures interactions

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)LBM-IB method is useful and popular simulation technique that is adopted ubiquitously to solve Fluid-Structure interaction problems in computational fluid dynamics. These problems are known for utilizing computing resources intensively while solving mathematical equations involved in simulations. Problems involving such interactions are omnipresent, therefore, it is eminent that a faster and accurate algorithm exists for solving these equations, to reproduce a real-life model of such complex analytical problems in a shorter time period. LBM-IB being inherently parallel, proves to be an ideal candidate for developing a parallel software. This research focuses on developing a parallel software library, LBM-IB based on the algorithm proposed by [1] which is first of its kind that utilizes the high performance computing abilities of supercomputers procurable today. An initial sequential version of LBM-IB is developed that is used as a benchmark for correctness and performance evaluation of shared memory parallel versions. Two shared memory parallel versions of LBM-IB have been developed using OpenMP and Pthread library respectively. The OpenMP version is able to scale well enough, as good as 83% speedup on multicore machines for <=8 cores. Based on the profiling and instrumentation done on this version, to improve the data-locality and increase the degree of parallelism, Pthread based data centric version is developed which is able to outperform the OpenMP version by 53% on manycore machines. A distributed version using the MPI interfaces on top of the cube based Pthread version has also been designed to be used by extreme scale distributed memory manycore systems

    McSimA+: A Manycore Simulator with Application-level+ Simulation and Detailed Microarchitecture Modeling

    Get PDF
    Abstract-With their significant performance and energy advantages, emerging manycore processors have also brought new challenges to the architecture research community. Manycore processors are highly integrated complex system-on-chips with complicated core and uncore subsystems. The core subsystems can consist of a large number of traditional and asymmetric cores. The uncore subsystems have also become unprecedentedly powerful and complex with deeper cache hierarchies, advanced on-chip interconnects, and high-performance memory controllers. In order to conduct research for emerging manycore processor systems, a microarchitecture-level and cycle-level manycore simulation infrastructure is needed. This paper introduces McSimA+, a new timing simulation infrastructure, to meet these needs. McSimA+ models x86-based asymmetric manycore microarchitectures in detail for both core and uncore subsystems, including a full spectrum of asymmetric cores from single-threaded to multithreaded and from in-order to out-of-order, sophisticated cache hierarchies, coherence hardware, on-chip interconnects, memory controllers, and main memory. McSimA+ is an application-level+ simulator, offering a middle ground between a full-system simulator and an application-level simulator. Therefore, it enjoys the light weight of an application-level simulator and the full control of threads and processes as in a full-system simulator. This paper also explores an asymmetric clustered manycore architecture that can reduce the thread migration cost to achieve a noticeable performance improvement compared to a state-of-the-art asymmetric manycore architecture

    Exploring coordinated software and hardware support for hardware resource allocation

    Get PDF
    Multithreaded processors are now common in the industry as they offer high performance at a low cost. Traditionally, in such processors, the assignation of hardware resources between the multiple threads is done implicitly, by the hardware policies. However, a new class of multithreaded hardware allows the explicit allocation of resources to be controlled or biased by the software. Currently, there is little or no coordination between the allocation of resources done by the hardware and the prioritization of tasks done by the software.This thesis targets to narrow the gap between the software and the hardware, with respect to the hardware resource allocation, by proposing a new explicit resource allocation hardware mechanism and novel schedulers that use the currently available hardware resource allocation mechanisms.It approaches the problem in two different types of computing systems: on the high performance computing domain, we characterize the first processor to present a mechanism that allows the software to bias the allocation hardware resources, the IBM POWER5. In addition, we propose the use of hardware resource allocation as a way to balance high performance computing applications. Finally, we propose two new scheduling mechanisms that are able to transparently and successfully balance applications in real systems using the hardware resource allocation. On the soft real-time domain, we propose a hardware extension to the existing explicit resource allocation hardware and, in addition, two software schedulers that use the explicit allocation hardware to improve the schedulability of tasks in a soft real-time system.In this thesis, we demonstrate that system performance improves by making the software aware of the mechanisms to control the amount of resources given to each running thread. In particular, for the high performance computing domain, we show that it is possible to decrease the execution time of MPI applications biasing the hardware resource assignation between threads. In addition, we show that it is possible to decrease the number of missed deadlines when scheduling tasks in a soft real-time SMT system.Postprint (published version

    Boosting Multi-Core Reachability Performance with Shared Hash Tables

    Get PDF
    This paper focuses on data structures for multi-core reachability, which is a key component in model checking algorithms and other verification methods. A cornerstone of an efficient solution is the storage of visited states. In related work, static partitioning of the state space was combined with thread-local storage and resulted in reasonable speedups, but left open whether improvements are possible. In this paper, we present a scaling solution for shared state storage which is based on a lockless hash table implementation. The solution is specifically designed for the cache architecture of modern CPUs. Because model checking algorithms impose loose requirements on the hash table operations, their design can be streamlined substantially compared to related work on lockless hash tables. Still, an implementation of the hash table presented here has dozens of sensitive performance parameters (bucket size, cache line size, data layout, probing sequence, etc.). We analyzed their impact and compared the resulting speedups with related tools. Our implementation outperforms two state-of-the-art multi-core model checkers (SPIN and DiVinE) by a substantial margin, while placing fewer constraints on the load balancing and search algorithms.Comment: preliminary repor

    pocl: A Performance-Portable OpenCL Implementation

    Get PDF
    OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi
    • …
    corecore