13 research outputs found

    A key-based adaptive transactional memory executor

    Get PDF
    Software transactional memory systems enable a programmer to easily write concurrent data structures such as lists, trees, hashtables, and graphs, where nonconflicting operations proceed in parallel. Many of these structures take the abstract form of a dictionary, in which each transaction is associated with a search key. By regrouping transactions based on their keys, one may improve locality and reduce conflicts among parallel transactions. In this paper, we present an executor that partitions transactions among available processors. Our keybased adaptive partitioning monitors incoming transactions, estimates the probability distribution of their keys, and adaptively determines the (usually nonuniform) partitions. By comparing the adaptive partitioning with uniform partitioning and round-robin keyless partitioning on a 16-processor SunFire 6800 machine, we demonstrate that key-based adaptive partitioning significantly improves the throughput of finegrained parallel operations on concurrent data structures

    High performance cloud computing on multicore computers

    Get PDF
    The cloud has become a major computing platform, with virtualization being a key to allow applications to run and share the resources in the cloud. A wide spectrum of applications need to process large amounts of data at high speeds in the cloud, e.g., analyzing customer data to find out purchase behavior, processing location data to determine geographical trends, or mining social media data to assess brand sentiment. To achieve high performance, these applications create and use multiple threads running on multicore processors. However, existing virtualization technology cannot support the efficient execution of such applications on virtual machines, making them suffer poor and unstable performance in the cloud. Targeting multi-threaded applications, the dissertation analyzes and diagnoses their performance issues on virtual machines, and designs practical solutions to improve their performance. The dissertation makes the following contributions. First, the dissertation conducts extensive experiments with standard multicore applications, in order to evaluate the performance overhead on virtualization systems and diagnose the causing factors. Second, focusing on one main source of the performance overhead, excessive spinning, the dissertation designs and evaluates a holistic solution to make effective utilization of the hardware virtualization support in processors to reduce excessive spinning with low cost. Third, focusing on application scalability, which is the most important performance feature for multi-threaded applications, the dissertation models application scalability in virtual machines and analyzes how application scalability changes with virtualization and resource sharing. Based on the modeling and analysis, the dissertation identifies key application features and system factors that have impacts on application scalability, and reveals possible approaches for improving scalability. Forth, the dissertation explores one approach to improving application scalability by making fully utilization of virtual resources of each virtual machine. The general idea is to match the workload distribution among the virtual CPUs in a virtual machine and the virtual CPU resource of the virtual machine manager

    The fast multipole method at exascale

    Get PDF
    This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D

    Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

    Get PDF
    Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations

    Matching non-uniformity for program optimizations on heterogeneous many-core systems

    Get PDF
    As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns

    Predicting Application Performance for Chip Multiprocessors

    Get PDF
    Today's computers have processors with multiple cores that allow several applications to execute simultaneously. The way resources are allocated to an application affects whether performance objectives, such as quality of service (QoS), are satisfied. To ensure objectives are met, resources must be carefully but quickly allocated in response to changing runtime conditions. Traditional approaches to resource allocation take place either purely online or offline. Online methods do not scale to large, multiple core systems because there are too many allocations to evaluate at runtime. Offline methods cannot handle unanticipated workloads or changes. A hybrid approach could combine the lower runtime overhead of offline approaches with the flexibility of online approaches. This thesis introduces AUTO, a hybrid solution to perform resource allocation. AUTO dynamically adjusts thread count, core count, and core type. It does so in accordance with a user-provided policy to meet performance objectives. AUTO's capabilities come from four prediction techniques. The first technique builds and uses models that consider CPU contention and application scalability in order to select co-running applications' thread counts. The second technique predicts applications' preferred thread-to-core mappings. The predictions are thread count independent and are translated into concrete thread-to-core mappings based on resource availability. The third technique predicts application performance under thread-to-core mappings. The final technique selects thread count and core count for applications on a system with cores of different capabilities. AUTO was tested in several scenarios. In each scenario, it was shown to be an effective, efficient solution to resource allocation. First, it was used to select the thread count of one or more co-running applications. Second, it was used to select application thread-to-core mappings. Third, it was used to make predictions about application performance under thread-to-core mappings. Finally, it was used to select both thread count and core type for applications on a computer with cores of different capabilities. AUTO's resource allocation and models allow for more effective and more efficient policies. By using hybrid online and offline techniques, AUTO solves the problem of allocating threads and cores to meet performance objectives

    Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance

    Get PDF
    Chip Multiprocessors (CMPs) are here to stay for the foreseeable future. In terms of programmability of these processors what is different from legacy multiprocessors is that sharing among the different cores (processors) is less expensive than it was in the past. Previous research suggested that sharing is a desirable feature to be incorporated in new codes. For some programs, more cache leads to more beneficial sharing since the sharing starts to kick in for large on chip caches. This work tries to answer the question of whether or not we can (should) write code differently when the underlying chip microarchitecture is powered by a Chip Multiprocessor. We use a set three graph benchmarks each with three different input problems varying in size and connectivity to characterize the importance of how we partition the problem space among cores and how that partitioning can happen at multiple levels of the cache leading to better performance because of good utilization of the caches at the lowest level and because of the increased sharing of data items that can be boosted at the shared cache level (L2 in our case) which can effectively be a prefetching effect among different compute cores. The thesis has two thrusts. The first is exploring the design space represented by different parallelization schemes (we devise some tweaks on top of existing techniques) and different graph partitionings (a locality optimization techniques suited for graph problems). The combination of the parallelization strategy and graph partitioning provides a large and complex space that we characterize using detailed simulation results to see how much gain we can obtain over a baseline legacy parallelization technique with a partition sized to fit in the L1 cache. We show that the legacy parallelization is not the best alternative in most of the cases and other parallelization techniques perform better. Also, we show that there is a search problem to determine the partitioning size and in most of the cases the best partitioning size is smaller than the baseline partition size. The second thrust of the thesis is exploring how we can predict the best combination of parallelization and partitioning that performs the best for any given benchmark under any given input data set. We use a PIN based reuse distance profile computation tool to build an execution time prediction model that can rank order the different combinations of parallelization strategies and partitioning sizes. We report the amount of gain that we can capture using the PIN prediction relative to what detailed simulation results deem the best under a given benchmark and input size. In some cases the prediction is 100% accurate and in some other cases the prediction projects worse performance than the baseline case. We report the difference between the simulation best performing combination and the PIN predicted ones as well as other statistics to evaluate how good the predictions are. We show that the PIN prediction method performs very well in predicting the partition size compared to predicting the parallelization strategy. In this case, the accuracy of the overall scheme can be highly improved if we only use the partitioning size predicted by the PIN prediction scheme and then we use a search strategy to find the best parallelization strategy for that partition size. In this thesis, we use a detailed performance model to scan a large solution space for the best parameters for locality optimization of a set of graph problems. Using the M5 performance simulation we show gains of up to 20% vs. a naively picked baseline case. Our prediction scheme can achieve up to 100% of the best performance gains obtained using a search method on real hardware or performance simulation without running at all on the target hardware and up to 48% on average across all of our benchmarks and input sizes. There are several interesting aspects to this work. We are the first to devise and verify a performance model against detailed simulation results. We suggest and quantify that locality optimization and problem partitioning can increase sharing synergistically to achieve better performance overall. We have shown a new utilization for coherent reuse distance profiles as a helping tool for program developers and compilers to a optimize program's performance

    Detecting and quantifying resource contention in concurrent programs

    Get PDF
    Thesis (MSc)--Stellenbosch University, 2016ENGLISH ABSTRACT : Parallel programs, both shared-memory and message-passing programs, typically require the sharing of resources. For example, software resources, such as shared mutual exclusion locks and hardware resources, such as caches and memory. Shared resources can only be used by one thread or process at a time. The competition for limited resources is called resource contention. The result of resource contention is delays while waiting for access to a resource and/or extra computational overhead to resolve the request for a resource. Thus, the performance of the program can be improved by identifying and reducing contention for shared resources. This study investigates the effect of individual types of contention for hardware and software resources in detail and discusses the three tools that were developed to identify and quantify the sources of contention in concurrent programs.AFRIKAANSE OPSOMMING : Geen Afrikaanse opsomming geskikbaar ni

    Lightweight Massively Parallel Suffix Array Construction

    Get PDF
    The suffix array is an array of sorted suffixes in lexicographic order, where each sorted suffix is represented by its starting position in the input string. It is a fundamental data structure that finds various applications in areas such as string processing, text indexing, data compression, computational biology, and many more. Over the last three decades, researchers have proposed a broad spectrum of suffix array construction algorithms (SACAs). However, the majority of SACAs were implemented using sequential and parallel programming models. The maturity of GPU programming opened doors to the development of massively parallel GPU SACAs that outperform the fastest versions of suffix sorting algorithms optimized for the CPU parallel computing. Over the last five years, several GPU SACA approaches were proposed and implemented. They prioritized the running time over lightweight design. In this thesis, we design and implement a lightweight massively parallel SACA on the GPU using the prefix-doubling technique. Our prefix-doubling implementation is memory-efficient and can successfully construct the suffix array for input strings as large as 640 megabytes (MB) on Tesla P100 GPU. On large datasets, our implementation achieves a speedup of 7-16x over the fastest, highly optimized, OpenMP-accelerated suffix array constructor, libdivsufsort, that leverages the CPU shared memory parallelism. The performance of our algorithm relies on several high-performance parallel primitives such as radix sort, conditional filtering, inclusive prefix sum, random memory scattering, and segmented sort. We evaluate the performance of our implementation over a variety of real-world datasets with respect to its runtime, throughput, memory usage, and scalability. We compare our results against libdivsufsort that we run on a Haswell compute node equipped with 24 cores. Our GPU SACA is simple and compact, consisting of less than 300 lines of readable and effective source code. Additionally, we design and implement a fast and lightweight algorithm for checking the correctness of the suffix array
    corecore