144,645 research outputs found

    HeTM: Transactional Memory for Heterogeneous Systems

    Full text link
    Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT'19

    Object-oriented Neural Programming (OONP) for Document Understanding

    Full text link
    We propose Object-oriented Neural Programming (OONP), a framework for semantically parsing documents in specific domains. Basically, OONP reads a document and parses it into a predesigned object-oriented data structure (referred to as ontology in this paper) that reflects the domain-specific semantics of the document. An OONP parser models semantic parsing as a decision process: a neural net-based Reader sequentially goes through the document, and during the process it builds and updates an intermediate ontology to summarize its partial understanding of the text it covers. OONP supports a rich family of operations (both symbolic and differentiable) for composing the ontology, and a big variety of forms (both symbolic and differentiable) for representing the state and the document. An OONP parser can be trained with supervision of different forms and strength, including supervised learning (SL) , reinforcement learning (RL) and hybrid of the two. Our experiments on both synthetic and real-world document parsing tasks have shown that OONP can learn to handle fairly complicated ontology with training data of modest sizes.Comment: accepted by ACL 201

    Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

    Full text link
    The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as `Kepler'. We provide a review of previous optimisation strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of `performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimised storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11Hz for both GPUs, independent of the size of the problem.Comment: 12 page

    Transparent and efficient shared-state management for optimistic simulations on multi-core machines

    Get PDF
    Traditionally, Logical Processes (LPs) forming a simulation model store their execution information into disjoint simulations states, forcing events exchange to communicate data between each other. In this work we propose the design and implementation of an extension to the traditional Time Warp (optimistic) synchronization protocol for parallel/distributed simulation, targeted at shared-memory/multicore machines, allowing LPs to share parts of their simulation states by using global variables. In order to preserve optimism's intrinsic properties, global variables are transparently mapped to multi-version ones, so to avoid any form of safety predicate verification upon updates. Execution's consistency is ensured via the introduction of a new rollback scheme which is triggered upon the detection of an incorrect global variable's read. At the same time, efficiency in the execution is guaranteed by the exploitation of non-blocking algorithms in order to manage the multi-version variables' lists. Furthermore, our proposal is integrated with the simulation model's code through software instrumentation, in order to allow the application-level programmer to avoid using any specific API to mark or to inform the simulation kernel of updates to global variables. Thus we support full transparency. An assessment of our proposal, comparing it with a traditional message-passing implementation of variables' multi-version is provided as well. © 2012 IEEE
    • …
    corecore