144,645 research outputs found
HeTM: Transactional Memory for Heterogeneous Systems
Modern heterogeneous computing architectures, which couple multi-core CPUs
with discrete many-core GPUs (or other specialized hardware accelerators),
enable unprecedented peak performance and energy efficiency levels.
Unfortunately, though, developing applications that can take full advantage of
the potential of heterogeneous systems is a notoriously hard task. This work
takes a step towards reducing the complexity of programming heterogeneous
systems by introducing the abstraction of Heterogeneous Transactional Memory
(HeTM). HeTM provides programmers with the illusion of a single memory region,
shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with
support for atomic transactions. Besides introducing the abstract semantics and
programming model of HeTM, we present the design and evaluation of a concrete
implementation of the proposed abstraction, which we named Speculative HeTM
(SHeTM). SHeTM makes use of a novel design that leverages on speculative
techniques and aims at hiding the inherently large communication latency
between CPUs and discrete GPUs and at minimizing inter-device synchronization
overhead. SHeTM is based on a modular and extensible design that allows for
easily integrating alternative TM implementations on the CPU's and GPU's sides,
which allows the flexibility to adopt, on either side, the TM implementation
(e.g., in hardware or software) that best fits the applications' workload and
the architectural characteristics of the processing unit. We demonstrate the
efficiency of the SHeTM via an extensive quantitative study based both on
synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on
Parallel Architectures and Compilation Techniques (PACT'19
Object-oriented Neural Programming (OONP) for Document Understanding
We propose Object-oriented Neural Programming (OONP), a framework for
semantically parsing documents in specific domains. Basically, OONP reads a
document and parses it into a predesigned object-oriented data structure
(referred to as ontology in this paper) that reflects the domain-specific
semantics of the document. An OONP parser models semantic parsing as a decision
process: a neural net-based Reader sequentially goes through the document, and
during the process it builds and updates an intermediate ontology to summarize
its partial understanding of the text it covers. OONP supports a rich family of
operations (both symbolic and differentiable) for composing the ontology, and a
big variety of forms (both symbolic and differentiable) for representing the
state and the document. An OONP parser can be trained with supervision of
different forms and strength, including supervised learning (SL) ,
reinforcement learning (RL) and hybrid of the two. Our experiments on both
synthetic and real-world document parsing tasks have shown that OONP can learn
to handle fairly complicated ontology with training data of modest sizes.Comment: accepted by ACL 201
Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well
suited to an efficient implementation for massively parallel computing, due to
the prevalence of local operations in the algorithm. This paper presents and
analyses the performance of a 3D lattice Boltzmann solver, optimized for third
generation nVidia GPU hardware, also known as `Kepler'. We provide a review of
previous optimisation strategies and analyse data read/write times for
different memory types. In LBM, the time propagation step (known as streaming),
involves shifting data to adjacent locations and is central to parallel
performance; here we examine three approaches which make use of different
hardware options. Two of which make use of `performance enhancing' features of
the GPU; shared memory and the new shuffle instruction found in Kepler based
GPUs. These are compared to a standard transfer of data which relies instead on
optimised storage to increase coalesced access. It is shown that the more
simple approach is most efficient; since the need for large numbers of
registers per thread in LBM limits the block size and thus the efficiency of
these special features is reduced. Detailed results are obtained for a D3Q19
LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter
case the use of a read-only data cache is explored, and peak performance of
over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The
appearance of a periodic bottleneck in the solver performance is also reported,
believed to be hardware related; spikes in iteration-time occur with a
frequency of around 11Hz for both GPUs, independent of the size of the problem.Comment: 12 page
Transparent and efficient shared-state management for optimistic simulations on multi-core machines
Traditionally, Logical Processes (LPs) forming a simulation model store their execution information into disjoint simulations states, forcing events exchange to communicate data between each other. In this work we propose the design and implementation of an extension to the traditional Time Warp (optimistic) synchronization protocol for parallel/distributed simulation, targeted at shared-memory/multicore machines, allowing LPs to share parts of their simulation states by using global variables. In order to preserve optimism's intrinsic properties, global variables are transparently mapped to multi-version ones, so to avoid any form of safety predicate verification upon updates. Execution's consistency is ensured via the introduction of a new rollback scheme which is triggered upon the detection of an incorrect global variable's read. At the same time, efficiency in the execution is guaranteed by the exploitation of non-blocking algorithms in order to manage the multi-version variables' lists. Furthermore, our proposal is integrated with the simulation model's code through software instrumentation, in order to allow the application-level programmer to avoid using any specific API to mark or to inform the simulation kernel of updates to global variables. Thus we support full transparency. An assessment of our proposal, comparing it with a traditional message-passing implementation of variables' multi-version is provided as well. © 2012 IEEE
- …