18,265 research outputs found
Store-Ordered Streaming of Shared Memory
Coherence misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. Memory streaming provides a promising solution to the coherence miss bottleneck because it improves memory level parallelism and lookahead while using on-chip resources efficiently. We observe that the order in which shared data are consumed by one processor is correlated to the order in which they were produced by another. We investigate this phenomenon and demonstrate that it can be exploited to send Store- ORDered Streams (SORDS) of shared data from producers to consumers, thereby eliminating coherent read misses. Using a trace-driven analysis of all user and OS memory references in a cache-coherent distributed shared- memory multiprocessor, we show that SORDS based memory streaming can eliminate between 36% and 100% of all coherent read misses in scientific workloads and between 23% and 48%in online transaction processing workloads
S-Store: Streaming Meets Transaction Processing
Stream processing addresses the needs of real-time applications. Transaction
processing addresses the coordination and safety of short atomic computations.
Heretofore, these two modes of operation existed in separate, stove-piped
systems. In this work, we attempt to fuse the two computational paradigms in a
single system called S-Store. In this way, S-Store can simultaneously
accommodate OLTP and streaming applications. We present a simple transaction
model for streams that integrates seamlessly with a traditional OLTP system. We
chose to build S-Store as an extension of H-Store, an open-source, in-memory,
distributed OLTP database system. By implementing S-Store in this way, we can
make use of the transaction processing facilities that H-Store already
supports, and we can concentrate on the additional implementation features that
are needed to support streaming. Similar implementations could be done using
other main-memory OLTP platforms. We show that we can actually achieve higher
throughput for streaming workloads in S-Store than an equivalent deployment in
H-Store alone. We also show how this can be achieved within H-Store with the
addition of a modest amount of new functionality. Furthermore, we compare
S-Store to two state-of-the-art streaming systems, Spark Streaming and Storm,
and show how S-Store matches and sometimes exceeds their performance while
providing stronger transactional guarantees
Single-Producer/Single-Consumer Queues on Shared Cache Multi-Core Systems
Using efficient point-to-point communication channels is critical for
implementing fine grained parallel program on modern shared cache multi-core
architectures.
This report discusses in detail several implementations of wait-free
Single-Producer/Single-Consumer queue (SPSC), and presents a novel and
efficient algorithm for the implementation of an unbounded wait-free SPSC queue
(uSPSC). The correctness proof of the new algorithm, and several performance
measurements based on simple synthetic benchmark and microbenchmark, are also
discussed
Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well
suited to an efficient implementation for massively parallel computing, due to
the prevalence of local operations in the algorithm. This paper presents and
analyses the performance of a 3D lattice Boltzmann solver, optimized for third
generation nVidia GPU hardware, also known as `Kepler'. We provide a review of
previous optimisation strategies and analyse data read/write times for
different memory types. In LBM, the time propagation step (known as streaming),
involves shifting data to adjacent locations and is central to parallel
performance; here we examine three approaches which make use of different
hardware options. Two of which make use of `performance enhancing' features of
the GPU; shared memory and the new shuffle instruction found in Kepler based
GPUs. These are compared to a standard transfer of data which relies instead on
optimised storage to increase coalesced access. It is shown that the more
simple approach is most efficient; since the need for large numbers of
registers per thread in LBM limits the block size and thus the efficiency of
these special features is reduced. Detailed results are obtained for a D3Q19
LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter
case the use of a read-only data cache is explored, and peak performance of
over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The
appearance of a periodic bottleneck in the solver performance is also reported,
believed to be hardware related; spikes in iteration-time occur with a
frequency of around 11Hz for both GPUs, independent of the size of the problem.Comment: 12 page
Security, Performance and Energy Trade-offs of Hardware-assisted Memory Protection Mechanisms
The deployment of large-scale distributed systems, e.g., publish-subscribe
platforms, that operate over sensitive data using the infrastructure of public
cloud providers, is nowadays heavily hindered by the surging lack of trust
toward the cloud operators. Although purely software-based solutions exist to
protect the confidentiality of data and the processing itself, such as
homomorphic encryption schemes, their performance is far from being practical
under real-world workloads.
The performance trade-offs of two novel hardware-assisted memory protection
mechanisms, namely AMD SEV and Intel SGX - currently available on the market to
tackle this problem, are described in this practical experience.
Specifically, we implement and evaluate a publish/subscribe use-case and
evaluate the impact of the memory protection mechanisms and the resulting
performance. This paper reports on the experience gained while building this
system, in particular when having to cope with the technical limitations
imposed by SEV and SGX.
Several trade-offs that provide valuable insights in terms of latency,
throughput, processing time and energy requirements are exhibited by means of
micro- and macro-benchmarks.Comment: European Commission Project: LEGaTO - Low Energy Toolset for
Heterogeneous Computing (EC-H2020-780681
- …