954 research outputs found
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems
In this paper, we present a novel cache design based on Multi-Level Cell
Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set
capacity and associativity to use efficiently the full potential of MLC STTRAM.
We exploit the asymmetric nature of the MLC storage scheme to build cache lines
featuring heterogeneous performances, that is, half of the cache lines are
read-friendly, while the other is write-friendly. Furthermore, we propose to
opportunistically deactivate ways in underutilized sets to convert MLC to
Single-Level Cell (SLC) mode, which features overall better performance and
lifetime. Our ultimate goal is to build a cache architecture that combines the
capacity advantages of MLC and performance/energy advantages of SLC. Our
experiments show an improvement of 43% in total numbers of conflict misses, 27%
in memory access latency, 12% in system performance, and 26% in LLC access
energy, with a slight degradation in cache lifetime (about 7%) compared to an
SLC cache
Locality-Adaptive Parallel Hash Joins Using Hardware Transactional Memory
Previous work [1] has claimed that the best performing implementation of in-memory hash joins is based on (radix-)partitioning of the build-side input. Indeed, despite the overhead of partitioning, the benefits from increased cache-locality and synchronization free parallelism in the build-phase outweigh the costs when the input data is randomly ordered. However, many datasets already exhibit significant spatial locality (i.e., non-randomness) due to the way data items enter the database: through periodic ETL or trickle loaded in the form of transactions. In such cases, the first benefit of partitioning — increased locality — is largely irrelevant. In this paper, we demonstrate how hardware transactional memory (HTM) can render the other benefit, freedom from synchronization, irrelevant as well. Specifically, using careful analysis and engineering, we develop an adaptive hash join implementation that outperforms parallel radix-partitioned hash joins as well as sort-merge joins on data with high spatial locality. In addition, we show how, through lightweight (less than 1% overhead) runtime monitoring of the transaction abort rate, our implementation can detect inputs with low spatial locality and dynamically fall back to radix-partitioning of the build-side input. The result is a hash join implementation that is more than 3 times faster than the state-of-the-art on high-locality data and never more than 1% slower
Communication-optimal Parallel and Sequential Cholesky Decomposition
Numerical algorithms have two kinds of costs: arithmetic and communication,
by which we mean either moving data between levels of a memory hierarchy (in
the sequential case) or over a network connecting processors (in the parallel
case). Communication costs often dominate arithmetic costs, so it is of
interest to design algorithms minimizing communication. In this paper we first
extend known lower bounds on the communication cost (both for bandwidth and for
latency) of conventional (O(n^3)) matrix multiplication to Cholesky
factorization, which is used for solving dense symmetric positive definite
linear systems. Second, we compare the costs of various Cholesky decomposition
implementations to these lower bounds and identify the algorithms and data
structures that attain them. In the sequential case, we consider both the
two-level and hierarchical memory models. Combined with prior results in [13,
14, 15], this gives a set of communication-optimal algorithms for O(n^3)
implementations of the three basic factorizations of dense linear algebra: LU
with pivoting, QR and Cholesky. But it goes beyond this prior work on
sequential LU by optimizing communication for any number of levels of memory
hierarchy.Comment: 29 pages, 2 tables, 6 figure
A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD
In this paper, a new methodology for speeding up Matrix–Matrix Multiplication
using Single Instruction Multiple Data unit, at one and more cores having a
shared cache, is presented. This methodology achieves higher execution speed than
ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number
of instructions (load/store and arithmetic) and the data cache accesses and misses in
thememory hierarchy. This is achieved by fully exploiting the software characteristics
(e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities)
as one problem and not separately, giving high quality solutions and a smaller search
space
0.5 Petabyte Simulation of a 45-Qubit Quantum Circuit
Near-term quantum computers will soon reach sizes that are challenging to
directly simulate, even when employing the most powerful supercomputers. Yet,
the ability to simulate these early devices using classical computers is
crucial for calibration, validation, and benchmarking. In order to make use of
the full potential of systems featuring multi- and many-core processors, we use
automatic code generation and optimization of compute kernels, which also
enables performance portability. We apply a scheduling algorithm to quantum
supremacy circuits in order to reduce the required communication and simulate a
45-qubit circuit on the Cori II supercomputer using 8,192 nodes and 0.5
petabytes of memory. To our knowledge, this constitutes the largest quantum
circuit simulation to this date. Our highly-tuned kernels in combination with
the reduced communication requirements allow an improvement in time-to-solution
over state-of-the-art simulations by more than an order of magnitude at every
scale
A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures
Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures
- …