561 research outputs found
Software and hardware methods for memory access latency reduction on ILP processors
While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies
Simple, compact and robust approximate string dictionary
This paper is concerned with practical implementations of approximate string
dictionaries that allow edit errors. In this problem, we have as input a
dictionary of strings of total length over an alphabet of size
. Given a bound and a pattern of length , a query has to
return all the strings of the dictionary which are at edit distance at most
from , where the edit distance between two strings and is defined as
the minimum-cost sequence of edit operations that transform into . The
cost of a sequence of operations is defined as the sum of the costs of the
operations involved in the sequence. In this paper, we assume that each of
these operations has unit cost and consider only three operations: deletion of
one character, insertion of one character and substitution of a character by
another. We present a practical implementation of the data structure we
recently proposed and which works only for one error. We extend the scheme to
. Our implementation has many desirable properties: it has a very
fast and space-efficient building algorithm. The dictionary data structure is
compact and has fast and robust query time. Finally our data structure is
simple to implement as it only uses basic techniques from the literature,
mainly hashing (linear probing and hash signatures) and succinct data
structures (bitvectors supporting rank queries).Comment: Accepted to a journal (19 pages, 2 figures
Bit-Flip Aware Data Structures for Phase Change Memory
Big, non-volatile, byte-addressable, low-cost, and fast non-volatile memories like Phase Change Memory are appearing in the marketplace. They have the capability to unify both memory and storage and allow us to rethink the present memory hierarchy. An important draw-back to Phase Change Memory is limited write-endurance. In addition, Phase Change Memory shares with other Non-Volatile Random Access Memories an asym- metry in the energy costs of writes and reads. Best use of Non-Volatile Random Access Memories limits the number of times a Non-Volatile Random Access Memory cell changes contents, called a bit-flip. While the future of main memory is still unknown, we should already start to create data structures for them in order to shape the future era. This thesis investigates the creation of bit-flip aware data structures.The thesis first considers general ways in which a data structure can save bit- flips by smart overwrites and by using the exclusive-or of pointers. It then shows how a simple content dependent encoding can reduce bit-flips for web corpora. It then shows how to build hash based dictionary structures for Linear Hashing and Spiral Storage. Finally, the thesis presents Gray counters, close to bit-flip optimal counters that even enable age- based wear leveling with counters managed by the Non-Volatile Random Access Memories themselves instead of by the Operating Systems
Sixteen space-filling curves and traversals for d-dimensional cubes and simplices
This article describes sixteen different ways to traverse d-dimensional space
recursively in a way that is well-defined for any number of dimensions. Each of
these traversals has distinct properties that may be beneficial for certain
applications. Some of the traversals are novel, some have been known in
principle but had not been described adequately for any number of dimensions,
some of the traversals have been known. This article is the first to present
them all in a consistent notation system. Furthermore, with this article, tools
are provided to enumerate points in a regular grid in the order in which they
are visited by each traversal. In particular, we cover: five discontinuous
traversals based on subdividing cubes into 2^d subcubes: Z-traversal (Morton
indexing), U-traversal, Gray-code traversal, Double-Gray-code traversal, and
Inside-out traversal; two discontinuous traversals based on subdividing
simplices into 2^d subsimplices: the Hill-Z traversal and the Maehara-reflected
traversal; five continuous traversals based on subdividing cubes into 2^d
subcubes: the Base-camp Hilbert curve, the Harmonious Hilbert curve, the Alfa
Hilbert curve, the Beta Hilbert curve, and the Butz-Hilbert curve; four
continuous traversals based on subdividing cubes into 3^d subcubes: the Peano
curve, the Coil curve, the Half-coil curve, and the Meurthe curve. All of these
traversals are self-similar in the sense that the traversal in each of the
subcubes or subsimplices of a cube or simplex, on any level of recursive
subdivision, can be obtained by scaling, translating, rotating, reflecting
and/or reversing the traversal of the complete unit cube or simplex.Comment: 28 pages, 12 figures. v2: fixed a confusing typo on page 12, line
Coding for Racetrack Memories
Racetrack memory is a new technology which utilizes magnetic domains along a
nanoscopic wire in order to obtain extremely high storage density. In racetrack
memory, each magnetic domain can store a single bit of information, which can
be sensed by a reading port (head). The memory has a tape-like structure which
supports a shift operation that moves the domains to be read sequentially by
the head. In order to increase the memory's speed, prior work studied how to
minimize the latency of the shift operation, while the no less important
reliability of this operation has received only a little attention.
In this work we design codes which combat shift errors in racetrack memory,
called position errors. Namely, shifting the domains is not an error-free
operation and the domains may be over-shifted or are not shifted, which can be
modeled as deletions and sticky insertions. While it is possible to use
conventional deletion and insertion-correcting codes, we tackle this problem
with the special structure of racetrack memory, where the domains can be read
by multiple heads. Each head outputs a noisy version of the stored data and the
multiple outputs are combined in order to reconstruct the data. Under this
paradigm, we will show that it is possible to correct, with at most a single
bit of redundancy, deletions with heads if the heads are
well-separated. Similar results are provided for burst of deletions, sticky
insertions and combinations of both deletions and sticky insertions
Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers
The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism
- …