137,478 research outputs found
Static locality analysis for cache management
Most memory references in numerical codes correspond to array references whose indices are affine functions of surrounding loop indices. These array references follow a regular predictable memory pattern that can be analysed at compile time. This analysis can provide valuable information like the locality exhibited by the program, which can be used to implement more intelligent caching strategy. In this paper we propose a static locality analysis oriented to the management of data caches. We show that previous proposals on locality analysis are not appropriate when the proposals have a high conflict miss ratio. This paper examines those proposals by introducing a compile-time interference analysis that significantly improve the performance of them. We first show how this analysis can be used to characterize the dynamic locality properties of numerical codes. This evaluation show for instance that a large percentage of references exhibit any type of locality. This motivates the use of a dual data cache, which has a module specialized to exploit temporal locality, and a selective cache respectively. Then, the performance provided by these two cache organizations is evaluated. In both organizations, the static locality analysis is responsible for tagging each memory instruction accordingly to the particular type(s) of locality that it exhibits.Peer ReviewedPostprint (published version
Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential
Emerging computer architectures will feature drastically decreased flops/byte
(ratio of peak processing rate to memory bandwidth) as highlighted by recent
studies on Exascale architectural trends. Further, flops are getting cheaper
while the energy cost of data movement is increasingly dominant. The
understanding and characterization of data locality properties of computations
is critical in order to guide efforts to enhance data locality. Reuse distance
analysis of memory address traces is a valuable tool to perform data locality
characterization of programs. A single reuse distance analysis can be used to
estimate the number of cache misses in a fully associative LRU cache of any
size, thereby providing estimates on the minimum bandwidth requirements at
different levels of the memory hierarchy to avoid being bandwidth bound.
However, such an analysis only holds for the particular execution order that
produced the trace. It cannot estimate potential improvement in data locality
through dependence preserving transformations that change the execution
schedule of the operations in the computation. In this article, we develop a
novel dynamic analysis approach to characterize the inherent locality
properties of a computation and thereby assess the potential for data locality
enhancement via dependence preserving transformations. The execution trace of a
code is analyzed to extract a computational directed acyclic graph (CDAG) of
the data dependences. The CDAG is then partitioned into convex subsets, and the
convex partitioning is used to reorder the operations in the execution trace to
enhance data locality. The approach enables us to go beyond reuse distance
analysis of a single specific order of execution of the operations of a
computation in characterization of its data locality properties. It can serve a
valuable role in identifying promising code regions for manual transformation,
as well as assessing the effectiveness of compiler transformations for data
locality enhancement. We demonstrate the effectiveness of the approach using a
number of benchmarks, including case studies where the potential shown by the
analysis is exploited to achieve lower data movement costs and better
performance.Comment: Transaction on Architecture and Code Optimization (2014
Locality-Adaptive Parallel Hash Joins Using Hardware Transactional Memory
Previous work [1] has claimed that the best performing implementation of in-memory hash joins is based on (radix-)partitioning of the build-side input. Indeed, despite the overhead of partitioning, the benefits from increased cache-locality and synchronization free parallelism in the build-phase outweigh the costs when the input data is randomly ordered. However, many datasets already exhibit significant spatial locality (i.e., non-randomness) due to the way data items enter the database: through periodic ETL or trickle loaded in the form of transactions. In such cases, the first benefit of partitioning — increased locality — is largely irrelevant. In this paper, we demonstrate how hardware transactional memory (HTM) can render the other benefit, freedom from synchronization, irrelevant as well. Specifically, using careful analysis and engineering, we develop an adaptive hash join implementation that outperforms parallel radix-partitioned hash joins as well as sort-merge joins on data with high spatial locality. In addition, we show how, through lightweight (less than 1% overhead) runtime monitoring of the transaction abort rate, our implementation can detect inputs with low spatial locality and dynamically fall back to radix-partitioning of the build-side input. The result is a hash join implementation that is more than 3 times faster than the state-of-the-art on high-locality data and never more than 1% slower
Truly Online Paging with Locality of Reference
The competitive analysis fails to model locality of reference in the online
paging problem. To deal with it, Borodin et. al. introduced the access graph
model, which attempts to capture the locality of reference. However, the access
graph model has a number of troubling aspects. The access graph has to be known
in advance to the paging algorithm and the memory required to represent the
access graph itself may be very large.
In this paper we present truly online strongly competitive paging algorithms
in the access graph model that do not have any prior information on the access
sequence. We present both deterministic and randomized algorithms. The
algorithms need only O(k log n) bits of memory, where k is the number of page
slots available and n is the size of the virtual address space. I.e.,
asymptotically no more memory than needed to store the virtual address
translation table.
We also observe that our algorithms adapt themselves to temporal changes in
the locality of reference. We model temporal changes in the locality of
reference by extending the access graph model to the so called extended access
graph model, in which many vertices of the graph can correspond to the same
virtual page. We define a measure for the rate of change in the locality of
reference in G denoted by Delta(G). We then show our algorithms remain strongly
competitive as long as Delta(G) >= (1+ epsilon)k, and no truly online algorithm
can be strongly competitive on a class of extended access graphs that includes
all graphs G with Delta(G) >= k- o(k).Comment: 37 pages. Preliminary version appeared in FOCS '9
Locality and Singularity for Store-Atomic Memory Models
Robustness is a correctness notion for concurrent programs running under
relaxed consistency models. The task is to check that the relaxed behavior
coincides (up to traces) with sequential consistency (SC). Although
computationally simple on paper (robustness has been shown to be
PSPACE-complete for TSO, PGAS, and Power), building a practical robustness
checker remains a challenge. The problem is that the various relaxations lead
to a dramatic number of computations, only few of which violate robustness.
In the present paper, we set out to reduce the search space for robustness
checkers. We focus on store-atomic consistency models and establish two
completeness results. The first result, called locality, states that a
non-robust program always contains a violating computation where only one
thread delays commands. The second result, called singularity, is even stronger
but restricted to programs without lightweight fences. It states that there is
a violating computation where a single store is delayed.
As an application of the results, we derive a linear-size source-to-source
translation of robustness to SC-reachability. It applies to general programs,
regardless of the data domain and potentially with an unbounded number of
threads and with unbounded buffers. We have implemented the translation and
verified, for the first time, PGAS algorithms in a fully automated fashion. For
TSO, our analysis outperforms existing tools
- …