Search CORE

383 research outputs found

Cache-Oblivious Selection in Sorted X+Y Matrices

Author: de Berg Mark
Thite Shripad
Publication venue
Publication date: 01/01/2008
Field of study

Let X[0..n-1] and Y[0..m-1] be two sorted arrays, and define the mxn matrix A by A[j][i]=X[i]+Y[j]. Frederickson and Johnson gave an efficient algorithm for selecting the k-th smallest element from A. We show how to make this algorithm IO-efficient. Our cache-oblivious algorithm performs O((m+n)/B) IOs, where B is the block size of memory transfers

arXiv.org e-Print Archive

CiteSeerX

Pure OAI Repository

Caltech Authors

Towards a theory of cache-efficient algorithms

Author: Chatterjee Siddhartha
Sen Sandeep
Publication venue: Society for Industrial and Applied Mathematics Philadelphia
Publication date: 01/01/2000
Field of study

We describe a model that enables us to analyze the running time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-optimal algorithms for some fundamental problems like sorting, FFT, and an important subclass of permutations in the single-level cache model. We also show that ignoring associativity concerns could lead to inferior performance, by analyzing the average-case cache behavior of mergesort. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and dealing with the hitherto unresolved problem of limited associativity

Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

Author: Elango Venmugil
Fauzia Naznin
Pouchet Louis-Noël
Ramanujam J.
Rastello Fabrice
Ravishankar Mahesh
Rountev Atanas
Sadayappan P.
Publication venue
Publication date: 01/12/2013
Field of study

Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014

arXiv.org e-Print Archive

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

LEMP: Fast Retrieval of Large Entries in a Matrix Product

Author: Gemulla Rainer
Mykytiuk Olga
Teflioudi Christina
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

MAnnheim DOCument Server

MPG.PuRe

White-box methodologies, programming abstractions and libraries

Author: Atalar Aras
Gidenstam Anders
Ha Hoai Phuong
Renaud-Goud Paul
Tran Ngoc Nha Vi
Tsigas Philippas
Umar Ibrahim
Publication venue: The EXCESS Consortium
Publication date: 01/01/2015
Field of study

EXCESS deliverable D2.2. More information at http://www.excess-project.eu/This deliverable reports the results of white-box methodologies and early results ofthe first prototype of libraries and programming abstractions as available by projectmonth 18 by Work Package 2 (WP2). It reports i) the latest results of Task 2.2on white-box methodologies, programming abstractions and libraries for developingenergy-efficient data structures and algorithms and ii) the improved results of Task2.1 on investigating and modeling the trade-off between energy and performance ofconcurrent data structures and algorithms. The work has been conducted on two mainEXCESS platforms: Intel platforms with recent Intel multicore CPUs and MovidiusMyriad1 platform

Munin - Open Research Archive

Fast Sorting on a Distributed-Memory Architecture

Author: Cheng David R.
Edelman Alan
Gilbert John R.
Shah Viral
Publication venue
Publication date: 01/01/2005
Field of study

We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms.Singapore-MIT Alliance (SMA

DSpace@MIT

Cache-efficient algorithms for finite element methods on arbitrary discretizations

Author: Snel H.D.
Publication venue
Publication date: 01/01/2015
Field of study

Repository TU/e

Pure OAI Repository