167 research outputs found
Space-Round Tradeoffs for MapReduce Computations
This work explores fundamental modeling and algorithmic issues arising in the
well-established MapReduce framework. First, we formally specify a
computational model for MapReduce which captures the functional flavor of the
paradigm by allowing for a flexible use of parallelism. Indeed, the model
diverges from a traditional processor-centric view by featuring parameters
which embody only global and local memory constraints, thus favoring a more
data-centric view. Second, we apply the model to the fundamental computation
task of matrix multiplication presenting upper and lower bounds for both dense
and sparse matrix multiplication, which highlight interesting tradeoffs between
space and round complexity. Finally, building on the matrix multiplication
results, we derive further space-round tradeoffs on matrix inversion and
matching
On the Computational Complexity of MapReduce
In this paper we study MapReduce computations from a complexity-theoretic
perspective. First, we formulate a uniform version of the MRC model of Karloff
et al. (2010). We then show that the class of regular languages, and moreover
all of sublogarithmic space, lies in constant round MRC. This result also
applies to the MPC model of Andoni et al. (2014). In addition, we prove that,
conditioned on a variant of the Exponential Time Hypothesis, there are strict
hierarchies within MRC so that increasing the number of rounds or the amount of
time per processor increases the power of MRC. To the best of our knowledge we
are the first to approach the MapReduce model with complexity-theoretic
techniques, and our work lays the foundation for further analysis relating
MapReduce to established complexity classes
Energy Saving Techniques for Phase Change Memory (PCM)
In recent years, the energy consumption of computing systems has increased
and a large fraction of this energy is consumed in main memory. Towards this,
researchers have proposed use of non-volatile memory, such as phase change
memory (PCM), which has low read latency and power; and nearly zero leakage
power. However, the write latency and power of PCM are very high and this,
along with limited write endurance of PCM present significant challenges in
enabling wide-spread adoption of PCM. To address this, several
architecture-level techniques have been proposed. In this report, we review
several techniques to manage power consumption of PCM. We also classify these
techniques based on their characteristics to provide insights into them. The
aim of this work is encourage researchers to propose even better techniques for
improving energy efficiency of PCM based main memory.Comment: Survey, phase change RAM (PCRAM
Modeling Algorithm Performance on Highly-threaded Many-core Architectures
The rapid growth of data processing required in various arenas of computation over the past decades necessitates extensive use of parallel computing engines. Among those, highly-threaded many-core machines, such as GPUs have become increasingly popular for accelerating a diverse range of data-intensive applications. They feature a large number of hardware threads with low-overhead context switches to hide the memory access latencies and therefore provide high computational throughput. However, understanding and harnessing such machines places great challenges on algorithm designers and performance tuners due to the complex interaction of threads and hierarchical memory subsystems of these machines. The achieved performance jointly depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy). Contemporary work tries to model the performance of GPUs from various aspects with different emphasis and granularity. However, no model considers all of these factors together at the same time.
This dissertation presents an analytical framework that jointly addresses parallelism, latency-hiding, and occupancy for both theoretical and empirical performance analysis of algorithms on highly-threaded many-core machines so that it can guide both algorithm design and performance tuning. In particular, this framework not only helps to explore and reduce the runtime configuration space for tuning kernel execution on GPUs, but also reflects performance bottlenecks and predicts how the runtime will trend as the problem and other parameters scale. The framework consists of a pair of analytical models with one focusing on higher-level asymptotic algorithm performance on GPUs and the other one emphasizing lower-level details about scheduling and runtime configuration. Based on the two models, we have conducted extensive analysis of a large set of algorithms. Two analysis provides interesting results and explains previously unexplained data. In addition, the two models are further bridged and combined as a consistent framework. The framework is able to provide an end-to-end methodology for algorithm design, evaluation, comparison, implementation, and prediction of real runtime on GPUs fairly accurately.
To demonstrate the viability of our methods, the models are validated through data from implementations of a variety of classic algorithms, including hashing, Bloom filters, all-pairs shortest path, matrix multiplication, FFT, merge sort, list ranking, string matching via suffix tree/array, etc. We evaluate the models\u27 performance across a wide spectrum of parameters, data values, and machines. The results indicate that the models can be effectively used for algorithm performance analysis and runtime prediction on highly-threaded many-core machines
Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation
We develop a novel parallel decomposition strategy for unweighted, undirected
graphs, based on growing disjoint connected clusters from batches of centers
progressively selected from yet uncovered nodes. With respect to similar
previous decompositions, our strategy exercises a tighter control on both the
number of clusters and their maximum radius.
We present two important applications of our parallel graph decomposition:
(1) -center clustering approximation; and (2) diameter approximation. In
both cases, we obtain algorithms which feature a polylogarithmic approximation
factor and are amenable to a distributed implementation that is geared for
massive (long-diameter) graphs. The total space needed for the computation is
linear in the problem size, and the parallel depth is substantially sublinear
in the diameter for graphs with low doubling dimension. To the best of our
knowledge, ours are the first parallel approximations for these problems which
achieve sub-diameter parallel time, for a relevant class of graphs, using only
linear space. Besides the theoretical guarantees, our algorithms allow for a
very simple implementation on clustered architectures: we report on extensive
experiments which demonstrate their effectiveness and efficiency on large
graphs as compared to alternative known approaches.Comment: 14 page
High-performance computing for vision
Vision is a challenging application for high-performance computing (HPC). Many vision tasks have stringent latency and throughput requirements. Further, the vision process has a heterogeneous computational profile. Low-level vision consists of structured computations, with regular data dependencies. The subsequent, higher level operations consist of symbolic computations with irregular data dependencies. Over the years, many approaches to high-speed vision have been pursued. VLSI hardware solutions such as ASIC's and digital signal processors (DSP's) have provided good processing speeds on structured low-level vision tasks. Special purpose systems for vision have also been designed. Currently, there is growing interest in using general purpose parallel systems for vision problems. These systems offer advantages of higher performance, sofavare programmability, generality, and architectural flexibility over the earlier approaches. The choice of low-cost commercial-off-theshelf (COTS) components as building blocks for these systems leads to easy upgradability and increased system life. The main focus of the paper is on effectively using the COTSbased general purpose parallel computing platforms to realize high-speed implementations of vision tasks. Due to the successful use of the COTS-based systems in a variety of high performance applications, it is attractive to consider their use for vision applications as well. However, the irregular data dependencies in vision tasks lead to large communication overheads in the HPC systems. At the University of Southern California, our research efforts have been directed toward designing scalable parallel algorithms for vision tasks on the HPC systems. In our approach, we use the message passing programming model to develop portable code. Our algorithms are specified using C and MPI. In this paper, we summarize our efforts, and illustrate our approach using several example vision tasks. To facilitate the analysis and development of scalable algorithms, a realistic computational model of the parallel system must be used. Several such models have been proposed in the literature. We use the General-purpose Distributed Memory (GDM) model which is a simple but realistic model of state-of-theart parallel machines. Using the GDM model, generic algorithmic techniques such as data remapping, overlapping of communication with computation, message packing, asynchronous execution, and communication scheduling are developed. Using these techniques, we have developed scalable algorithms for many vision tasks. For instance, a scalable algorithm for linear approximation has been developed using the asynchronous execution technique. Using this algorithm, linear feature extraction can be performed in 0.065 s on a 64 node SP-2 for a 512 × 512 image. A serial implementation takes 3.45 s for the same task. Similarly, the communication scheduling and decomposition techniques lead to a scalable algorithm for the line grouping task. We believe that such an algorithmic approach can result in the development of scalable and portable solutions for vision tasks. © 1996 IEEE Publisher Item Identifier S 0018-9219(96)04992-4.published_or_final_versio
- …