12,630 research outputs found
Communication Cost for Updating Linear Functions when Message Updates are Sparse: Connections to Maximally Recoverable Codes
We consider a communication problem in which an update of the source message
needs to be conveyed to one or more distant receivers that are interested in
maintaining specific linear functions of the source message. The setting is one
in which the updates are sparse in nature, and where neither the source nor the
receiver(s) is aware of the exact {\em difference vector}, but only know the
amount of sparsity that is present in the difference-vector. Under this
setting, we are interested in devising linear encoding and decoding schemes
that minimize the communication cost involved. We show that the optimal
solution to this problem is closely related to the notion of maximally
recoverable codes (MRCs), which were originally introduced in the context of
coding for storage systems. In the context of storage, MRCs guarantee optimal
erasure protection when the system is partially constrained to have local
parity relations among the storage nodes. In our problem, we show that optimal
solutions exist if and only if MRCs of certain kind (identified by the desired
linear functions) exist. We consider point-to-point and broadcast versions of
the problem, and identify connections to MRCs under both these settings. For
the point-to-point setting, we show that our linear-encoder based achievable
scheme is optimal even when non-linear encoding is permitted. The theory is
illustrated in the context of updating erasure coded storage nodes. We present
examples based on modern storage codes such as the minimum bandwidth
regenerating codes.Comment: To Appear in IEEE Transactions on Information Theor
Runtime Optimizations for Prediction with Tree-Based Models
Tree-based models have proven to be an effective solution for web ranking as
well as other problems in diverse domains. This paper focuses on optimizing the
runtime performance of applying such models to make predictions, given an
already-trained model. Although exceedingly simple conceptually, most
implementations of tree-based models do not efficiently utilize modern
superscalar processor architectures. By laying out data structures in memory in
a more cache-conscious fashion, removing branches from the execution flow using
a technique called predication, and micro-batching predictions using a
technique called vectorization, we are able to better exploit modern processor
architectures and significantly improve the speed of tree-based models over
hard-coded if-else blocks. Our work contributes to the exploration of
architecture-conscious runtime implementations of machine learning algorithms
Using Localised āGossipā to Structure Distributed Learning
The idea of a āmemeticā spread of solutions through a human culture in parallel to their development is applied as a distributed approach to learning. Local parts of a problem are associated with a set of overlappingt localities in a space and solutions are then evolved in those localites. Good solutions are not only crossed with others to search for better solutions but also they propogate across the areas of the problem space where they are relatively successful. Thus the whole population co-evolves solutions with the domains in which they are found to work. This approach is compared to the equivalent global evolutionary computation approach with respect to predicting the occcurence of heart disease in the Cleveland data set. It greatly outperforms the global approach, but the space of attributes within which this evolutionary process occurs can effect its efficiency
Using the High Productivity Language Chapel to Target GPGPU Architectures
It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPGPU architectures using the high productivity programming language Chapel. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of being portable across distinct classes of parallel architectures, including desktop multicores, distributed memory clusters, large-scale shared memory, and now CPU-GPU hybrids. We present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA.NSF CCF 0702260Cray Inc. Cray-SRA-2010-016962010-2011 Nvidia Research Fellowshipunpublishednot peer reviewe
Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+OpenMP programming
We evaluate optimized parallel sparse matrix-vector operations for two
representative application areas on widespread multicore-based cluster
configurations. First the single-socket baseline performance is analyzed and
modeled with respect to basic architectural properties of standard multicore
chips. Going beyond the single node, parallel sparse matrix-vector operations
often suffer from an unfavorable communication to computation ratio. Starting
from the observation that nonblocking MPI is not able to hide communication
cost using standard MPI implementations, we demonstrate that explicit overlap
of communication and computation can be achieved by using a dedicated
communication thread, which may run on a virtual core. We compare our approach
to pure MPI and the widely used "vector-like" hybrid programming strategy.Comment: 12 pages, 6 figure
- ā¦