10,196 research outputs found
How to Optimally Allocate Resources for Coded Distributed Computing?
Today's data centers have an abundance of computing resources, hosting server
clusters consisting of as many as tens or hundreds of thousands of machines. To
execute a complex computing task over a data center, it is natural to
distribute computations across many nodes to take advantage of parallel
processing. However, as we allocate more and more computing resources to a
computation task and further distribute the computations, large amounts of
(partially) computed data must be moved between consecutive stages of
computation tasks among the nodes, hence the communication load can become the
bottleneck. In this paper, we study the optimal allocation of computing
resources in distributed computing, in order to minimize the total execution
time in distributed computing accounting for both the duration of computation
and communication phases. In particular, we consider a general MapReduce-type
distributed computing framework, in which the computation is decomposed into
three stages: \emph{Map}, \emph{Shuffle}, and \emph{Reduce}. We focus on a
recently proposed \emph{Coded Distributed Computing} approach for MapReduce and
study the optimal allocation of computing resources in this framework. For all
values of problem parameters, we characterize the optimal number of servers
that should be used for distributed processing, provide the optimal placements
of the Map and Reduce tasks, and propose an optimal coded data shuffling
scheme, in order to minimize the total execution time. To prove the optimality
of the proposed scheme, we first derive a matching information-theoretic
converse on the execution time, then we prove that among all possible resource
allocation schemes that achieve the minimum execution time, our proposed scheme
uses the exactly minimum possible number of servers
Spark deployment and performance evaluation on the MareNostrum supercomputer
In this paper we present a framework to enable data-intensive Spark workloads on MareNostrum, a petascale supercomputer designed mainly for compute-intensive applications. As far as we know, this is the first attempt to investigate optimized deployment configurations of Spark on a petascale HPC setup. We detail the design of the framework and present some benchmark data to provide insights into the scalability of the system. We examine the impact of different configurations including parallelism, storage and networking alternatives, and we discuss several aspects in executing Big Data workloads on a computing system that is based on the compute-centric paradigm. Further, we derive conclusions aiming to pave the way towards systematic and optimized methodologies for fine-tuning data-intensive application on large clusters emphasizing on parallelism configurations.Peer ReviewedPostprint (author's final draft
An Information-Theoretic Test for Dependence with an Application to the Temporal Structure of Stock Returns
Information theory provides ideas for conceptualising information and
measuring relationships between objects. It has found wide application in the
sciences, but economics and finance have made surprisingly little use of it. We
show that time series data can usefully be studied as information -- by noting
the relationship between statistical redundancy and dependence, we are able to
use the results of information theory to construct a test for joint dependence
of random variables. The test is in the same spirit of those developed by
Ryabko and Astola (2005, 2006b,a), but differs from these in that we add extra
randomness to the original stochatic process. It uses data compression to
estimate the entropy rate of a stochastic process, which allows it to measure
dependence among sets of random variables, as opposed to the existing
econometric literature that uses entropy and finds itself restricted to
pairwise tests of dependence. We show how serial dependence may be detected in
S&P500 and PSI20 stock returns over different sample periods and frequencies.
We apply the test to synthetic data to judge its ability to recover known
temporal dependence structures.Comment: 22 pages, 7 figure
Towards the quantification of the semantic information encoded in written language
Written language is a complex communication signal capable of conveying
information encoded in the form of ordered sequences of words. Beyond the local
order ruled by grammar, semantic and thematic structures affect long-range
patterns in word usage. Here, we show that a direct application of information
theory quantifies the relationship between the statistical distribution of
words and the semantic content of the text. We show that there is a
characteristic scale, roughly around a few thousand words, which establishes
the typical size of the most informative segments in written language.
Moreover, we find that the words whose contributions to the overall information
is larger, are the ones more closely associated with the main subjects and
topics of the text. This scenario can be explained by a model of word usage
that assumes that words are distributed along the text in domains of a
characteristic size where their frequency is higher than elsewhere. Our
conclusions are based on the analysis of a large database of written language,
diverse in subjects and styles, and thus are likely to be applicable to general
language sequences encoding complex information.Comment: 19 pages, 4 figure
- …