10,196 research outputs found

    How to Optimally Allocate Resources for Coded Distributed Computing?

    Full text link
    Today's data centers have an abundance of computing resources, hosting server clusters consisting of as many as tens or hundreds of thousands of machines. To execute a complex computing task over a data center, it is natural to distribute computations across many nodes to take advantage of parallel processing. However, as we allocate more and more computing resources to a computation task and further distribute the computations, large amounts of (partially) computed data must be moved between consecutive stages of computation tasks among the nodes, hence the communication load can become the bottleneck. In this paper, we study the optimal allocation of computing resources in distributed computing, in order to minimize the total execution time in distributed computing accounting for both the duration of computation and communication phases. In particular, we consider a general MapReduce-type distributed computing framework, in which the computation is decomposed into three stages: \emph{Map}, \emph{Shuffle}, and \emph{Reduce}. We focus on a recently proposed \emph{Coded Distributed Computing} approach for MapReduce and study the optimal allocation of computing resources in this framework. For all values of problem parameters, we characterize the optimal number of servers that should be used for distributed processing, provide the optimal placements of the Map and Reduce tasks, and propose an optimal coded data shuffling scheme, in order to minimize the total execution time. To prove the optimality of the proposed scheme, we first derive a matching information-theoretic converse on the execution time, then we prove that among all possible resource allocation schemes that achieve the minimum execution time, our proposed scheme uses the exactly minimum possible number of servers

    Spark deployment and performance evaluation on the MareNostrum supercomputer

    Get PDF
    In this paper we present a framework to enable data-intensive Spark workloads on MareNostrum, a petascale supercomputer designed mainly for compute-intensive applications. As far as we know, this is the first attempt to investigate optimized deployment configurations of Spark on a petascale HPC setup. We detail the design of the framework and present some benchmark data to provide insights into the scalability of the system. We examine the impact of different configurations including parallelism, storage and networking alternatives, and we discuss several aspects in executing Big Data workloads on a computing system that is based on the compute-centric paradigm. Further, we derive conclusions aiming to pave the way towards systematic and optimized methodologies for fine-tuning data-intensive application on large clusters emphasizing on parallelism configurations.Peer ReviewedPostprint (author's final draft

    An Information-Theoretic Test for Dependence with an Application to the Temporal Structure of Stock Returns

    Full text link
    Information theory provides ideas for conceptualising information and measuring relationships between objects. It has found wide application in the sciences, but economics and finance have made surprisingly little use of it. We show that time series data can usefully be studied as information -- by noting the relationship between statistical redundancy and dependence, we are able to use the results of information theory to construct a test for joint dependence of random variables. The test is in the same spirit of those developed by Ryabko and Astola (2005, 2006b,a), but differs from these in that we add extra randomness to the original stochatic process. It uses data compression to estimate the entropy rate of a stochastic process, which allows it to measure dependence among sets of random variables, as opposed to the existing econometric literature that uses entropy and finds itself restricted to pairwise tests of dependence. We show how serial dependence may be detected in S&P500 and PSI20 stock returns over different sample periods and frequencies. We apply the test to synthetic data to judge its ability to recover known temporal dependence structures.Comment: 22 pages, 7 figure

    Towards the quantification of the semantic information encoded in written language

    Get PDF
    Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.Comment: 19 pages, 4 figure
    • …
    corecore