4 research outputs found

    Geo-distributed big data processing

    No full text
    Big data processing undoubtedly represents a major challenge of this era. Big data inherently arises due to many reasons including applications retaining more information to improve operation, monitoring, or auditing. Many systems have been proposed for efficiently handling big data. MapReduce, popularized by Google, is a widely used model where data is processed in two essential phases, mapping and reducing. Also many workflow systems have been introduced for efficiently handling multiple big datasets. These include Google\u27s FlumeJava and Apache Pig. One major limitation of current systems for processing big data is that they assume a single homogeneously addressable cluster of nodes. Most of these systems are not designed to operate across multiple data centers and operate poorly in such environments. Many analysis tasks involve several datasets which are not necessarily stored in the same data center, and some datasets themselves may consist of several sub-datasets that may be partitioned into several data centers. In other terms, in contrast to the illusion of omnipresent uniform storage and computation resources promoted by cloud vendors, clouds are implemented by concrete data centers with specific locations; and big data is often geographically distributed. Current tools perform poorly in such environments if they support them at all. In this dissertation, we present solutions for efficiently handling big data that is geographically distributed. First, we investigate ways for efficiently processing a single geographically distributed dataset, and present G-MR, a tool for executing a sequence of tasks on such a dataset in an optimized manner. Second, we identify ways for efficiently handling multiple geographically distributed datasets using big data workflow systems. We present our languages Rout and DuctWork and corresponding systems that extend the big data workflow languages Apache Pig and Google\u27s FlumeJava respectively, for defining and executing geographically distributed big data workflows. Third, we present Atmosphere, a distributed middleware system for efficiently communicating data across multiple cloud environments

    Atmosphere: A Universal Cross-Cloud Communication Infrastructure

    No full text
    Part 2: Cloud ComputingInternational audienceAs demonstrated by the emergence of paradigms like fog computing [1] or cloud-of-clouds [2], the landscape of third-party computation is moving beyond straightforward single datacenter-based cloud computing. However, building applications that execute efficiently across data-centers and clouds is tedious due to the variety of communication abstractions provided, and variations in latencies within and between datacenters.The publish/subscribe paradigm seems like an adequate abstraction for supporting “cross-cloud” communication as it abstracts low-level communication and addressing and supports many-to-many communication between publishers and subscribers, of which one-to-one or one-to-many addressing can be viewed as special cases. In particular, content-based publish/subscribe (CPS) provides an expressive abstraction that matches well with the key-value pair model of many established cloud storage and computing systems, and decentralized overlay-based CPS implementations scale up well. On the flip side, such CPS systems perform poorly at small scale. This holds especially for multi-send scenarios which we refer to as entourages that range from a channel between a publisher and a single subscriber to a broadcast between a publisher and a handful of subscribers. These scenarios are common in datacenter computing, where cheap hardware is exploited for parallelism (efficiency) and redundancy (fault-tolerance).In this paper, we present Atmosphere, a CPS system for cross-cloud communication that can dynamically identify entourages of publishers and corresponding subscribers, taking geographical constraints into account. Atmosphere connects publishers with their entourages through überlays, enabling low latency communication. We describe three case studies of systems that employ Atmosphere as communication framework, illustrating that Atmosphere can be utilized to considerably improve cross-cloud communication efficiency

    Optimal communication structures for big data aggregation

    No full text
    corecore