37 research outputs found

    Parallel programming paradigms and frameworks in big data era

    Get PDF
    With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines-ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data-typically of heterogeneous nature-in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.Peer ReviewedPostprint (author's final draft

    Optimizing MapReduce for Multicore Architectures

    Get PDF
    MapReduce is a programming model for data-parallel programs originally intended for data centers. MapReduce simplifies parallel programming, hiding synchronization and task management. These properties make it a promising programming model for future processors with many cores, and existing MapReduce libraries such as Phoenix have demonstrated that applications written with MapReduce perform competitively with those written with Pthreads. This paper explores the design of the MapReduce data structures for grouping intermediate key/value pairs, which is often a performance bottleneck on multicore processors. The paper finds the best choice depends on workload characteristics, such as the number of keys used by the application, the degree of repetition of keys, etc. This paper also introduces a new MapReduce library, Metis, with a compromise data structure designed to perform well for most workloads. Experiments with the Phoenix benchmarks on a 16-core AMD-based servershow that Metisâ data structure performs better than simpler alternatives, including Phoenix

    Liquid stream processing on the web: a JavaScript framework

    Get PDF
    The Web is rapidly becoming a mature platform to host distributed applications. Pervasive computing application running on the Web are now common in the era of the Web of Things, which has made it increasingly simple to integrate sensors and microcontrollers in our everyday life. Such devices are of great in- terest to Makers with basic Web development skills. With them, Makers are able to build small smart stream processing applications with sensors and actuators without spending a fortune and without knowing much about the technologies they use. Thanks to ongoing Web technology trends enabling real-time peer-to- peer communication between Web-enabled devices, Web browsers and server- side JavaScript runtimes, developers are able to implement pervasive Web ap- plications using a single programming language. These can take advantage of direct and continuous communication channels going beyond what was possible in the early stages of the Web to push data in real-time. Despite these recent advances, building stream processing applications on the Web of Things remains a challenging task. On the one hand, Web-enabled devices of different nature still have to communicate with different protocols. On the other hand, dealing with a dynamic, heterogeneous, and volatile environment like the Web requires developers to face issues like disconnections, unpredictable workload fluctuations, and device overload. To help developers deal with such issues, in this dissertation we present the Web Liquid Streams (WLS) framework, a novel streaming framework for JavaScript. Developers implement streaming operators written in JavaScript and may interactively and dynamically define a streaming topology. The framework takes care of deploying the user-defined operators on the available devices and connecting them using the appropriate data channel, removing the burden of dealing with different deployment environments from the developers. Changes in the semantic of the application and in its execution environment may be ap- plied at runtime without stopping the stream flow. Like a liquid adapts its shape to the one of its container, the Web Liquid Streams framework makes streaming topologies flow across multiple heterogeneous devices, enabling dynamic operator migration without disrupting the data flow. By constantly monitoring the execution of the topology with a hierarchical controller infrastructure, WLS takes care of parallelising the operator execution across multiple devices in case of bottlenecks and of recovering the execution of the streaming topology in case one or more devices disconnect, by restarting lost operators on other available devices

    An Empirical Analysis of Scheduling Techniques for Real-Time Cloud-Based Data Processing

    Get PDF
    In this paper, we explore the challenges and needs of current cloud infrastructures, to better support cloud-based data-intensive applications that are not only latency-sensitive but also require strong timing guarantees. These applications have strict deadlines (e.g., to perform time-dependent mission critical tasks or to complete real-time control decisions using a human-in-the-loop), and deadline misses are undesirable. To highlight the challenges in this space, we provide a case study of the online scheduling of MapReduce jobs executed by Hadoop. Our evaluations on Amazon EC2 show that the existing Hadoop scheduler is ill-equipped to handle jobs with deadlines. However, by adapting existing multiprocessor scheduling techniques for the cloud environment, we observe significant performance improvements in minimizing missed deadlines and tardiness. Based on our case study, we discuss a range of challenges in this domain posed by virtualization and scale, and propose our research agenda centered around the application of advanced real-time scheduling techniques in the cloud environment

    Scalable elastic systems architecture

    Full text link
    Cloud computing has spurred the exploration and exploitation of elastic access to large scales of computing. To date the predominate building blocks by which elasticity has been exploited are applications and operating systems that are built around traditional computing infrastructure and programming models that are in-elastic or at best coarsely elastic. What would happen if application themselves could express and exploit elasticity in a fine grain fashion and this elasticity could be efficiently mapped to the scale and elasticity offered by modern cloud hardware systems? Would economic and market models that exploit elasticity pervade even the lowest levels? And would this enable greater efficiency both globally and individually? Would novel approaches to traditional problems such as quality of service arise? Would new applications be enabled both technically and economically? How to construct scalable and elastic software is an open challenge. Our work explores a systematic method for constructing and deploying such software. Building on several years of prior research, we will develop and evaluate a new cloud computing systems software architecture that addresses both scalability and elasticity. We explore a combination of a novel programming model and alternative operating systems structure. The goal of the architecture is to enable applications that inherently can scale up or down to react to changes in demand. We hypothesize that enabling such fine-grain elastic applications will open up new avenues for exploring both supply and demand elasticity across a broad range of research areas such as economic models, optimization, mechanism design, software engineering, networking and others.Department of Energy Office of Science (DE-SC0005365), National Science Foundation (1012798

    Towards Low-Latency Batched Stream Processing by Pre-Scheduling

    Get PDF
    corecore