94 research outputs found

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Implementation of composite semijoins using a variation of Bloom filters.

    Get PDF
    Different from a centralized database system, distributed query processing involves data transmission among different sites and this communication cost is a dominant factor compared to local processing cost. So, the objective of distributed query optimization is to find strategies to minimize the amount of data transmitted over the network. Since optimal query processing in distributed database systems has been shown to be an NP-hard problem, heuristics are applied to find a near-optimal processing strategy. Previous research has mainly focused on the use of joins, semijoins, and hash semijoins (Bloom filters). The semijoin is a commonly recognized operator, which provides efficient query results. As a variation of semijoin, the composite semijoin is beneficial to do semijoins as one composite rather than as multiple single column semijoins. The Hash semijoin (which uses a Bloom filter) is used to minimize the cost of a semijoin operation. This thesis report provides a summary of each category of query processing techniques and optimization algorithms. Also in this thesis, we propose a new algorithm called Composite Semijoin Filter by combining the idea of composite semijoins, Bloom filters and PERF joins. One of the advantages of this algorithm is to avoid collisions. The algorithm is evaluated and compared with initial feasible solution (IFS) and another filter-based algorithm. It has been shown that the algorithm gives substantial reduction on relations and the total cost.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .Z58. Source: Masters Abstracts International, Volume: 43-01, page: 0249. Adviser: Joan Morrissey. Thesis (M.Sc.)--University of Windsor (Canada), 2004

    Recurring Query Processing on Big Data

    Get PDF
    The advances in hardware, software, and networks have enabled applications from business enterprises, scientific and engineering disciplines, to social networks, to generate data at unprecedented volume, variety, velocity, and varsity not possible before. Innovation in these domains is thus now hindered by their ability to analyze and discover knowledge from the collected data in a timely and scalable fashion. To facilitate such large-scale big data analytics, the MapReduce computing paradigm and its open-source implementation Hadoop is one of the most popular and widely used technologies. Hadoop\u27s success as a competitor to traditional parallel database systems lies in its simplicity, ease-of-use, flexibility, automatic fault tolerance, superior scalability, and cost effectiveness due to its use of inexpensive commodity hardware that can scale petabytes of data over thousands of machines. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Efficient execution and optimization techniques must be designed to assure the responsiveness and scalability of these recurring queries. In this dissertation, we thoroughly investigate topics in the area of recurring query processing on big data. In this dissertation, we first propose a novel scalable infrastructure called Redoop that treats recurring query over big evolving data as first class citizens during query processing. This is in contrast to state-of-the-art MapReduce/Hadoop system experiencing significant challenges when faced with recurring queries including redundant computations, significant latencies, and huge application development efforts. Redoop offers innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support as well. Second, we address the crucial need to accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated data sets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. On top of Redoop, we built a scalable multi-query sharing engine tailored for recurring workloads in the MapReduce infrastructure, called Helix. Helix deploys new sliced window-alignment techniques to create sharing opportunities among recurring queries without introducing additional I/O overheads or unnecessary data scans. Furthermore, Helix introduces a cost/benefit model for creating a sharing plan among the recurring queries, and a scheduling strategy for executing them to maximize the SLA satisfaction. Third, recurring analytics queries tend to be expensive, especially when query processing consumes data sets in the hundreds of terabytes or more. Time sensitive recurring queries, such as fraud detection, often come with tight response time constraints as query deadlines. Data sampling is a popular technique for computing approximate results with an acceptable error bound while reducing high-demand resource consumption and thus improving query turnaround times. In this dissertation, we propose the first fast approximate query engine for recurring workloads in the MapReduce infrastructure, called Faro. Faro introduces two key innovations: (1) a deadline-aware sampling strategy that builds samples from the original data with reduced sample sizes compared to uniform sampling, and (2) adaptive resource allocation strategies that maximally improve the approximate results while assuring to still meet the response time requirements specified in recurring queries. In our comprehensive experimental study of each part of this dissertation, we demonstrate the superiority of the proposed strategies over state-of-the-art techniques in scalability, effectiveness, as well as robustness

    Scalable RDF compression with MapReduce and HDT

    Get PDF
    El uso de RDF para publicar datos semánticos se ha incrementado de forma notable en los últimos años. Hoy los datasets son tan grandes y están tan interconectados que su procesamiento presenta problemas de escalabilidad. HDT es una representación compacta de RDF que pretende minimizar el consumo de espacio a la vez que proporciona capacidades de consulta. No obstante, la generación de HDT a partir de formatos en texto de RDF es una tarea costosa en tiempo y recursos. Este trabajo estudia el uso de MapReduce, un framework para el procesamiento distribuido de grandes cantidades de datos, para la tarea de creación de estructuras HDT a partir de RDF, y analiza las mejoras obtenidas tanto en recursos como en tiempo frente a la creación de dichas estructuras en un proceso mono-nodo.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Máster en Investigación en Tecnologías de la Información y las Comunicacione

    ART: A Large Scale Microblogging Data Management System

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Scalable and Declarative Information Extraction in a Parallel Data Analytics System

    Get PDF
    Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets

    An evaluation of PERF joins for a two-way semijoin based algorithm.

    Get PDF
    Distributed database system is becoming more widely used instead of centralized database systems in business world due to business expansion and network technology development. Query optimization provides a strategy for executing each query over the networks in the most cost-effective way, which aims to minimize the transmission cost over the networks. Many techniques and algorithms have been proposed to optimize queries, such as semijoin[BC81][BGW+81], 2-way semijoin[KR87], composite semijoin[PC90], hash semijoin[TC92], PERF join[LR95], etc. In distributed query processing, the semijoin has been used as an effective operator to reduce the total amount of data transmission. 2-way semijoin is an extended version of semijoin for more cost-effective distributed query processing. PERF joins are 2-way semijoins using a bit vector during the backward phase. PERF[LR95] is designed to minimize the cost of the backward reduction. It is based on the tuple scan order instead of hashing. Thus it does not suffer any loss of join information incurred by hash collisions. Algorithm UPSJ and Algorithm CPSJ are proposed based on a 2-way semijoin algorithm. Two variants of PERF joins are applied to the 2-way semijoin algorithm. In Algorithm UPSJ, uncompressed PERF joins and 2-way semijoin techniques are combined. In Algorithm CPSJ, compressed PERF joins are applied during the backward processing. Programs are designed to implement both original and the enhanced algorithms. Several experiments are conducted and the results showed a considerable enhancement obtained by applying the PERF join concept.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .Y36. Source: Masters Abstracts International, Volume: 44-03, page: 1419. Thesis (M.Sc.)--University of Windsor (Canada), 2005

    Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation

    Get PDF
    With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and translation technologies
    corecore