40 research outputs found

    Towards Predicting the Runtime of Iterative Analytics with PREDIcT

    Get PDF
    Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their resource requirements and runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult. This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithm's convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100x runtime variability among consecutive iterations

    Same Queries, Different Data: Can we Predict Query Performance?

    Get PDF
    We consider MapReduce workloads that are produced by analytics applications. In contrast to ad hoc query workloads, analytics applications are comprised of fixed data flows that are run over newly arriving data sets or on different portions of an existing data set. Examples of such workloads include document analysis/indexing, social media analytics, and ETL (Extract Transform Load). Motivated by these workloads, we propose a technique that predicts the runtime performance for a fixed set of queries running over varying input data sets. Our prediction technique splits each query into several segments where each segment’s performance is estimated using machine learning models. These per-segment estimates are plugged into a global analytical model to predict the overall query runtime. Our approach uses minimal statistics about the input data sets (e.g., tuple size, cardinality), which are complemented with historical information about prior query executions (e.g., execution time). We analyze the accuracy of predictions for several segment granularities on both standard analytical benchmarks such as TPC-DS [17], and on several real workloads. We obtain less than 25% prediction errors for 90% of predictions

    Structured, unstructured, and semistructured search in semistructured databases

    No full text
    A single framework for storing and querying XML data, using denormalized schema decompositions, can support both structured queries and unstructured searches, as well as serve as a foundation for combining the two forms of information access. XML data format becomes increasingly popular in applications that mix structured data and unstructured text. These applications require integration of structured query and text search mechanisms to access XML data. First, we introduce a framework for storing and querying XML data using denormalized schema decompositions. This framework was initially implemented in the XCacheDB XML database system, which uses XML schemas to shred XML data into relational storage. The XCacheDB supports a subset of XQuery language and emphasizes query optimization to reduce latency and output first results quickly. The XCacheDB relies on XML schemas, which poses a novel challenge for validation XML updates. We investigate the incremental validation of XML documents with respect to DTDs and XML Schemas. We exhibit an O(m log n) algorithm using an auxiliary structure of size O(n), where n is the size of the document and m is the number of updates. We exhibit a restricted class of DTDs called & quot;local" that arise commonly in practice and for which incremental validation can be done in practically constant time by maintaining only a list of counters. We present implementations and experimental evaluations of both general incremental validation and local validation in the XCacheDB system. We, then, present XKeyword system which uses a variation of XCacheDB of schema decompositions to support keyword proximity searches in XML databases. XKeyword decompositions include "ID relations" which store of IDs of target objects, and pre-compute common joins. Finally, we present an architecture of the Semi-Structured Search System (S4) designed to bridge the gap between traditional database and information retrieval systems. S4QL query language combines features of structured queries and text search to facilitate information discovery without knowledge of schema. S4 is based on the same schema decomposition framework of XCacheDB and XKeyword. However, the combination structured and unstructured query features pose novel challenges to efficient query processing. We outline these issues and possible ways of addressing the

    ABSTRACT On the Path to Efficient XML Queries

    No full text
    XQuery and SQL/XML are powerful new languages for querying XML data. However, they contain a number of stumbling blocks that users need to be aware of to get the expected results and performance. For example, certain language features make it hard if not impossible to exploit XML indexes. The major database vendors provide XQuery and SQL/XML support in their current or upcoming product releases. In this paper, we identify common pitfalls gleaned from the experiences of early adopters of this functionality. We illustrate these pitfalls through concrete examples, explain the unexpected query behavior, and show alternative formulations of the queries that behave and perform as anticipated. As results we provide guidelines for XQuery and SQL/XML users, feedback on the language standards, and food for thought for emerging languages and APIs. 1

    WIKIANALYTICS: Ad-hoc Querying of Highly Heterogeneous Structured Data

    No full text
    Abstract — Searching and extracting meaningful information out of highly heterogeneous datasets is a hot topic that received a lot of attention. However, the existing solutions are based on either rigid complex query languages (e.g., SQL, XQuery/XPath) which are hard to use without full schema knowledge, without an expert user, and which require up-front data integration. At the other extreme, existing solutions employ keyword search queries over relational databases [3], [1], [10], [9], [2], [11] as well as over semistructured data [6], [12], [17], [15] which are too imprecise to specify exactly the user’s intent [16]. To address these limitations, we propose an alternative search paradigm in order to derive tables of precise and complete results from a very sparse set of heterogeneous records. Our approach allows users to disambiguate search results by navigation along conceptual dimensions that describe the records. Therefore, we cluster documents based on fields and values that contain the query keywords. We build a universal navigational lattice (UNL) over all such discovered clusters. Conceptually, the UNL encodes all possible ways to group the documents in the data corpus based on where the keywords hit. We describe, WIKIANALYTICS, a system that facilitates data extraction from the Wikipedia infobox collection. WIKIANALYT-ICS provides a dynamic and intuitive interface that lets the average user explore the search results and construct homogeneous structured tables, which can be further queried and mashed up (e.g., filtered and aggregated) using the conventional tools. I
    corecore