1,220 research outputs found

    Efficient Evaluation of Sparse Data Cubes

    Get PDF
    Computing data cubes requires the aggregation of measures over arbitrary combinations of dimensions in a data set. Efficient data cube evaluation remains challenging because of the potentially very large sizes of input datasets (e.g., in the data warehousing context), the well-known curse of dimensionality, and the complexity of queries that need to be supported. This paper proposes a new dynamic data structure called SST (Sparse Statistics Trees) and a novel, in-teractive, and fast cube evaluation algorithm called CUPS (Cubing by Pruning SST), which is especially well suitable for computing aggregates in cubes whose data sets are sparse. SST only stores the aggregations of non-empty cube cells instead of the detailed records. Furthermore, it retains in memory the dense cubes (a.k.a. iceberg cubes) whose aggregate values are above a threshold. Sparse cubes are stored on disks. This allows a fast, accurate approximation for queries. If users desire more refined answers, related sparse cubes are aggregated. SST is incrementally maintainable, which makes CUPS suitable for data warehousing and analysis of streaming data. Experiment results demonstrate the excellent performance and good scalability of our approach

    Efficient Evaluation of Sparse Data Cubes

    Full text link
    available at www.springerlink.com ***Note: Figures may be missing from this format of the document Computing data cubes requires the aggregation of measures over arbitrary combinations of dimensions in a data set. Efficient data cube evaluation remains challenging because of the potentially very large sizes of input datasets (e.g., in the data warehousing context), the well-known curse of dimensionality, and the complexity of queries that need to be supported. This paper proposes a new dynamic data structure called SST (Sparse Statistics Trees) and a novel, in-teractive, and fast cube evaluation algorithm called CUPS (Cubing by Pruning SST), which is especially well suitable for computing aggregates in cubes whose data sets are sparse. SST only stores the aggregations of non-empty cube cells instead of the detailed records. Furthermore, it retains in memory the dense cubes (a.k.a. iceberg cubes) whose aggregate values are above a threshold. Sparse cubes are stored on disks. This allows a fast, accurate approximation for queries. If users desire more refined answers, related sparse cubes are aggregated. SST is incrementally maintainable, which makes CUPS suitable for data warehousing and analysis of streaming data. Experiment results demonstrate the excellent performance and good scalability of our approach. Article

    Enabling Graph Analysis Over Relational Databases

    Get PDF
    Complex interactions and systems can be modeled by analyzing the connections between underlying entities or objects described by a dataset. These relationships form networks (graphs), the analysis of which has been shown to provide tremendous value in areas ranging from retail to many scientific domains. This value is obtained by using various methodologies from network science-- a field which focuses on studying network representations in the real world. In particular "graph algorithms", which iteratively traverse a graph's connections, are often leveraged to gain insights. To take advantage of the opportunity presented by graph algorithms, there have been a variety of specialized graph data management systems, and analysis frameworks, proposed in recent years, which have made significant advances in efficiently storing and analyzing graph-structured data. Most datasets however currently do not reside in these specialized systems but rather in general-purpose relational database management systems (RDBMS). A relational or similarly structured system is typically governed by a schema of varying strictness that implements constraints and is meticulously designed for the specific enterprise. Such structured datasets contain many relationships between the entities therein, that can be seen as latent or "hidden" graphs that exist inherently inside the datasets. However, these relationships can only typically be traversed via conducting expensive JOINs using SQL or similar languages. Thus, in order for users to efficiently traverse these latent graphs to conduct analysis, data needs to be transformed and migrated to specialized systems. This creates barriers that hinder and discourage graph analysis; our vision is to break these barriers. In this dissertation we investigate the opportunities and challenges involved in efficiently leveraging relationships within data stored in structured databases. First, we present GraphGen, a lightweight software layer that is independent from the underlying database, and provides interfaces for graph analysis of data in RDBMSs. GraphGen is the first such system that introduces an intuitive high-level language for specifying graphs of interest, and utilizes in-memory graph representations to tackle the problems associated with analyzing graphs that are hidden inside structured datasets. We show GraphGen can analyze such graphs in orders of magnitude less memory, and often computation time, while eliminating manual Extract-Transform-Load (ETL) effort. Second, we examine how in-memory graph representations of RDBMS data can be used to enhance relational query processing. We present a novel, general framework for executing GROUP BY aggregation over conjunctive queries which avoids materialization of intermediate JOIN results, and wrap this framework inside a multi-way relational operator called Join-Agg. We show that Join-Agg can compute aggregates over a class of relational and graph queries using orders of magnitude less memory and computation time

    Searching and mining in enriched geo-spatial data

    Get PDF
    The emergence of new data collection mechanisms in geo-spatial applications paired with a heightened tendency of users to volunteer information provides an ever-increasing flow of data of high volume, complex nature, and often associated with inherent uncertainty. Such mechanisms include crowdsourcing, automated knowledge inference, tracking, and social media data repositories. Such data bearing additional information from multiple sources like probability distributions, text or numerical attributes, social context, or multimedia content can be called multi-enriched. Searching and mining this abundance of information holds many challenges, if all of the data's potential is to be released. This thesis addresses several major issues arising in that field, namely path queries using multi-enriched data, trend mining in social media data, and handling uncertainty in geo-spatial data. In all cases, the developed methods have made significant contributions and have appeared in or were accepted into various renowned international peer-reviewed venues. A common use of geo-spatial data is path queries in road networks where traditional methods optimise results based on absolute and ofttimes singular metrics, i.e., finding the shortest paths based on distance or the best trade-off between distance and travel time. Integrating additional aspects like qualitative or social data by enriching the data model with knowledge derived from sources as mentioned above allows for queries that can be issued to fit a broader scope of needs or preferences. This thesis presents two implementations of incorporating multi-enriched data into road networks. In one case, a range of qualitative data sources is evaluated to gain knowledge about user preferences which is subsequently matched with locations represented in a road network and integrated into its components. Several methods are presented for highly customisable path queries that incorporate a wide spectrum of data. In a second case, a framework is described for resource distribution with reappearance in road networks to serve one or more clients, resulting in paths that provide maximum gain based on a probabilistic evaluation of available resources. Applications for this include finding parking spots. Social media trends are an emerging research area giving insight in user sentiment and important topics. Such trends consist of bursts of messages concerning a certain topic within a time frame, significantly deviating from the average appearance frequency of the same topic. By investigating the dissemination of such trends in space and time, this thesis presents methods to classify trend archetypes to predict future dissemination of a trend. Processing and querying uncertain data is particularly demanding given the additional knowledge required to yield results with probabilistic guarantees. Since such knowledge is not always available and queries are not easily scaled to larger datasets due to the #P-complete nature of the problem, many existing approaches reduce the data to a deterministic representation of its underlying model to eliminate uncertainty. However, data uncertainty can also provide valuable insight into the nature of the data that cannot be represented in a deterministic manner. This thesis presents techniques for clustering uncertain data as well as query processing, that take the additional information from uncertainty models into account while preserving scalability using a sampling-based approach, while previous approaches could only provide one of the two. The given solutions enable the application of various existing clustering techniques or query types to a framework that manages the uncertainty.Das Erscheinen neuer Methoden zur Datenerhebung in räumlichen Applikationen gepaart mit einer erhöhten Bereitschaft der Nutzer, Daten über sich preiszugeben, generiert einen stetig steigenden Fluss von Daten in großer Menge, komplexer Natur, und oft gepaart mit inhärenter Unsicherheit. Beispiele für solche Mechanismen sind Crowdsourcing, automatisierte Wissensinferenz, Tracking, und Daten aus sozialen Medien. Derartige Daten, angereichert mit mit zusätzlichen Informationen aus verschiedenen Quellen wie Wahrscheinlichkeitsverteilungen, Text- oder numerische Attribute, sozialem Kontext, oder Multimediainhalten, werden als multi-enriched bezeichnet. Suche und Datamining in dieser weiten Datenmenge hält viele Herausforderungen bereit, wenn das gesamte Potenzial der Daten genutzt werden soll. Diese Arbeit geht auf mehrere große Fragestellungen in diesem Feld ein, insbesondere Pfadanfragen in multi-enriched Daten, Trend-mining in Daten aus sozialen Netzwerken, und die Beherrschung von Unsicherheit in räumlichen Daten. In all diesen Fällen haben die entwickelten Methoden signifikante Forschungsbeiträge geleistet und wurden veröffentlicht oder angenommen zu diversen renommierten internationalen, von Experten begutachteten Konferenzen und Journals. Ein gängiges Anwendungsgebiet räumlicher Daten sind Pfadanfragen in Straßennetzwerken, wo traditionelle Methoden die Resultate anhand absoluter und oft auch singulärer Maße optimieren, d.h., der kürzeste Pfad in Bezug auf die Distanz oder der beste Kompromiss zwischen Distanz und Reisezeit. Durch die Integration zusätzlicher Aspekte wie qualitativer Daten oder Daten aus sozialen Netzwerken als Anreicherung des Datenmodells mit aus diesen Quellen abgeleitetem Wissen werden Anfragen möglich, die ein breiteres Spektrum an Anforderungen oder Präferenzen erfüllen. Diese Arbeit präsentiert zwei Ansätze, solche multi-enriched Daten in Straßennetze einzufügen. Zum einen wird eine Reihe qualitativer Datenquellen ausgewertet, um Wissen über Nutzerpräferenzen zu generieren, welches darauf mit Örtlichkeiten im Straßennetz abgeglichen und in das Netz integriert wird. Diverse Methoden werden präsentiert, die stark personalisierbare Pfadanfragen ermöglichen, die ein weites Spektrum an Daten mit einbeziehen. Im zweiten Fall wird ein Framework präsentiert, das eine Ressourcenverteilung im Straßennetzwerk modelliert, bei der einmal verbrauchte Ressourcen erneut auftauchen können. Resultierende Pfade ergeben einen maximalen Ertrag basieren auf einer probabilistischen Evaluation der verfügbaren Ressourcen. Eine Anwendung ist die Suche nach Parkplätzen. Trends in sozialen Medien sind ein entstehendes Forscchungsgebiet, das Einblicke in Benutzerverhalten und wichtige Themen zulässt. Solche Trends bestehen aus großen Mengen an Nachrichten zu einem bestimmten Thema innerhalb eines Zeitfensters, so dass die Auftrittsfrequenz signifikant über den durchschnittlichen Level liegt. Durch die Untersuchung der Fortpflanzung solcher Trends in Raum und Zeit präsentiert diese Arbeit Methoden, um Trends nach Archetypen zu klassifizieren und ihren zukünftigen Weg vorherzusagen. Die Anfragebearbeitung und Datamining in unsicheren Daten ist besonders herausfordernd, insbesondere im Hinblick auf das notwendige Zusatzwissen, um Resultate mit probabilistischen Garantien zu erzielen. Solches Wissen ist nicht immer verfügbar und Anfragen lassen sich aufgrund der \P-Vollständigkeit des Problems nicht ohne Weiteres auf größere Datensätze skalieren. Dennoch kann Datenunsicherheit wertvollen Einblick in die Struktur der Daten liefern, der mit deterministischen Methoden nicht erreichbar wäre. Diese Arbeit präsentiert Techniken zum Clustering unsicherer Daten sowie zur Anfragebearbeitung, die die Zusatzinformation aus dem Unsicherheitsmodell in Betracht ziehen, jedoch gleichzeitig die Skalierbarkeit des Ansatzes auf große Datenmengen sicherstellen

    Multi-Dimensional Partitioning in BUC for Data Cubes

    Get PDF
    Bottom-Up Computation (BUC) is one of the most studied algorithms for data cube generation in on-line analytical processing. Its computation in the bottom-up style allows the algorithm to efficiently generate a data cube for memory-sized input data. When the entire input data cannot fit into memory, many literatures suggest partitioning the data by a dimension and run the algorithm on each of the single-dimensional partitioned data. For very large sized input data, the partitioned data might still not be able to fit into the memory and partitioning by additional dimensions is required; however, this multi- dimensional partitioning is more complicated than single-dimensional partitioning and it has not been fully discussed before. Our goal is to provide a heuristic implementation on multi-dimensional partitioning in BUC. To confirm our design, we compare it with our implemented PipeSort, which is a top-down data cubing algorithm; meanwhile, we confirm the advantages and disadvantages between the top-down data cubing algorithm and the bottom-up data cubing algorithm
    corecore