8,683 research outputs found

    SODA: Generating SQL for Business Users

    Full text link
    The purpose of data warehouses is to enable business analysts to make better decisions. Over the years the technology has matured and data warehouses have become extremely successful. As a consequence, more and more data has been added to the data warehouses and their schemas have become increasingly complex. These systems still work great in order to generate pre-canned reports. However, with their current complexity, they tend to be a poor match for non tech-savvy business analysts who need answers to ad-hoc queries that were not anticipated. This paper describes the design, implementation, and experience of the SODA system (Search over DAta Warehouse). SODA bridges the gap between the business needs of analysts and the technical complexity of current data warehouses. SODA enables a Google-like search experience for data warehouses by taking keyword queries of business users and automatically generating executable SQL. The key idea is to use a graph pattern matching algorithm that uses the metadata model of the data warehouse. Our results with real data from a global player in the financial services industry show that SODA produces queries with high precision and recall, and makes it much easier for business users to interactively explore highly-complex data warehouses.Comment: VLDB201

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    NOSQL design for analytical workloads: Variability matters

    Get PDF
    Big Data has recently gained popularity and has strongly questioned relational databases as universal storage systems, especially in the presence of analytical workloads. As result, co-relational alternatives, commonly known as NOSQL (Not Only SQL) databases, are extensively used for Big Data. As the primary focus of NOSQL is on performance, NOSQL databases are directly designed at the physical level, and consequently the resulting schema is tailored to the dataset and access patterns of the problem in hand. However, we believe that NOSQL design can also benefit from traditional design approaches. In this paper we present a method to design databases for analytical workloads. Starting from the conceptual model and adopting the classical 3-phase design used for relational databases, we propose a novel design method considering the new features brought by NOSQL and encompassing relational and co-relational design altogether.Peer ReviewedPostprint (author's final draft

    A succinct data structure for self-indexing ternary relations

    Get PDF
    The final publication is available via http://dx.doi.org/10.1016/j.jda.2016.10.002[Abstract] The representation of binary relations has been intensively studied and many different theoretical and practical representations have been proposed to answer the usual queries in multiple domains. However, ternary relations have not received as much attention, even though many real-world applications require the processing of ternary relations. In this paper we present a new compressed and self-indexed data structure that we call Interleaved K2-tree (IK2-tree), designed to compactly represent and efficiently query general ternary relations. The IK2-tree is an evolution of an existing data structure, the K2-tree [6], initially designed to represent Web graphs and later applied to other domains. The IK2-tree is able to extend the K2-tree to represent a ternary relation, based on the idea of decomposing it into a collection of binary relations but providing indexing capabilities in all the three dimensions. We present different ways to use IK2-tree to model different types of ternary relations using as reference two typical domains: RDF and Temporal Graphs. We also experimentally evaluate our representations comparing them in space usage and performance with other solutions of the state of the art.Ministerio de Economía y Competitividad; TIN2013-46238-C4-3-RXunta de Galicia; GRC2013/053Chile. Núcleo Milenio Información y Coordinación en Redes; ICM/FIC RC130003Chile.Fondo Nacional de Desarrollo Científico y Tecnológico; 1-14079

    Geospatial Narratives and their Spatio-Temporal Dynamics: Commonsense Reasoning for High-level Analyses in Geographic Information Systems

    Full text link
    The modelling, analysis, and visualisation of dynamic geospatial phenomena has been identified as a key developmental challenge for next-generation Geographic Information Systems (GIS). In this context, the envisaged paradigmatic extensions to contemporary foundational GIS technology raises fundamental questions concerning the ontological, formal representational, and (analytical) computational methods that would underlie their spatial information theoretic underpinnings. We present the conceptual overview and architecture for the development of high-level semantic and qualitative analytical capabilities for dynamic geospatial domains. Building on formal methods in the areas of commonsense reasoning, qualitative reasoning, spatial and temporal representation and reasoning, reasoning about actions and change, and computational models of narrative, we identify concrete theoretical and practical challenges that accrue in the context of formal reasoning about `space, events, actions, and change'. With this as a basis, and within the backdrop of an illustrated scenario involving the spatio-temporal dynamics of urban narratives, we address specific problems and solutions techniques chiefly involving `qualitative abstraction', `data integration and spatial consistency', and `practical geospatial abduction'. From a broad topical viewpoint, we propose that next-generation dynamic GIS technology demands a transdisciplinary scientific perspective that brings together Geography, Artificial Intelligence, and Cognitive Science. Keywords: artificial intelligence; cognitive systems; human-computer interaction; geographic information systems; spatio-temporal dynamics; computational models of narrative; geospatial analysis; geospatial modelling; ontology; qualitative spatial modelling and reasoning; spatial assistance systemsComment: ISPRS International Journal of Geo-Information (ISSN 2220-9964); Special Issue on: Geospatial Monitoring and Modelling of Environmental Change}. IJGI. Editor: Duccio Rocchini. (pre-print of article in press

    Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning

    Full text link
    Many problems in sequential decision making and stochastic control often have natural multiscale structure: sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure, particularly beyond a single level of abstraction, has remained a longstanding challenge. We describe a fast multiscale procedure for repeatedly compressing, or homogenizing, Markov decision processes (MDPs), wherein a hierarchy of sub-problems at different scales is automatically determined. Coarsened MDPs are themselves independent, deterministic MDPs, and may be solved using existing algorithms. The multiscale representation delivered by this procedure decouples sub-tasks from each other and can lead to substantial improvements in convergence rates both locally within sub-problems and globally across sub-problems, yielding significant computational savings. A second fundamental aspect of this work is that these multiscale decompositions yield new transfer opportunities across different problems, where solutions of sub-tasks at different levels of the hierarchy may be amenable to transfer to new problems. Localized transfer of policies and potential operators at arbitrary scales is emphasized. Finally, we demonstrate compression and transfer in a collection of illustrative domains, including examples involving discrete and continuous statespaces.Comment: 86 pages, 15 figure

    A Nine Month Progress Report on an Investigation into Mechanisms for Improving Triple Store Performance

    No full text
    This report considers the requirement for fast, efficient, and scalable triple stores as part of the effort to produce the Semantic Web. It summarises relevant information in the major background field of Database Management Systems (DBMS), and provides an overview of the techniques currently in use amongst the triple store community. The report concludes that for individuals and organisations to be willing to provide large amounts of information as openly-accessible nodes on the Semantic Web, storage and querying of the data must be cheaper and faster than it is currently. Experiences from the DBMS field can be used to maximise triple store performance, and suggestions are provided for lines of investigation in areas of storage, indexing, and query optimisation. Finally, work packages are provided describing expected timetables for further study of these topics
    corecore