1,848 research outputs found

    Towards Analytics Aware Ontology Based Access to Static and Streaming Data (Extended Version)

    Full text link
    Real-time analytics that requires integration and aggregation of heterogeneous and distributed streaming and static data is a typical task in many industrial scenarios such as diagnostics of turbines in Siemens. OBDA approach has a great potential to facilitate such tasks; however, it has a number of limitations in dealing with analytics that restrict its use in important industrial applications. Based on our experience with Siemens, we argue that in order to overcome those limitations OBDA should be extended and become analytics, source, and cost aware. In this work we propose such an extension. In particular, we propose an ontology, mapping, and query language for OBDA, where aggregate and other analytical functions are first class citizens. Moreover, we develop query optimisation techniques that allow to efficiently process analytical tasks over static and streaming data. We implement our approach in a system and evaluate our system with Siemens turbine data

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Big Data Analysis

    Get PDF
    The value of big data is predicated on the ability to detect trends and patterns and more generally to make sense of the large volumes of data that is often comprised of a heterogeneous mix of format, structure, and semantics. Big data analysis is the component of the big data value chain that focuses on transforming raw acquired data into a coherent usable resource suitable for analysis. Using a range of interviews with key stakeholders in small and large companies and academia, this chapter outlines key insights, state of the art, emerging trends, future requirements, and sectorial case studies for data analysis

    Bench-Ranking: ettekirjutav analüüsimeetod suurte teadmiste graafide päringutele

    Get PDF
    Relatsiooniliste suurandmete (BD) töötlemisraamistike kasutamine suurte teadmiste graafide töötlemiseks kätkeb endas võimalust päringu jõudlust optimeerimida. Kaasaegsed BD-süsteemid on samas keerulised andmesüsteemid, mille konfiguratsioonid omavad olulist mõju jõudlusele. Erinevate raamistike ja konfiguratsioonide võrdlusuuringud pakuvad kogukonnale parimaid tavasid parema jõudluse saavutamiseks. Enamik neist võrdlusuuringutest saab liigitada siiski vaid kirjeldavaks ja diagnostiliseks analüütikaks. Lisaks puudub ühtne standard nende uuringute võrdlemiseks kvantitatiivselt järjestatud kujul. Veelgi enam, suurte graafide töötlemiseks vajalike konveierite kavandamine eeldab täiendavaid disainiotsuseid mis tulenevad mitteloomulikust (relatsioonilisest) graafi töötlemise paradigmast. Taolisi disainiotsuseid ei saa automaatselt langetada, nt relatsiooniskeemi, partitsioonitehnika ja salvestusvormingute valikut. Käesolevas töös käsitleme kuidas me antud uurimuslünga täidame. Esmalt näitame disainiotsuste kompromisside mõju BD-süsteemide jõudluse korratavusele suurte teadmiste graafide päringute tegemisel. Lisaks näitame BD-raamistike jõudluse kirjeldavate ja diagnostiliste analüüside piiranguid suurte graafide päringute tegemisel. Seejärel uurime, kuidas lubada ettekirjutavat analüütikat järjestamisfunktsioonide ja mitmemõõtmeliste optimeerimistehnikate (nn "Bench-Ranking") kaudu. See lähenemine peidab kirjeldava tulemusanalüüsi keerukuse, suunates praktiku otse teostatavate teadlike otsusteni.Leveraging relational Big Data (BD) processing frameworks to process large knowledge graphs yields a great interest in optimizing query performance. Modern BD systems are yet complicated data systems, where the configurations notably affect the performance. Benchmarking different frameworks and configurations provides the community with best practices for better performance. However, most of these benchmarking efforts are classified as descriptive and diagnostic analytics. Moreover, there is no standard for comparing these benchmarks based on quantitative ranking techniques. Moreover, designing mature pipelines for processing big graphs entails considering additional design decisions that emerge with the non-native (relational) graph processing paradigm. Those design decisions cannot be decided automatically, e.g., the choice of the relational schema, partitioning technique, and storage formats. Thus, in this thesis, we discuss how our work fills this timely research gap. Particularly, we first show the impact of those design decisions’ trade-offs on the BD systems’ performance replicability when querying large knowledge graphs. Moreover, we showed the limitations of the descriptive and diagnostic analyses of BD frameworks’ performance for querying large graphs. Thus, we investigate how to enable prescriptive analytics via ranking functions and Multi-Dimensional optimization techniques (called ”Bench-Ranking”). This approach abstracts out from the complexity of descriptive performance analysis, guiding the practitioner directly to actionable informed decisions.https://www.ester.ee/record=b553332
    corecore