28 research outputs found

    Towards information profiling: data lake content metadata management

    Get PDF
    There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer ReviewedPostprint (author's final draft

    Big data in SATA Airline: finding new solutions for old problems

    Get PDF
    With the rapid growth of operational data needed in airlines and the value that can be attributed to knowledge extracted from these data, airlines have already realized the importance of technologies and methodologies associated with the concept of big data. In this article we present the case study of SATA Airlines. The operational and the decision support systems are described as well as the perspectives of using these new technologies to support knowledge creation and aid the solution of problems in this specific company. The proposed system provides a new operational environment.info:eu-repo/semantics/publishedVersio

    Modeling Data Lake Metadata with a Data Vault

    Get PDF
    International audienceWith the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage data lake presenting a lack of schema evolutivity, we propose in this paper to use ensemble modeling, and more precisely a data vault, to address this issue. To illustrate the feasibility of this approach, we instantiate our metadata conceptual model into relational and document-oriented logical and physical models, respectively. We also compare the physical models in terms of metadata storage and query response time

    Metadata Systems for Data Lakes: Models and Features

    Get PDF
    International audienceOver the past decade, the data lake concept has emerged as an alternative to data warehouses for storing and analyzing big data. A data lake allows storing data without any predefined schema. Therefore, data querying and analysis depend on a metadata system that must be efficient and comprehensive. However, metadata management in data lakes remains a current issue and the criteria for evaluating its effectiveness are more or less nonexistent.In this paper, we introduce MEDAL, a generic, graph-based model for metadata management in data lakes. We also propose evaluation criteria for data lake metadata systems through a list of expected features. Eventually, we show that our approach is more comprehensive than existing metadata systems

    A Big Data Lake for Multilevel Streaming Analytics

    Get PDF
    Large organizations are seeking to create new architectures and scalable platforms to effectively handle data management challenges due to the explosive nature of data rarely seen in the past. These data management challenges are largely posed by the availability of streaming data at high velocity from various sources in multiple formats. The changes in data paradigm have led to the emergence of new data analytics and management architecture. This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake. First, we present our study on the limitations of traditional data warehouses in handling recent changes in data paradigms. We discuss and compare different open source and commercial platforms that can be used to develop a data lake. We then describe our end-to-end data lake design and implementation approach using the Hadoop Distributed File System (HDFS) on the Hadoop Data Platform (HDP). Finally, we present a real-world data lake development use case for data stream ingestion, staging, and multilevel streaming analytics which combines structured and unstructured data. This study can serve as a guide for individuals or organizations planning to implement a data lake solution for their use cases.Comment: 6 page

    SemLinker: automating big data integration for casual users

    Get PDF
    A data integration approach combines data from different sources and builds a unified view for the users. Big data integration inherently is a complex task, and the existing approaches are either potentially limited or invariably rely on manual inputs and interposition from experts or skilled users. SemLinker, an ontology-based data integration system, is part of a metadata management framework for personal data lake (PDL), a personal store-everything architecture. PDL is for casual and unskilled users, therefore SemLinker adopts an automated data integration workflow to minimize manual input requirements. To support the flat architecture of a lake, SemLinker builds and maintains a schema metadata level without involving any physical transformation of data during integration, preserving the data in their native formats while, at the same time, allowing them to be queried and analyzed. Scalability, heterogeneity, and schema evolution are big data integration challenges that are addressed by SemLinker. Large and real-world datasets of substantial heterogeneities are used in evaluating SemLinker. The results demonstrate and confirm the integration efficiency and robustness of SemLinker, especially regarding its capability in the automatic handling of data heterogeneities and schema evolutions

    Realising the right to data portability for the domestic Internet of Things

    Get PDF
    There is an increasing role for the IT design community to play in regulation of emerging IT. Article 25 of the EU General Data Protection Regulation (GDPR) 2016 puts this on a strict legal basis by establishing the need for information privacy by design and default (PbD) for personal data-driven technologies. Against this backdrop, we examine legal, commercial and technical perspectives around the newly created legal right to data portability (RTDP) in GDPR. We are motivated by a pressing need to address regulatory challenges stemming from the Internet of Things (IoT). We need to find channels to support the protection of these new legal rights for users in practice. In Part I we introduce the internet of things and information PbD in more detail. We briefly consider regulatory challenges posed by the IoT and the nature and practical challenges surrounding the regulatory response of information privacy by design. In Part II, we look in depth at the legal nature of the RTDP, determining what it requires from IT designers in practice but also limitations on the right and how it relates to IoT. In Part III we focus on technical approaches that can support the realisation of the right. We consider the state of the art in data management architectures, tools and platforms that can provide portability, increased transparency and user control over the data flows. In Part IV, we bring our perspectives together to reflect on the technical, legal and business barriers and opportunities that will shape the implementation of the RTDP in practice, and how the relationships may shape emerging IoT innovation and business models. We finish with brief conclusions about the future for the RTDP and PbD in the IoT
    corecore