28 research outputs found
Towards information profiling: data lake content metadata management
There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer ReviewedPostprint (author's final draft
Big data in SATA Airline: finding new solutions for old problems
With the rapid growth of operational data needed in airlines and the value that can be attributed to knowledge extracted from these data, airlines have already realized the importance of technologies and methodologies associated with the concept of big data. In this article we present the case study of SATA Airlines. The operational and the decision support systems are described as well as the perspectives of using these new technologies to support knowledge creation and aid the solution of problems in this specific company. The proposed system provides a new operational environment.info:eu-repo/semantics/publishedVersio
Modeling Data Lake Metadata with a Data Vault
International audienceWith the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage data lake presenting a lack of schema evolutivity, we propose in this paper to use ensemble modeling, and more precisely a data vault, to address this issue. To illustrate the feasibility of this approach, we instantiate our metadata conceptual model into relational and document-oriented logical and physical models, respectively. We also compare the physical models in terms of metadata storage and query response time
Metadata Systems for Data Lakes: Models and Features
International audienceOver the past decade, the data lake concept has emerged as an alternative to data warehouses for storing and analyzing big data. A data lake allows storing data without any predefined schema. Therefore, data querying and analysis depend on a metadata system that must be efficient and comprehensive. However, metadata management in data lakes remains a current issue and the criteria for evaluating its effectiveness are more or less nonexistent.In this paper, we introduce MEDAL, a generic, graph-based model for metadata management in data lakes. We also propose evaluation criteria for data lake metadata systems through a list of expected features. Eventually, we show that our approach is more comprehensive than existing metadata systems
A Big Data Lake for Multilevel Streaming Analytics
Large organizations are seeking to create new architectures and scalable
platforms to effectively handle data management challenges due to the explosive
nature of data rarely seen in the past. These data management challenges are
largely posed by the availability of streaming data at high velocity from
various sources in multiple formats. The changes in data paradigm have led to
the emergence of new data analytics and management architecture. This paper
focuses on storing high volume, velocity and variety data in the raw formats in
a data storage architecture called a data lake. First, we present our study on
the limitations of traditional data warehouses in handling recent changes in
data paradigms. We discuss and compare different open source and commercial
platforms that can be used to develop a data lake. We then describe our
end-to-end data lake design and implementation approach using the Hadoop
Distributed File System (HDFS) on the Hadoop Data Platform (HDP). Finally, we
present a real-world data lake development use case for data stream ingestion,
staging, and multilevel streaming analytics which combines structured and
unstructured data. This study can serve as a guide for individuals or
organizations planning to implement a data lake solution for their use cases.Comment: 6 page
SemLinker: automating big data integration for casual users
A data integration approach combines data from different sources and builds a unified view for the users. Big data integration inherently is a complex task, and the existing approaches are either potentially limited or invariably rely on manual inputs and interposition from experts or skilled users. SemLinker, an ontology-based data integration system, is part of a metadata management framework for personal data lake (PDL), a personal store-everything architecture. PDL is for casual and unskilled users, therefore SemLinker adopts an automated data integration workflow to minimize manual input requirements. To support the flat architecture of a lake, SemLinker builds and maintains a schema metadata level without involving any physical transformation of data during integration, preserving the data in their native formats while, at the same time, allowing them to be queried and analyzed. Scalability, heterogeneity, and schema evolution are big data integration challenges that are addressed by SemLinker. Large and real-world datasets of substantial heterogeneities are used in evaluating SemLinker. The results demonstrate and confirm the integration efficiency and robustness of SemLinker, especially regarding its capability in the automatic handling of data heterogeneities and schema evolutions
Realising the right to data portability for the domestic Internet of Things
There is an increasing role for the IT design community to play in regulation
of emerging IT. Article 25 of the EU General Data Protection Regulation (GDPR)
2016 puts this on a strict legal basis by establishing the need for information
privacy by design and default (PbD) for personal data-driven technologies.
Against this backdrop, we examine legal, commercial and technical perspectives
around the newly created legal right to data portability (RTDP) in GDPR. We are
motivated by a pressing need to address regulatory challenges stemming from the
Internet of Things (IoT). We need to find channels to support the protection of
these new legal rights for users in practice. In Part I we introduce the
internet of things and information PbD in more detail. We briefly consider
regulatory challenges posed by the IoT and the nature and practical challenges
surrounding the regulatory response of information privacy by design. In Part
II, we look in depth at the legal nature of the RTDP, determining what it
requires from IT designers in practice but also limitations on the right and
how it relates to IoT. In Part III we focus on technical approaches that can
support the realisation of the right. We consider the state of the art in data
management architectures, tools and platforms that can provide portability,
increased transparency and user control over the data flows. In Part IV, we
bring our perspectives together to reflect on the technical, legal and business
barriers and opportunities that will shape the implementation of the RTDP in
practice, and how the relationships may shape emerging IoT innovation and
business models. We finish with brief conclusions about the future for the RTDP
and PbD in the IoT