84 research outputs found

    SCIT: A Schema Change Interpretation Tool for Dynamic-Schema Data Warehouses

    Get PDF
    Data Warehouses (DW) have to continuously adapt to evolving business requirements, which implies structure modification (schema changes) and data migration requirements in the system design. However, it is challenging for designers to control the performance and cost overhead of different schema change implementations. In this paper, we demonstrate SCIT, a tool for DW designers to test and implement different logical design alternatives in a two-fold manner. As a main functionality, SCIT translates common DW schema modifications into directly executable SQL scripts for relational database systems, facilitating design and testing automation. At the same time, SCIT assesses changes and recommends alternative design decisions to help designers improve logical designs and avoid common dimensional modeling pitfalls and mistakes. This paper serves as a walk-through of the system features, showcasing the interaction with the tool’s user interface in order to easily and effectively modify DW schemata and enable schema change analysis

    Metadata Representations for Queryable ML Model Zoos

    Full text link
    Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the ML models and datasets that are useful for reporting, auditing, reproducibility, and interpretability purposes. The metatada is currently not standardised; its expressivity is limited; and there is no interoperable way to store and query it. Consequently, model search, reuse, comparison, and composition are hindered. In this paper, we advocate for standardized ML model meta-data representation and management, proposing a toolkit supported to help practitioners manage and query that metadata

    Data integration and metadata management in data lakes

    No full text
    Although big data is being discussed for some years, it still has many research challenges, such as the variety of data. The diversity of data sources often exists in information silos, which are a collection of non-integrated data management systems with heterogeneous schemas, query languages, and data models. It poses huge difficulty to efficiently integrate, access, and query the large volume of diverse data in these information silos with the traditional ‘schema-on-write’ approaches such as data warehouses. Data lake systems have been proposed as a solution to this problem, which are repositories storing raw data in its original formats and providing a common access interface. The challenges of combining multiple heterogeneous data sources in data lakes are rooted in the research area of data integration. To integrate the data in data lakes, the primary tasks include understanding the relationships (e.g., schema mappings) among data sources in data lakes, and answering user queries over heterogeneous sources. Moreover, to prevent a data lake from turning into an unusable data swamp, metadata management is crucial, especially for accessing and querying the data. The main challenges for metadata management in data lakes are to acquire, model, store, and enrich the metadata that describes the data sources. Therefore, in this thesis, we present a comprehensive and flexible data lake architecture and a prototype system Constance, which provides data ingestion, integration, querying and sophisticated metadata management over structured, semi-structured (e.g., JSON, XML), and graph data. First, we propose a native mapping representation to capture the hierarchical structures of nested mappings, and efficient mapping generation algorithms, which avoid producing a considerable number of intermediate basic mappings. Second, to store heterogeneous data in raw formats, our data lake system enables the coexistence of several data storage systems with different data models. To provide a unified querying interface, we design a novel query rewriting engine that combines logical methods for data integration based on declarative mappings with the big data processing system Apache Spark. Our query rewriting engine efficiently executes the rewritten queries and reconciles the query results into an integrated dataset. Third, we also study the formalism of the generated schema mappings as dependencies. Regarding computation complexity and decidability of certain reasoning tasks, the mapping formalisms in second-order logic are less desirable compared to first-order mapping languages. Our algorithmic approach transforms schema mappings expressed in second-order logic to their logically equivalent first-order forms. Finally, we define a generic metadata model to represent the structure of heterogeneous sources and introduce clustering-based algorithms to discover relaxed functional dependencies, which enrich the metadata and improve data quality in the data lake

    Rewriting of plain SO tgds into nested tgds

    No full text

    SCIT: A Schema Change Interpretation Tool for Dynamic-Schema Data Warehouses

    No full text
    Data Warehouses (DW) have to continuously adapt to evolving business requirements, which implies structure modification (schema changes) and data migration requirements in the system design. However, it is challenging for designers to control the performance and cost overhead of different schema change implementations. In this paper, we demonstrate SCIT, a tool for DW designers to test and implement different logical design alternatives in a two-fold manner. As a main functionality, SCIT translates common DW schema modifications into directly executable SQL scripts for relational database systems, facilitating design and testing automation. At the same time, SCIT assesses changes and recommends alternative design decisions to help designers improve logical designs and avoid common dimensional modeling pitfalls and mistakes. This paper serves as a walk-through of the system features, showcasing the interaction with the tool’s user interface in order to easily and effectively modify DW schemata and enable schema change analysis

    SCIT: A Schema Change Interpretation Tool for Dynamic-Schema Data Warehouses

    No full text
    Data Warehouses (DW) have to continuously adapt to evolving business requirements, which implies structure modification (schema changes) and data migration requirements in the system design. However, it is challenging for designers to control the performance and cost overhead of different schema change implementations. In this paper, we demonstrate SCIT, a tool for DW designers to test and implement different logical design alternatives in a two-fold manner. As a main functionality, SCIT translates common DW schema modifications into directly executable SQL scripts for relational database systems, facilitating design and testing automation. At the same time, SCIT assesses changes and recommends alternative design decisions to help designers improve logical designs and avoid common dimensional modeling pitfalls and mistakes. This paper serves as a walk-through of the system features, showcasing the interaction with the tool’s user interface in order to easily and effectively modify DW schemata and enable schema change analysis

    Recommender-System für Projektkollaborationen basierend auf wissenschaftlichen Publikationen und Patenten

    Get PDF
    Die erfolgreiche Durchführung von Entwicklungs- und Forschungsprojekten hängt von vielen Faktoren ab. Innovationspotential und Zukunftsorientierung helfen bei der Antragsbewilligung. Doch genauso wichtig ist die Zusammensetzung des Projektteams. Insbesondere bei interdisziplinären Projekten ist man auf ein Team angewiesen, das aus hervorragenden Experten der jeweiligen Teilgebiete besteht. Die Medizintechnik ist ein gutes Beispiel für ein sehr innovatives und gleichzeitig hoch interdisziplinäres Feld. Aber gerade die Interdisziplinarität macht die Suche nach Experten schwierig und langwierig, da man sich erst in fremden Domänen zurechtfinden muss und eventuell nicht zum gewünschten Ergebnis kommt. In diesem Beitrag stellen wir unsere Arbeiten an einem Recommender-System für Projektpartner im Rahmen des mi-Mappa- Projekts vor, das basierend auf Informationen aus Patenten, wissenschaftlichen Publikationen und Produktinformationen Experten für ein Projekt innerhalb ei- nes Innovationsfeldes der Medizintechnik empfehlen kann

    Metadata Extraction and Management in Data Lakes With GEMMS

    Get PDF
    In addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a more flexible way for data integration and analysis, as source data is loaded in its original structure to the data lake repository. Some syntactic transformation might be applied to enable access to the data in one common repository; however, a deep semantic integration is done only after the initial loading of the data into the data lake. Thereby, data is easily made available and can be restructured, aggregated, and transformed as required by later applications. Metadata management is a crucial component in a data lake, as the source data needs to be described by metadata to capture its semantics. We developed a Generic and Extensible Metadata Management System for data lakes (called GEMMS) that aims at the automatic extraction of metadata from a wide variety of data sources. Furthermore, the metadata is managed in an extensible metamodel that distinguishes structural and semantical metadata. The use case applied for evaluation is from the life science domain where the data is often stored only in files which hinders data access and efficient querying. The GEMMS framework has been proven to be useful in this domain. Especially, the extensibility and flexibility of the framework are important, as data and metadata structures in scientific experiments cannot be defined a priori
    • …
    corecore