625 research outputs found

    Efficient Detection of XML Integrity Constraints

    Get PDF
    Název práce: Efektívna detekcia integritných obmedzení v XML Autor: Michal Švirec Katedra: Katedra softwarového inženýrství Vedoucí diplomové práce: RNDr. Irena Mlýnková, Ph.D. Abstrakt: Znalosť integritných obmedzení v XML dátach je jeden z dôležitých aspektov ich spracovania. Avšak aj keď tieto integritné obmedzenia pre dané dáta poznáme, je častým javom, že dané dáta sú voči ním nekonzistentné. Z tohto dôvodu vznikla snaha detekovať tieto nekonzistentosti dát a následne ich opravovať. Táto práca rozširuje a zdokonaľuje doterajšie prístupy opráv XML dokumentov porušujúcich definované integritné obmedzenia, konkrétne takzvané funkčné závislosti. Práca prináša algoritmus začleňujúci váhový model a taktiež zapája užívateľa do procesu hľadania a následného aplikovania vhodnej opravy nekonzistentných XML dokumentov. Súčasťou práce sú experimentálne výsledky. Klíčová slova: XML, funkčná závislosť, porušenie funkčných závislostí, oprava porušeníTitle: Efficient Detection of XML Integrity Constraints Author: Michal Švirec Department: Department of Software Engineering Supervisor: RNDr. Irena Mlýnková, Ph.D. Abstract: Knowledge of integrity constraints covered in XML data is an impor- tant aspect of efficient data processing. However, although integrity constraints are defined for the given data, it is a common phenomenon that data violate the predefined set of constraints. Therefore detection of these inconsistencies and consecutive repair has emerged. This work extends and refines recent approaches to repairing XML documents violating defined set of integrity constraints, specif- ically so-called functional dependencies. The work proposes the repair algorithm incorporating the weight model and also involve a user into the process of de- tection and subsequent application of appropriate repair of inconsistent XML documents. Experimental results are part of the work. Keywords: XML, functional dependency, functional dependencies violations, vi- olations repairDepartment of Software EngineeringKatedra softwarového inženýrstvíFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult

    From Relations to XML: Cleaning, Integrating and Securing Data

    Get PDF
    While relational databases are still the preferred approach for storing data, XML is emerging as the primary standard for representing and exchanging data. Consequently, it has been increasingly important to provide a uniform XML interface to various data sources— integration; and critical to protect sensitive and confidential information in XML data — access control. Moreover, it is preferable to first detect and repair the inconsistencies in the data to avoid the propagation of errors to other data processing steps. In response to these challenges, this thesis presents an integrated framework for cleaning, integrating and securing data. The framework contains three parts. First, the data cleaning sub-framework makes use of a new class of constraints specially designed for improving data quality, referred to as conditional functional dependencies (CFDs), to detect and remove inconsistencies in relational data. Both batch and incremental techniques are developed for detecting CFD violations by SQL efficiently and repairing them based on a cost model. The cleaned relational data, together with other non-XML data, is then converted to XML format by using widely deployed XML publishing facilities. Second, the data integration sub-framework uses a novel formalism, XML integration grammars (XIGs), to integrate multi-source XML data which is either native or published from traditional databases. XIGs automatically support conformance to a target DTD, and allow one to build a large, complex integration via composition of component XIGs. To efficiently materialize the integrated data, algorithms are developed for merging XML queries in XIGs and for scheduling them. Third, to protect sensitive information in the integrated XML data, the data security sub-framework allows users to access the data only through authorized views. User queries posed on these views need to be rewritten into equivalent queries on the underlying document to avoid the prohibitive cost of materializing and maintaining large number of views. Two algorithms are proposed to support virtual XML views: a rewriting algorithm that characterizes the rewritten queries as a new form of automata and an evaluation algorithm to execute the automata-represented queries. They allow the security sub-framework to answer queries on views in linear time. Using both relational and XML technologies, this framework provides a uniform approach to clean, integrate and secure data. The algorithms and techniques in the framework have been implemented and the experimental study verifies their effectiveness and efficiency

    Dependencies revisited for improving data quality

    Get PDF

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Data Quality: Theory and Practice

    Get PDF

    Leveraging Decision Making in Cyber Security Analysis through Data Cleaning

    Get PDF
    Security Operations Centers (SOCs) have been built in many institutions for intrusion detection and incident response. A SOC employs various cyber defense technologies to continually monitor and control network traffic. Given the voluminous monitoring data, cyber security analysts need to identify suspicious network activities to detect potential attacks. As the network monitoring data are generated at a rapid speed and contain a lot of noise, analysts are so bounded by tedious and repetitive data triage tasks that they can hardly concentrate on in-depth analysis for further decision making. Therefore, it is critical to employ data cleaning methods in cyber situational awareness. In this paper, we investigate the main characteristics and categories of cyber security data with a special emphasis on its heterogeneous features. We also discuss how cyber analysts attempt to understand the incoming data through the data analytical process. Based on this understanding, this paper discusses five categories of data cleaning methods for heterogeneous data and addresses the main challenges for applying data cleaning in cyber situational awareness. The goal is to create a dataset that contains accurate information for cyber analysts to work with and thus achieving higher levels of data-driven decision making in cyber defense

    Semi-automatic support for evolving functional dependencies

    Get PDF
    During the life of a database, systematic and frequent violations of a given constraint may suggest that the represented reality is changing and thus the constraint should evolve with it. In this paper we propose a method and a tool to (i) find the functional dependencies that are violated by the current data, and (ii) support their evolution when it is necessary to update them. The method relies on the use of confidence, as a measure that is associated with each dependency and allows us to understand \u201dhow far\u201d the dependency is from correctly describing the current data; and of goodness, as a measure of balance between the data satisfying the antecedent of the dependency and those satisfying its consequent. Our method compares favorably with literature that approaches the same problem in a different way, and performs effectively and efficiently as shown by our tests on both real and synthetic databases
    corecore