31 research outputs found

    Data Cleaning: Problems and Current Approaches

    Get PDF
    We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning

    On Resolving Semantic Heterogeneities and Deriving Constraints in Schema Integration

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A Molecular Biology Database Digest

    Get PDF
    Computational Biology or Bioinformatics has been defined as the application of mathematical and Computer Science methods to solving problems in Molecular Biology that require large scale data, computation, and analysis [18]. As expected, Molecular Biology databases play an essential role in Computational Biology research and development. This paper introduces into current Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the integration of Molecular Biology data from different sources. This paper is primarily intended for an audience of computer scientists with a limited background in Biology

    Towards ensuring scalability, interoperability and efficient access control in a multi-domain grid-based environment

    Get PDF
    The application of grid computing has been hampered by three basic challenges: scalability, interoperability and efficient access control which need to be optimized before a full-scale adoption of grid computing can take place. To address these challenges, a novel architectural model was designed for a multi-domain grid based environment (built on three domains). It was modelled using the dynamic role-based access control. The architecture’s framework assumes that each domain has an independent local security monitoring unit and a central security monitoring unit that monitors security for the entire grid. The architecture was evaluated using the Grid Security Services Simulator, a meta-query language and Java Runtime Environment 1.7.0.5 for implementing the workflows that define the model’s task. In terms of scalability, the results show that as the number of grid nodes increases, the average turnaround time reduces, and thereby increases the number of service requesters (grid users) on the grid. Grid middleware integration across various domains as well as the appropriate handling of authentication and authorisation through a local security monitoring unit and a central security monitoring unit proved that the architecture is interoperable. Finally, a case study scenario used for access control across the domains shows the efficiency of the role based access control approach used for achieving appropriate access to resources. Based on the results obtained, the proposed framework has proved to be interoperable, scalable and efficiently suitable for enforcing access control within the parameters evaluated.Department of HE and Training approved lis

    INCREMENTAL QUERY PROCESSING IN INFORMATION FUSION SYSTEMS

    Get PDF
    This dissertation studies the methodology and techniques of information retrieval in fusion systems where information referring to same objects is assessed on the basis of data from multiple heterogeneous data sources. A wide range of important applications can be categorized as information fusion systems e.g. multisensor surveillance system, local search system, multisource medical diagnose system, and so on. Up to the time of this dissertation, most information retrieval methods in fusion systems are highly domain specific, and most query systems do not address fusion problem with enough efforts. In this dissertation, I describe a broadly applicable query based information retrieval approach in general fusion systems: user information needs are interpreted as fusion queries, and the query processing techniques e.g. source dependence graph (SDG), query refinement and optimization are described. Aiming to remove the query building bottleneck, a novel incremental query method is proposed, which can eliminate the accumulated complexity in query building as well as in query execution. Query pattern is defined to capture and reuse repeated structures in the incremental queries. Several new techniques for query pattern matching and learning are described in detail. Some important experiments in a real-world multisensor fusion system, i.e. the intelligent vehicle tracking (IVET) system, have been presented to validate the proposed methodology and techniques

    Tree algorithms for mining association rules

    Get PDF
    With the increasing reliability of digital communication, the falling cost of hardware and increased computational power, the gathering and storage of data has become easier than at any other time in history. Commercial and public agencies are able to hold extensive records about all aspects of their operations. Witness the proliferation of point of sale (POS) transaction recording within retailing, digital storage of census data and computerized hospital records. Whilst the gathering of such data has uses in terms of answering specific queries and allowing visulisation of certain trends the volumes of data can hide significant patterns that would be impossible to locate manually. These patterns, once found, could provide an insight into customer behviour, demographic shifts and patient diagnosis hitherto unseen and unexpected. Remaining competitive in a modem business environment, or delivering services in a timely and cost effective manner for public services is a crucial part of modem economics. Analysis of the data held by an organisaton, by a system that "learns" can allow predictions to be made based on historical evidence. Users may guide the process but essentially the software is exploring the data unaided. The research described within this thesis develops current ideas regarding the exploration of large data volumes. Particular areas of research are the reduction of the search space within the dataset and the generation of rules which are deduced from the patterns within the data. These issues are discussed within an experimental framework which extracts information from binary data
    corecore