Search CORE

31 research outputs found

Data Cleaning: Problems and Current Approaches

Author: Do Hong Hai
Rahm Erhard
Publication venue
Publication date: 04/02/2019
Field of study

We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Qucosa - Publikationsserver der Universität Leipzig

On Resolving Semantic Heterogeneities and Deriving Constraints in Schema Integration

Author: HE QI
Publication venue
Publication date: 12/09/2006
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

A Molecular Biology Database Digest

Author: Bry François
Kröger Peer
Publication venue
Publication date: 01/01/2000
Field of study

Computational Biology or Bioinformatics has been defined as the application of mathematical and Computer Science methods to solving problems in Molecular Biology that require large scale data, computation, and analysis [18]. As expected, Molecular Biology databases play an essential role in Computational Biology research and development. This paper introduces into current Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the integration of Molecular Biology data from different sources. This paper is primarily intended for an audience of computer scientists with a limited background in Biology

CiteSeerX

Open Access LMU

Towards ensuring scalability, interoperability and efficient access control in a multi-domain grid-based environment

Author: Azeez Nureni A.
Venter Isabella M.
Publication venue: South African Institute of Electrical Engineers
Publication date: 01/01/2013
Field of study

The application of grid computing has been hampered by three basic challenges: scalability, interoperability and efficient access control which need to be optimized before a full-scale adoption of grid computing can take place. To address these challenges, a novel architectural model was designed for a multi-domain grid based environment (built on three domains). It was modelled using the dynamic role-based access control. The architecture’s framework assumes that each domain has an independent local security monitoring unit and a central security monitoring unit that monitors security for the entire grid. The architecture was evaluated using the Grid Security Services Simulator, a meta-query language and Java Runtime Environment 1.7.0.5 for implementing the workflows that define the model’s task. In terms of scalability, the results show that as the number of grid nodes increases, the average turnaround time reduces, and thereby increases the number of service requesters (grid users) on the grid. Grid middleware integration across various domains as well as the appropriate handling of authentication and authorisation through a local security monitoring unit and a central security monitoring unit proved that the architecture is interoperable. Finally, a case study scenario used for access control across the domains shows the efficiency of the role based access control approach used for achieving appropriate access to resources. Based on the results obtained, the proposed framework has proved to be interoperable, scalable and efficiently suitable for enforcing access control within the parameters evaluated.Department of HE and Training approved lis

University of the Western Cape Research Repository

INCREMENTAL QUERY PROCESSING IN INFORMATION FUSION SYSTEMS

Author: Li Xin
Publication venue
Publication date: 02/06/2006
Field of study

This dissertation studies the methodology and techniques of information retrieval in fusion systems where information referring to same objects is assessed on the basis of data from multiple heterogeneous data sources. A wide range of important applications can be categorized as information fusion systems e.g. multisensor surveillance system, local search system, multisource medical diagnose system, and so on. Up to the time of this dissertation, most information retrieval methods in fusion systems are highly domain specific, and most query systems do not address fusion problem with enough efforts. In this dissertation, I describe a broadly applicable query based information retrieval approach in general fusion systems: user information needs are interpreted as fusion queries, and the query processing techniques e.g. source dependence graph (SDG), query refinement and optimization are described. Aiming to remove the query building bottleneck, a novel incremental query method is proposed, which can eliminate the accumulated complexity in query building as well as in query execution. Query pattern is defined to capture and reuse repeated structures in the incremental queries. Several new techniques for query pattern matching and learning are described in detail. Some important experiments in a real-world multisensor fusion system, i.e. the intelligent vehicle tracking (IVET) system, have been presented to validate the proposed methodology and techniques

D-Scholarship@Pitt

Tree algorithms for mining association rules

Author: Goulbourne Graham.
Publication venue
Publication date
Field of study

With the increasing reliability of digital communication, the falling cost of hardware and increased computational power, the gathering and storage of data has become easier than at any other time in history. Commercial and public agencies are able to hold extensive records about all aspects of their operations. Witness the proliferation of point of sale (POS) transaction recording within retailing, digital storage of census data and computerized hospital records. Whilst the gathering of such data has uses in terms of answering specific queries and allowing visulisation of certain trends the volumes of data can hide significant patterns that would be impossible to locate manually. These patterns, once found, could provide an insight into customer behviour, demographic shifts and patient diagnosis hitherto unseen and unexpected. Remaining competitive in a modem business environment, or delivering services in a timely and cost effective manner for public services is a crucial part of modem economics. Analysis of the data held by an organisaton, by a system that "learns" can allow predictions to be made based on historical evidence. Users may guide the process but essentially the software is exploring the data unaided. The research described within this thesis develops current ideas regarding the exploration of large data volumes. Particular areas of research are the reduction of the search space within the dataset and the generation of rules which are deduced from the patterns within the data. These issues are discussed within an experimental framework which extracts information from binary data

University of Liverpool Repository