31 research outputs found
Data Cleaning: Problems and Current Approaches
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning
On Resolving Semantic Heterogeneities and Deriving Constraints in Schema Integration
Ph.DDOCTOR OF PHILOSOPH
A Molecular Biology Database Digest
Computational Biology or Bioinformatics has been defined as the application of mathematical
and Computer Science methods to solving problems in Molecular Biology that require large scale
data, computation, and analysis [18]. As expected, Molecular Biology databases play an essential
role in Computational Biology research and development. This paper introduces into current
Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the
integration of Molecular Biology data from different sources. This paper is primarily intended
for an audience of computer scientists with a limited background in Biology
Towards ensuring scalability, interoperability and efficient access control in a multi-domain grid-based environment
The application of grid computing has been hampered by three basic challenges:
scalability, interoperability and efficient access control which need to be optimized before a full-scale
adoption of grid computing can take place. To address these challenges, a novel architectural model
was designed for a multi-domain grid based environment (built on three domains). It was modelled
using the dynamic role-based access control. The architecture’s framework assumes that each domain
has an independent local security monitoring unit and a central security monitoring unit that monitors
security for the entire grid. The architecture was evaluated using the Grid Security Services
Simulator, a meta-query language and Java Runtime Environment 1.7.0.5 for implementing the
workflows that define the model’s task. In terms of scalability, the results show that as the number of
grid nodes increases, the average turnaround time reduces, and thereby increases the number of
service requesters (grid users) on the grid. Grid middleware integration across various domains as
well as the appropriate handling of authentication and authorisation through a local security
monitoring unit and a central security monitoring unit proved that the architecture is interoperable.
Finally, a case study scenario used for access control across the domains shows the efficiency of the
role based access control approach used for achieving appropriate access to resources. Based on the
results obtained, the proposed framework has proved to be interoperable, scalable and efficiently
suitable for enforcing access control within the parameters evaluated.Department of HE and Training approved lis
INCREMENTAL QUERY PROCESSING IN INFORMATION FUSION SYSTEMS
This dissertation studies the methodology and techniques of information retrieval in fusion systems where information referring to same objects is assessed on the basis of data from multiple heterogeneous data sources. A wide range of important applications can be categorized as information fusion systems e.g. multisensor surveillance system, local search system, multisource medical diagnose system, and so on. Up to the time of this dissertation, most information retrieval methods in fusion systems are highly domain specific, and most query systems do not address fusion problem with enough efforts. In this dissertation, I describe a broadly applicable query based information retrieval approach in general fusion systems: user information needs are interpreted as fusion queries, and the query processing techniques e.g. source dependence graph (SDG), query refinement and optimization are described. Aiming to remove the query building bottleneck, a novel incremental query method is proposed, which can eliminate the accumulated complexity in query building as well as in query execution. Query pattern is defined to capture and reuse repeated structures in the incremental queries. Several new techniques for query pattern matching and learning are described in detail. Some important experiments in a real-world multisensor fusion system, i.e. the intelligent vehicle tracking (IVET) system, have been presented to validate the proposed methodology and techniques
Tree algorithms for mining association rules
With the increasing reliability of digital communication, the falling cost of hardware
and increased computational power, the gathering and storage of data has become
easier than at any other time in history. Commercial and public agencies are able to
hold extensive records about all aspects of their operations. Witness the proliferation
of point of sale (POS) transaction recording within retailing, digital storage of
census data and computerized hospital records. Whilst the gathering of such data
has uses in terms of answering specific queries and allowing visulisation of certain
trends the volumes of data can hide significant patterns that would be impossible to
locate manually. These patterns, once found, could provide an insight into customer
behviour, demographic shifts and patient diagnosis hitherto unseen and unexpected.
Remaining competitive in a modem business environment, or delivering services in
a timely and cost effective manner for public services is a crucial part of modem
economics. Analysis of the data held by an organisaton, by a system that "learns"
can allow predictions to be made based on historical evidence. Users may guide the
process but essentially the software is exploring the data unaided.
The research described within this thesis develops current ideas regarding the exploration
of large data volumes. Particular areas of research are the reduction of
the search space within the dataset and the generation of rules which are deduced
from the patterns within the data. These issues are discussed within an experimental
framework which extracts information from binary data