35 research outputs found

    A comparative study of transaction management services in multidatabase heterogeneous systems

    Get PDF
    Multidatabases are being actively researched as a relatively new area in which many aspects are not yet fully understood. This area of transaction management in multidatabase systems still has many unresolved problems. The problem areas which this dissertation addresses are classification of multidatabase systems, global concurrency control, correctness criterion in a multidatabase environment, global deadlock detection, atomic commitment and crash recovery. A core group of research addressing these problems was identified and studied. The dissertation contributes to the multidatabase transaction management topic by introducing an alternative classification method for such multiple database systems; assessing existing research into transaction management schemes and based on this assessment, proposes a transaction processing model founded on the optimal properties of transaction management identified during the course of this research.ComputingM. Sc. (Computer Science

    31th International Conference on Information Modelling and Knowledge Bases

    Get PDF
    Information modelling is becoming more and more important topic for researchers, designers, and users of information systems.The amount and complexity of information itself, the number of abstractionlevels of information, and the size of databases and knowledge bases arecontinuously growing. Conceptual modelling is one of the sub-areas ofinformation modelling. The aim of this conference is to bring together experts from different areas of computer science and other disciplines, who have a common interest in understanding and solving problems on information modelling and knowledge bases, as well as applying the results of research to practice. We also aim to recognize and study new areas on modelling and knowledge bases to which more attention should be paid. Therefore philosophy and logic, cognitive science, knowledge management, linguistics and management science are relevant areas, too. In the conference, there will be three categories of presentations, i.e. full papers, short papers and position papers

    30th International Conference on Information Modelling and Knowledge Bases

    Get PDF
    Information modelling is becoming more and more important topic for researchers, designers, and users of information systems. The amount and complexity of information itself, the number of abstraction levels of information, and the size of databases and knowledge bases are continuously growing. Conceptual modelling is one of the sub-areas of information modelling. The aim of this conference is to bring together experts from different areas of computer science and other disciplines, who have a common interest in understanding and solving problems on information modelling and knowledge bases, as well as applying the results of research to practice. We also aim to recognize and study new areas on modelling and knowledge bases to which more attention should be paid. Therefore philosophy and logic, cognitive science, knowledge management, linguistics and management science are relevant areas, too. In the conference, there will be three categories of presentations, i.e. full papers, short papers and position papers

    Data quality and data cleaning in database applications

    Get PDF
    Today, data plays an important role in people's daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today's business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of 'algorithm selection mechanism' in the data cleaning framework, which enhances the performance of data cleaning system in database applications.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Semantic Similarity of Spatial Scenes

    Get PDF
    The formalization of similarity in spatial information systems can unleash their functionality and contribute technology not only useful, but also desirable by broad groups of users. As a paradigm for information retrieval, similarity supersedes tedious querying techniques and unveils novel ways for user-system interaction by naturally supporting modalities such as speech and sketching. As a tool within the scope of a broader objective, it can facilitate such diverse tasks as data integration, landmark determination, and prediction making. This potential motivated the development of several similarity models within the geospatial and computer science communities. Despite the merit of these studies, their cognitive plausibility can be limited due to neglect of well-established psychological principles about properties and behaviors of similarity. Moreover, such approaches are typically guided by experience, intuition, and observation, thereby often relying on more narrow perspectives or restrictive assumptions that produce inflexible and incompatible measures. This thesis consolidates such fragmentary efforts and integrates them along with novel formalisms into a scalable, comprehensive, and cognitively-sensitive framework for similarity queries in spatial information systems. Three conceptually different similarity queries at the levels of attributes, objects, and scenes are distinguished. An analysis of the relationship between similarity and change provides a unifying basis for the approach and a theoretical foundation for measures satisfying important similarity properties such as asymmetry and context dependence. The classification of attributes into categories with common structural and cognitive characteristics drives the implementation of a small core of generic functions, able to perform any type of attribute value assessment. Appropriate techniques combine such atomic assessments to compute similarities at the object level and to handle more complex inquiries with multiple constraints. These techniques, along with a solid graph-theoretical methodology adapted to the particularities of the geospatial domain, provide the foundation for reasoning about scene similarity queries. Provisions are made so that all methods comply with major psychological findings about people’s perceptions of similarity. An experimental evaluation supplies the main result of this thesis, which separates psychological findings with a major impact on the results from those that can be safely incorporated into the framework through computationally simpler alternatives

    On Resolving Semantic Heterogeneities and Deriving Constraints in Schema Integration

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Data quality and data cleaning in database applications

    Get PDF
    Today, data plays an important role in people’s daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today’s business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning.In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process.Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of ‘algorithm selection mechanism’ in the data cleaning framework, which enhances the performance of data cleaning system in database applications

    On Practical machine Learning and Data Analysis

    Get PDF
    This thesis discusses and addresses some of the difficulties associated with practical machine learning and data analysis. Introducing data driven methods in e.g industrial and business applications can lead to large gains in productivity and efficiency, but the cost and complexity are often overwhelming. Creating machine learning applications in practise often involves a large amount of manual labour, which often needs to be performed by an experienced analyst without significant experience with the application area. We will here discuss some of the hurdles faced in a typical analysis project and suggest measures and methods to simplify the process. One of the most important issues when applying machine learning methods to complex data, such as e.g. industrial applications, is that the processes generating the data are modelled in an appropriate way. Relevant aspects have to be formalised and represented in a way that allow us to perform our calculations in an efficient manner. We present a statistical modelling framework, Hierarchical Graph Mixtures, based on a combination of graphical models and mixture models. It allows us to create consistent, expressive statistical models that simplify the modelling of complex systems. Using a Bayesian approach, we allow for encoding of prior knowledge and make the models applicable in situations when relatively little data are available. Detecting structures in data, such as clusters and dependency structure, is very important both for understanding an application area and for specifying the structure of e.g. a hierarchical graph mixture. We will discuss how this structure can be extracted for sequential data. By using the inherent dependency structure of sequential data we construct an information theoretical measure of correlation that does not suffer from the problems most common correlation measures have with this type of data. In many diagnosis situations it is desirable to perform a classification in an iterative and interactive manner. The matter is often complicated by very limited amounts of knowledge and examples when a new system to be diagnosed is initially brought into use. We describe how to create an incremental classification system based on a statistical model that is trained from empirical data, and show how the limited available background information can still be used initially for a functioning diagnosis system. To minimise the effort with which results are achieved within data analysis projects, we need to address not only the models used, but also the methodology and applications that can help simplify the process. We present a methodology for data preparation and a software library intended for rapid analysis, prototyping, and deployment. Finally, we will study a few example applications, presenting tasks within classification, prediction and anomaly detection. The examples include demand prediction for supply chain management, approximating complex simulators for increased speed in parameter optimisation, and fraud detection and classification within a media-on-demand system
    corecore