950 research outputs found

    Optimizing and Implementing Repair Programs for Consistent Query Answering in Databases

    Get PDF
    Databases may not always satisfy their integrity constraints (ICs) and a number of different reasons can be held accountable for this. However, in most cases an important part of the data is still consistent with the ICs, and can still be retrieved through queries posed to the database. Consistent query answers are characterized as ordinary answers obtained from every minimally repaired and consistent version of the database. Database repairs wrt a wide class of ICs can be specified as stable models of disjunctive logic programs. Thus, Consistent Query Answering (CQA) for first-order queries is translated into cautious reasoning under the stable models semantics. The use of logic programs does not exceed the intrinsic complexity of CQA. However, using them in a straightforward manner is usually inefficient. The goal of this thesis is to develop optimized techniques to evaluate queries over inconsisten

    A multidimensional data model with subcategories for flexibly capturing summarizability

    Full text link

    Investigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach

    Get PDF
    The value derivable from the use of data is continuously increasing since some years. Both commercial and non-commercial organisations have realised the immense benefits that might be derived if all data at their disposal could be analysed and form the basis of decision taking. The technological tools required to produce, capture, store, transmit and analyse huge amounts of data form the background to the development of the phenomenon of Big Data. With Big Data, the aim is to be able to generate value from huge amounts of data, often in non-structured format and produced extremely frequently. However, the potential value derivable depends on general level of governance of data, more precisely on the quality of the data. The field of data quality is well researched for traditional data uses but is still in its infancy for the Big Data context. This dissertation focused on investigating effective methods to enhance data quality for Big Data. The principal deliverable of this research is in the form of a methodological approach which can be used to optimize the level of data quality in the Big Data context. Since data quality is contextual, (that is a non-generalizable field), this research study focuses on applying the methodological approach in one use case, in terms of the Electronic Health Records (EHR). The first main contribution to knowledge of this study systematically investigates which data quality dimensions (DQDs) are most important for EHR Big Data. The two most important dimensions ascertained by the research methods applied in this study are accuracy and completeness. These are two well-known dimensions, and this study confirms that they are also very important for EHR Big Data. The second important contribution to knowledge is an investigation into whether Artificial Intelligence with a special focus upon machine learning could be used in improving the detection of dirty data, focusing on the two data quality dimensions of accuracy and completeness. Regression and clustering algorithms proved to be more adequate for accuracy and completeness related issues respectively, based on the experiments carried out. However, the limits of implementing and using machine learning algorithms for detecting data quality issues for Big Data were also revealed and discussed in this research study. It can safely be deduced from the knowledge derived from this part of the research study that use of machine learning for enhancing data quality issues detection is a promising area but not yet a panacea which automates this entire process. The third important contribution is a proposed guideline to undertake data repairs most efficiently for Big Data; this involved surveying and comparing existing data cleansing algorithms against a prototype developed for data reparation. Weaknesses of existing algorithms are highlighted and are considered as areas of practice which efficient data reparation algorithms must focus upon. Those three important contributions form the nucleus for a new data quality methodological approach which could be used to optimize Big Data quality, as applied in the context of EHR. Some of the activities and techniques discussed through the proposed methodological approach can be transposed to other industries and use cases to a large extent. The proposed data quality methodological approach can be used by practitioners of Big Data Quality who follow a data-driven strategy. As opposed to existing Big Data quality frameworks, the proposed data quality methodological approach has the advantage of being more precise and specific. It gives clear and proven methods to undertake the main identified stages of a Big Data quality lifecycle and therefore can be applied by practitioners in the area. This research study provides some promising results and deliverables. It also paves the way for further research in the area. Technical and technological changes in Big Data is rapidly evolving and future research should be focusing on new representations of Big Data, the real-time streaming aspect, and replicating same research methods used in this current research study but on new technologies to validate current results

    Dagstuhl News January - December 2011

    Get PDF
    "Dagstuhl News" is a publication edited especially for the members of the Foundation "Informatikzentrum Schloss Dagstuhl" to thank them for their support. The News give a summary of the scientific work being done in Dagstuhl. Each Dagstuhl Seminar is presented by a small abstract describing the contents and scientific highlights of the seminar as well as the perspectives or challenges of the research topic

    Extension and evaluation of the global cardinality constraints functionality of the Gecode open source toolkit

    Get PDF
    Ο Προγραμματισμός με Περιορισμούς είναι μια μεθοδολογία της Τεχνητής Νοημοσύνης που αποσκοπεί να επιλύσει πραγματικά προβλήματα με αποτελεσματικό τρόπο. Σε αυ- τή την διπλωματική εργασία, επεκτείνουμε τον επιλυτή προβλημάτων ικανοποίησης περιορισμών ανοιχτού κώδικα Gecode, συνεισφέροντας στις δυνατότητές του σχετικά με Καθολικούς Περιορισμούς, συγκεκριμένα περιορισμούς Global Cardinality. Ένας Global Cardinality περιορισμός περιορίζει τον αριθμό εμφάνισης τιμών μέσα σε μια συλλογή μεταβλητών, ώστε να βρίσκεται μεταξύ συγκεκριμένων ορίων. Αναπτύσσουμε τον περιορισμό Global Cardinality With Costs, ο οποίος είναι παρόμοιος του Global Cardinality και επιπλέον συσχετίζει ένα κόστος με κάθε ανάθεση τιμής σε μεταβλητή, ενώ ταυτόχρονα απαιτεί το άθροισμα των κοστών να μην ξεπερνάει ένα όριο. Στη συνέχεια προσθέτουμε τον περιορισμό Symmetric Global Cardinality, ο οποίος ορίζεται πάνω σε μεταβλητές που αφορούν σύνολα, δίνοντας επιπλέον περιορισμούς γύρω από τον πληθικό αριθμό του κάθε συνόλου, πέραν των περιορισμών που αφορούν τις τιμές. Ερευνούμε τη βελτιστοποίηση της επίδοσής τους, πειραματιζόμενοι με διάφορες εναλλακτικές επιλογές υλοποίησης, και τελικά τους συγκρίνουμε ώστε να ανακαλύψουμε κάτω από ποιές συνθήκες είναι ωφέλιμοι, σε σχέση με την αποσύνθεσή τους σε περισσότερους απλούστερους περιορισμούς.Constraint Programming is an Artificial Intelligence methodology that aims to solve real world problems in an efficient way. In this work, we extend the open source constraint solver Gecode by expanding its features concerning Global Constraints, specifically Global Cardinality Constraints. A Global Cardinality Constraint restricts the value occurrences among a collection of variables, to be between certain bounds. We develop the Global Cardinality Constraint With Costs, which is similar to the Global Cardinality Constraint and additionally associates a cost with each variable-value assignment, while further restricting the sum of the costs related to the assigned variable-value pairs to not exceed a given cost bound. Moreover, we add the Symmetric Global Cardinality Constraint, which is defined on Set variables and introduces additional restrictions on the cardinality of each set, aside from the value occurrences. We attempt to optimize their performance by experimenting with various different implementation choices, and finally we evaluate our constraints to discover under which conditions they are beneficial compared to decomposing them to multiple simpler ones

    Data quality and data cleaning in database applications

    Get PDF
    Today, data plays an important role in people's daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today's business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of 'algorithm selection mechanism' in the data cleaning framework, which enhances the performance of data cleaning system in database applications.EThOS - Electronic Theses Online ServiceGBUnited Kingdo