1,176 research outputs found

    SQLCheck: Automated Detection and Diagnosis of SQL Anti-Patterns

    Full text link
    The emergence of database-as-a-service platforms has made deploying database applications easier than before. Now, developers can quickly create scalable applications. However, designing performant, maintainable, and accurate applications is challenging. Developers may unknowingly introduce anti-patterns in the application's SQL statements. These anti-patterns are design decisions that are intended to solve a problem, but often lead to other problems by violating fundamental design principles. In this paper, we present SQLCheck, a holistic toolchain for automatically finding and fixing anti-patterns in database applications. We introduce techniques for automatically (1) detecting anti-patterns with high precision and recall, (2) ranking the anti-patterns based on their impact on performance, maintainability, and accuracy of applications, and (3) suggesting alternative queries and changes to the database design to fix these anti-patterns. We demonstrate the prevalence of these anti-patterns in a large collection of queries and databases collected from open-source repositories. We introduce an anti-pattern detection algorithm that augments query analysis with data analysis. We present a ranking model for characterizing the impact of frequently occurring anti-patterns. We discuss how SQLCheck suggests fixes for high-impact anti-patterns using rule-based query refactoring techniques. Our experiments demonstrate that SQLCheck enables developers to create more performant, maintainable, and accurate applications.Comment: 18 pages (14 page paper, 1 page references, 2 page Appendix), 12 figures, Conference: SIGMOD'2

    Efficient Discovery of Ontology Functional Dependencies

    Full text link
    Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Constraint based data cleaning techniques rely on integrity constraints as a benchmark to identify and correct errors. Data values that do not satisfy the given set of constraints are flagged as dirty, and data updates are made to re-align the data and the constraints. However, many errors often require user input to resolve due to domain expertise defining specific terminology and relationships. For example, in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be captured in a pharmaceutical ontology. While functional dependencies (FDs) have traditionally been used in existing data cleaning solutions to model syntactic equivalence, they are not able to model broader relationships (e.g., is-a) defined by an ontology. In this paper, we take a first step towards extending the set of data quality constraints used in data cleaning by defining and discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out theoretical and practical foundations for OFDs, including a set of sound and complete axioms, and a linear inference procedure. We then develop effective algorithms for discovering OFDs, and a set of optimizations that efficiently prune the search space. Our experimental evaluation using real data show the scalability and accuracy of our algorithms.Comment: 12 page

    Data Masking, Encryption, and their Effect on Classification Performance: Trade-offs Between Data Security and Utility

    Get PDF
    As data mining increasingly shapes organizational decision-making, the quality of its results must be questioned to ensure trust in the technology. Inaccuracies can mislead decision-makers and cause costly mistakes. With more data collected for analytical purposes, privacy is also a major concern. Data security policies and regulations are increasingly put in place to manage risks, but these policies and regulations often employ technologies that substitute and/or suppress sensitive details contained in the data sets being mined. Data masking and substitution and/or data encryption and suppression of sensitive attributes from data sets can limit access to important details. It is believed that the use of data masking and encryption can impact the quality of data mining results. This dissertation investigated and compared the causal effects of data masking and encryption on classification performance as a measure of the quality of knowledge discovery. A review of the literature found a gap in the body of knowledge, indicating that this problem had not been studied before in an experimental setting. The objective of this dissertation was to gain an understanding of the trade-offs between data security and utility in the field of analytics and data mining. The research used a nationally recognized cancer incidence database, to show how masking and encryption of potentially sensitive demographic attributes such as patients’ marital status, race/ethnicity, origin, and year of birth, could have a statistically significant impact on the patients’ predicted survival. Performance parameters measured by four different classifiers delivered sizable variations in the range of 9% to 10% between a control group, where the select attributes were untouched, and two experimental groups where the attributes were substituted or suppressed to simulate the effects of the data protection techniques. In practice, this represented a corroboration of the potential risk involved when basing medical treatment decisions using data mining applications where attributes in the data sets are masked or encrypted for patient privacy and security concerns

    Information System Articulation Development - Managing Veracity Attributes and Quantifying Relationship with Readability of Textual Data

    Get PDF
    Often the textual data are either disorganized or misinterpreted because of unstructured Big Data in multiple dimensions. Managing readable textual alphanumeric data and its analytics is challenging. In spatial dimensions, the facts can be ambiguous and inconsistent, posing interpretation and new knowledge discovery challenges. The information can be wordy, erratic, and noisy. The research aims to assimilate the data characteristics through Information System (IS) artefacts that are appropriate to data analytics, especially in application domains that involve big data sources. Data heterogeneity and multidimensionality can make and preclude IS-guided veracity models in the data integration process, including customer analytics services. The veracity of big data thus can impact visualization and value, including knowledge enhancement in the vast amount of textual data qualitatively. The manner the veracity features construed in each schematic, semantic and syntactic attribute dimension in several IS artefacts and relevant documents can enhance the readability of textual data robustly

    A pattern based approach for data quality requirements modelling

    Get PDF

    BlogForever: D3.1 Preservation Strategy Report

    Get PDF
    This report describes preservation planning approaches and strategies recommended by the BlogForever project as a core component of a weblog repository design. More specifically, we start by discussing why we would want to preserve weblogs in the first place and what it is exactly that we are trying to preserve. We further present a review of past and present work and highlight why current practices in web archiving do not address the needs of weblog preservation adequately. We make three distinctive contributions in this volume: a) we propose transferable practical workflows for applying a combination of established metadata and repository standards in developing a weblog repository, b) we provide an automated approach to identifying significant properties of weblog content that uses the notion of communities and how this affects previous strategies, c) we propose a sustainability plan that draws upon community knowledge through innovative repository design

    Adapting a quality model for a Big Data application: the case of a feature prediction system

    Get PDF
    En la última década hemos sido testigos del considerable incremento de proyectos basados en aplicaciones de Big Data. Algunos de los tipos más populares de esas aplicaciones han sido: los sistemas de recomendaciones, la predicción de características y la toma de decisiones. En este nuevo auge han surgido propuestas de implementación de modelos de calidad para las aplicaciones de Big data que por su gran heterogeneidad se hace difícil la selección del modelo de calidad ideal para el desarrollo de un tipo específico de aplicación de Big Data. En el presente Trabajo de Fin de Máster se realiza un estudio de mapeo sistemático (SMS, por sus siglas en inglés) que parte de dos preguntas clave de investigación. La primera trata sobre cuál es el estado en la identificación de riesgos, problemas o desafíos en las aplicaciones de Big Data. La segunda, trata sobre qué modelos de calidad se han aplicado hasta la fecha a las aplicaciones de Big Data, específicamente a los sistemas de predicción de características. El objetivo final es analizar los modelos de calidad disponibles y adaptar un modelo de calidad a partir de los existentes que se puedan aplicar a un tipo específico de aplicación de Big Data: los sistemas de predicción de características. El modelo definido comprende un conjunto de características de calidad definidas como parte del modelo y métricas de calidad para evaluarlas. Finalmente, se realiza una aproximación a un caso de estudio donde se aplica el modelo y se evalúan las características de calidad definidas a través de sus métricas de calidad presentándose los resultados obtenidos.In the last decade, we have been witnesses of the considerable increment of projects based on big data applications. Some of the most popular types of those applications have been: Recommendations, Feature Predictions, and Decision making. In this new context, several proposals have arisen for the implementation of quality models applied to Big Data applications. As part of the current Master thesis, a Systematic Mapping Study (SMS) is conducted which starts from two key research questions. The first one is about what is the state of the art about the identification of risks, issues, problems, or challenges in big data applications. The second one, is about which quality models have been applied up to date to big data applications, specifically to feature prediction systems. The main objective is to analyze the available quality models and adapt a quality model from the existing ones that can be applied to a specific type of Big Data application: The Feature Prediction Systems. The defined model comprises a set of quality characteristics defined as part of the model and a set of quality metrics to evaluate them. Finally, an approach is made to a case study where the model is applied, and the quality characteristics defined through its quality metrics are evaluated. The results are presented and discussed.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Máster en Ingeniería Informátic
    corecore