19 research outputs found

    Explanation-Based Auditing

    Full text link
    To comply with emerging privacy laws and regulations, it has become common for applications like electronic health records systems (EHRs) to collect access logs, which record each time a user (e.g., a hospital employee) accesses a piece of sensitive data (e.g., a patient record). Using the access log, it is easy to answer simple queries (e.g., Who accessed Alice's medical record?), but this often does not provide enough information. In addition to learning who accessed their medical records, patients will likely want to understand why each access occurred. In this paper, we introduce the problem of generating explanations for individual records in an access log. The problem is motivated by user-centric auditing applications, and it also provides a novel approach to misuse detection. We develop a framework for modeling explanations which is based on a fundamental observation: For certain classes of databases, including EHRs, the reason for most data accesses can be inferred from data stored elsewhere in the database. For example, if Alice has an appointment with Dr. Dave, this information is stored in the database, and it explains why Dr. Dave looked at Alice's record. Large numbers of data accesses can be explained using general forms called explanation templates. Rather than requiring an administrator to manually specify explanation templates, we propose a set of algorithms for automatically discovering frequent templates from the database (i.e., those that explain a large number of accesses). We also propose techniques for inferring collaborative user groups, which can be used to enhance the quality of the discovered explanations. Finally, we have evaluated our proposed techniques using an access log and data from the University of Michigan Health System. Our results demonstrate that in practice we can provide explanations for over 94% of data accesses in the log.Comment: VLDB201

    Explain3D: Explaining Disagreements in Disjoint Datasets

    Get PDF
    Data plays an important role in applications, analytic processes, and many aspects of human activity. As data grows in size and complexity, we are met with an imperative need for tools that promote understanding and explanations over data-related operations. Data management research on explanations has focused on the assumption that data resides in a single dataset, under one common schema. But the reality of today's data is that it is frequently un-integrated, coming from different sources with different schemas. When different datasets provide different answers to semantically similar questions, understanding the reasons for the discrepancies is challenging and cannot be handled by the existing single-dataset solutions. In this paper, we propose Explain3D, a framework for explaining the disagreements across disjoint datasets (3D). Explain3D focuses on identifying the reasons for the differences in the results of two semantically similar queries operating on two datasets with potentially different schemas. Our framework leverages the queries to perform a semantic mapping across the relevant parts of their provenance; discrepancies in this mapping point to causes of the queries' differences. Exploiting the queries gives Explain3D an edge over traditional schema matching and record linkage techniques, which are query-agnostic. Our work makes the following contributions: (1) We formalize the problem of deriving optimal explanations for the differences of the results of semantically similar queries over disjoint datasets. (2) We design a 3-stage framework for solving the optimal explanation problem. (3) We develop a smart-partitioning optimizer that improves the efficiency of the framework by orders of magnitude. (4)~We experiment with real-world and synthetic data to demonstrate that Explain3D can derive precise explanations efficiently

    Sisäisen uhan havaitseminen terveydenhuollon käyttölokeista

    Get PDF
    Sosiaali- ja terveydenhuollossa on siirrytty käyttämään sähköisiä potilastietoja. Potilasturvallisuuden takaamiseksi laki edellyttää keräämään lokitietoja niiden käytöstä. Käyttölokeista voidaan havaita käyttäjien suorittamaa potilastietojen väärinkäyttöä auditoimalla, mutta tietojen suuri määrä vaikeuttaa niiden manuaalista läpikäyntiä. Kun suurista tietomääristä yritetään löytää oleellista tietoa, samankaltaisuuksia ja poikkeavuuksia, voidaan hyödyntää tiedonlouhinta- ja koneoppimistekniikoita. Tekniikat ovat tärkeä osa väärinkäytön ja sisäisen uhan havaitsemiseksi kutsuttuja tutkimusaloja. Tutkielmassa etsittiin terveydenhuoltoon sopivia sisäisen uhan havaitsemismenetelmiä, jotka hyödyntävät käyttölokeja. Tutkimusmenetelmänä havaitsemismenetelmien etsintään käytettiin integroivaa kirjallisuuskatsausta, jonka aineistoon valikoitui 19 laatuarvioitua tieteellistä julkaisua. Sisällytetyt julkaisut vuosilta 2009–2019 kerättiin tietotekniikan alan tietokannoista. Tutkielman keskeisin tulos on itse kirjallisuuskatsaus, jossa esitellään aihealueen aiempia tutkimuksia ja muodostetaan synteesi. Synteesi sisältää tiivistetyn nykytilannekuvauksen sisäisen uhan havaitsemisratkaisuista terveydenhuoltoympäristössä. Toimiva järjestelmä selvittää, viittaako käyttölokitietue, käyttäjä tai potilas väärinkäyttöön. Järjestelmän havaitsemisstrategia hyödyntää yksinkertaisia sääntöjä, hälytysten priorisointia ja vähentämistä, suosittelua, normaalikäytön selitysmallinteita tai läheisyysmittoja. Järjestelmän kannalta tärkeitä tietoja ovat käyttölokit, organisaatio- ja hoitotiedot. Terveydenhuoltoon sopivien havaitsemismenetelmien löytäminen on mahdollista kirjallisuuskatsauksen avulla, vaikka yhtenäisten hakusanojen muodostaminen tuo haasteita. Katsaus osoitti, että soveltuvien menetelmien kokonaisuus on monipuolinen, ja että niiden avulla havaitsemistyötä on todennäköisesti mahdollista tehostaa. Lisäksi sisäisen uhan havaitsemisen tutkimusala on aktiivinen, joten uusia havaitsemisstrategioita voi löytyä lisää lähitulevaisuudessa. On todennäköistä, että terveydenhuoltoympäristön erityispiirteiden vuoksi tulevaisuudenkin ratkaisut nojaavat vahvasti käyttölokeihin. Jatkotutkimuksissa olisi syytä selvittää menetelmien käytännön soveltuvuutta suomalaisessa terveydenhuollossa olemassa olevien järjestelmien rinnalla

    Causality and explanations in databases

    Get PDF
    ABSTRACT With the surge in the availability of information, there is a great demand for tools that assist users in understanding their data. While today's exploration tools rely mostly on data visualization, users often want to go deeper and understand the underlying causes of a particular observation. This tutorial surveys research on causality and explanation for data-oriented applications. We will review and summarize the research thus far into causality and explanation in the database and AI communities, giving researchers a snapshot of the current state of the art on this topic, and propose a unified framework as well as directions for future research. We will cover both the theory of causality/explanation and some applications; we also discuss the connections with other topics in database research like provenance, deletion propagation, why-not queries, and OLAP techniques. MOTIVATION With the surge in the availability of information, there is great need for tools that help users understand data. There are several examples of systems that offer some kind of assistance for users to understand and explore datasets. Humans typically observe the data at a high level of abstraction, by aggregating or by visualizing it in a graph, but often they want to go deeper and understand the ultimate causes of their observations. Over the last few years there have been several efforts in the Database and AI communities to develop general techniques to model causes, or explanations for observations on the data, some of them enabled by Judea Pearl's seminal book on Causality 1 . Causality has been formalized both for AI applications and for database queries, and formal definitions of explanations have also been proposed both in the AI and the Database literature. Given the importance of developing general purpose tools to assist * Partially supported by NSF Awards IIS-0911036 and CCF-1349784. 1 All references are omitted and will appear in the tutorial due to space limitations. users in understanding data, it is likely that research in this space will continue, perhaps even intensify. Depth and Coverage. This 1.5-hour tutorial aims at establishing a research checkpoint: its goal is to review, summarize, and systematize the research so far into causality and explanation in databases, giving researchers a snapshot of the current state of the art on this topic, and at the same time propose a unified framework for future research. We will cover a wide range of work on causality and explanation from the database and AI communities, and we will discuss the connections with other topics in database research. Intended audience. The tutorial is aimed both at active researchers in databases, and at graduate students and young researchers seeking a new research topic. Practitioners from industry might find the tutorial useful as a preview of plausible future trends in data analysis tools. Assumed Background. Basic knowledge in databases will be sufficient to follow the tutorial. Some background in Datalog, provenance, and/or OLAP would be useful, but is not necessary. COVERED TOPICS Our tutorial is divided in three thematic sections. First, we discuss the notion of causality, its foundations in AI and philosophy, and its applications in the database field. Second, we discuss how the intuition of causality can be used to explain query results. Third, we relate these notions to several other topics of database research, including provenance, missing results, and view updates. Causality Understanding causality in a broad sense is of vital importance in many practical settings, e.g., in determining legal responsibility in multi-car accidents, in diagnosing malfunction of complex systems, or in scientific inquiry. The notion of causality and causation is a topic in philosophy, studied and argued over by philosophers over the centuries. On a high level, causality characterizes the relationship between an event and an outcome: the event is a cause if the outcome is a consequence of the event. The notion of counterfactual causes, which can be traced back to Hume (1748) and is analyzed later by Lewis (1973), explains causality in an intuitive way: if the first event (cause) had not occurred, then the second event (effect) would not have occurred. Several philosophers explored an alternative approach to counterfactuals that employs structural equations. Judea Pearl's landmark book on causality defined the state-of-theart formulation of this framework. Pearl's and Halper

    Computational Complexity And Algorithms For Dirty Data Evaluation And Repairing

    Get PDF
    In this dissertation, we study the dirty data evaluation and repairing problem in relational database. Dirty data is usually inconsistent, inaccurate, incomplete and stale. Existing methods and theories of consistency describe using integrity constraints, such as data dependencies. However, integrity constraints are good at detection but not at evaluating the degree of data inconsistency and cannot guide the data repairing. This dissertation first studies the computational complexity of and algorithms for the database inconsistency evaluation. We define and use the minimum tuple deletion to evaluate the database inconsistency. For such minimum tuple deletion problem, we study the relationship between the size of rule set and its computational complexity. We show that the minimum tuple deletion problem is NP-hard to approximate the minimum tuple deletion within 17/16 if given three functional dependencies and four attributes involved. A near optimal approximated algorithm for computing the minimum tuple deletion is proposed with a ratio of 2 − 1/2r , where r is the number of given functional dependencies. To guide the data repairing, this dissertation also investigates the data repairing method by using query feedbacks, formally studies two decision problems, functional dependency restricted deletion and insertion propagation problem, corresponding to the feedbacks of deletion and insertion. A comprehensive analysis on both combined and data complexity of the cases is provided by considering different relational operators and feedback types. We have identified the intractable and tractable cases to picture the complexity hierarchy of these problems, and provided the efficient algorithm on these tractable cases. Two improvements are proposed, one focuses on figuring out the minimum vertex cover in conflict graph to improve the upper bound of tuple deletion problem, and the other one is a better dichotomy for deletion and insertion propagation problems at the absence of functional dependencies from the point of respectively considering data, combined and parameterized complexities

    Privacy-preserving audit for broker-based health information exchange

    Get PDF
    Health Information Technology has spurred the development of distributed systems known as Health Information Exchanges (HIEs) to enable the sharing of patient records between different health care organizations. Participants using these exchanges wish to disclose the minimum possible amount of information that is needed due to patient privacy concerns over sensitive medical information. Therefore, broker-based HIEs aim to keep limited information in exchange repositories and to ensure faster and more efficient patient care. It is essential to audit these exchanges carefully to minimize the risk of illegitimate data sharing. This thesis presents a design for auditing broker-based HIEs in a way that controls the information available in audit logs and regulates its release during audit investigations based on the requirements of applicable privacy policy. In our design, we utilized formal rules to verify access to HIE and adopted Hierarchical Identity-Based Encryption (HIBE) to support the staged release of data required for audits and a balance between automated and manual reviews. We test our methodology with a consolidated and centralized audit source that incorporates a standard for auditing HIEs called the Audit Trail and Node Authentication Profile (ATNA) protocol with supplementary audit documentation from HIE participants
    corecore