19 research outputs found
Explanation-Based Auditing
To comply with emerging privacy laws and regulations, it has become common
for applications like electronic health records systems (EHRs) to collect
access logs, which record each time a user (e.g., a hospital employee) accesses
a piece of sensitive data (e.g., a patient record). Using the access log, it is
easy to answer simple queries (e.g., Who accessed Alice's medical record?), but
this often does not provide enough information. In addition to learning who
accessed their medical records, patients will likely want to understand why
each access occurred. In this paper, we introduce the problem of generating
explanations for individual records in an access log. The problem is motivated
by user-centric auditing applications, and it also provides a novel approach to
misuse detection. We develop a framework for modeling explanations which is
based on a fundamental observation: For certain classes of databases, including
EHRs, the reason for most data accesses can be inferred from data stored
elsewhere in the database. For example, if Alice has an appointment with Dr.
Dave, this information is stored in the database, and it explains why Dr. Dave
looked at Alice's record. Large numbers of data accesses can be explained using
general forms called explanation templates. Rather than requiring an
administrator to manually specify explanation templates, we propose a set of
algorithms for automatically discovering frequent templates from the database
(i.e., those that explain a large number of accesses). We also propose
techniques for inferring collaborative user groups, which can be used to
enhance the quality of the discovered explanations. Finally, we have evaluated
our proposed techniques using an access log and data from the University of
Michigan Health System. Our results demonstrate that in practice we can provide
explanations for over 94% of data accesses in the log.Comment: VLDB201
Explain3D: Explaining Disagreements in Disjoint Datasets
Data plays an important role in applications, analytic processes, and many
aspects of human activity. As data grows in size and complexity, we are met
with an imperative need for tools that promote understanding and explanations
over data-related operations. Data management research on explanations has
focused on the assumption that data resides in a single dataset, under one
common schema. But the reality of today's data is that it is frequently
un-integrated, coming from different sources with different schemas. When
different datasets provide different answers to semantically similar questions,
understanding the reasons for the discrepancies is challenging and cannot be
handled by the existing single-dataset solutions.
In this paper, we propose Explain3D, a framework for explaining the
disagreements across disjoint datasets (3D). Explain3D focuses on identifying
the reasons for the differences in the results of two semantically similar
queries operating on two datasets with potentially different schemas. Our
framework leverages the queries to perform a semantic mapping across the
relevant parts of their provenance; discrepancies in this mapping point to
causes of the queries' differences. Exploiting the queries gives Explain3D an
edge over traditional schema matching and record linkage techniques, which are
query-agnostic. Our work makes the following contributions: (1) We formalize
the problem of deriving optimal explanations for the differences of the results
of semantically similar queries over disjoint datasets. (2) We design a 3-stage
framework for solving the optimal explanation problem. (3) We develop a
smart-partitioning optimizer that improves the efficiency of the framework by
orders of magnitude. (4)~We experiment with real-world and synthetic data to
demonstrate that Explain3D can derive precise explanations efficiently
Recommended from our members
A Knowledge-Constrained Role-Based Access Control model for protecting patient privacy in hospital information systems
Current access control mechanisms of the hospital
information system can hardly identify the real access intention of system users. A relaxed access control increases the risk of compromise of patient privacy. To reduce unnecessary access of patient information by hospital staff, this paper proposes a Knowledge-Constrained Role-Based Access Control (KCRBAC)model in which a variety of medical domain knowledge is considered in access control. Based on the proposed Purpose Tree and knowledge-involved algorithms, the model can dynamically define the boundary of access to the patient information according to the context, which helps protect patient privacy by controlling access. Compared with the Role-Based Access Control model, KC-RBAC can effectively protectpatient information according to the results of the experiments
Sisäisen uhan havaitseminen terveydenhuollon käyttölokeista
Sosiaali- ja terveydenhuollossa on siirrytty käyttämään sähköisiä potilastietoja. Potilasturvallisuuden takaamiseksi laki edellyttää keräämään lokitietoja niiden käytöstä. Käyttölokeista voidaan havaita käyttäjien suorittamaa potilastietojen väärinkäyttöä auditoimalla, mutta tietojen suuri määrä vaikeuttaa niiden manuaalista läpikäyntiä.
Kun suurista tietomääristä yritetään löytää oleellista tietoa, samankaltaisuuksia ja poikkeavuuksia, voidaan hyödyntää tiedonlouhinta- ja koneoppimistekniikoita. Tekniikat ovat tärkeä osa väärinkäytön ja sisäisen uhan havaitsemiseksi kutsuttuja tutkimusaloja. Tutkielmassa etsittiin terveydenhuoltoon sopivia sisäisen uhan havaitsemismenetelmiä, jotka hyödyntävät käyttölokeja.
Tutkimusmenetelmänä havaitsemismenetelmien etsintään käytettiin integroivaa kirjallisuuskatsausta, jonka aineistoon valikoitui 19 laatuarvioitua tieteellistä julkaisua. Sisällytetyt julkaisut vuosilta 2009–2019 kerättiin tietotekniikan alan tietokannoista. Tutkielman keskeisin tulos on itse kirjallisuuskatsaus, jossa esitellään aihealueen aiempia tutkimuksia ja muodostetaan synteesi. Synteesi sisältää tiivistetyn nykytilannekuvauksen sisäisen uhan havaitsemisratkaisuista terveydenhuoltoympäristössä. Toimiva järjestelmä selvittää, viittaako käyttölokitietue, käyttäjä tai potilas väärinkäyttöön. Järjestelmän havaitsemisstrategia hyödyntää yksinkertaisia sääntöjä, hälytysten priorisointia ja vähentämistä, suosittelua, normaalikäytön selitysmallinteita tai läheisyysmittoja. Järjestelmän kannalta tärkeitä tietoja ovat käyttölokit, organisaatio- ja hoitotiedot.
Terveydenhuoltoon sopivien havaitsemismenetelmien löytäminen on mahdollista kirjallisuuskatsauksen avulla, vaikka yhtenäisten hakusanojen muodostaminen tuo haasteita. Katsaus osoitti, että soveltuvien menetelmien kokonaisuus on monipuolinen, ja että niiden avulla havaitsemistyötä on todennäköisesti mahdollista tehostaa. Lisäksi sisäisen uhan havaitsemisen tutkimusala on aktiivinen, joten uusia havaitsemisstrategioita voi löytyä lisää lähitulevaisuudessa. On todennäköistä, että terveydenhuoltoympäristön erityispiirteiden vuoksi tulevaisuudenkin ratkaisut nojaavat vahvasti käyttölokeihin. Jatkotutkimuksissa olisi syytä selvittää menetelmien käytännön soveltuvuutta suomalaisessa terveydenhuollossa olemassa olevien järjestelmien rinnalla
Causality and explanations in databases
ABSTRACT With the surge in the availability of information, there is a great demand for tools that assist users in understanding their data. While today's exploration tools rely mostly on data visualization, users often want to go deeper and understand the underlying causes of a particular observation. This tutorial surveys research on causality and explanation for data-oriented applications. We will review and summarize the research thus far into causality and explanation in the database and AI communities, giving researchers a snapshot of the current state of the art on this topic, and propose a unified framework as well as directions for future research. We will cover both the theory of causality/explanation and some applications; we also discuss the connections with other topics in database research like provenance, deletion propagation, why-not queries, and OLAP techniques. MOTIVATION With the surge in the availability of information, there is great need for tools that help users understand data. There are several examples of systems that offer some kind of assistance for users to understand and explore datasets. Humans typically observe the data at a high level of abstraction, by aggregating or by visualizing it in a graph, but often they want to go deeper and understand the ultimate causes of their observations. Over the last few years there have been several efforts in the Database and AI communities to develop general techniques to model causes, or explanations for observations on the data, some of them enabled by Judea Pearl's seminal book on Causality 1 . Causality has been formalized both for AI applications and for database queries, and formal definitions of explanations have also been proposed both in the AI and the Database literature. Given the importance of developing general purpose tools to assist * Partially supported by NSF Awards IIS-0911036 and CCF-1349784. 1 All references are omitted and will appear in the tutorial due to space limitations. users in understanding data, it is likely that research in this space will continue, perhaps even intensify. Depth and Coverage. This 1.5-hour tutorial aims at establishing a research checkpoint: its goal is to review, summarize, and systematize the research so far into causality and explanation in databases, giving researchers a snapshot of the current state of the art on this topic, and at the same time propose a unified framework for future research. We will cover a wide range of work on causality and explanation from the database and AI communities, and we will discuss the connections with other topics in database research. Intended audience. The tutorial is aimed both at active researchers in databases, and at graduate students and young researchers seeking a new research topic. Practitioners from industry might find the tutorial useful as a preview of plausible future trends in data analysis tools. Assumed Background. Basic knowledge in databases will be sufficient to follow the tutorial. Some background in Datalog, provenance, and/or OLAP would be useful, but is not necessary. COVERED TOPICS Our tutorial is divided in three thematic sections. First, we discuss the notion of causality, its foundations in AI and philosophy, and its applications in the database field. Second, we discuss how the intuition of causality can be used to explain query results. Third, we relate these notions to several other topics of database research, including provenance, missing results, and view updates. Causality Understanding causality in a broad sense is of vital importance in many practical settings, e.g., in determining legal responsibility in multi-car accidents, in diagnosing malfunction of complex systems, or in scientific inquiry. The notion of causality and causation is a topic in philosophy, studied and argued over by philosophers over the centuries. On a high level, causality characterizes the relationship between an event and an outcome: the event is a cause if the outcome is a consequence of the event. The notion of counterfactual causes, which can be traced back to Hume (1748) and is analyzed later by Lewis (1973), explains causality in an intuitive way: if the first event (cause) had not occurred, then the second event (effect) would not have occurred. Several philosophers explored an alternative approach to counterfactuals that employs structural equations. Judea Pearl's landmark book on causality defined the state-of-theart formulation of this framework. Pearl's and Halper
Computational Complexity And Algorithms For Dirty Data Evaluation And Repairing
In this dissertation, we study the dirty data evaluation and repairing problem in relational database. Dirty data is usually inconsistent, inaccurate, incomplete and stale. Existing methods and theories of consistency describe using integrity constraints, such as data dependencies. However, integrity constraints are good at detection but not at evaluating the degree of data inconsistency and cannot guide the data repairing. This dissertation first studies the computational complexity of and algorithms for the database inconsistency evaluation. We define and use the minimum tuple deletion to evaluate the database inconsistency. For such minimum tuple deletion problem, we study the relationship between the size of rule set and its computational complexity. We show that the minimum tuple deletion problem is NP-hard to approximate the minimum tuple deletion within 17/16 if given three functional dependencies and four attributes involved. A near optimal approximated algorithm for computing the minimum tuple deletion is proposed with a ratio of 2 − 1/2r , where r is the number of given functional dependencies. To guide the data repairing, this dissertation also investigates the data repairing method by using query feedbacks, formally studies two decision problems, functional dependency restricted deletion and insertion propagation problem, corresponding to the feedbacks of deletion and insertion. A comprehensive analysis on both combined and data complexity of the cases is provided by considering different relational operators and feedback types. We have identified the intractable and tractable cases to picture the complexity hierarchy of these problems, and provided the efficient algorithm on these tractable cases. Two improvements are proposed, one focuses on figuring out the minimum vertex cover in conflict graph to improve the upper bound of tuple deletion problem, and the other one is a better dichotomy for deletion and insertion propagation problems at the absence of functional dependencies from the point of respectively considering data, combined and parameterized complexities
Privacy-preserving audit for broker-based health information exchange
Health Information Technology has spurred the development of distributed systems known as Health Information Exchanges (HIEs) to enable the sharing of patient records between different health care organizations. Participants using these exchanges wish to disclose the minimum possible amount of information that is needed due to patient privacy concerns over sensitive medical information. Therefore, broker-based HIEs aim to keep limited information in exchange repositories and to ensure faster and more efficient patient care. It is essential to audit these exchanges carefully to minimize the risk of illegitimate data sharing. This thesis presents a design for auditing broker-based HIEs in a way that controls the information available in audit logs and regulates its release during audit investigations based on the requirements of applicable privacy policy. In our design, we utilized formal rules to verify access to HIE and adopted Hierarchical Identity-Based Encryption (HIBE) to support the staged release of data required for audits and a balance between automated and manual reviews. We test our methodology with a consolidated and centralized audit source that incorporates a standard for auditing HIEs called the Audit Trail and Node Authentication Profile (ATNA) protocol with supplementary audit documentation from HIE participants