348 research outputs found

    A Critical Look at the Evaluation of Knowledge Graph Question Answering

    Get PDF
    PhD thesis in Information technologyThe field of information retrieval (IR) is concerned with systems that “make a given stored collection of information items available to a user population” [111]. The way in which information is made available to the user depends on the formulation of this broad concern of IR into specific tasks by which a system should address a user’s information need [85]. The specific IR task also dictates how the user may express their information need. The classic IR task is ad hoc retrieval, where the user issues a query to the system and gets in return a list of documents ranked by estimated relevance of each document to the query [85]. However, it has long been acknowledged that users are often looking for answers to questions, rather than an entire document or ranked list of documents [17, 141]. Question answering (QA) is thus another IR task; it comes in many flavors, but overall consists of taking in a user’s natural language (NL) question and returning an answer. This thesis describes work done within the scope of the QA task. The flavor of QA called knowledge graph question answering (KGQA) is taken as the primary focus, which enables QA with factual questions against structured data in the form of a knowledge graph (KG). This means the KGQA system addresses a structured representation of knowledge rather than—as in other QA flavors—an unstructured prose context. KGs have the benefit that given some identified entities or predicates, all associated properties are available and relationships can be utilized. KGQA then enables users to access structured data using only NL questions and without requiring formal query language expertise. Even so, the construction of satisfactory KGQA systems remains a challenge. Machine learning with deep neural networks (DNNs) is a far more promising approach than manually engineering retrieval models [29, 56, 130]. The current era dominated by DNNs began with seminal work on computer vision, where the deep learning paradigm demonstrated its first cases of “superhuman” performance [32, 71]. Subsequent work in other applications has also demonstrated “superhuman” performance with DNNs [58, 87]. As a result of its early position and hence longer history as a leading application of deep learning, computer vision with DNNs has been bolstered with much work on different approaches towards augmenting [120] or synthesizing [94] additional training data. The difficulty with machine learning approaches to KGQA appears to rest in large part with the limited volume, quality, and variety of available datasets for this task. Compared to labeled image data for computer vision, the problems of data collection, augmentation, and synthesis are only to a limited extent solved for QA, and especially for KGQA. There are few datasets for KGQA overall, and little previous work that has found unsupervised or semi-supervised learning approaches to address the sparsity of data. Instead, neural network approaches to KGQA rely on either fully or weakly supervised learning [29]. We are thus concerned with neural models trained in a supervised setting to perform QA tasks, especially of the KGQA flavor. Given a clear task to delegate to a computational system, it seems clear that we want the task performed as well as possible. However, what methodological elements are important to ensure good system performance within the chosen scope? How should the quality of system performance be assessed? This thesis describes work done to address these overarching questions through a number of more specific research questions. Altogether, we designate the topic of this thesis as KGQA evaluation, which we address in a broad sense, encompassing four subtopics from (1) the impact on performance due to volume of training data provided and (2) the information leakage between training and test splits due to unhygienic data partitioning, through (3) the naturalness of NL questions resulting from a common approach for generating KGQA datasets, to (4) the axiomatic analysis and development of evaluation measures for a specific flavor of the KGQA task. Each of the four subtopics is informed by previous work, but we aim in this thesis to critically examine the assumptions of previous work to uncover, verify, or address weaknesses in current practices surrounding KGQA evaluation

    Privacy preserving linkage and sharing of sensitive data

    Get PDF
    2018 Summer.Includes bibliographical references.Sensitive data, such as personal and business information, is collected by many service providers nowadays. This data is considered as a rich source of information for research purposes that could benet individuals, researchers and service providers. However, because of the sensitivity of such data, privacy concerns, legislations, and con ict of interests, data holders are reluctant to share their data with others. Data holders typically lter out or obliterate privacy related sensitive information from their data before sharing it, which limits the utility of this data and aects the accuracy of research. Such practice will protect individuals' privacy; however it prevents researchers from linking records belonging to the same individual across dierent sources. This is commonly referred to as record linkage problem by the healthcare industry. In this dissertation, our main focus is on designing and implementing ecient privacy preserving methods that will encourage sensitive information sources to share their data with researchers without compromising the privacy of the clients or aecting the quality of the research data. The proposed solution should be scalable and ecient for real-world deploy- ments and provide good privacy assurance. While this problem has been investigated before, most of the proposed solutions were either considered as partial solutions, not accurate, or impractical, and therefore subject to further improvements. We have identied several issues and limitations in the state of the art solutions and provided a number of contributions that improve upon existing solutions. Our rst contribution is the design of privacy preserving record linkage protocol using semi-trusted third party. The protocol allows a set of data publishers (data holders) who compete with each other, to share sensitive information with subscribers (researchers) while preserving the privacy of their clients and without sharing encryption keys. Our second contribution is the design and implementation of a probabilistic privacy preserving record linkage protocol, that accommodates discrepancies and errors in the data such as typos. This work builds upon the previous work by linking the records that are similar, where the similarity range is formally dened. Our third contribution is a protocol that performs information integration and sharing without third party services. We use garbled circuits secure computation to design and build a system to perform the record linkages between two parties without sharing their data. Our design uses Bloom lters as inputs to the garbled circuits and performs a probabilistic record linkage using the Dice coecient similarity measure. As garbled circuits are known for their expensive computations, we propose new approaches that reduce the computation overhead needed, to achieve a given level of privacy. We built a scalable record linkage system using garbled circuits, that could be deployed in a distributed computation environment like the cloud, and evaluated its security and performance. One of the performance issues for linking large datasets is the amount of secure computation to compare every pair of records across the linked datasets to nd all possible record matches. To reduce the amount of computations a method, known as blocking, is used to lter out as much as possible of the record pairs that will not match, and limit the comparison to a subset of the record pairs (called can- didate pairs) that possibly match. Most of the current blocking methods either require the parties to share blocking keys (called blocks identiers), extracted from the domain of some record attributes (termed blocking variables), or share reference data points to group their records around these points using some similarity measures. Though these methods reduce the computation substantially, they leak too much information about the records within each block. Toward this end, we proposed a novel privacy preserving approximate blocking scheme that allows parties to generate the list of candidate pairs with high accuracy, while protecting the privacy of the records in each block. Our scheme is congurable such that the level of performance and accuracy could be achieved according to the required level of privacy. We analyzed the accuracy and privacy of our scheme, implemented a prototype of the scheme, and experimentally evaluated its accuracy and performance against dierent levels of privacy

    Abduction and Anonymity in Data Mining

    Get PDF
    This thesis investigates two new research problems that arise in modern data mining: reasoning on data mining results, and privacy implication of data mining results. Most of the data mining algorithms rely on inductive techniques, trying to infer information that is generalized from the input data. But very often this inductive step on raw data is not enough to answer the user questions, and there is the need to process data again using other inference methods. In order to answer high level user needs such as explanation of results, we describe an environment able to perform abductive (hypothetical) reasoning, since often the solutions of such queries can be seen as the set of hypothesis that satisfy some requirements. By using cost-based abduction, we show how classification algorithms can be boosted by performing abductive reasoning over the data mining results, improving the quality of the output. Another growing research area in data mining is the one of privacy-preserving data mining. Due to the availability of large amounts of data, easily collected and stored via computer systems, new applications are emerging, but unfortunately privacy concerns make data mining unsuitable. We study the privacy implications of data mining in a mathematical and logical context, focusing on the anonymity of people whose data are analyzed. A formal theory on anonymity preserving data mining is given, together with a number of anonymity-preserving algorithms for pattern mining. The post-processing improvement on data mining results (w.r.t. utility and privacy) is the central focus of the problems we investigated in this thesis

    Unlinkable and Invisible Îł-Sanitizable Signatures

    Get PDF
    Sanitizable signatures (SaS) allow a (single) sanitizer, chosen by the signer, to modify and re-sign a message in a somewhat controlled way, that is, only editing parts (or blocks) of the message that are admissible for modification. This primitive is an efficient tool, with many formally defined security properties, such as unlinkability, transparency, immutability, invisibility, and unforgeability. An SaS scheme that satisfies these properties can be a great asset to the privacy of any field it will be applied to, e.g., anonymizing medical files. In this work, we look at the notion of γ-sanitizable signatures ( γSaS): we take the sanitizable signatures one step further by allowing the signer to not only decide which blocks can be modified, but also how many of them at most can be modified within a single sanitization, setting a limit, denoted with γ. We adapt the security properties listed above to γSaS and propose our own scheme, ULISS (Unlinkable Limited Invisible Sanitizable Signature), then show that it verifies these properties. This extension of SaS can not only improve current use cases, but also introduce new ones, e.g., restricting the number of changes in a document within a certain timeframe

    Search-based Software Testing Driven by Automatically Generated and Manually Defined Fitness Functions

    Full text link
    Search-based software testing (SBST) typically relies on fitness functions to guide the search exploration toward software failures. There are two main techniques to define fitness functions: (a) automated fitness function computation from the specification of the system requirements and (b) manual fitness function design. Both techniques have advantages. The former uses information from the system requirements to guide the search toward portions of the input domain that are more likely to contain failures. The latter uses the engineers' domain knowledge. We propose ATheNA, a novel SBST framework that combines fitness functions that are automatically generated from requirements specifications and manually defined by engineers. We design and implement ATheNA-S, an instance of ATheNA that targets Simulink models. We evaluate ATheNA-S by considering a large set of models and requirements from different domains. We compare our solution with an SBST baseline tool that supports automatically generated fitness functions, and another one that supports manually defined fitness functions. Our results show that ATheNA-S generates more failure-revealing test cases than the baseline tools and that the difference between the performance of ATheNA-S and the baseline tools is not statistically significant. We also assess whether ATheNA-S could generate failure-revealing test cases when applied to a large case study from the automotive domain. Our results show that ATheNA-S successfully revealed a requirement violation in our case study

    Feature Subset Selection in Intrusion Detection Using Soft Computing Techniques

    Get PDF
    Intrusions on computer network systems are major security issues these days. Therefore, it is of utmost importance to prevent such intrusions. The prevention of such intrusions is entirely dependent on their detection that is a main part of any security tool such as Intrusion Detection System (IDS), Intrusion Prevention System (IPS), Adaptive Security Alliance (ASA), checkpoints and firewalls. Therefore, accurate detection of network attack is imperative. A variety of intrusion detection approaches are available but the main problem is their performance, which can be enhanced by increasing the detection rates and reducing false positives. Such weaknesses of the existing techniques have motivated the research presented in this thesis. One of the weaknesses of the existing intrusion detection approaches is the usage of a raw dataset for classification but the classifier may get confused due to redundancy and hence may not classify correctly. To overcome this issue, Principal Component Analysis (PCA) has been employed to transform raw features into principal features space and select the features based on their sensitivity. The sensitivity is determined by the values of eigenvalues. The recent approaches use PCA to project features space to principal feature space and select features corresponding to the highest eigenvalues, but the features corresponding to the highest eigenvalues may not have the optimal sensitivity for the classifier due to ignoring many sensitive features. Instead of using traditional approach of selecting features with the highest eigenvalues such as PCA, this research applied a Genetic Algorithm (GA) to search the principal feature space that offers a subset of features with optimal sensitivity and the highest discriminatory power. Based on the selected features, the classification is performed. The Support Vector Machine (SVM) and Multilayer Perceptron (MLP) are used for classification purpose due to their proven ability in classification. This research work uses the Knowledge Discovery and Data mining (KDD) cup dataset, which is considered benchmark for evaluating security detection mechanisms. The performance of this approach was analyzed and compared with existing approaches. The results show that proposed method provides an optimal intrusion detection mechanism that outperforms the existing approaches and has the capability to minimize the number of features and maximize the detection rates

    Tensions and paradoxes in electronic patient record research: a systematic literature review using the meta-narrative method

    Get PDF
    Background: The extensive and rapidly expanding research literature on electronic patient records (EPRs) presents challenges to systematic reviewers. This literature is heterogeneous and at times conflicting, not least because it covers multiple research traditions with different underlying philosophical assumptions and methodological approaches. Aim: To map, interpret and critique the range of concepts, theories, methods and empirical findings on EPRs, with a particular emphasis on the implementation and use of EPR systems. Method: Using the meta-narrative method of systematic review, and applying search strategies that took us beyond the Medline-indexed literature, we identified over 500 full-text sources. We used ‘conflicting’ findings to address higher-order questions about how the EPR and its implementation were differently conceptualised and studied by different communities of researchers. Main findings: Our final synthesis included 24 previous systematic reviews and 94 additional primary studies, most of the latter from outside the biomedical literature. A number of tensions were evident, particularly in relation to: [1] the EPR (‘container’ or ‘itinerary’); [2] the EPR user (‘information-processer’ or ‘member of socio-technical network’); [3] organizational context (‘the setting within which the EPR is implemented’ or ‘the EPR-in-use’); [4] clinical work (‘decision-making’ or ‘situated practice’); [5] the process of change (‘the logic of determinism’ or ‘the logic of opposition’); [6] implementation success (‘objectively defined’ or ‘socially negotiated’); and [7] complexity and scale (‘the bigger the better’ or ‘small is beautiful’). Findings suggest that integration of EPRs will always require human work to re-contextualize knowledge for different uses; that whilst secondary work (audit, research, billing) may be made more efficient by the EPR, primary clinical work may be made less efficient; that paper, far from being technologically obsolete, currently offers greater ecological flexibility than most forms of electronic record; and that smaller systems may sometimes be more efficient and effective than larger ones. Conclusions: The tensions and paradoxes revealed in this study extend and challenge previous reviews and suggest that the evidence base for some EPR programs is more limited than is often assumed. We offer this paper as a preliminary contribution to a much-needed debate on this evidence and its implications, and suggest avenues for new research
    • 

    corecore