104 research outputs found

    Active duplicate detection with Bayesian nonparametric models

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 129-137).When multiple databases are merged, an essential step is identifying sets of records that refer to the same entity. Called duplicate detection, this task is typically tedious to perform manually, and so a variety of automated methods have been developed for partitioning a collection of records into coreference sets. This task is complicated by ambiguous or noisy field values, so systems are typically domain-specific and often fitted to a representative labeled training corpus. Once fitted, such systems can estimate a partition of a similar corpus without human intervention. While this approach has many applications, it is often infeasible to encode the appropriate domain knowledge a priori or to identify suitable training data. To address such cases, this thesis uses an active framework for duplicate detection, wherein the system initially estimates a partition of a test corpus without training, but is then allowed to query a human user about the coreference labeling of a portion of the corpus. The responses to these queries are used to guide the system in producing improved partition estimates and further queries of interest. This thesis describes a complete implementation of this framework with three technical contributions: a domain-independent Bayesian model expressing the relationship between the unobserved partition and the observed field values of a set of database records; a criterion for picking informative queries based on the mutual information between the response and the unobserved partition; and an algorithm for estimating a minimum-error partition under a Bayesian model through a reduction to the well-studied problem of correlation clustering. It also present experimental results demonstrating the effectiveness of this method in a variety of data domains.by Nicholas Elias Matsakis.Ph.D

    Doctor of Philosophy

    Get PDF
    dissertationTemporal reasoning denotes the modeling of causal relationships between different variables across different instances of time, and the prediction of future events or the explanation of past events. Temporal reasoning helps in modeling and understanding interactions between human pathophysiological processes, and in predicting future outcomes such as response to treatment or complications. Dynamic Bayesian Networks (DBN) support modeling changes in patients' condition over time due to both diseases and treatments, using probabilistic relationships between different clinical variables, both within and across different points in time. We describe temporal reasoning and representation in general and DBN in particular, with special attention to DBN parameter learning and inference. We also describe temporal data preparation (aggregation, consolidation, and abstraction) techniques that are applicable to medical data that were used in our research. We describe and evaluate various data discretization methods that are applicable to medical data. Projeny, an opensource probabilistic temporal reasoning toolkit developed as part of this research, is also described. We apply these methods, techniques, and algorithms to two disease processes modeled as Dynamic Bayesian Networks. The first test case is hyperglycemia due to severe illness in patients treated in the Intensive Care Unit (ICU). We model the patients' serum glucose and insulin drip rates using Dynamic Bayesian Networks, and recommend insulin drip rates to maintain the patients' serum glucose within a normal range. The model's safety and efficacy are proven by comparing it to the current gold standard. The second test case is the early prediction of sepsis in the emergency department. Sepsis is an acute life threatening condition that requires timely diagnosis and treatment. We present various DBN models and data preparation techniques that detect sepsis with very high accuracy within two hours after the patients' admission to the emergency department. We also discuss factors affecting the computational tractability of the models and appropriate optimization techniques. In this dissertation, we present a guide to temporal reasoning, evaluation of various data preparation, discretization, learning and inference methods, proofs using two test cases using real clinical data, an open-source toolkit, and recommend methods and techniques for temporal reasoning in medicine

    Requirement-based Root Cause Analysis Using Log Data

    Get PDF
    Root Cause Analysis for software systems is a challenging diagnostic task due to complexity emanating from the interactions between system components. Furthermore, the sheer size of the logged data makes it often difficult for human operators and administrators to perform problem diagnosis and root cause analysis. The diagnostic task is further complicated by the lack of models that could be used to support the diagnostic process. Traditionally, this diagnostic task is conducted by human experts who create mental models of systems, in order to generate hypotheses and conduct the analysis even in the presence of incomplete logged data. A challenge in this area is to provide the necessary concepts, tools, and techniques for the operators to focus their attention to specific parts of the logged data and ultimately to automate the diagnostic process. The work described in this thesis aims at proposing a framework that includes techniques, formalisms, and algorithms aimed at automating the process of root cause analysis. In particular, this work uses annotated requirement goal models to represent the monitored systems' requirements and runtime behavior. The goal models are used in combination with log data to generate a ranked set of diagnostics that represent the combination of tasks that failed leading to the observed failure. In addition, the framework uses a combination of word-based and topic-based information retrieval techniques to reduce the size of log data by filtering out a subset of log data to facilitate the diagnostic process. The process of log data filtering and reduction is based on goal model annotations and generates a sequence of logical literals that represent the possible systems' observations. A second level of investigation consists of looking for evidence for any malicious (i.e., intentionally caused by a third party) activity leading to task failures. This analysis uses annotated anti-goal models that denote possible actions that can be taken by an external user to threaten a given system task. The framework uses a novel probabilistic approach based on Markov Logic Networks. Our experiments show that our approach improves over existing proposals by handling uncertainty in observations, using natively generated log data, and by providing ranked diagnoses. The proposed framework has been evaluated using a test environment based on commercial off-the-shelf software components, publicly available Java Based ATM machine, and the large publicly available dataset (DARPA 2000)

    Social Network Data Management

    Get PDF
    With the increasing usage of online social networks and the semantic web's graph structured RDF framework, and the rising adoption of networks in various fields from biology to social science, there is a rapidly growing need for indexing, querying, and analyzing massive graph structured data. Facebook has amassed over 500 million users creating huge volumes of highly connected data. Governments have made RDF datasets containing billions of triples available to the public. In the life sciences, researches have started to connect disparate data sets of research results into one giant network of valuable information. Clearly, networks are becoming increasingly popular and growing rapidly in size, requiring scalable solutions for network data management. This thesis focuses on the following aspects of network data management. We present a hierarchical index structure for external memory storage of network data that aims to maximize data locality. We propose efficient algorithms to answer subgraph matching queries against network databases and discuss effective pruning strategies to improve performance. We show how adaptive cost models can speed up subgraph matching query answering by assigning budgets to index retrieval operations and adjusting the query plan while executing. We develop a cloud oriented social network database, COSI, which handles massive network datasets too large for a single computer by partitioning the data across multiple machines and achieving high performance query answering through asynchronous parallelization and cluster-aware heuristics. Tracking multiple standing queries against a social network database is much faster with our novel multi-view maintenance algorithm, which exploits common substructures between queries. To capture uncertainty inherent in social network querying, we define probabilistic subgraph matching queries over deterministic graph data and propose algorithms to answer them efficiently. Finally, we introduce a general relational machine learning framework and rule-based language, Probabilistic Soft Logic, to learn from and probabilistically reason about social network data and describe applications to information integration and information fusion

    Proceedings of the 2008 Oxford University Computing Laboratory student conference.

    Get PDF
    This conference serves two purposes. First, the event is a useful pedagogical exercise for all participants, from the conference committee and referees, to the presenters and the audience. For some presenters, the conference may be the first time their work has been subjected to peer-review. For others, the conference is a testing ground for announcing work, which will be later presented at international conferences, workshops, and symposia. This leads to the conference's second purpose: an opportunity to expose the latest-and-greatest research findings within the laboratory. The fourteen abstracts within these proceedings were selected by the programme and conference committee after a round of peer-reviewing, by both students and staff within this department

    First IJCAI International Workshop on Graph Structures for Knowledge Representation and Reasoning (GKR@IJCAI'09)

    Get PDF
    International audienceThe development of effective techniques for knowledge representation and reasoning (KRR) is a crucial aspect of successful intelligent systems. Different representation paradigms, as well as their use in dedicated reasoning systems, have been extensively studied in the past. Nevertheless, new challenges, problems, and issues have emerged in the context of knowledge representation in Artificial Intelligence (AI), involving the logical manipulation of increasingly large information sets (see for example Semantic Web, BioInformatics and so on). Improvements in storage capacity and performance of computing infrastructure have also affected the nature of KRR systems, shifting their focus towards representational power and execution performance. Therefore, KRR research is faced with a challenge of developing knowledge representation structures optimized for large scale reasoning. This new generation of KRR systems includes graph-based knowledge representation formalisms such as Bayesian Networks (BNs), Semantic Networks (SNs), Conceptual Graphs (CGs), Formal Concept Analysis (FCA), CPnets, GAI-nets, all of which have been successfully used in a number of applications. The goal of this workshop is to bring together the researchers involved in the development and application of graph-based knowledge representation formalisms and reasoning techniques

    Providing Information by Resource- Constrained Data Analysis

    Get PDF
    The Collaborative Research Center SFB 876 (Providing Information by Resource-Constrained Data Analysis) brings together the research fields of data analysis (Data Mining, Knowledge Discovery in Data Bases, Machine Learning, Statistics) and embedded systems and enhances their methods such that information from distributed, dynamic masses of data becomes available anytime and anywhere. The research center approaches these problems with new algorithms respecting the resource constraints in the different scenarios. This Technical Report presents the work of the members of the integrated graduate school

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
    corecore