162,172 research outputs found

    Active duplicate detection with Bayesian nonparametric models

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 129-137).When multiple databases are merged, an essential step is identifying sets of records that refer to the same entity. Called duplicate detection, this task is typically tedious to perform manually, and so a variety of automated methods have been developed for partitioning a collection of records into coreference sets. This task is complicated by ambiguous or noisy field values, so systems are typically domain-specific and often fitted to a representative labeled training corpus. Once fitted, such systems can estimate a partition of a similar corpus without human intervention. While this approach has many applications, it is often infeasible to encode the appropriate domain knowledge a priori or to identify suitable training data. To address such cases, this thesis uses an active framework for duplicate detection, wherein the system initially estimates a partition of a test corpus without training, but is then allowed to query a human user about the coreference labeling of a portion of the corpus. The responses to these queries are used to guide the system in producing improved partition estimates and further queries of interest. This thesis describes a complete implementation of this framework with three technical contributions: a domain-independent Bayesian model expressing the relationship between the unobserved partition and the observed field values of a set of database records; a criterion for picking informative queries based on the mutual information between the response and the unobserved partition; and an algorithm for estimating a minimum-error partition under a Bayesian model through a reduction to the well-studied problem of correlation clustering. It also present experimental results demonstrating the effectiveness of this method in a variety of data domains.by Nicholas Elias Matsakis.Ph.D

    An active learning framework for duplicate detection in SaaS platforms

    Get PDF
    With the rapid growth of users’ data in SaaS (Software-as-a-service) platforms using micro-services, it becomes essential to detect duplicated entities for ensuring the integrity and consistency of data in many companies and businesses (primarily multinational corporations). Due to the large volume of databases today, the expected duplicate detection algorithms need to be not only accurate but also practical, which means that it can release the detection results as fast as possible for a given request. Among existing algorithms for the deduplicate detection problem, using Siamese neural networks with the triplet loss has become one of the robust ways to measure the similarity of two entities (texts, paragraphs, or documents) for identifying all possible duplicated items. In this paper, we first propose a practical framework for building a duplicate detection system in a SaaS platform. Second, we present a new active learning schema for training and updating duplicate detection algorithms. In this schema, we not only allow the crowd to provide more annotated data for enhancing the chosen learning model but also use the Siamese neural networks as well as the triplet loss to construct an efficient model for the problem. Finally, we design a user interface of our proposed deduplicate detection system, which can easily apply for empirical applications in different companies

    Exploiting multimedia in creating and analysing multimedia Web archives

    No full text
    The data contained on the web and the social web are inherently multimedia and consist of a mixture of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. In many ways, the web is the greatest resource ever created by human-kind. However, due to the dynamic and distributed nature of the web, its content changes, appears and disappears on a daily basis. Web archiving provides a way of capturing snapshots of (parts of) the web for preservation and future analysis. This paper provides an overview of techniques we have developed within the context of the EU funded ARCOMEM (ARchiving COmmunity MEMories) project to allow multimedia web content to be leveraged during the archival process and for post-archival analysis. Through a set of use cases, we explore several practical applications of multimedia analytics within the realm of web archiving, web archive analysis and multimedia data on the web in general

    Duplicate detection methodology for IP network traffic analysis

    Full text link
    Network traffic monitoring systems have to deal with a challenging problem: the traffic capturing process almost invariably produces duplicate packets. In spite of this, and in contrast with other fields, there is no scientific literature addressing it. This paper establishes the theoretical background concerning data duplication in network traffic analysis: generating mechanisms, types of duplicates and their characteristics are described. On this basis, a duplicate detection and removal methodology is proposed. Moreover, an analytical and experimental study is presented, whose results provide a dimensioning rule for this methodology.Comment: 7 pages, 8 figures. For the GitHub project, see https://github.com/Enchufa2/nantool

    Symbolic QED Pre-silicon Verification for Automotive Microcontroller Cores: Industrial Case Study

    Full text link
    We present an industrial case study that demonstrates the practicality and effectiveness of Symbolic Quick Error Detection (Symbolic QED) in detecting logic design flaws (logic bugs) during pre-silicon verification. Our study focuses on several microcontroller core designs (~1,800 flip-flops, ~70,000 logic gates) that have been extensively verified using an industrial verification flow and used for various commercial automotive products. The results of our study are as follows: 1. Symbolic QED detected all logic bugs in the designs that were detected by the industrial verification flow (which includes various flavors of simulation-based verification and formal verification). 2. Symbolic QED detected additional logic bugs that were not recorded as detected by the industrial verification flow. (These additional bugs were also perhaps detected by the industrial verification flow.) 3. Symbolic QED enables significant design productivity improvements: (a) 8X improved (i.e., reduced) verification effort for a new design (8 person-weeks for Symbolic QED vs. 17 person-months using the industrial verification flow). (b) 60X improved verification effort for subsequent designs (2 person-days for Symbolic QED vs. 4-7 person-months using the industrial verification flow). (c) Quick bug detection (runtime of 20 seconds or less), together with short counterexamples (10 or fewer instructions) for quick debug, using Symbolic QED
    • 

    corecore