24 research outputs found

    A Primer on the Data Cleaning Pipeline

    Full text link
    The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple data sources, has also grown. Specifically, the science of the ``data cleaning pipeline'' contains four stages that allow an analyst to perform downstream tasks, predictive analyses, or statistical analyses on ``cleaned data.'' This article provides a review of this emerging field, introducing technical terminology and commonly used methods

    Web knowledge bases

    Get PDF
    Knowledge is key to natural language understanding. References to specific people, places and things in text are crucial to resolving ambiguity and extracting meaning. Knowledge Bases (KBs) codify this information for automated systems — enabling applications such as entity-based search and question answering. This thesis explores the idea that sites on the web may act as a KB, even if that is not their primary intent. Dedicated kbs like Wikipedia are a rich source of entity information, but are built and maintained at an ongoing cost in human effort. As a result, they are generally limited in terms of the breadth and depth of knowledge they index about entities. Web knowledge bases offer a distributed solution to the problem of aggregating entity knowledge. Social networks aggregate content about people, news sites describe events with tags for organizations and locations, and a diverse assortment of web directories aggregate statistics and summaries for long-tail entities notable within niche movie, musical and sporting domains. We aim to develop the potential of these resources for both web-centric entity Information Extraction (IE) and structured KB population. We first investigate the problem of Named Entity Linking (NEL), where systems must resolve ambiguous mentions of entities in text to their corresponding node in a structured KB. We demonstrate that entity disambiguation models derived from inbound web links to Wikipedia are able to complement and in some cases completely replace the role of resources typically derived from the KB. Building on this work, we observe that any page on the web which reliably disambiguates inbound web links may act as an aggregation point for entity knowledge. To uncover these resources, we formalize the task of Web Knowledge Base Discovery (KBD) and develop a system to automatically infer the existence of KB-like endpoints on the web. While extending our framework to multiple KBs increases the breadth of available entity knowledge, we must still consolidate references to the same entity across different web KBs. We investigate this task of Cross-KB Coreference Resolution (KB-Coref) and develop models for efficiently clustering coreferent endpoints across web-scale document collections. Finally, assessing the gap between unstructured web knowledge resources and those of a typical KB, we develop a neural machine translation approach which transforms entity knowledge between unstructured textual mentions and traditional KB structures. The web has great potential as a source of entity knowledge. In this thesis we aim to first discover, distill and finally transform this knowledge into forms which will ultimately be useful in downstream language understanding tasks

    INSERT from Reality: A Schema-driven Approach to Image Capture of Structured Information

    Get PDF
    3rd place in the undergraduate 3-Minute Thesis CompetitionThere is a tremendous amount of structured information locked away on document images, e.g., receipts, invoices, medical testing documents, and banking statements. However, the document images that retain this structured information are often ad hoc and vary between businesses, organizations, or time periods. Although optical character recognition allows us to digitize document images into sequences of words, there still does not exist a means to identify schema attributes in the words of these ad hoc images and extract them into a database. In this thesis, we push beyond optical character recognition: while current information extraction techniques use only optical character recognition from structured images, we infer the visual structure and combine it with the textual information on the document image to create a highly-structured INSERT statement, ready to be executed against a database. We call this approach IFR. We use OCR to obtain the textual contents of the image. Our natural language processes annotate this with relevant information such as data type. We also prune irrelevant words to improve performance in subsequent steps. In parallel to textual analysis, we visually segment the input document image, with no a-priori information, to create a visual context window around each textual token. We merge the two analyses to augment the textual information with context from the visual context windows. Using analyst-defined heuristic functions, we can score each of these context-enabled entities to probabilistically construct the final INSERT statement. We evaluated IFR on three real-world datasets and were able to achieve F1 scores of over 83% in INSERT generation on these datasets, spending approximately 2 seconds per image on average. Comparing IFR to natural language processing approaches, such as regular expressions and conditional random fields, we found IFR to perform better at detecting the correct schema attributes. To compare IFR to a human baseline, we conducted a user study to find the human baseline of INSERT quality on our datasets and found IFR to produce INSERT statements that were comparable or exceeded that baseline.National Science FoundationNo embargoAcademic Major: Computer Science and Engineerin

    Identifying Graphs from Noisy Observational Data

    Get PDF
    There is a growing amount of data describing networks -- examples include social networks, communication networks, and biological networks. As the amount of available data increases, so does our interest in analyzing the properties and characteristics of these networks. However, in most cases the data is noisy, incomplete, and the result of passively acquired observational data; naively analyzing these networks without taking these errors into account can result in inaccurate and misleading conclusions. In my dissertation, I study the tasks of entity resolution, link prediction, and collective classification to address these deficiencies. I describe these tasks in detail and discuss my own work on each of these tasks. For entity resolution, I develop a method for resolving the identities of name mentions in email communications. For link prediction, I develop a method for inferring subordinate-manager relationships between individuals in an email communication network. For collective classification, I propose an adaptive active surveying method to address node labeling in a query-driven setting on network data. In many real-world settings, however, these deficiencies are not found in isolation and all need to be addressed to infer the desired complete and accurate network. Furthermore, because of the dependencies typically found in these tasks, the tasks are inherently inter-related and must be performed jointly. I define the general problem of graph identification which simultaneously performs these tasks; removing the noise and missing values in the observed input network and inferring the complete and accurate output network. I present a novel approach to graph identification using a collection of Coupled Collective Classifiers, C3, which, in addition to capturing the variety of features typically used for each task, can capture the intra- and inter-dependencies required to correctly infer nodes, edges, and labels in the output network. I discuss variants of C3 using different learning and inference paradigms and show the superior performance of C3, in terms of both prediction quality and runtime performance, over various previous approaches. I then conclude by presenting the Graph Alignment, Identification, and Analysis (GAIA) open-source software library which not only provides an implementation of C3 but also algorithms for various tasks in network data such as entity resolution, link prediction, collective classification, clustering, active learning, data generation, and analysis

    Reasoning Between the Lines: a Logic of Relational Propositions

    Get PDF
    This paper describes how Rhetorical Structure Theory (RST) and relational propositions can be used to define a method for rendering and analyzing texts as expressions in propositional logic.  Relational propositions, the implicit assertions that correspond to RST relations, are defined using standard logical operators and rules of inference.  The resulting logical forms are used to construct logical expressions that map to RST tree structures.  The resulting expressions show that inference is pervasive within coherent texts.  To support reasoning over these expressions, a set of rules for negation is defined.  The logical forms and their negation rules can be used to examine the flow of reasoning and the effects of incoherence.  Because there is a correspondence between logical coherence and the functional relationships of RST, an RST analysis that cannot pass the test of logic is indicative either of a problematic analysis or of an incoherent text.  The result is a method for analyzing the logic implicit within discursive reasoning

    Doctor of Philosophy

    Get PDF
    dissertationWith the steady increase in online shopping, more and more consumers are resorting to Product Search Engines and shopping sites such as Yahoo! Shopping, Google Product Search, and Bing Shopping as their first stop for purchasing goods online. These sites act as intermediaries between shoppers and merchants to drive user experience by enabling faceted search, comparison of products based on their specifications, and ranking of products based on their attributes. The success of these systems heavily relies on the variety and quality of the products that they present to users. In that sense, product catalogs are to online shopping what the Web index is to Web search. Therefore, comprehensive product catalogs are fundamental to the success of Product Search Engines. Given the large number of products and categories, and the speed at which they are released to the market, constructing and keeping catalogs up-to-date becomes a challenging task, calling for the need of automated techniques that do not rely on human intervention. The main goal of this dissertation is to automatically construct catalogs for product search engines. To achieve this goal, the following problems must be addressed by these search engines: (i) product synthesis-creation of product instances that conform with the catalog schema; (ii) product discovery- derivation of product instances for products whose schemata are not present in the catalog; (iii) schema synthesis- construction of schemata for new product categories. We propose an end-to-end framework that automates, to a great extent, these tasks. We present a detailed experimental evaluation using real data sets which shows that our framework is effective, scaling to a large number of products and categories, and resilient to noise that is inherent in Web data
    corecore