16 research outputs found

    Toponym Resolution in Text

    Get PDF
    Institute for Communicating and Collaborative SystemsBackground. In the area of Geographic Information Systems (GIS), a shared discipline between informatics and geography, the term geo-parsing is used to describe the process of identifying names in text, which in computational linguistics is known as named entity recognition and classification (NERC). The term geo-coding is used for the task of mapping from implicitly geo-referenced datasets (such as structured address records) to explicitly geo-referenced representations (e.g., using latitude and longitude). However, present-day GIS systems provide no automatic geo-coding functionality for unstructured text. In Information Extraction (IE), processing of named entities in text has traditionally been seen as a two-step process comprising a flat text span recognition sub-task and an atomic classification sub-task; relating the text span to a model of the world has been ignored by evaluations such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)). However, spatial and temporal expressions refer to events in space-time, and the grounding of events is a precondition for accurate reasoning. Thus, automatic grounding can improve many applications such as automatic map drawing (e.g. for choosing a focus) and question answering (e.g. , for questions like How far is London from Edinburgh?, given a story in which both occur and can be resolved). Whereas temporal grounding has received considerable attention in the recent past (Mani and Wilson (2000); Setzer (2001)), robust spatial grounding has long been neglected. Concentrating on geographic names for populated places, I define the task of automatic Toponym Resolution (TR) as computing the mapping from occurrences of names for places as found in a text to a representation of the extensional semantics of the location referred to (its referent), such as a geographic latitude/longitude footprint. The task of mapping from names to locations is hard due to insufficient and noisy databases, and a large degree of ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth). In addition, names of places and the boundaries referred to change over time, and databases are incomplete. Objective. I investigate how referentially ambiguous spatial named entities can be grounded, or resolved, with respect to an extensional coordinate model robustly on open-domain news text. I begin by comparing the few algorithms proposed in the literature, and, comparing semiformal, reconstructed descriptions of them, I factor out a shared repertoire of linguistic heuristics (e.g. rules, patterns) and extra-linguistic knowledge sources (e.g. population sizes). I then investigate how to combine these sources of evidence to obtain a superior method. I also investigate the noise effect introduced by the named entity tagging step that toponym resolution relies on in a sequential system pipeline architecture. Scope. In this thesis, I investigate a present-day snapshot of terrestrial geography as represented in the gazetteer defined and, accordingly, a collection of present-day news text. I limit the investigation to populated places; geo-coding of artifact names (e.g. airports or bridges), compositional geographic descriptions (e.g. 40 miles SW of London, near Berlin), for instance, is not attempted. Historic change is a major factor affecting gazetteer construction and ultimately toponym resolution. However, this is beyond the scope of this thesis. Method. While a small number of previous attempts have been made to solve the toponym resolution problem, these were either not evaluated, or evaluation was done by manual inspection of system output instead of curating a reusable reference corpus. Since the relevant literature is scattered across several disciplines (GIS, digital libraries, information retrieval, natural language processing) and descriptions of algorithms are mostly given in informal prose, I attempt to systematically describe them and aim at a reconstruction in a uniform, semi-formal pseudo-code notation for easier re-implementation. A systematic comparison leads to an inventory of heuristics and other sources of evidence. In order to carry out a comparative evaluation procedure, an evaluation resource is required. Unfortunately, to date no gold standard has been curated in the research community. To this end, a reference gazetteer and an associated novel reference corpus with human-labeled referent annotation are created. These are subsequently used to benchmark a selection of the reconstructed algorithms and a novel re-combination of the heuristics catalogued in the inventory. I then compare the performance of the same TR algorithms under three different conditions, namely applying it to the (i) output of human named entity annotation, (ii) automatic annotation using an existing Maximum Entropy sequence tagging model, and (iii) a našıve toponym lookup procedure in a gazetteer. Evaluation. The algorithms implemented in this thesis are evaluated in an intrinsic or component evaluation. To this end, we define a task-specific matching criterion to be used with traditional Precision (P) and Recall (R) evaluation metrics. This matching criterion is lenient with respect to numerical gazetteer imprecision in situations where one toponym instance is marked up with different gazetteer entries in the gold standard and the test set, respectively, but where these refer to the same candidate referent, caused by multiple near-duplicate entries in the reference gazetteer. Main Contributions. The major contributions of this thesis are as follows: ‱ A new reference corpus in which instances of location named entities have been manually annotated with spatial grounding information for populated places, and an associated reference gazetteer, from which the assigned candidate referents are chosen. This reference gazetteer provides numerical latitude/longitude coordinates (such as 51320 North, 0 50 West) as well as hierarchical path descriptions (such as London > UK) with respect to a world wide-coverage, geographic taxonomy constructed by combining several large, but noisy gazetteers. This corpus contains news stories and comprises two sub-corpora, a subset of the REUTERS RCV1 news corpus used for the CoNLL shared task (Tjong Kim Sang and De Meulder (2003)), and a subset of the Fourth Message Understanding Contest (MUC-4; Chinchor (1995)), both available pre-annotated with gold-standard. This corpus will be made available as a reference evaluation resource; ‱ a new method and implemented system to resolve toponyms that is capable of robustly processing unseen text (open-domain online newswire text) and grounding toponym instances in an extensional model using longitude and latitude coordinates and hierarchical path descriptions, using internal (textual) and external (gazetteer) evidence; ‱ an empirical analysis of the relative utility of various heuristic biases and other sources of evidence with respect to the toponym resolution task when analysing free news genre text; ‱ a comparison between a replicated method as described in the literature, which functions as a baseline, and a novel algorithm based on minimality heuristics; and ‱ several exemplary prototypical applications to show how the resulting toponym resolution methods can be used to create visual surrogates for news stories, a geographic exploration tool for news browsing, geographically-aware document retrieval and to answer spatial questions (How far...?) in an open-domain question answering system. These applications only have demonstrative character, as a thorough quantitative, task-based (extrinsic) evaluation of the utility of automatic toponym resolution is beyond the scope of this thesis and left for future work

    XVI. Magyar Szåmítógépes Nyelvészeti Konferencia

    Get PDF

    XVIII. Magyar Szåmítógépes Nyelvészeti Konferencia

    Get PDF

    Semantic Feature Extraction Using Multi-Sense Embeddings and Lexical Chains

    Full text link
    The relationship between words in a sentence often tell us more about the underlying semantic content of a document than its actual words individually. Natural language understanding has seen an increasing effort in the formation of techniques that try to produce non-trivial features, in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. These new dense vector representations indeed leverage the baseline in natural language processing, but they still fall short in dealing with intrinsic issues in linguistics, such as polysemy and homonymy. Systems that make use of natural language at its core, can be affected by a weak semantic representation of human language, resulting in inaccurate outcomes based on poor decisions. In this subject, word sense disambiguation and lexical chains have been exploring alternatives to alleviate several problems in linguistics, such as semantic representation, definitions, differentiation, polysemy, and homonymy. However, little effort is seen in combining recent advances in token embeddings (e.g. words, documents) with word sense disambiguation and lexical chains. To collaborate in building a bridge between these areas, this work proposes a collection of algorithms to extract semantic features from large corpora as its main contributions, named MSSA, MSSA-D, MSSA-NR, FLLC II, and FXLC II. The MSSA techniques focus on disambiguating and annotating each word by its specific sense, considering the semantic effects of its context. The lexical chains group derive the semantic relations between consecutive words in a document in a dynamic and pre-defined manner. These original techniques' target is to uncover the implicit semantic links between words using their lexical structure, incorporating multi-sense embeddings, word sense disambiguation, lexical chains, and lexical databases. A few natural language problems are selected to validate the contributions of this work, in which our techniques outperform state-of-the-art systems. All the proposed algorithms can be used separately as independent components or combined in one single system to improve the semantic representation of words, sentences, and documents. Additionally, they can also work in a recurrent form, refining even more their results.Ph.D.College of Engineering & Computer ScienceUniversity of Michigan-Dearbornhttps://deepblue.lib.umich.edu/bitstream/2027.42/149647/1/Terry Ruas Final Dissertation.pdfDescription of Terry Ruas Final Dissertation.pdf : Dissertatio

    XVII. Magyar Szåmítógépes Nyelvészeti Konferencia

    Get PDF
    corecore