1 research outputs found

    Spatial characteristics of a large web n-gram corpus

    Full text link
    N-gram corpora, though prominently used to structure and index large natural language corpora, are rarely in the focus of GIR. In this study we describe a step in this direction by characterizing spatial information in a large Web n-gram corpus provided by Microsoft. We explore how continent and country toponyms are represented in this corpus and if basic topological relations can be correctly retrieved. Results suggest that toponym ambiguity has major impact and that although retrieved topological relations are often correct, recall is considerably low. We conclude that further research is required if more fine grained spatial information is to be retrieved from n-grams
    corecore