Search CORE

270 research outputs found

Entity-Centric Text Mining for Historical Documents

Author: Coll Ardanuy Maria
Publication venue
Publication date: 07/07/2017
Field of study

Geocoding location expressions in Twitter messages: A preference learning method

Author: Gelernter Judith
Zhang Wei
Publication venue: DigitalCommons@UMaine
Publication date: 22/12/2014
Field of study

Resolving location expressions in text to the correct physical location, also known as geocoding or grounding, is complicated by the fact that so many places around the world share the same name. Correct resolution is made even more difficult when there is little context to determine which place is intended, as in a 140-character Twitter message, or when location cues from different sources conflict, as may be the case among different metadata fields of a Twitter message. We used supervised machine learning to weigh the different fields of the Twitter message and the features of a world gazetteer to create a model that will prefer the correct gazetteer candidate to resolve the extracted expression. We evaluated our model using the F1 measure and compared it to similar algorithms. Our method achieved results higher than state-of-the-art competitors

University of Maine

Discovering the spatial coverage of the documents through the SpatialCIM Methodology.

Author: MOURA M. F.
REZENDE S. de O.
RODRIGUEZ E.
SPERANZA E. A.
VARGAS R. N. P.
Publication venue
Publication date: 22/01/2020
Field of study

The main focus of this paper is to present the SpatialCIM methodology to identify the spatial coverage of the documents in the Brazilian geographic area. This methodology uses a linguistic tool to assist in the entity recognition process. The linguistic tool classifies the recognized entities as person, organization, time and localization, among others. The localization entities are checked using a geographic information system (GIS) in order to extract the Brazilian entity geographic paths. If there are multiple geographic paths for a single entity, the disambiguation process is carried out. This process attempts to locate the best geographic path for an entity considering all the geographic entities in the text. Another important objective of this paper is to show that the disambiguation process improves the geographic classification of the documents considering the obtained geographic paths. The validation process considers a set of news previously labeled by an expert and compared with the results of the disambiguated and non-disambiguated geographic paths. The results showed that the disambiguation process improves the classification compared with the classification without disambiguation. Keywords: Ambiguity problem resolution, spatial coverage identification, toponym resolution

Repository Open Access to Scientific Information from Embrapa

Domain-specific Named Entity Disambiguation in Historical Memoirs

Author: Federico Nanni
Goy Annamaria
Rovera Marco
Simone Paolo Ponzetto
Publication venue: CEUR
Publication date: 01/01/2017
Field of study

Institutional Research Information System University of Turin

Toponym Disambiguation in Information Retrieval

Author: Buscaldi Davide
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 12/11/2010
Field of study

In recent years, geography has acquired a great importance in the context of Information Retrieval (IR) and, in general, of the automated processing of information in text. Mobile devices that are able to surf the web and at the same time inform about their position are now a common reality, together with applications that can exploit this data to provide users with locally customised information, such as directions or advertisements. Therefore, it is important to deal properly with the geographic information that is included in electronic texts. The majority of such kind of information is contained as place names, or toponyms. Toponym ambiguity represents an important issue in Geographical Information Retrieval (GIR), due to the fact that queries are geographically constrained. There has been a struggle to nd speci c geographical IR methods that actually outperform traditional IR techniques. Toponym ambiguity may constitute a relevant factor in the inability of current GIR systems to take advantage from geographical knowledge. Recently, some Ph.D. theses have dealt with Toponym Disambiguation (TD) from di erent perspectives, from the development of resources for the evaluation of Toponym Disambiguation (Leidner (2007)) to the use of TD to improve geographical scope resolution (Andogah (2010)). The Ph.D. thesis presented here introduces a TD method based on WordNet and carries out a detailed study of the relationship of Toponym Disambiguation to some IR applications, such as GIR, Question Answering (QA) and Web retrieval. The work presented in this thesis starts with an introduction to the applications in which TD may result useful, together with an analysis of the ambiguity of toponyms in news collections. It could not be possible to study the ambiguity of toponyms without studying the resources that are used as placename repositories; these resources are the equivalent to language dictionaries, which provide the di erent meanings of a given word.Buscaldi, D. (2010). Toponym Disambiguation in Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8912Palanci

RiuNet

Domain-specific named entity disambiguation in historical memoirs

Author: Goy Anna
Nanni Federico
Ponzetto Simone Paolo
Rovera Marco
Publication venue: RWTH
Publication date: 01/01/2017
Field of study

This paper presents the results of the extraction of named entities from a collection of historical memoirs about the italian Resistance during the World War II. The methodology followed for the extraction and disambiguation task will be discussed, as well as its evaluation. For the semantic annotations of the dataset, we have developed a pipeline based on established practices for extracting and disambiguating Named Entities. This has been necessary, considering the poor performances of out-of-the-box Named Entity Recognition and Disambiguation (NERD) tools tested in the initial phase of this work.Questo articolo presenta l’attività di estrazione di entità nominate realizzata su una collezione di memorie relative al periodo della Resistenza italiana nella Seconda Guerra Mondiale. Verrà discussa la metodologia sviluppata per il processo di estrazione e disambiguazione delle entità nominate, nonché la sua valutazione. L’implementazione di una metodologia di estrazione e disambiguazione basata su lookup si è resa necessaria in considerazione delle scarse prestazioni dei sistemi di Named Entity Recognition and Disambiguation (NERD), come si evince dalla discussione nella prima parte di questo lavoro

Crossref

MAnnheim DOCument Server

OpenEdition

The SpatialCIM methodology for spatial document coverage disambiguation and the entity recognition process aided by linguistic techniques.

Author: MOURA M. F.
REZENDE S. O.
RODRIGUEZ E.
SPERANZA E. A.
VARGAS R. N. P.
Publication venue
Publication date: 22/01/2020
Field of study

Abstract. Nowadays it is becoming more usual for users to take into account the geographical localization of the documents in the retrieval information process. However, the conventional retrieval information systems based on key-word matching do not consider which words can represent geographical entities that are spatially related to other entities in the document. This paper presents the SpatialCIM methodology, which is based on three steps: pre-processing, data expansion and disambiguation. In the pre-processing step, the entity recognition process is carried out with the support of the Rembrandt tool. Additionally, a comparison between the performances regarding the discovery of the location entities in the texts of the Rembrandt tool against the use of a controlled vocabulary corresponding to the Brazilian geographic locations are presented. For the comparison a set of geographic labeled news covering the sugar cane culture in the Portuguese language is used. The results showed a F-measure value increase for the Rembrandt tool from 45% in the non-disambiguated process to 0.50 after disambiguation and from 35% to 38% using the controlled vocabulary. Additionally, the results showed the Rembrandt tool has a minimal amplitude difference between precision and recall, although the controlled vocabulary has always the biggest recall values.GeoDoc 2012, PAKDD 2012

Repository Open Access to Scientific Information from Embrapa

Automatic reconstruction of itineraries from descriptive texts

Author: Gaio Mauro
Moncla Ludovic
Nogueras Iso Francisco Javier
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2015
Field of study

Esta tesis se inscribe dentro del marco del proyecto PERDIDO donde los objetivos son la extracción y reconstrucción de itinerarios a partir de documentos textuales. Este trabajo se ha realizado en colaboración entre el laboratorio LIUPPA de l' Université de Pau et des Pays de l' Adour (France), el grupo de Sistemas de Información Avanzados (IAAA) de la Universidad de Zaragoza y el laboratorio COGIT de l' IGN (France). El objetivo de esta tesis es concebir un sistema automático que permita extraer, a partir de guías de viaje o descripciones de itinerarios, los desplazamientos, además de representarlos sobre un mapa. Se propone una aproximación para la representación automática de itinerarios descritos en lenguaje natural. Nuestra propuesta se divide en dos tareas principales. La primera pretende identificar y extraer de los textos describiendo itinerarios información como entidades espaciales y expresiones de desplazamiento o percepción. El objetivo de la segunda tarea es la reconstrucción del itinerario. Nuestra propuesta combina información local extraída gracias al procesamiento del lenguaje natural con datos extraídos de fuentes geográficas externas (por ejemplo, gazetteers). La etapa de anotación de informaciones espaciales se realiza mediante una aproximación que combina el etiquetado morfo-sintáctico y los patrones léxico-sintácticos (cascada de transductores) con el fin de anotar entidades nombradas espaciales y expresiones de desplazamiento y percepción. Una primera contribución a la primera tarea es la desambiguación de topónimos, que es un problema todavía mal resuelto dentro del reconocimiento de entidades nombradas (Named Entity Recognition - NER) y esencial en la recuperación de información geográfica. Se plantea un algoritmo no supervisado de georreferenciación basado en una técnica de clustering capaz de proponer una solución para desambiguar los topónimos los topónimos encontrados en recursos geográficos externos, y al mismo tiempo, la localización de topónimos no referenciados. Se propone un modelo de grafo genérico para la reconstrucción automática de itinerarios, donde cada nodo representa un lugar y cada arista representa un camino enlazando dos lugares. La originalidad de nuestro modelo es que además de tener en cuenta los elementos habituales (caminos y puntos del recorrido), permite representar otros elementos involucrados en la descripción de un itinerario, como por ejemplo los puntos de referencia visual. Se calcula de un árbol de recubrimiento mínimo a partir de un grafo ponderado para obtener automáticamente un itinerario bajo la forma de un grafo. Cada arista del grafo inicial se pondera mediante un método de análisis multicriterio que combina criterios cualitativos y cuantitativos. El valor de estos criterios se determina a partir de informaciones extraídas del texto e informaciones provenientes de recursos geográficos externos. Por ejemplo, se combinan las informaciones generadas por el procesamiento del lenguaje natural como las relaciones espaciales describiendo una orientación (ej: dirigirse hacia el sur) con las coordenadas geográficas de lugares encontrados dentro de los recursos para determinar el valor del criterio ``relación espacial''. Además, a partir de la definición del concepto de itinerario y de las informaciones utilizadas en la lengua para describir un itinerario, se ha modelado un lenguaje de anotación de información espacial adaptado a la descripción de desplazamientos, apoyándonos en las recomendaciones del consorcio TEI (Text Encoding and Interchange). Finalmente, se ha implementado y evaluado las diferentes etapas de nuestra aproximación sobre un corpus multilingüe de descripciones de senderos y excursiones (francés, español, italiano)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Universidad de Zaragoza

GeoAcademy: plataforma web y algoritmo para la detección automática y localización de coordenadas geográficas en artículos científicos

Author: Carranza García Francisco
Cascón Katchadourian Jesús
Rodríguez Domínguez Carlos
Torres Salinas Daniel
Publication venue: Consejo Superior de Investigaciones Científicas
Publication date: 02/10/2023
Field of study

The following study relates the qualities and uses of the GeoAcademy Project, a program designed with the aim of geolocating scientific articles automatically, such articles would be found in Scopus, Web of Science, or similar databases. An algorithm has been developed with the intention of capturing geographical coordinates or toponyms contained within the documents in order to perform reliable geolocation. In the methodology, we describe the stages of the project that have been necessary so as to build a sample database concerning the Sierra Nevada (Spain), as well as the development of the algorithm. The technical data regarding the employment of the algorithm on the sample documents and its levels of success are included in the results, as is an explanation of the platform containing web maps which can be utilised to show the texts which have been geolocated. In conclusion we outline the obstacles faced, potential bibliometric uses and the advantages it offers as a reference resource and source of information.El siguiente estudio describe las cualidades y usos del proyecto GeoAcademy, un programa diseñado con el objetivo de geolocalizar artículos científicos automáticamente, dichos artículos se descargarían de bases de datos científicas generales como Scopus o Web of Science. Esta geolocalización se realiza sobre el contenido del documento, ya sea mediante la captura de posibles coordenadas geográficas que tenga el documento, o topónimos que puedan aparecer en el documento a través de un algoritmo creado a tal efecto. En la metodología explicamos los pasos que se han dado en este proyecto para crear una base de datos de muestra con artículos que tratan sobre Sierra Nevada (España) y la creación y diseño del algoritmo. Los resultados muestran los datos técnicos de la aplicación del algoritmo sobre la base de datos y su tasa de éxito, así como una descripción de la plataforma creada para visualizar gráficamente los documentos geolocalizados en un mapa web. Finalmente, en la discusión, definimos las dificultades encontradas, las posibles aplicaciones bibliométricas y su utilidad como herramienta de consulta y recuperación de información

Revista española de Documentación Científica