24 research outputs found

    Toponym Disambiguation in Information Retrieval

    Full text link
    In recent years, geography has acquired a great importance in the context of Information Retrieval (IR) and, in general, of the automated processing of information in text. Mobile devices that are able to surf the web and at the same time inform about their position are now a common reality, together with applications that can exploit this data to provide users with locally customised information, such as directions or advertisements. Therefore, it is important to deal properly with the geographic information that is included in electronic texts. The majority of such kind of information is contained as place names, or toponyms. Toponym ambiguity represents an important issue in Geographical Information Retrieval (GIR), due to the fact that queries are geographically constrained. There has been a struggle to nd speci c geographical IR methods that actually outperform traditional IR techniques. Toponym ambiguity may constitute a relevant factor in the inability of current GIR systems to take advantage from geographical knowledge. Recently, some Ph.D. theses have dealt with Toponym Disambiguation (TD) from di erent perspectives, from the development of resources for the evaluation of Toponym Disambiguation (Leidner (2007)) to the use of TD to improve geographical scope resolution (Andogah (2010)). The Ph.D. thesis presented here introduces a TD method based on WordNet and carries out a detailed study of the relationship of Toponym Disambiguation to some IR applications, such as GIR, Question Answering (QA) and Web retrieval. The work presented in this thesis starts with an introduction to the applications in which TD may result useful, together with an analysis of the ambiguity of toponyms in news collections. It could not be possible to study the ambiguity of toponyms without studying the resources that are used as placename repositories; these resources are the equivalent to language dictionaries, which provide the di erent meanings of a given word.Buscaldi, D. (2010). Toponym Disambiguation in Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8912Palanci

    Tratamiento lingüístico de las preguntas en español en los sistemas de búsqueda de respuestas / Linguistic treatment of questions in Spanish for question classification in question answering systems

    Get PDF
    We propose a procedure for the linguistic treatment of Spanish questions as a step prior to their classification in question answering systems. The main types of question answering systems and their basic architecture are described. We review the principal question classification taxonomies used to date and the different fields from which they have been derived. Finally, we present the stages of linguistic analysis that the text of questions in question answering systems should be subject to in order to facilitate the location of appropriate answers

    Arabic named entity recognition

    Full text link
    En esta tesis doctoral se describen las investigaciones realizadas con el objetivo de determinar las mejores tecnicas para construir un Reconocedor de Entidades Nombradas en Arabe. Tal sistema tendria la habilidad de identificar y clasificar las entidades nombradas que se encuentran en un texto arabe de dominio abierto. La tarea de Reconocimiento de Entidades Nombradas (REN) ayuda a otras tareas de Procesamiento del Lenguaje Natural (por ejemplo, la Recuperacion de Informacion, la Busqueda de Respuestas, la Traduccion Automatica, etc.) a lograr mejores resultados gracias al enriquecimiento que a~nade al texto. En la literatura existen diversos trabajos que investigan la tarea de REN para un idioma especifico o desde una perspectiva independiente del lenguaje. Sin embargo, hasta el momento, se han publicado muy pocos trabajos que estudien dicha tarea para el arabe. El arabe tiene una ortografia especial y una morfologia compleja, estos aspectos aportan nuevos desafios para la investigacion en la tarea de REN. Una investigacion completa del REN para elarabe no solo aportaria las tecnicas necesarias para conseguir un alto rendimiento, sino que tambien proporcionara un analisis de los errores y una discusion sobre los resultados que benefician a la comunidad de investigadores del REN. El objetivo principal de esta tesis es satisfacer esa necesidad. Para ello hemos: 1. Elaborado un estudio de los diferentes aspectos del arabe relacionados con dicha tarea; 2. Analizado el estado del arte del REN; 3. Llevado a cabo una comparativa de los resultados obtenidos por diferentes tecnicas de aprendizaje automatico; 4. Desarrollado un metodo basado en la combinacion de diferentes clasificadores, donde cada clasificador trata con una sola clase de entidades nombradas y emplea el conjunto de caracteristicas y la tecnica de aprendizaje automatico mas adecuados para la clase de entidades nombradas en cuestion. Nuestros experimentos han sido evaluados sobre nueve conjuntos de test.Benajiba, Y. (2009). Arabic named entity recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8318Palanci

    Knowledge-based and data-driven approaches for geographical information access

    Get PDF
    Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents. Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms. The main contributions of this thesis to the state-of-the-art of GeoIA tasks are: 1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG). 2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks. 3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated. 4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and Rodríguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and Rodríguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments. 5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and Rodríguez, 2006). 6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and Rodríguez, 2014) and posterior experiments (Ferrés and Rodríguez, 2011; Ferrés and Rodríguez, 2015b).L'Accés a la Informació Geogràfica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anàlisi automàtic i la interpretació dels termes i restriccions geogràfiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogràfica (RIG), Cerca de la Resposta Geogràfica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogràfiques (com polígons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semàntic i els termes i les restriccions geogràfiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogràfic i temàtic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurístics basats en Coneixement Geogràfic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades específics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogràfic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogràfica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogràfic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguí resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and Rodríguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and Rodríguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadística en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'àmbit geogràfic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografía espanyola (Ferrés and Rodríguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and Rodríguez, 2014) i en experiments posteriors (Ferrés and Rodríguez, 2011; Ferrés and Rodríguez, 2015b)

    Knowledge-based and data-driven approaches for geographical information access

    Get PDF
    Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents. Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms. The main contributions of this thesis to the state-of-the-art of GeoIA tasks are: 1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG). 2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks. 3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated. 4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and Rodríguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and Rodríguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments. 5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and Rodríguez, 2006). 6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and Rodríguez, 2014) and posterior experiments (Ferrés and Rodríguez, 2011; Ferrés and Rodríguez, 2015b).L'Accés a la Informació Geogràfica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anàlisi automàtic i la interpretació dels termes i restriccions geogràfiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogràfica (RIG), Cerca de la Resposta Geogràfica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogràfiques (com polígons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semàntic i els termes i les restriccions geogràfiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogràfic i temàtic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurístics basats en Coneixement Geogràfic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades específics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogràfic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogràfica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogràfic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguí resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and Rodríguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and Rodríguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadística en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'àmbit geogràfic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografía espanyola (Ferrés and Rodríguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and Rodríguez, 2014) i en experiments posteriors (Ferrés and Rodríguez, 2011; Ferrés and Rodríguez, 2015b).Postprint (published version

    Linguistic treatment of questions in Spanish for question classification in question answering systems

    Get PDF
    Se propone un procedimiento para el tratamiento lingüístico de las preguntas en español como paso previo a su clasificación en los sistemas de búsqueda de respuestas. Se mencionan los principales tipos de sistemas de búsqueda de respuestas y su arquitectura básica. Se revisan las principales taxonomías utilizadas hasta el momento para la clasificación de preguntas y las distintas perspectivas desde las que se enfocan. Finalmente, se presentan las etapas de análisis lingüístico a las que ha de someterse el texto de las preguntas en estos sistemas para facilitar la localización de las respuestas adecuadas.We propose a procedure for the linguistic treatment of Spanish questions as a step prior to their classification in question answering systems. The main types of question answering systems and their basic architecture are described. We review the principal question classification taxonomies used to date and the different fields from which they have been derived. Finally, we present the stages of linguistic analysis that the text of questions in question answering systems should be subject to in order to facilitate the location of appropriate answers

    Bootstrapping named entity resources for adaptive question answering systems

    Get PDF
    Los Sistemas de Búsqueda de Respuestas (SBR) amplían las capacidades de un buscador de información tradicional con la capacidad de encontrar respuestas precisas a las preguntas del usuario. El objetivo principal es facilitar el acceso a la información y disminuir el tiempo y el esfuerzo que el usuario debe emplear para encontrar una información concreta en una lista de documentos relevantes. En esta investigación se han abordado dos trabajos relacionados con los SBR. La primera parte presenta una arquitectura para SBR en castellano basada en la combinación y adaptación de diferentes técnicas de Recuperación y de Extracción de Información. Esta arquitectura está integrada por tres módulos principales que incluyen el análisis de la pregunta, la recuperación de pasajes relevantes y la extracción y selección de respuestas. En ella se ha prestado especial atención al tratamiento de las Entidades Nombradas puesto que, con frecuencia, son el tema de las preguntas o son buenas candidatas como respuestas. La propuesta se ha encarnado en el SBR del grupo MIRACLE que ha sido evaluado de forma independiente durante varias ediciones en la tarea compartida CLEF@QA, parte del foro de evaluación competitiva Cross-Language Evaluation Forum (CLEF). Se describen aquí las participaciones y los resultados obtenidos entre 2004 y 2007. El SBR de MIRACLE ha obtenido resultados moderados en el desempeño de la tarea con tasas de respuestas correctas entre el 20% y el 30%. Entre los resultados obtenidos destacan los de la tarea principal de 2005 y la tarea piloto de Búsqueda de Respuestas en tiempo real de 2006, RealTimeQA. Esta última tarea, además de requerir respuestas correctas incluía el tiempo de respuesta como un factor adicional en la evaluación. Estos resultados respaldan la validez de la arquitectura propuesta como una alternativa viable para los SBR sobre colecciones textuales y también corrobora resultados similares para el inglés y otras lenguas. Por otro lado, el análisis de los resultados a lo largo de las diferentes ediciones de CLEF así como la comparación con otros SBR apunta nuevos problemas y retos. Según nuestra experiencia, los sistemas de QA son más complicados de adaptar a otros dominios y lenguas que los sistemas de Recuperación de Información. Este problema viene heredado del uso de herramientas complejas de análisis de lenguaje como analizadores morfológicos, sintácticos y semánticos. Entre estos últimos se cuentan las herramientas para el Reconocimiento y Clasificación de Entidades Nombradas (NERC en inglés) así como para la Detección y Clasificación de Relaciones (RDC en inglés). Debido a la di cultad de adaptación del SBR a distintos dominios y colecciones, en la segunda parte de esta tesis se investiga una propuesta diferente basada en la adquisición de conocimiento mediante métodos de aprendizaje ligeramente supervisado. El objetivo de esta investigación es adquirir recursos semánticos útiles para las tareas de NERC y RDC usando colecciones de textos no anotados. Además, se trata de eliminar la dependencia de herramientas de análisis lingüístico con el fin de facilitar que las técnicas sean portables a diferentes dominios e idiomas. En primer lugar, se ha realizado un estudio de diferentes algoritmos para NERC y RDC de forma semisupervisada a partir de unos pocos ejemplos (bootstrapping). Este trabajo propone primero una arquitectura común y compara diferentes funciones que se han usado en la evaluación y selección de resultados intermedios, tanto instancias como patrones. La principal propuesta es un nuevo algoritmo que permite la adquisición simultánea e iterativa de instancias y patrones asociados a una relación. Incluye también la posibilidad de adquirir varias relaciones de forma simultánea y mediante el uso de la hipótesis de exclusividad obtener mejores resultados. Como característica distintiva el algoritmo explora la colección de textos con una estrategia basada en indización, que permite adquirir conocimiento de grandes colecciones. La estrategia de selección de candidatos y la evaluación se basan en la construcción de un grafo de instancias y patrones, que justifica nuestro método para la selección de candidatos. Este procedimiento es semejante al frente de exploración de una araña web y permite encontrar las instancias más parecidas a las semillas con las evidencias disponibles. Este algoritmo se ha implementado en el sistema SPINDEL y para su evaluación se ha comenzado con el caso concreto de la adquisición de recursos para las clases de Entidades Nombradas más comunes, Persona, Lugar y Organización. El objetivo es adquirir nombres asociados a cada una de las categorías así como patrones contextuales que permitan detectar menciones asociadas a una clase. Se presentan resultados para la adquisición de dos idiomas distintos, castellano e inglés, y para el castellano, en dos dominios diferentes, noticias y textos de una enciclopedia colaborativa, Wikipedia. En ambos casos el uso de herramientas de análisis lingüístico se ha limitado de acuerdo con el objetivo de avanzar hacia la independencia de idioma. Las listas adquiridas mediante bootstrapping parten de menos de 40 semillas por clase y obtienen del orden de 30.000 instancias de calidad variable. Además se obtienen listas de patrones indicativos asociados a cada clase de entidad. La evaluación indirecta confirma la utilidad de ambos recursos en la clasificación de Entidades Nombradas usando un enfoque simple basado únicamente en diccionarios. La mejor configuración obtiene para la clasificación en castellano una medida F de 67,17 y para inglés de 55,99. Además se confirma la utilidad de los patrones adquiridos que en ambos casos ayudan a mejorar la cobertura. El módulo requiere menor esfuerzo de desarrollo que los enfoques supervisados, si incluimos la necesidad de anotación, aunque su rendimiento es inferior por el momento. En definitiva, esta investigación constituye un primer paso hacia el desarrollo de aplicaciones semánticas como los SBR que requieran menos esfuerzo de adaptación a un dominio o lenguaje nuevo.-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Question Answering (QA) systems add new capabilities to traditional search engines with the ability to find precise answers to user questions. Their objective is to enable easier information access by reducing the time and effort that the user requires to find a concrete information among a list of relevant documents. In this thesis we have carried out two works related with QA systems. The first part introduces an architecture for QA systems for Spanish which is based on the combination and adaptation of different techniques from Information Retrieval (IR) and Information Extraction (IE). This architecture is composed by three modules that include question analysis, relevant passage retrieval and answer extraction and selection. The appropriate processing of Named Entities (NE) has received special attention because of their importance as question themes and candidate answers. The proposed architecture has been implemented as part of the MIRACLE QA system. This system has taken part in independent evaluations like the CLEF@QA track in the Cross-Language Evaluation Forum (CLEF). Results from 2004 to 2007 campaigns as well as the details and the evolution of the system have been described in deep. The MIRACLE QA system has obtained moderate performance with a first answer accuracy ranging between 20% and 30%. Nevertheless, it is important to highlight the results obtained in the 2005 main QA task and the RealTimeQA pilot task in 2006. The last one included response time as an important additional variable of the evaluation. These results back the proposed architecture as an option for QA from textual collection and confirm similar findings obtained for English and other languages. On the other hand, the analysis of the results along evaluation campaigns and the comparison with other QA systems point problems with current systems and new challenges. According to our experience, it is more dificult to tailor QA systems to different domains and languages than IR systems. The problem is inherited by the use of complex language analysis tools like POS taggers, parsers and other semantic analyzers, like NE Recognition and Classification (NERC) and Relation Detection and Characterization (RDC) tools. The second part of this thesis tackles this problem and proposes a different approach to adapting QA systems for di erent languages and collections. The proposal focuses on acquiring knowledge for the semantic analyzers based on lightly supervised approaches. The goal is to obtain useful resources that help to perform NERC or RDC using as few annotated resources as possible. Besides, we try to avoid dependencies from other language analysis tools with the purpose that these methods apply to different languages and domains. First of all, we have study previous work on building NERC and RDC modules with few supervision, particularly bootstrapping methods. We propose a common framework for different bootstrapping systems that help to unify different evaluation functions for intermediate results. The main proposal is a new algorithm that is able to simultaneously acquire instances and patterns associated to a relation of interest. It also uses mutual exclusion among relations to reduce concept drift and achieve better results. A distinctive characteristic is that it uses a query based exploration strategy of the text collection which enables their use for larger collections. Candidate selection and evaluation are based on incrementally building a graph of instances and patterns which also justifies our evaluation function. The discovery approach is analogous to the front of exploration in a web crawler and it is able to find the most similar instances to the available seeds. This algorithm has been implemented in the SPINDEL system. We have selected for evaluation the task of acquiring resources for the most common NE classes, Person, Location and Organization. The objective is to acquire name instances that belong to any of the classes as well as contextual patterns that help to detect mentions of NE that belong to that class. We present results for the acquisition of resources from raw text from two different languages, Spanish and English. We also performed experiments for Spanish in two different collections, news and texts from a collaborative encyclopedia, Wikipedia. Both cases are tackled with limited language analysis tools and resources. With an initial list of 40 instance seeds, the bootstrapping process is able to acquire large name lists containing up to 30.000 instances with a variable quality. Besides, large lists of indicative patterns are obtained too. Our indirect evaluation confirms the utility of both resources to classify NE using a simple dictionary recognition approach. Best results for Spanish obtained a F-score of 67,17 and for English this value is 55,99. The module requires much less development effort than annotation for supervised algorithms although the performance is not in pair yet. This research is a first step towards the development of semantic applications like QA for a new language or domain with no annotated corpora that requires less adaptation effort

    Finding answers to definition questions on the web

    Get PDF
    Fundamentally, question answering systems are designed for automatically responding to queries posed by users in natural language. The first step in the answering process is query analysis, and its goal is to classify the query in concert with a set of pre-specified types. Traditionally, these classes include: factoid, definition, and list. Systems thereafter chose the answering method in congruence with the class recognised in this early phase. In short, this thesis focuses exclusively on strategies to tackle definition questions (e.g.\u27; Who is Ben Bernanke?"). This sort of question has become especially interesting in recent years, due to its significant number of submissions to search engines. Most advances in definition question answering have been made under the umbrella of the Text REtrieval Conference (TREC). This is, more precisely, a framework for testing systems operating on a collection of news articles. Thus, the objective of chapter one is to describe this framework along with presenting additional introductory aspects of definition question answering including: (a) how definition questions are prompted by individuals; (b) the different conceptions of definition, and thus of answers; and (c) the various metrics exploited for assessing systems. Since the inception of TREC, systems have put to the test manifold approaches to discover answers, throwing some light onto several key aspects of this problem. On this account, chapter four goes over a selection of some notable TREC systems. This selection is not aimed at completeness, but rather at highlighting the leading features of these systems. For the most part, systems benefit from knowledge bases (e.g., Wikipedia) for obtaining descriptions about the concept being defined (a.k.a. definiendum). These descriptions are thereafter projected onto the array of candidate answers as a means of discerning the correct answer. In other words, these knowledge bases play the role of annotated resources, and most systems attempt to find the answer candidates across the collection of news articles that are more similar to these descriptions. The cornerstone of this thesis is the assumption that it is plausible to devise competitive, and hopefully better, systems without the necessity of annotated resources. Although this descriptive knowledge is helpful, it is the belief of the author that they are built on two wrong premises: 1.It is arguable that senses or contexts related to the definiendum across knowledge bases are the same senses or contexts for the instances across the array of answer candidates. This observation also extends to the fact that not all descriptions within the group of putative answers are necessarily covered by knowledge bases, even though they might refer to the same contexts or senses. 2.Finding an efficient projection strategy does not necessarily entail a good procedure for discerning descriptive knowledge, because it shifts the goal of the task to a more like this set" instead of analysing whether or not each candidate bears the characteristics of a description. In other words, the coverage given by knowledge bases for a specific definiendum is not wide enough to learn all the characteristics that typify its descriptions, so that systems are capable of identifying all answers within the set of candidates. From another angle, a conventional projection methodology can be seen as a finder of lexical analogies. All in all, this thesis investigates into models that disregard this kind of annotated resource and projection strategy. In effect, it is the belief of the author that a robust technique of this sort can be integrated with traditional projection methodologies, and in this way bringing about an enhancement in performance. The major contributions of this thesis are presented in chapters five, six and seven. There are several ways of understanding this structure. For example, chapter five presents a general framework for answering definition questions in several languages. The primary goal of this study is to design a lightweight definition question answering system operating on web-snippets and two languages: English and Spanish. The idea is to utilise web-snippets as a source of descriptive information in several languages, and the high degree of language independency is achieved by making allowances for as little linguistic knowledge as possible. To put it more precisely, this system accounts for statistical methods and a list of stop-words, as well as a set of language-dependent definition patterns. In detail, chapter five branches into two more specific studies. The first study is essentially aimed at capitalising on redundancy for detecting answers (e.g., word frequency counts across answer candidates). Although this type of feature has been widely used by TREC systems, this study focuses on its impact on different languages, and its benefits when applied to web-snippets instead of a collection of news documents. An additional motivation behind targeting web-snippets is the hope of studying systems working on more heterogenous corpora, without incurring the need of downloading full-documents. For instance, on the Internet, the number of distinct senses for the definiendum considerably increases, ergo making it necessary to consider a sense discrimination technique. For this purpose, the system presented in this chapter takes advantage of an unsupervised approach premised on Latent Semantic Analysis. Although the outcome of this study shows that sense discrimination is hard to achieve when operating solely on web snippets, it also reveals that they are a fruitful source of descriptive knowledge, and that their extraction poses exciting challenges. The second branch extends this first study by exploiting multilingual knowledge bases (i.e. Wikipedia) for ranking putative answers. Generally speaking, it makes use of word association norms deduced from sentences that match definitions patterns across Wikipedia. In order to adhere to the premise of not profiting from articles related to a specific definiendum, these sentences are anonymised by replacing the concept with a placeholder, and the word norms are learnt from all training sentences, instead of only from the Wikipedia page about the particular definiendum. The results of this study signify that this use of these resources can also be beneficial; in particular, they reveal that word association norms are a cost-efficient solution. However, the size of the corpus markedly decreases for languages different from English, thus indicating their insufficiency to design models for other languages. Later, chapter six gets more specific and deals only with the ranking of answer candidates in English. The reason for abandoning the idea of Spanish is the sparseness observed across both the redundancy from the Internet and the training material mined from Wikipedia. This sparseness is considerably greater than in the case of English, and it makes learning powerful statistical models more difficult. This chapter presents a novel way of modeling definitions grounded on n-gram language models inferred from the lexicalised dependency tree representation of the training material acquired in the study of chapter five. These models are contextual in the sense that they are built in relation to the semantic of the sentence. Generally, these semantics can be perceived as the distinct types of definienda (e.g., footballer, language, artist, disease, and tree). This study, in addition, investigates the effect of some features on these context models (i.e., named entities, and part-of-speech tags). Overall, the results obtained by this approach are encouraging, in particular in terms of increasing the accuracy of the pattern matching. However, in all likelihood, it was experimentally observed that a training corpus comprising only positive examples (descriptions) is not enough to achieve perfect accuracy, because these models cannot deduce the characteristics that typify non-descriptive content. More essential, as future work, context models give the chance to study how different contexts can be amalgamated (smoothed) in agreement with their semantic similarities in order to ameliorate the performance. Subsequently, chapter seven gets even more specific and it searches for the set of properties that can aid in discriminating descriptions from other kinds of texts. Note that this study regards all kinds of descriptions, including those mismatching definition patters. In so doing, Maximum Entropy models are constructed on top of an automatically acquired large-scale training corpus, which encompasses descriptions from Wikipedia and non-descriptions from the Internet. Roughly speaking, different models are constructed as a means of studying the impact of assorted properties: surface, named entities, part-of-speech tags, chunks, and more interestingly, attributes derived from the lexicalised dependency graphs. In general, results corroborate the efficiency of features taken from dependency graphs, especially the root node and n-gram paths. Experiments conducted on testing sets of various characteristics suggest that it is also plausible to find attributes that can port to other corpora. The second and the third are extra chapters. The former examines different strategies to trawl the Web for descriptive knowledge. In essence, this chapter touches on several strategies geared towards boosting the recall of descriptive sentences across web snippets, especially sentences that match widespread definition patterns. This is a side, but instrumental study to the core of this thesis, as it is necessary for systems targeted at the Internet to develop effective crawling techniques. On the contrary, chapter three has two goals: (a) presenting some components used by the strategies outlined in the last three chapters, this way helping to focus on key aspects of the ranking methodologies, and hence to clearly present the relevant aspects of approaches laid out in these three chapters; and (b) fleshing out some characteristics that make separating the genuine from the misleading answer candidates difficult; particularly, across sentences matching definition patterns. Chapter three is helpful for understanding part of the linguistic phenomena that the posterior chapters deal with. On a final note about the organisation of this thesis, since there is a myriad of techniques, chapter six and seven start dissecting the related work closer to each strategy. The main contribution of each chapter begins at section 6.5 and 7.6, respectively. These two sections start with a discussion and comparison between the proposed methods and the related work presented in their corresponding preceding sections. This organisation is directed at facilitating the contextualisation of the proposed approaches as there are different question answering systems with manifold characteristics.Frage-Antwort-Systeme sind im Wesentlichen dafür konzipiert, von Benutzern in natürlicher Sprache gestellte Anfragen automatisiert zu beantworten. Der erste Schritt im Beantwortungsprozess ist die Analyse der Anfrage, deren Ziel es ist, die Anfrage entsprechend einer Menge von vordefinierten Typen zu klassifizieren. Traditionell umfassen diese: Faktoid, Definition und Liste. Danach wählten die Systeme dieser frühen Phase die Antwortmethode entsprechend der zuvor erkannten Klasse. Kurz gesagt konzentriert sich diese Arbeit ausschließlich auf Strategien zur Lösung von Fragen nach Definitionen (z.B. ,,emph{Wer ist Ben Bernanke?}"). Diese Art von Anfrage ist in den letzten Jahren besonders interessant geworden, weil sie in beachtlicher Zahl bei Suchmaschinen eingeht. Die meisten Fortschritte in Bezug auf die Beantwortung von Fragen nach Definitionen wurden unter dem Dach der Text Retrieval Conference (TREC) gemacht. Das ist, genauer gesagt, ein Framework zum Testen von Systemen, die mit einer Auswahl von Zeitungsartikeln arbeiten. Daher, zielt Kapitel eins auf eine Beschreibung dieses Rahmenwerks ab, zusammen mit einer Darstellung weiterer einführender Aspekte der Beantwortung von Definitionsanfragen. Diesen schließen u.a. ein: (a) wie Definitionsanfragen von Personen gestellt werden; (b) die unterschiedlichen Begriffe von Definition und folglich auch Antworten; und (c) die unterschiedlichen Metriken, die zur Bewertung von Systemen genutzt werden. Seit Anbeginn von TREC haben Systeme vielfältige Ansätze, Antworten zu entdecken, auf die Probe gestellt und dabei eine Reihe von zentralen Aspekten dieses Problems beleuchtet. Aus diesem Grund behandelt Kapitel vier eine Auswahl einiger bekannter TREC Systeme. Diese Auswahl zielt nicht auf Vollständigkeit ab, sondern darauf, die wesentlichen Merkmale dieser Systeme hervorzuheben. Zum größten Teil nutzen die Systeme Wissensbasen (wie z.B. Wikipedia), um Beschreibungen des zu definierenden Konzeptes (auch als Definiendum bezeichnet) zu erhalten. Diese Beschreibungen werden danach auf eine Reihe von möglichen Antworten projiziert, um auf diese Art die richtige Antwort zu ermitteln. Anders ausgedrückt nehmen diese Wissensbasen die Funktion von annotierten Ressourcen ein, wobei die meisten Systeme versuchen, die Antwortkandidaten in einer Sammlung von Zeitungsartikeln zu finden, die diesen Beschreibungen ähnlicher sind. Den Grundpfeiler dieser Arbeit bildet die Annahme, dass es plausibel ist, ohne annotierte Ressourcen konkurrenzfähige, und hoffentlich bessere, Systeme zu entwickeln. Obwohl dieses deskriptive Wissen hilfreich ist, basieren sie nach Überzeugung des Autors auf zwei falschen Annahmen: 1. Es ist zweifelhaft, ob die Bedeutungen oder Kontexte, auf die sich das Definiendum bezieht, dieselben sind wie die der Instanzen in der Reihe der Antwortkandidaten. Darüber hinaus erstreckt sich diese Beobachtung auch auf die Tatsache, dass nicht alle Beschreibungen innerhalb der Gruppe der mutmaßlichen Antworten notwendigerweise von Wissensbasen abgedeckt werden, auch wenn sie sich auf dieselben Bedeutungen und Kontexte beziehen. 2. Eine effiziente Projektionsstrategie zu finden bedeutet nicht notwendigerweise auch ein gutes Verfahren zur Feststellung von deskriptivem Wissen, denn es verschiebt die Zielsetzung der Aufgabe hin zu einem ,,mehr wie diese Menge" statt zu analysieren, ob jeder Kandidat den Charakteristika einer Beschreibung entspricht oder nicht. Anders ausgedrückt ist die Abdeckung, die durch Wissensbasen für ein spezifisches Definiendum gegeben ist, nicht umfassend genug, um alle Charakteristika, die für seine Beschreibungen kennzeichnend sind, zu erlernen, so dass die Systeme in der Lage sind, alle Antworten innerhalb der Kandidatenmenge zu identifizieren. Eine konventionelle Projektionsstrategie kann aus einem anderen Blickwinkel als Prozedur zum Finden lexikalischer Analogien betrachtet werden. Insgesamt untersucht diese Arbeit Modelle, die Strategien dieser Art in Verbindung mit annotierten Ressourcen und Projektion außer Acht lassen. Tatsächlich ist es die Überzeugung des Autors, dass eine robuste Technik dieser Art mit traditionellen Methoden der Projektion integriert wird und so eine Leistungssteigerung ermöglichen kann. Die größeren Beiträge dieser Arbeit werden in den Kapiteln fünf, sechs und sieben präsentiert. Es gibt mehrere Wege diese Struktur zu verstehen. Kapitel fünf, beispielsweise, präsentiert einen allgemeinen Rahmen für die Beantwortung von Fragen nach Definitionen in mehreren Sprachen. Das primäre Ziel dieser Studie ist es, ein leichtgewichtiges System zur Beantwortung von Fragen nach Definitionen zu entwickeln, das mit Web-Snippets und zwei Sprachen arbeitet: Englisch und Spanisch. Die Grundidee ist, von Web-Snippets als Quelle deskriptiver Information in mehreren Sprachen zu profitieren, wobei der hohe Grad an Sprachunabhängigkeit dadurch erreicht wird, dass so wenig linguistisches Wissen wie möglich berücksichtigt wird. Genauer gesagt berücksichtigt dieses System statistische Methoden und eine Liste von Stop-Wörtern sowie eine Reihe von sprach-spezifischen Definitionsmustern. Im Einzelnen teilt sich Kapitel fünf in zwei spezifischere Studien auf. Die erste Studie zielt im Grunde darauf ab, aus Redundanz für die Ermittlung von Antworten Kapital zu schlagen (z.B. Worthäufigkeiten über verschiedene Antwortkandidaten hinweg). Obwohl eine solche Eigenschaft unter TREC Systemen weit verbreitet ist, legt diese Studie den Schwerpunkt auf die Auswirkungen auf verschiedene Sprachen und auf ihre Vorteile bei der Anwendung auf Web-Snippets statt Zeitungsartikeln. Eine weitere Motivation dahinter, Web-Snippets ins Auge zu fassen, ist die Hoffnung, Systeme zu studieren, die mit heterogenen Corpora arbeiten, ohne es nötig zu machen, vollständige Dokumente herunter zu laden. Im Internet, beispielsweise, steigt die Zahl verschiedener Bedeutungen für das Definiendum deutlich an, was es notwendig macht, eine Technik zur Unterscheidung von Bedeutungen in Betracht zu ziehen. Zu diesem Zweck nutzt das System, das in diesem Kapitel vorgestellt wird, einen unüberwachten Ansatz, der auf der Latent Semantic Analysis basiert. Auch wenn das Ergebnis dieser Studie zeigt, dass die Unterscheidung von Bedeutungen allein anhand von Web-Snippets schwer zu erreichen ist, so lässt es doch auch erkennen, dass sie eine fruchtbare Quelle deskriptiven Wissens darstellen und dass ihre Extraktion spannende Herausforderungen bereithält. Der zweite Teil erweitert diese erste Studie durch die Nutzung mehrsprachiger Wissensbasen (d.h. Wikipedia), um die möglichen Antworten in eine Rangfolge einzureihen. Allgemein ausgedrückt profitiert sie von Wortassoziationsnormen, die von Sätzen gelernt werden, die über Wikipedia hinweg zu Definitionsmustern passen. Um an der Prämisse festzuhalten, keine Artikel mit Bezug auf eine spezifisches Definiendum zu nutzen, werden diese Sätze anonymisiert, indem der Begriff mit einem Platzhalter ersetzt wird, und die Wortnormen werden von allen Sätzen der Trainingsmenge gelernt, statt nur von dem Wikipedia-Artikel, der sich auf das spezielle Definiendum bezieht. Die Ergebnisse dieser Studie zeigen, dass diese Nutzung dieser Ressourcen ebenfalls vorteilhaft sein kann; speziell zeigen sie auf, dass Wortassoziationsnormen eine kosteneffiziente Lösung darstellen. Allerdings nehmen die Corpusgrößen über andere Sprachen als Englisch deutlich ab, was auf deren Unzulänglichkeit für die Konstruktion von Modellen für andere Sprachen hinweist. Kapitel sechs, weiter hinten, wird spezieller und handelt ausschließlich von der Einordnung von Antwortkandidaten in englischer Sprache in eine Rangfolge. Der Grund dafür, hier Spanisch außer Acht zu lassen, ist die geringe beobachtete Dichte, sowohl in Bezug auf redundante Information im Internet als auch in Bezug auf Trainingsmaterial, das von Wikipedia erworben wurde. Diese geringe Dichte ist deutlich stärker ausgeprägt als im Fall der englischen Sprache und erschwert das Erlernen mächtiger statischer Modelle. Dieses Kapitel präsentiert einen neuartigen Weg, Definitionen zu modellieren, die in n-gram Sprachmodellen verankert sind, die aus der lexikalisierten Darstellung des Abhängigkeitsbaumes des in Kapitel fünf erworbenen Trainingsmaterials gelernt wurden. Diese Modelle sind kontextuell in dem Sinne, dass sie in Bezug auf die Semantikdes Satzes konstruiert werden. Im Allgemeinen können diese Semantiken als unterschiedliche Typen von Definienda betrachtet werden (z.B. Fußballer, Sprache, Künstler, Krankheit und Baum). Diese Studie untersucht zusätzlich die Auswirkungen einiger Eigenschaften (nämlich benannter Entitäten und Part-of-speech-Tags) auf diese Kontextmodelle. Insgesamt sind die Ergebnisse, die mit diesem Ansatz erhalten wurden, ermutigend, insbesondere in Bezug auf eine Steigerung der Genauigkeit des Musterabgleichs. Indes wurde höchstwahrscheinlich experimentell beobachtet, dass ein Trainingscorpus, das nur Positivbeispiele (Beschreibungen) enthält, nicht ausreicht, um perfekte Genauigkeit zu erreichen, da diese Modelle die Charakteristika nicht ableiten können, die für nicht-deskriptiven Inhalt kennzeichnend sind. Für die weitere Arbeit ermöglichen es Kontextmodelle zu untersuchen, wie unterschiedliche Kontexte in Übereinstimmung mit deren semantischen Ähnlichkeiten verschmolzen (geglättet) werden können, um die Leistung zu verstärken. Kapitel sieben wird anschließend sogar noch spezieller und sucht nach der Menge von Eigenschaften, die dabei helfen kann, Beschreibungen von anderen Textarten zu unterscheiden. Dabei sollte beachtet werden, dass diese Studie alle Arten von Beschreibungen berücksichtigt, einschließlich derer, die Definitionsmustern nicht genügen. Dadurch werden Maximum-Entropy-Modelle konstruiert, die auf einen automatisch akquirierten Corpus von großem Umfang aufsetzen, der Beschreibungen von Wikipedia und Nicht-Beschreibungen aus dem Internet umfasst. Grob gesagt werden unterschiedliche Modelle konstruiert, um die Auswirkungen verschiedenerlei Merkmale zu untersuchen: Oberfläche, benannte Entitäten, Part-of-speech-Tags, Chunks und, noch interessanter, von den lexikalisierten Abhängigkeitsgraphen abgeleitete Attribute. Im Allgemeinen bestätigen die Ergebnisse die Effizienz von Merkmalen, die Abhängigkeitsgraphen entnommen sind, insbesondere Wurzelknoten und n-gram-Pfaden. Experimente, die mit verschiedenen Testmengen diverser Charakteristika durchgeführt wurden, legen nahe, dass auch angenommen werden kann, dass Attribute gefunden werden, die sich auf andere Corpora übertragen lassen. Es gibt zwei weitere Kapitel: zwei und drei. Ersteres untersucht unterschiedliche Strategien, das Netz nach deskriptivem Wissen zu durchforsten. Im Wesentlichen analysiert dieses Kapitel einige Strategien, die darauf abzielen, die Trefferquote (den Recall) deskriptiver Sätze

    Contributions to information extraction for spanish written biomedical text

    Get PDF
    285 p.Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue andscope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
    corecore