194 research outputs found
Knowledge-based and data-driven approaches for geographical information access
Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents.
Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms.
The main contributions of this thesis to the state-of-the-art of GeoIA tasks are:
1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG).
2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks.
3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated.
4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and RodrÃguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments.
5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and RodrÃguez, 2006).
6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and RodrÃguez, 2014) and posterior experiments (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b).L'Accés a la Informació Geogrà fica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anà lisi automà tic i la interpretació dels termes i restriccions geogrà fiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogrà fica (RIG), Cerca de la Resposta Geogrà fica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogrà fiques (com polÃgons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semà ntic i els termes i les restriccions geogrà fiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogrà fic i temà tic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurÃstics basats en Coneixement Geogrà fic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades especÃfics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogrà fic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogrà fica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogrà fic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguà resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and RodrÃguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadÃstica en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'à mbit geogrà fic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografÃa espanyola (Ferrés and RodrÃguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and RodrÃguez, 2014) i en experiments posteriors (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b)
Knowledge-based and data-driven approaches for geographical information access
Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents.
Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms.
The main contributions of this thesis to the state-of-the-art of GeoIA tasks are:
1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG).
2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks.
3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated.
4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and RodrÃguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments.
5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and RodrÃguez, 2006).
6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and RodrÃguez, 2014) and posterior experiments (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b).L'Accés a la Informació Geogrà fica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anà lisi automà tic i la interpretació dels termes i restriccions geogrà fiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogrà fica (RIG), Cerca de la Resposta Geogrà fica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogrà fiques (com polÃgons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semà ntic i els termes i les restriccions geogrà fiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogrà fic i temà tic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurÃstics basats en Coneixement Geogrà fic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades especÃfics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogrà fic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogrà fica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogrà fic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguà resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and RodrÃguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadÃstica en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'à mbit geogrà fic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografÃa espanyola (Ferrés and RodrÃguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and RodrÃguez, 2014) i en experiments posteriors (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b).Postprint (published version
Toponym Disambiguation in Information Retrieval
In recent years, geography has acquired a great importance in the context of
Information Retrieval (IR) and, in general, of the automated processing of
information in text. Mobile devices that are able to surf the web and at the
same time inform about their position are now a common reality, together
with applications that can exploit this data to provide users with locally
customised information, such as directions or advertisements. Therefore,
it is important to deal properly with the geographic information that is
included in electronic texts. The majority of such kind of information is
contained as place names, or toponyms.
Toponym ambiguity represents an important issue in Geographical Information
Retrieval (GIR), due to the fact that queries are geographically constrained.
There has been a struggle to nd speci c geographical IR methods
that actually outperform traditional IR techniques. Toponym ambiguity
may constitute a relevant factor in the inability of current GIR systems to
take advantage from geographical knowledge. Recently, some Ph.D. theses
have dealt with Toponym Disambiguation (TD) from di erent perspectives,
from the development of resources for the evaluation of Toponym Disambiguation
(Leidner (2007)) to the use of TD to improve geographical scope
resolution (Andogah (2010)). The Ph.D. thesis presented here introduces
a TD method based on WordNet and carries out a detailed study of the
relationship of Toponym Disambiguation to some IR applications, such as
GIR, Question Answering (QA) and Web retrieval.
The work presented in this thesis starts with an introduction to the applications
in which TD may result useful, together with an analysis of the
ambiguity of toponyms in news collections. It could not be possible to
study the ambiguity of toponyms without studying the resources that are
used as placename repositories; these resources are the equivalent to language
dictionaries, which provide the di erent meanings of a given word.Buscaldi, D. (2010). Toponym Disambiguation in Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8912Palanci
On Term Selection Techniques for Patent Prior Art Search
A patent is a set of exclusive rights granted to an inventor to
protect his invention for
a limited period of time. Patent prior art search involves
finding previously granted
patents, scientific articles, product descriptions, or any other
published work that
may be relevant to a new patent application. Many well-known
information retrieval
(IR) techniques (e.g., typical query expansion methods), which
are proven effective
for ad hoc search, are unsuccessful for patent prior art search.
In this thesis, we
mainly investigate the reasons that generic IR techniques are not
effective for prior
art search on the CLEF-IP test collection. First, we analyse the
errors caused due to
data curation and experimental settings like applying
International Patent Classification
codes assigned to the patent topics to filter the search results.
Then, we investigate
the influence of term selection on retrieval performance on the
CLEF-IP prior art
test collection, starting with the description section of the
reference patent and using
language models (LM) and BM25 scoring functions. We find that an
oracular relevance
feedback system, which extracts terms from the judged relevant
documents
far outperforms the baseline (i.e., 0.11 vs. 0.48) and performs
twice as well on mean
average precision (MAP) as the best participant in CLEF-IP 2010
(i.e., 0.22 vs. 0.48).
We find a very clear term selection value threshold for use when
choosing terms. We
also notice that most of the useful feedback terms are actually
present in the original
query and hypothesise that the baseline system can be
substantially improved by removing
negative query terms. We try four simple automated approaches to
identify
negative terms for query reduction but we are unable to improve
on the baseline
performance with any of them. However, we show that a simple,
minimal feedback
interactive approach, where terms are selected from only the
first retrieved relevant
document outperforms the best result from CLEF-IP 2010,
suggesting the promise of
interactive methods for term selection in patent prior art
search
Rapport : a fact-based question answering system for portuguese
Question answering is one of the longest-standing problems in natural language processing. Although natural language interfaces for computer systems can be considered
more common these days, the same still does not happen regarding access to specific
textual information. Any full text search engine can easily retrieve documents containing user specified or closely related terms, however it is typically unable to answer user
questions with small passages or short answers.
The problem with question answering is that text is hard to process, due to its syntactic structure and, to a higher degree, to its semantic contents. At the sentence level,
although the syntactic aspects of natural language have well known rules, the size and
complexity of a sentence may make it difficult to analyze its structure. Furthermore, semantic aspects are still arduous to address, with text ambiguity being one of the hardest
tasks to handle. There is also the need to correctly process the question in order to define its target, and then select and process the answers found in a text. Additionally, the
selected text that may yield the answer to a given question must be further processed
in order to present just a passage instead of the full text. These issues take also longer
to address in languages other than English, as is the case of Portuguese, that have a lot
less people working on them.
This work focuses on question answering for Portuguese. In other words, our field
of interest is in the presentation of short answers, passages, and possibly full sentences,
but not whole documents, to questions formulated using natural language. For that purpose, we have developed a system, RAPPORT, built upon the use of open information
extraction techniques for extracting triples, so called facts, characterizing information
on text files, and then storing and using them for answering user queries done in natural language. These facts, in the form of subject, predicate and object, alongside other
metadata, constitute the basis of the answers presented by the system. Facts work both
by storing short and direct information found in a text, typically entity related information, and by containing in themselves the answers to the questions already in the
form of small passages. As for the results, although there is margin for improvement,
they are a tangible proof of the adequacy of our approach and its different modules for
storing information and retrieving answers in question answering systems.
In the process, in addition to contributing with a new approach to question answering for Portuguese, and validating the application of open information extraction to
question answering, we have developed a set of tools that has been used in other natural language processing related works, such as is the case of a lemmatizer, LEMPORT,
which was built from scratch, and has a high accuracy. Many of these tools result from
the improvement of those found in the Apache OpenNLP toolkit, by pre-processing their
input, post-processing their output, or both, and by training models for use in those
tools or other, such as MaltParser. Other tools include the creation of interfaces for
other resources containing, for example, synonyms, hypernyms, hyponyms, or the creation of lists of, for instance, relations between verbs and agents, using rules
The Infinite Index: Information Retrieval on Generative Text-To-Image Models
Conditional generative models such as DALL-E and Stable Diffusion generate
images based on a user-defined text, the prompt. Finding and refining prompts
that produce a desired image has become the art of prompt engineering.
Generative models do not provide a built-in retrieval model for a user's
information need expressed through prompts. In light of an extensive literature
review, we reframe prompt engineering for generative models as interactive
text-based retrieval on a novel kind of "infinite index". We apply these
insights for the first time in a case study on image generation for game design
with an expert. Finally, we envision how active learning may help to guide the
retrieval of generated images.Comment: Final version for CHIIR 202
Geographic information extraction from texts
A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
Foundation, Implementation and Evaluation of the MorphoSaurus System: Subword Indexing, Lexical Learning and Word Sense Disambiguation for Medical Cross-Language Information Retrieval
Im medizinischen Alltag, zu welchem viel Dokumentations- und Recherchearbeit gehört, ist mittlerweile der überwiegende Teil textuell kodierter Information elektronisch verfügbar. Hiermit kommt der Entwicklung leistungsfähiger Methoden zur effizienten Recherche eine vorrangige Bedeutung zu.
Bewertet man die Nützlichkeit gängiger Textretrievalsysteme aus dem Blickwinkel der medizinischen Fachsprache, dann mangelt es ihnen an morphologischer Funktionalität (Flexion, Derivation und Komposition), lexikalisch-semantischer Funktionalität und der Fähigkeit zu einer sprachübergreifenden Analyse großer Dokumentenbestände.
In der vorliegenden Promotionsschrift werden die theoretischen Grundlagen des MorphoSaurus-Systems (ein Akronym für Morphem-Thesaurus) behandelt. Dessen methodischer Kern stellt ein um Morpheme der medizinischen Fach- und Laiensprache gruppierter Thesaurus dar, dessen Einträge mittels semantischer Relationen sprachübergreifend verknüpft sind. Darauf aufbauend wird ein Verfahren vorgestellt, welches (komplexe) Wörter in Morpheme segmentiert, die durch sprachunabhängige, konzeptklassenartige Symbole ersetzt werden. Die resultierende Repräsentation ist die Basis für das sprachübergreifende, morphemorientierte Textretrieval.
Neben der Kerntechnologie wird eine Methode zur automatischen Akquise von Lexikoneinträgen vorgestellt, wodurch bestehende Morphemlexika um weitere Sprachen ergänzt werden. Die Berücksichtigung sprachübergreifender Phänomene führt im Anschluss zu einem neuartigen Verfahren zur Auflösung von semantischen Ambiguitäten.
Die Leistungsfähigkeit des morphemorientierten Textretrievals wird im Rahmen umfangreicher, standardisierter Evaluationen empirisch getestet und gängigen Herangehensweisen gegenübergestellt
- …