268 research outputs found
Query Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in
the natural language processing and machine learning communities for their
ability to model term similarity and other relationships. We study the use of
term relatedness in the context of query expansion for ad hoc information
retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when
trained globally, underperform corpus and query specific embeddings for
retrieval tasks. These results suggest that other tasks benefiting from global
embeddings may also benefit from local embeddings
Term Association Modelling in Information Retrieval
Many traditional Information Retrieval (IR) models assume that query terms are independent of each other. For those models, a document is normally represented as a bag of words/terms and their frequencies. Although traditional retrieval models can achieve reasonably good performance in many applications, the corresponding independence assumption has limitations. There are some recent studies that investigate how to model term associations/dependencies by proximity measures. However, the modeling of term associations theoretically under the probabilistic retrieval framework is still largely unexplored.
In this thesis, I propose a new concept named Cross Term, to model term proximity, with the aim of boosting retrieval performance. With Cross Terms, the association of multiple query terms can be modeled in the same way as a simple unigram term. In particular, an occurrence of a query term is assumed to have an impact on its neighboring text. The degree of the query term impact gradually weakens with increasing distance from the place of occurrence. Shape functions are used to characterize such impacts. Based on this assumption, I first propose a bigram CRoss TErm Retrieval (CRTER2) model for probabilistic IR and a Language model based model CRTER2LM. Specifically, a bigram Cross Term occurs when the corresponding query terms appear close to each other, and its impact can be modeled by the intersection of the respective shape functions of the query terms. Second, I propose a generalized n-gram CRoss TErm Retrieval (CRTERn) model recursively for n query terms where n>2. For n-gram Cross Term, I develop several distance metrics with different properties and employ them in the proposed models for ranking. Third, an enhanced context-sensitive proximity model is proposed to boost the CRTER models, where the contextual relevance of term proximity is studied. The models are validated on several large standard data sets, and show improved performance over other state-of-art approaches. I also discusse the practical impact of the proposed models. The approaches in this thesis can also provide helpful benefit for term association modeling in other domains
A visual analytics tool for the incremental evaluation of search engines
In the field of information retrieval (IR), it is of fundamental importance the activity of evaluating the performance of an information retrieval system (IRS), by means of specific metrics. For this reason, we developed AVIATOR, a visual analytics tool capable of incrementally indexing a document collection in order to help IR experts to automatically obtain the values for the evaluation metrics, so that they can be compared dynamically as the indexing process proceeds
Observing Users - Designing clarity a case study on the user-centred design of a cross-language information retrieval system
This paper presents a case study of the development of an interface to a novel and complex form of document retrieval: searching for texts written in foreign languages based on native language queries. Although the underlying technology for achieving such a search is relatively well understood, the appropriate interface design is not. A study involving users (with such searching needs) from the start of the design process is described covering initial examination of user needs and tasks; preliminary
design and testing of interface components; building, testing, and further refining an interface; before
finally conducting usability tests of the system. Lessons are learned at every stage of the process leading to a much more informed view of how such an interface should be built
Execution Performance Issues in Full-Text Information Retrieval
The task of an information retrieval system is to identify documents that will satisfy a user’s information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and measuring similarity between the two. The maturity and proven effectiveness of these systems has resulted in demand for increased capacity, performance, scalability, and functionality, especially as information retrieval is integrated into more traditional database management environments. In this dissertation we explore a number of functionality and performance issues in information retrieval. First, we consider creation and modification of the document collection, concentrating on management of the inverted file index. An inverted file architecture based on a persistent object store is described and experimental results are presented for inverted file creation and modification. Our architecture provides performance that scales well with document collection size and the database features supported by the persistent object store provide many solutions to issues that arise during integration of information retrieval into vii more general database environments. We then turn to query evaluation speed and introduce a new optimization technique for statistical ranking retrieval systems that support structured queries. Experimental results from a variety of query sets show that execution time can be reduced by more than 50% with no noticeable impact on retrieval effectiveness, making these more complex retrieval models attractive alternatives for environments that demand high performance
A Robust Linguistic Platform for Efficient and Domain specific Web Content Analysis
Web semantic access in specific domains calls for specialized search engines
with enhanced semantic querying and indexing capacities, which pertain both to
information retrieval (IR) and to information extraction (IE). A rich
linguistic analysis is required either to identify the relevant semantic units
to index and weight them according to linguistic specific statistical
distribution, or as the basis of an information extraction process. Recent
developments make Natural Language Processing (NLP) techniques reliable enough
to process large collections of documents and to enrich them with semantic
annotations. This paper focuses on the design and the development of a text
processing platform, Ogmios, which has been developed in the ALVIS project. The
Ogmios platform exploits existing NLP modules and resources, which may be tuned
to specific domains and produces linguistically annotated documents. We show
how the three constraints of genericity, domain semantic awareness and
performance can be handled all together
- …