14 research outputs found
Overview of the TREC 2013 Federated Web Search Track
The TREC Federated Web Search track is intended to promote research related to federated search in a realistic web setting, and hereto provides a large data collection gathered from a series of online search engines. This overview paper discusses the results of the first edition of the track, FedWeb 2013. The focus was on basic challenges in federated search: (1) resource selection, and (2) results merging. After an overview of the provided data collection and the relevance judgments for the test topics, the participants’ individual approaches and results on both tasks are discussed. Promising research directions and an outlook on the 2014 edition of the track are provided as well
Explicit diversification of event aspects for temporal summarization
During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness
An Evaluation of Contextual Suggestion
This thesis examines techniques that can be used to evaluate systems that solve the complex task of suggesting points of interest to users. A traveller visiting an unfamiliar, foreign city might be looking for a place to have fun in the last few hours before returning home. Our traveller might browse various search engines and travel websites to find something that he is interested in doing, however this process is time consuming and the visitor may want to find some suggestion quickly.
We will consider the type of system that is able to handle this complex request in such a way that the user is satisfied. Because the type of suggestion one person wants will differ from the type of suggestion another person wants we will consider systems that incorporate some level of personalization. In this work we will develop user profiles that are based on real users and set up experiments that many research groups can participate in, competing to develop the best techniques for implementing this kind of system. These systems will make suggestion of attractions to visit in various different US cities to many users.
This thesis is divided into two stages. During the first stage we will look at what information will go into our user profiles and what information we need to know about the users in order to decide whether they would visit an attraction. The second stage will be deciding how to evaluate the suggestions that various systems make in order to determine which system is able to make the best suggestions
Combining heterogeneous sources in an interactive multimedia content retrieval model
Interactive multimodal information retrieval systems (IMIR) increase the capabilities of traditional search systems, by adding the ability to retrieve information of different types (modes) and from different sources. This article describes a formal model for interactive multimodal information retrieval. This model includes formal and widespread definitions of each component of an IMIR system. A use case that focuses on information retrieval regarding sports validates the model, by developing a prototype that implements a subset of the features of the model. Adaptive techniques applied to the retrieval functionality of IMIR systems have been defined by analysing past interactions using decision trees, neural networks, and clustering techniques. This model includes a strategy for selecting sources and combining the results obtained from every source. After modifying the strategy of the prototype for selecting sources, the system is reevaluated using classification techniques.This work was partially supported by eGovernAbility-Access project (TIN2014-52665-C2-2-R)
Filtering News from Document Streams: Evaluation Aspects and Modeled Stream Utility
Events like hurricanes, earthquakes,
or accidents can impact a large number of people. Not only are people in the
immediate vicinity of the event affected, but concerns about their well-being are
shared by the local government and well-wishers across the world.
The latest information about news events
could be of use to government and aid agencies in order to make informed decisions on
providing necessary support, security and relief. The general public
avails of news updates via dedicated news feeds or broadcasts, and lately,
via social media services
like Facebook or Twitter.
Retrieving the latest information about newsworthy events from the world-wide web
is thus of importance to a large section of society.
As new content on a multitude of topics is continuously being published on the web,
specific event related information needs to be filtered from the resulting
stream of documents.
We present in this thesis, a user-centric evaluation measure for
evaluating systems that filter news related information from document streams.
Our proposed evaluation measure, Modeled Stream Utility (MSU), models
users accessing information from a stream of sentences
produced by a news update filtering system.
The user model allows for simulating a large number of users with different
characteristic stream browsing behavior. Through simulation,
MSU estimates the utility of a system for an
average user browsing a stream of sentences.
Our results show that system performance is sensitive to a user population's
stream browsing behavior and that
existing evaluation metrics correspond to very specific types of user behavior.
To evaluate systems that filter sentences from a document stream,
we need a set of judged sentences. This judged set is
a subset of all the sentences returned by all systems, and is
typically constructed by pooling
together the highest quality sentences,
as determined by respective system assigned scores for each sentence.
Sentences in the pool are manually assessed and
the resulting set of judged sentences is then used to compute system performance metrics.
In this thesis, we investigate the effect of including duplicates of
judged sentences, into the judged set, on system performance evaluation. We also develop an
alternative pooling methodology, that given the MSU user model,
selects sentences for pooling based on the probability of a sentences being read by
modeled users.
Our research lays the foundation for interesting future work for utilizing
user-models in different aspects of evaluation of stream filtering systems.
The MSU measure enables incorporation of different
user models. Furthermore, the applicability of MSU could be extended through
calibration based on user
behavior
Design and Evaluation of Temporal Summarization Systems
Temporal Summarization (TS) is a new track introduced as part of the Text REtrieval Conference (TREC) in 2013. This track aims to develop systems which can return important updates related to an event over time. In TREC 2013, the TS track specifically used disaster related events such as earthquake, hurricane, bombing, etc. This thesis mainly focuses on building an effective TS system by using a combination of Information Retrieval techniques. The developed TS system returns updates related to disaster related events in a timely manner.
By participating in TREC 2013 and with experiments conducted after TREC, we examine the effectiveness of techniques such as distributional similarity for term expansion, which can be employed in building TS systems. Also, this thesis describes the effectiveness of other techniques such as stemming, adaptive sentence selection over time and de-duplication in our system, by comparing it with other baseline systems.
The second part of the thesis examines the current methodology used for evaluating TS systems. We propose a modified evaluation method which could reduce the manual effort of assessors, and also correlates well with the official track’s evaluation. We also propose a supervised learning based evaluation method, which correlates well with the official track’s evaluation of systems and could save the assessor’s time by as much as 80%
Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval
Neural networks with deep architectures have demonstrated significant
performance improvements in computer vision, speech recognition, and natural
language processing. The challenges in information retrieval (IR), however, are
different from these other application areas. A common form of IR involves
ranking of documents--or short passages--in response to keyword-based queries.
Effective IR systems must deal with query-document vocabulary mismatch problem,
by modeling relationships between different query and document terms and how
they indicate relevance. Models should also consider lexical matches when the
query contains rare terms--such as a person's name or a product model
number--not seen during training, and to avoid retrieving semantically related
but irrelevant results. In many real-life IR tasks, the retrieval involves
extremely large collections--such as the document index of a commercial Web
search engine--containing billions of documents. Efficient IR methods should
take advantage of specialized IR data structures, such as inverted index, to
efficiently retrieve from large collections. Given an information need, the IR
system also mediates how much exposure an information artifact receives by
deciding whether it should be displayed, and where it should be positioned,
among other results. Exposure-aware IR systems may optimize for additional
objectives, besides relevance, such as parity of exposure for retrieved items
and content publishers. In this thesis, we present novel neural architectures
and methods motivated by the specific needs and challenges of IR tasks.Comment: PhD thesis, Univ College London (2020
Recommended from our members
A user-centred approach to information retrieval
A user model is a fundamental component in user-centred information retrieval systems. It enables personalization of a user's search experience. The development of such a model involves three phases: collecting information about each user, representing such information, and integrating the model into a retrieval application. Progress in this area is typically met with privacy and scalability challenges that hinder the ability to synthesize collective knowledge from each user's search behaviour. In this thesis, I propose a framework that addresses each of these three phases. The proposed framework is based on social role theory from the social science literature and at the centre of this theory is the concept of a social position. A social position is a label for a group of users with similar behavioural patterns. Examples of such positions are traveller, patient, movie fan, and computer scientist. In this thesis, a social position acts as a label for users who are expected to have similar interests. The proposed framework does not require real users' data; rather it uses the web as a resource to model users.
The proposed framework offers a data-driven and modular design for each of the three phases of building a user model. First, I present an approach to identify social positions from natural language sentences. I formulate this task as a binary classification task and develop a method to enumerate candidate social positions. The proposed classifier achieves an accuracy score of 85.8%, which indicates that social positions can be identified with good accuracy. Through an inter-annotator agreement study, I further show a reasonable level of agreement between users when identifying social positions.
Second, I introduce a novel topic modelling-based approach to represent each social position as a multinomial distribution over words. This approach estimates a topic from a document collection for each position. To construct such a collection for a particular position, I propose a seeding algorithm that extracts a set of terms relevant to the social position. Coherence-based evaluation shows that the proposed approach learns significantly more coherent representations when compared with a relevance modelling baseline.
Third, I present a diversification approach based on the proposed framework. Diversification algorithms aim to return a result list for a search query that would potentially satisfy users with diverse information needs. I propose to identify social positions that are relevant to a search query. These positions act as an implicit representation of the many possible interpretations of the search query. Then, relevant positions are provided to a diversification technique that proportionally diversifies results based on each social position's importance. I evaluate my approach using four test collections provided by the diversity task of the Text REtrieval Conference (TREC) web tracks for 2009, 2010, 2011, and 2012. Results demonstrate that my proposed diversification approach is effective and provides statistically significant improvements over various implicit diversification approaches.
Fourth, I introduce a session-based search system under the framework of learning to rank. Such a system aims to improve the retrieval performance for a search query using previous user interactions during the search session. I present a method to match a search session to its most relevant social positions based on the session's interaction data. I then suggest identifying related sessions from query logs that are likely to be issued by users with similar information needs. Novel learning features are then estimated from the session's social positions, related sessions, and interaction data. I evaluate the proposed system using four test collections from the TREC session track. This approach achieves state-of-the-art results compared with effective session-based search systems. I demonstrate that such a strong performance is mainly attributed to features that are derived from social positions' data
Learning representations for Information Retrieval
La recherche d'informations s'intéresse, entre autres, à répondre à des questions comme: est-ce qu'un document est pertinent à une requête ?
Est-ce que deux requêtes ou deux documents sont similaires ? Comment la similarité entre deux requêtes ou documents peut être utilisée pour améliorer
l'estimation de la pertinence ? Pour donner réponse à ces questions, il est nécessaire d'associer chaque document et requête à des représentations interprétables
par ordinateur. Une fois ces représentations estimées, la similarité peut correspondre, par exemple, à une distance ou une divergence qui opère dans l'espace de représentation.
On admet généralement que la qualité d'une représentation a un impact direct sur l'erreur d'estimation par rapport à la vraie pertinence, jugée par un humain.
Estimer de bonnes représentations des documents et des requêtes a longtemps été un problème central de la recherche d'informations.
Le but de cette thèse est de proposer des nouvelles méthodes pour estimer les représentations des documents et des requêtes, la relation de pertinence entre eux et ainsi modestement avancer l'état de l'art du domaine.
Nous présentons quatre articles publiés dans des conférences internationales et un article publié dans un forum d'évaluation. Les deux premiers articles concernent des méthodes qui créent l'espace de représentation selon une connaissance à priori sur les caractéristiques qui sont importantes pour la tâche à accomplir. Ceux-ci nous amènent à présenter un nouveau modèle de recherche d'informations qui diffère des modèles existants sur le plan théorique et de l'efficacité expérimentale. Les deux derniers articles marquent un changement fondamental dans l'approche de construction des représentations. Ils bénéficient notamment de l'intérêt de recherche dont les techniques d'apprentissage profond par réseaux de neurones, ou deep learning, ont fait récemment l'objet. Ces modèles d'apprentissage élicitent automatiquement les caractéristiques importantes pour la tâche demandée à partir d'une quantité importante de données. Nous nous intéressons à la modélisation des relations sémantiques entre documents et requêtes ainsi qu'entre deux ou plusieurs requêtes. Ces derniers articles marquent les premières applications de l'apprentissage de représentations par réseaux de neurones à la recherche d'informations. Les modèles proposés ont aussi produit une performance améliorée sur des collections de test standard. Nos travaux nous mènent à la conclusion générale suivante: la performance en recherche d'informations pourrait drastiquement être améliorée en se basant sur les approches d'apprentissage de représentations.Information retrieval is generally concerned with answering questions such as: is this document relevant to this query?
How similar are two queries or two documents?
How query and document similarity can be used to enhance relevance estimation?
In order to answer these questions, it is necessary to access computational representations of documents and queries.
For example, similarities between documents and queries may correspond to a distance or a divergence defined on the representation space.
It is generally assumed that the quality of the representation has a direct impact on the bias with respect to the true similarity, estimated by means of human intervention.
Building useful representations for documents and queries has always been central to information retrieval research.
The goal of this thesis is to provide new ways of estimating such representations and the relevance relationship between them.
We present four articles that have been published in international conferences and one published in an information retrieval evaluation
forum. The first two articles can be categorized as feature engineering approaches, which transduce a priori knowledge about the domain into the features of the representation.
We present a novel retrieval model that compares favorably to existing models in terms of both theoretical originality and experimental effectiveness.
The remaining two articles mark a significant change in our vision and originate from the widespread interest in deep learning research that took place during the time they were written.
Therefore, they naturally belong to the category of representation learning approaches, also known as feature learning. Differently from previous approaches, the learning model discovers alone the most important features for the task at hand, given a considerable amount of labeled data. We propose to model the semantic relationships between documents and queries and between queries themselves.
The models presented have also shown improved effectiveness on standard test collections. These last articles are amongst the first applications of representation learning with neural networks for information retrieval. This series of research leads to the following observation: future improvements of information retrieval effectiveness has to rely on representation learning techniques instead of manually defining the representation space