10 research outputs found
Overview of the TREC 2014 Federated Web Search Track
The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants’ results for the tasks are introduced, analyzed, and compared
Overview of the TREC 2013 Federated Web Search Track
The TREC Federated Web Search track is intended to promote research related to federated search in a realistic web setting, and hereto provides a large data collection gathered from a series of online search engines. This overview paper discusses the results of the first edition of the track, FedWeb 2013. The focus was on basic challenges in federated search: (1) resource selection, and (2) results merging. After an overview of the provided data collection and the relevance judgments for the test topics, the participants’ individual approaches and results on both tasks are discussed. Promising research directions and an outlook on the 2014 edition of the track are provided as well
Real Time Web Search Framework for Performing Efficient Retrieval of Data
With the rapidly growing amount of information on the internet, real-time system is one of the key strategies to cope with the information overload and to help users in finding highly relevant information. Real-time events and domain-specific information are important knowledge base references on the Web that frequently accessed by millions of users. Real-time system is a vital to product and a technique must resolve the context of challenges to be more reliable, e.g. short data life-cycles, heterogeneous user interests, strict time constraints, and context-dependent article relevance. Since real-time data have only a short time to live, real-time models have to be continuously adapted, ensuring that real-time data are always up-to-date. The focal point of this manuscript is for designing a real-time web search approach that aggregates several web search algorithms at query time to tune search results for relevancy. We learn a context-aware delegation algorithm that allows choosing the best real-time algorithms for each query request. The evaluation showed that the proposed approach outperforms the traditional models, in which it allows us to adapt the specific properties of the considered real-time resources. In the experiments, we found that it is highly relevant for most recently searched queries, consistent in its performance, and resilient to the drawbacks faced by other algorithms
Overview of the TREC 2014 Federated Web Search Track
The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants’ results for the tasks are introduced, analyzed, and compared
Combining heterogeneous sources in an interactive multimedia content retrieval model
Interactive multimodal information retrieval systems (IMIR) increase the capabilities of traditional search systems, by adding the ability to retrieve information of different types (modes) and from different sources. This article describes a formal model for interactive multimodal information retrieval. This model includes formal and widespread definitions of each component of an IMIR system. A use case that focuses on information retrieval regarding sports validates the model, by developing a prototype that implements a subset of the features of the model. Adaptive techniques applied to the retrieval functionality of IMIR systems have been defined by analysing past interactions using decision trees, neural networks, and clustering techniques. This model includes a strategy for selecting sources and combining the results obtained from every source. After modifying the strategy of the prototype for selecting sources, the system is reevaluated using classification techniques.This work was partially supported by eGovernAbility-Access project (TIN2014-52665-C2-2-R)
Recommended from our members
Exploiting Social Media Sources for Search, Fusion and Evaluation
The web contains heterogeneous information that is generated with different characteristics and is presented via different media. Social media, as one of the largest content carriers, has generated information from millions of users worldwide, creating material rapidly in all types of forms such as comments, images, tags, videos and ratings, etc. In social applications, the formation of online communities contributes to conversations of substantially broader aspects, as well as unfiltered opinions about subjects that are rarely covered in public media. Information accrued on social platforms, therefore, presents a unique opportunity to augment web sources such as Wikipedia or news pages, which are usually characterized as being more formal. The goal of this dissertation is to investigate in depth how social data can be exploited and applied in the context of three fundamental information retrieval (IR) tasks: search, fusion, and evaluation. Improving search performance has consistently been a major focus in the IR community. Given the in-depth discussions and active interactions contained in social media, we present approaches to incorporating this type of data to improve search on general web corpora. In particular, we propose two graph-based frameworks, social anchor and information network, to associate related web and social content, where information sources of diverse characteristics can be used to complement each other in a unified manner. We investigate how the enriched representation can potentially reduce vocabulary mismatch and improve retrieval effectiveness. Presenting social media content to users is valuable particularly for queries intended for time-sensitive events or community opinions. Current major search engines commonly blend results from different search services (or verticals) into core web results. Motivated by this real-world need, we explore ways to merge results from different web and social services into a single ranked list. We present an optimization framework for fusion, where impact of documents, ranked lists, and verticals can be modeled simultaneously to maximize performance. Evaluating search system performance has largely relied on creating reusable test collections in IR. Traditional ways to creating evaluation sets can require substantial manual effort. To reduce such effort, we explore an approach to automating the process of collecting pairs of queries and relevance judgments, using high quality social media, Community Question Answering (CQA). Our approach is based on the idea that CQA services support platforms for users to raise questions and to share answers, therefore encoding the associations between real user information needs and real user assessments. To demonstrate the effectiveness of our approaches, we conduct extensive retrieval and fusion experiments, as well as verify the reliability of the new, CQA-based evaluation test sets
New approaches to interactive multimedia content retrieval from different sources
Mención Internacional en el título de doctorInteractive Multimodal Information Retrieval systems (IMIR) increase the capabilities of traditional search systems with the ability to retrieve information in different types (modes) and from different sources. The increase in online content while diversifying means of access to information (phones, tablets, smart watches) encourages the growing need for this type of system.
In this thesis a formal model for describing interactive multimodal information retrieval systems querying various information retrieval engines has been defined. This model includes formal and widespread definition of each component of an IMIR system, namely: multimodal information organized in collections, multimodal query, different retrieval engines, a source management system (handler), a results management module (fusion) and user interactions.
This model has been validated in two stages. The first, in a use case focused on information retrieval on sports. A prototype that implements a subset of the features of the model has been developed: a multimodal collection that is semantically related, three types of multimodal queries (text, audio and text + image), six different retrieval engines (question answering, full-text search, search based on ontologies, OCR in image, object detection in image and audio transcription), a strategy for source selection based on rules defined by experts, a strategy of combining results and recording of user interactions.
NDCG (normalized discounted cumulative gain) has been used for comparing the results obtained for each retrieval engine. These results are: 10,1% (Question answering), 80% (full text search) and 26;8% (ontology search).
These results are on the order of works of the state of art considering forums like CLEF. When the retrieval engine combination is used, the information retrieval performance increases by a percentage gain of 771,4% with question answering, 7,2% with full text search and 145,5% with Ontology search.
The second scenario is focused on a prototype retrieving information from social media in the health domain. A prototype has been developed which is based on the proposed model and integrates health domain social media user-generated information, knowledge bases, query, retrieval engines, sources selection module, results' combination module and GUI. In addition, the documents included in the retrieval system have been previously processed by a process that extracts semantic information in health domain.
In addition, several adaptation techniques applied to the retrieval functionality of an IMIR system have been defined by analyzing past interactions using decision trees, neural networks and clusters.
After modifying the sources selection strategy (handler), the system has been reevaluated using classification techniques. The same queries and relevance judgments done by users in the sports domain prototype will be used for this evaluation.
This evaluation compares the normalized discounted cumulative gain (NDCG) measure obtained with two different approaches: the multimodal system using predefined rules and the same multimodal system once the functionality is adapted by past user interactions. The NDCG has shown an improvement between -2,92% and 2,81% depending on the approaches used. We have considered three features to classify the approaches: (i) the classification algorithm; (ii) the query features; and (iii) the scores for computing the orders of retrieval engines. The best result is obtained using probabilities-based classification algorithm, the retrieval engines ranking generated with Averaged-Position score and the mode, type, length and entities of the query. Its NDCG value is 81,54%.Los Sistemas Interactivos de Recuperación de Información Multimodal (IMIR) incrementan las capacidades de los sistemas tradicionales de búsqueda con la posibilidad de recuperar información de diferentes tipos (modos) y a partir de diferentes fuentes. El incremento del contenido en internet a la vez que la diversificación de los medios de acceso a la información (móviles, tabletas, relojes inteligentes) fomenta la necesidad cada vez mayor de este tipo de sistemas.
En esta tesis se ha definido un modelo formal para la descripción de sistemas de recuperación de información multimodal e interactivos que consultan varios motores de recuperación. Este modelo incluye la definición formal y generalizada de cada componente de un sistema IMIR, a saber: información multimodal organizada en colecciones, consulta multimodal, diferentes motores de recuperación, sistema de gestión de fuentes (handler), módulo de gestión de resultados (fusión) y las interacciones de los usuarios.
Este modelo se ha validado en dos escenarios. El primero, en un caso de uso focalizado en recuperación de información relativa a deportes. Se ha desarrollado un prototipo que implementa un subconjunto de todas las características del modelo: una colección multimodal que se relaciona semánticamente, tres tipos de consultas multimodal (texto, audio y texto + imagen), seis motores diferentes de recuperación (búsqueda de respuestas, búsqueda de texto completo, búsqueda basada en ontologías, OCR en imagen, detección de objetos en imagen y transcripción de audio), una estrategia de selección de fuentes basada en reglas definidas por expertos, una estrategia de combinación de resultados y el registro de las interacciones.
Se utiliza la medida NDCG (normalized discounted cumulative gain) para describir los resultados obtenidos por cada motor de recuperación. Estos resultados son: 10,1% (Question Answering), 80% (Búsqueda a texto completo) y 26,8% (Búsqueda en ontologías). Estos resultados están en el orden de los trabajos del estado de arte considerando foros como CLEF (Cross-Language Evaluation Forum). Cuando se utiliza la combinación de motores de recuperación, el rendimiento de recuperación de información se incrementa en un porcentaje de ganancia de 771,4% con Question Answering, 7,2% con Búsqueda a texto completo y 145,5% con Búsqueda en ontologías.
El segundo escenario es un prototipo centrado en recuperación de información de medios sociales en el dominio de salud. Se ha desarrollado un prototipo basado en el modelo propuesto y que integra información del dominio de salud generada por el usuario en medios sociales, bases de conocimiento, consulta, motores de recuperación, módulo de selección de fuentes, módulo de combinación de resultados y la interfaz gráfica de usuario. Además, los documentos incluidos en el sistema de recuperación han sido previamente anotados mediante un proceso de extracción de información semántica del dominio de salud.
Además, se han definido técnicas de adaptación de la funcionalidad de recuperación de un sistema IMIR analizando interacciones pasadas mediante árboles de decisión, redes neuronales y agrupaciones.
Una vez modificada la estrategia de selección de fuentes (handler), se ha evaluado de nuevo el sistema usando técnicas de clasificación. Las mismas consultas y juicios de relevancia realizadas por los usuarios en el primer prototipo sobre deportes se han utilizado para esta evaluación.
La evaluación compara la medida NDCG (normalized discounted cumulative gain) obtenida con dos enfoques diferentes: el sistema multimodal usando reglas predefinidas y el mismo sistema multimodal una vez que la funcionalidad se ha adaptado por las interacciones de usuario. El NDCG ha mostrado una mejoría entre -2,92% y 2,81% en función de los métodos utilizados.
Hemos considerado tres características para clasificar los enfoques:
(i) el algoritmo de clasificación; (ii) las características de la consulta; y (iii) las puntuaciones para el cálculo del orden de los motores de recuperación.
El mejor resultado se obtiene utilizando el algoritmo de clasificación basado en probabilidades, las puntuaciones para los motores de recuperación basados en la media de la posición del primer resultado relevante y el modo, el tipo, la longitud y las entidades de la consulta. Su valor de NDCG es 81,54%.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: Ana García Serrano.- Secretario: María Belén Ruiz Mezcua.- Vocal: Davide Buscald
Recommended from our members
A user-centred approach to information retrieval
A user model is a fundamental component in user-centred information retrieval systems. It enables personalization of a user's search experience. The development of such a model involves three phases: collecting information about each user, representing such information, and integrating the model into a retrieval application. Progress in this area is typically met with privacy and scalability challenges that hinder the ability to synthesize collective knowledge from each user's search behaviour. In this thesis, I propose a framework that addresses each of these three phases. The proposed framework is based on social role theory from the social science literature and at the centre of this theory is the concept of a social position. A social position is a label for a group of users with similar behavioural patterns. Examples of such positions are traveller, patient, movie fan, and computer scientist. In this thesis, a social position acts as a label for users who are expected to have similar interests. The proposed framework does not require real users' data; rather it uses the web as a resource to model users.
The proposed framework offers a data-driven and modular design for each of the three phases of building a user model. First, I present an approach to identify social positions from natural language sentences. I formulate this task as a binary classification task and develop a method to enumerate candidate social positions. The proposed classifier achieves an accuracy score of 85.8%, which indicates that social positions can be identified with good accuracy. Through an inter-annotator agreement study, I further show a reasonable level of agreement between users when identifying social positions.
Second, I introduce a novel topic modelling-based approach to represent each social position as a multinomial distribution over words. This approach estimates a topic from a document collection for each position. To construct such a collection for a particular position, I propose a seeding algorithm that extracts a set of terms relevant to the social position. Coherence-based evaluation shows that the proposed approach learns significantly more coherent representations when compared with a relevance modelling baseline.
Third, I present a diversification approach based on the proposed framework. Diversification algorithms aim to return a result list for a search query that would potentially satisfy users with diverse information needs. I propose to identify social positions that are relevant to a search query. These positions act as an implicit representation of the many possible interpretations of the search query. Then, relevant positions are provided to a diversification technique that proportionally diversifies results based on each social position's importance. I evaluate my approach using four test collections provided by the diversity task of the Text REtrieval Conference (TREC) web tracks for 2009, 2010, 2011, and 2012. Results demonstrate that my proposed diversification approach is effective and provides statistically significant improvements over various implicit diversification approaches.
Fourth, I introduce a session-based search system under the framework of learning to rank. Such a system aims to improve the retrieval performance for a search query using previous user interactions during the search session. I present a method to match a search session to its most relevant social positions based on the session's interaction data. I then suggest identifying related sessions from query logs that are likely to be issued by users with similar information needs. Novel learning features are then estimated from the session's social positions, related sessions, and interaction data. I evaluate the proposed system using four test collections from the TREC session track. This approach achieves state-of-the-art results compared with effective session-based search systems. I demonstrate that such a strong performance is mainly attributed to features that are derived from social positions' data