2,845 research outputs found

    A Vertical PRF Architecture for Microblog Search

    Full text link
    In microblog retrieval, query expansion can be essential to obtain good search results due to the short size of queries and posts. Since information in microblogs is highly dynamic, an up-to-date index coupled with pseudo-relevance feedback (PRF) with an external corpus has a higher chance of retrieving more relevant documents and improving ranking. In this paper, we focus on the research question:how can we reduce the query expansion computational cost while maintaining the same retrieval precision as standard PRF? Therefore, we propose to accelerate the query expansion step of pseudo-relevance feedback. The hypothesis is that using an expansion corpus organized into verticals for expanding the query, will lead to a more efficient query expansion process and improved retrieval effectiveness. Thus, the proposed query expansion method uses a distributed search architecture and resource selection algorithms to provide an efficient query expansion process. Experiments on the TREC Microblog datasets show that the proposed approach can match or outperform standard PRF in MAP and NDCG@30, with a computational cost that is three orders of magnitude lower.Comment: To appear in ICTIR 201

    Anticipating Information Needs Based on Check-in Activity

    Full text link
    In this work we address the development of a smart personal assistant that is capable of anticipating a user's information needs based on a novel type of context: the person's activity inferred from her check-in records on a location-based social network. Our main contribution is a method that translates a check-in activity into an information need, which is in turn addressed with an appropriate information card. This task is challenging because of the large number of possible activities and related information needs, which need to be addressed in a mobile dashboard that is limited in size. Our approach considers each possible activity that might follow after the last (and already finished) activity, and selects the top information cards such that they maximize the likelihood of satisfying the user's information needs for all possible future scenarios. The proposed models also incorporate knowledge about the temporal dynamics of information needs. Using a combination of historical check-in data and manual assessments collected via crowdsourcing, we show experimentally the effectiveness of our approach.Comment: Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM '17), 201

    Updating collection representations for federated search

    Get PDF
    To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem

    Grid Databases for Shared Image Analysis in the MammoGrid Project

    Full text link
    The MammoGrid project aims to prove that Grid infrastructures can be used for collaborative clinical analysis of database-resident but geographically distributed medical images. This requires: a) the provision of a clinician-facing front-end workstation and b) the ability to service real-world clinician queries across a distributed and federated database. The MammoGrid project will prove the viability of the Grid by harnessing its power to enable radiologists from geographically dispersed hospitals to share standardized mammograms, to compare diagnoses (with and without computer aided detection of tumours) and to perform sophisticated epidemiological studies across national boundaries. This paper outlines the approach taken in MammoGrid to seamlessly connect radiologist workstations across a Grid using an "information infrastructure" and a DICOM-compliant object model residing in multiple distributed data stores in Italy and the UKComment: 10 pages, 5 figure

    Ranking for Web Data Search Using On-The-Fly Data Integration

    Get PDF
    Ranking - the algorithmic decision on how relevant an information artifact is for a given information need and the sorting of artifacts by their concluded relevancy - is an integral part of every search engine. In this book we investigate how structured Web data can be leveraged for ranking with the goal to improve the effectiveness of search. We propose new solutions for ranking using on-the-fly data integration and experimentally analyze and evaluate them against the latest baselines

    Application of Semantics to Solve Problems in Life Sciences

    Get PDF
    Fecha de lectura de Tesis: 10 de diciembre de 2018La cantidad de información que se genera en la Web se ha incrementado en los últimos años. La mayor parte de esta información se encuentra accesible en texto, siendo el ser humano el principal usuario de la Web. Sin embargo, a pesar de todos los avances producidos en el área del procesamiento del lenguaje natural, los ordenadores tienen problemas para procesar esta información textual. En este cotexto, existen dominios de aplicación en los que se están publicando grandes cantidades de información disponible como datos estructurados como en el área de las Ciencias de la Vida. El análisis de estos datos es de vital importancia no sólo para el avance de la ciencia, sino para producir avances en el ámbito de la salud. Sin embargo, estos datos están localizados en diferentes repositorios y almacenados en diferentes formatos que hacen difícil su integración. En este contexto, el paradigma de los Datos Vinculados como una tecnología que incluye la aplicación de algunos estándares propuestos por la comunidad W3C tales como HTTP URIs, los estándares RDF y OWL. Haciendo uso de esta tecnología, se ha desarrollado esta tesis doctoral basada en cubrir los siguientes objetivos principales: 1) promover el uso de los datos vinculados por parte de la comunidad de usuarios del ámbito de las Ciencias de la Vida 2) facilitar el diseño de consultas SPARQL mediante el descubrimiento del modelo subyacente en los repositorios RDF 3) crear un entorno colaborativo que facilite el consumo de Datos Vinculados por usuarios finales, 4) desarrollar un algoritmo que, de forma automática, permita descubrir el modelo semántico en OWL de un repositorio RDF, 5) desarrollar una representación en OWL de ICD-10-CM llamada Dione que ofrezca una metodología automática para la clasificación de enfermedades de pacientes y su posterior validación haciendo uso de un razonador OWL

    Temporal Information Models for Real-Time Microblog Search

    Get PDF
    Real-time search in Twitter and other social media services is often biased towards the most recent results due to the “in the moment” nature of topic trends and their ephemeral relevance to users and media in general. However, “in the moment”, it is often difficult to look at all emerging topics and single-out the important ones from the rest of the social media chatter. This thesis proposes to leverage on external sources to estimate the duration and burstiness of live Twitter topics. It extends preliminary research where itwas shown that temporal re-ranking using external sources could indeed improve the accuracy of results. To further explore this topic we pursued three significant novel approaches: (1) multi-source information analysis that explores behavioral dynamics of users, such as Wikipedia live edits and page view streams, to detect topic trends and estimate the topic interest over time; (2) efficient methods for federated query expansion towards the improvement of query meaning; and (3) exploiting multiple sources towards the detection of temporal query intent. It differs from past approaches in the sense that it will work over real-time queries, leveraging on live user-generated content. This approach contrasts with previous methods that require an offline preprocessing step
    corecore