128 research outputs found

    The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

    Full text link
    The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

    When temporal expressions help to detect vital documents related to an entity

    Get PDF
    International audienceIn this paper we aim at filtering documents containing timely relevant information about an entity (e.g., a person, a place, an organization) from a document stream. These documents that we call vital documents provide relevant and fresh information about the entity. The approach we propose leverages the temporal information reflected by the temporal expressions in the document in order to infer its vitality. Experiments carried out on the 2013 TREC Knowledge Base Acceleration (KBA) collection show the effectiveness of our approach compared to state-of-the-art ones

    Modeling Temporal Evidence from External Collections

    Full text link
    Newsworthy events are broadcast through multiple mediums and prompt the crowds to produce comments on social media. In this paper, we propose to leverage on this behavioral dynamics to estimate the most relevant time periods for an event (i.e., query). Recent advances have shown how to improve the estimation of the temporal relevance of such topics. In this approach, we build on two major novelties. First, we mine temporal evidences from hundreds of external sources into topic-based external collections to improve the robustness of the detection of relevant time periods. Second, we propose a formal retrieval model that generalizes the use of the temporal dimension across different aspects of the retrieval process. In particular, we show that temporal evidence of external collections can be used to (i) infer a topic's temporal relevance, (ii) select the query expansion terms, and (iii) re-rank the final results for improved precision. Experiments with TREC Microblog collections show that the proposed time-aware retrieval model makes an effective and extensive use of the temporal dimension to improve search results over the most recent temporal models. Interestingly, we observe a strong correlation between precision and the temporal distribution of retrieved and relevant documents.Comment: To appear in WSDM 201

    Temporal Information Models for Real-Time Microblog Search

    Get PDF
    Real-time search in Twitter and other social media services is often biased towards the most recent results due to the “in the moment” nature of topic trends and their ephemeral relevance to users and media in general. However, “in the moment”, it is often difficult to look at all emerging topics and single-out the important ones from the rest of the social media chatter. This thesis proposes to leverage on external sources to estimate the duration and burstiness of live Twitter topics. It extends preliminary research where itwas shown that temporal re-ranking using external sources could indeed improve the accuracy of results. To further explore this topic we pursued three significant novel approaches: (1) multi-source information analysis that explores behavioral dynamics of users, such as Wikipedia live edits and page view streams, to detect topic trends and estimate the topic interest over time; (2) efficient methods for federated query expansion towards the improvement of query meaning; and (3) exploiting multiple sources towards the detection of temporal query intent. It differs from past approaches in the sense that it will work over real-time queries, leveraging on live user-generated content. This approach contrasts with previous methods that require an offline preprocessing step

    A Vertical PRF Architecture for Microblog Search

    Full text link
    In microblog retrieval, query expansion can be essential to obtain good search results due to the short size of queries and posts. Since information in microblogs is highly dynamic, an up-to-date index coupled with pseudo-relevance feedback (PRF) with an external corpus has a higher chance of retrieving more relevant documents and improving ranking. In this paper, we focus on the research question:how can we reduce the query expansion computational cost while maintaining the same retrieval precision as standard PRF? Therefore, we propose to accelerate the query expansion step of pseudo-relevance feedback. The hypothesis is that using an expansion corpus organized into verticals for expanding the query, will lead to a more efficient query expansion process and improved retrieval effectiveness. Thus, the proposed query expansion method uses a distributed search architecture and resource selection algorithms to provide an efficient query expansion process. Experiments on the TREC Microblog datasets show that the proposed approach can match or outperform standard PRF in MAP and NDCG@30, with a computational cost that is three orders of magnitude lower.Comment: To appear in ICTIR 201

    Filtrage et agrégation d'informations vitales relatives à des entités

    Get PDF
    Nowadays, knowledge bases such as Wikipedia and DBpedia are the main sources to access information on a wide variety of entities (an entity is a thing that can be distinctly identified such a person, an organization, a product, an event, etc.). However, the update of these sources with new information related to a given entity is done manually by contributors with a significant latency time particularly if that entity is not popular. A system that analyzes documents when published on the Web to filter important information about entities will probably accelerate the update of these knowledge bases. In this thesis, we are interested in filtering timely and relevant information, called vital information, concerning the entities. We aim at answering the following two issues: (1) How to detect if a document is vital (i.e., it provides timely relevant information) to an entity? and (2) How to extract vital information from these documents to build a temporal summary about the entity that can be seen as a reference for updating the corresponding knowledge base entry?Regarding the first issue, we proposed two methods. The first proposal is fully supervised. It is based on a vitality language model. The second proposal measures the freshness of temporal expressions in a document to decide its vitality. Concerning the second issue, we proposed a method that selects the sentences based on the presence of triggers words automatically retrieved from the knowledge already represented in the knowledge base (such as the description of similar entities).We carried out our experiments on the TREC Stream corpus 2013 and 2014 with 1.2 billion documents and different types of entities (persons, organizations, facilities and events). For vital documents filtering approaches, we conducted our experiments in the context of the task "knowledge Base Acceleration (KBA)" for the years 2013 and 2014. Our method based on leveraging the temporal expressions in the document obtained good results outperforming the best participant system in the task KBA 2013. In addition, we showed the importance of our generated temporal summaries to accelerate the update of knowledge bases.Aujourd'hui, les bases de connaissances telles que Wikipedia et DBpedia représentent les sources principales pour accéder aux informations disponibles sur une grande variété d'entités (une entité est une chose qui peut être distinctement identifiée par exemple une personne, une organisation, un produit, un événement, etc.). Cependant, la mise à jour de ces sources avec des informations nouvelles en rapport avec une entité donnée se fait manuellement par des contributeurs et avec un temps de latence important en particulier si cette entité n'est pas populaire. Concevoir un système qui analyse les documents dès leur publication sur le Web pour filtrer les informations importantes relatives à des entités pourra sans doute accélérer la mise à jour de ces bases de connaissances. Dans cette thèse, nous nous intéressons au filtrage d'informations pertinentes et nouvelles, appelées vitales, relatives à des entités. Ces travaux rentrent dans le cadre de la recherche d'information mais visent aussi à enrichir les techniques d'ingénierie de connaissances en aidant à la sélection des informations à traiter. Nous souhaitons répondre principalement aux deux problématiques suivantes: (1) Comment détecter si un document est vital (c.à.d qu'il apporte une information pertinente et nouvelle) par rapport à une entité donnée? et (2) Comment extraire les informations vitales à partir de ces documents qui serviront comme référence pour mettre à jour des bases de connaissances? Concernant la première problématique, nous avons proposé deux méthodes. La première proposition est totalement supervisée. Elle se base sur un modèle de langue de vitalité. La deuxième proposition mesure la fraîcheur des expressions temporelles contenues dans un document afin de décider de sa vitalité. En ce qui concerne la deuxième problématique relative à l'extraction d'informations vitales à partir des documents vitaux, nous avons proposé une méthode qui sélectionne les phrases comportant potentiellement ces informations vitales, en nous basant sur la présence de mots déclencheurs récupérés automatiquement à partir de la connaissance déjà représentée dans la base de connaissances (comme la description d'entités similaires).L'évaluation des approches proposées a été effectuée dans le cadre de la campagne d'évaluation internationale TREC sur une collection de 1.2 milliard de documents avec différents types d'entités (personnes, organisations, établissements et événements). Pour les approches de filtrage de documents vitaux, nous avons mené nos expérimentations dans le cadre de la tâche "Knwoledge Base Acceleration (KBA)" pour les années 2013 et 2014. L'exploitation des expressions temporelles dans le document a permis d'obtenir de bons résultats dépassant le meilleur système proposé dans la tâche KBA 2013. Pour évaluer les contributions concernant l'extraction des informations vitales relatives à des entités, nous nous sommes basés sur le cadre expérimental de la tâche "Temporal Summarization (TS)". Nous avons montré que notre approche permet de minimiser le temps de latence des mises à jour de bases de connaissances

    Combining heterogeneous sources in an interactive multimedia content retrieval model

    Get PDF
    Interactive multimodal information retrieval systems (IMIR) increase the capabilities of traditional search systems, by adding the ability to retrieve information of different types (modes) and from different sources. This article describes a formal model for interactive multimodal information retrieval. This model includes formal and widespread definitions of each component of an IMIR system. A use case that focuses on information retrieval regarding sports validates the model, by developing a prototype that implements a subset of the features of the model. Adaptive techniques applied to the retrieval functionality of IMIR systems have been defined by analysing past interactions using decision trees, neural networks, and clustering techniques. This model includes a strategy for selecting sources and combining the results obtained from every source. After modifying the strategy of the prototype for selecting sources, the system is reevaluated using classification techniques.This work was partially supported by eGovernAbility-Access project (TIN2014-52665-C2-2-R)
    • …
    corecore