124 research outputs found

    Better contextual suggestions in ClueWeb12 using domain knowledge inferred from the open web

    Full text link
    Proceedings of the 23rd Text Retrieval Conference (TREC 2014), held in Gaithersburg, Maryland, USA, on 2014This paper provides an overview of our participation in the Contextual Suggestion Track. The TREC 2014 Contextual Suggestion Track allowed participants to submit personalized rankings using documents either from the OpenWeb or from an archived, static Web collection, the ClueWeb12 dataset. In this paper, we focus on filtering the entire ClueWeb12 collection to exploit domain knowledge from touristic sites available in the Open Web. We show that the generated recommendations to the provided user profiles and contexts improve significantly using this inferred domain knowledge.This research was supported by the Netherlands Organization for Scientific Research (NWO project #640.005.001

    What Makes a Top-Performing Precision Medicine Search Engine? Tracing Main System Features in a Systematic Way

    Full text link
    From 2017 to 2019 the Text REtrieval Conference (TREC) held a challenge task on precision medicine using documents from medical publications (PubMed) and clinical trials. Despite lots of performance measurements carried out in these evaluation campaigns, the scientific community is still pretty unsure about the impact individual system features and their weights have on the overall system performance. In order to overcome this explanatory gap, we first determined optimal feature configurations using the Sequential Model-based Algorithm Configuration (SMAC) program and applied its output to a BM25-based search engine. We then ran an ablation study to systematically assess the individual contributions of relevant system features: BM25 parameters, query type and weighting schema, query expansion, stop word filtering, and keyword boosting. For evaluation, we employed the gold standard data from the three TREC-PM installments to evaluate the effectiveness of different features using the commonly shared infNDCG metric.Comment: Accepted for SIGIR2020, 10 page

    When temporal expressions help to detect vital documents related to an entity

    Get PDF
    International audienceIn this paper we aim at filtering documents containing timely relevant information about an entity (e.g., a person, a place, an organization) from a document stream. These documents that we call vital documents provide relevant and fresh information about the entity. The approach we propose leverages the temporal information reflected by the temporal expressions in the document in order to infer its vitality. Experiments carried out on the 2013 TREC Knowledge Base Acceleration (KBA) collection show the effectiveness of our approach compared to state-of-the-art ones

    The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

    Full text link
    The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

    Temporal Information Models for Real-Time Microblog Search

    Get PDF
    Real-time search in Twitter and other social media services is often biased towards the most recent results due to the “in the moment” nature of topic trends and their ephemeral relevance to users and media in general. However, “in the moment”, it is often difficult to look at all emerging topics and single-out the important ones from the rest of the social media chatter. This thesis proposes to leverage on external sources to estimate the duration and burstiness of live Twitter topics. It extends preliminary research where itwas shown that temporal re-ranking using external sources could indeed improve the accuracy of results. To further explore this topic we pursued three significant novel approaches: (1) multi-source information analysis that explores behavioral dynamics of users, such as Wikipedia live edits and page view streams, to detect topic trends and estimate the topic interest over time; (2) efficient methods for federated query expansion towards the improvement of query meaning; and (3) exploiting multiple sources towards the detection of temporal query intent. It differs from past approaches in the sense that it will work over real-time queries, leveraging on live user-generated content. This approach contrasts with previous methods that require an offline preprocessing step

    Improving Contextual Suggestions using Open Web Domain Knowledge

    Full text link
    Also published online by CEUR Workshop Proceedings (CEUR-WS.org, ISSN 1613-0073)Contextual suggestion aims at recommending items to users given their current context, such as location-based tourist recommendations. Our contextual suggestion ranking model consists of two main components: selecting candidate suggestions and providing a ranked list of personalized suggestions. We focus on selecting appropriate suggestions from the ClueWeb12 collection using tourist domain knowledge inferred from social sites and resources available on the public Web (Open Web). Specifically, we generate two candidate subsets retrieved from the ClueWeb12 collection, one by filtering the content on mentions of the location context, and one by integrating domain knowledge derived from the OpenWeb. The impact of these candidate selection methods on contextual suggestion effectiveness is analyzed using the test collection constructed for the TREC Contextual Suggestion Track in 2014. Our main findings are that contextual suggestion performance on the subset created using OpenWeb domain knowledge is significantly better than using only geographical information. Second, using a prior probability estimated from domain knowledge leads to better suggestions and improves the performance

    Filtrage et agrégation d'informations vitales relatives à des entités

    Get PDF
    Nowadays, knowledge bases such as Wikipedia and DBpedia are the main sources to access information on a wide variety of entities (an entity is a thing that can be distinctly identified such a person, an organization, a product, an event, etc.). However, the update of these sources with new information related to a given entity is done manually by contributors with a significant latency time particularly if that entity is not popular. A system that analyzes documents when published on the Web to filter important information about entities will probably accelerate the update of these knowledge bases. In this thesis, we are interested in filtering timely and relevant information, called vital information, concerning the entities. We aim at answering the following two issues: (1) How to detect if a document is vital (i.e., it provides timely relevant information) to an entity? and (2) How to extract vital information from these documents to build a temporal summary about the entity that can be seen as a reference for updating the corresponding knowledge base entry?Regarding the first issue, we proposed two methods. The first proposal is fully supervised. It is based on a vitality language model. The second proposal measures the freshness of temporal expressions in a document to decide its vitality. Concerning the second issue, we proposed a method that selects the sentences based on the presence of triggers words automatically retrieved from the knowledge already represented in the knowledge base (such as the description of similar entities).We carried out our experiments on the TREC Stream corpus 2013 and 2014 with 1.2 billion documents and different types of entities (persons, organizations, facilities and events). For vital documents filtering approaches, we conducted our experiments in the context of the task "knowledge Base Acceleration (KBA)" for the years 2013 and 2014. Our method based on leveraging the temporal expressions in the document obtained good results outperforming the best participant system in the task KBA 2013. In addition, we showed the importance of our generated temporal summaries to accelerate the update of knowledge bases.Aujourd'hui, les bases de connaissances telles que Wikipedia et DBpedia reprĂ©sentent les sources principales pour accĂ©der aux informations disponibles sur une grande variĂ©tĂ© d'entitĂ©s (une entitĂ© est une chose qui peut ĂȘtre distinctement identifiĂ©e par exemple une personne, une organisation, un produit, un Ă©vĂ©nement, etc.). Cependant, la mise Ă  jour de ces sources avec des informations nouvelles en rapport avec une entitĂ© donnĂ©e se fait manuellement par des contributeurs et avec un temps de latence important en particulier si cette entitĂ© n'est pas populaire. Concevoir un systĂšme qui analyse les documents dĂšs leur publication sur le Web pour filtrer les informations importantes relatives Ă  des entitĂ©s pourra sans doute accĂ©lĂ©rer la mise Ă  jour de ces bases de connaissances. Dans cette thĂšse, nous nous intĂ©ressons au filtrage d'informations pertinentes et nouvelles, appelĂ©es vitales, relatives Ă  des entitĂ©s. Ces travaux rentrent dans le cadre de la recherche d'information mais visent aussi Ă  enrichir les techniques d'ingĂ©nierie de connaissances en aidant Ă  la sĂ©lection des informations Ă  traiter. Nous souhaitons rĂ©pondre principalement aux deux problĂ©matiques suivantes: (1) Comment dĂ©tecter si un document est vital (c.Ă .d qu'il apporte une information pertinente et nouvelle) par rapport Ă  une entitĂ© donnĂ©e? et (2) Comment extraire les informations vitales Ă  partir de ces documents qui serviront comme rĂ©fĂ©rence pour mettre Ă  jour des bases de connaissances? Concernant la premiĂšre problĂ©matique, nous avons proposĂ© deux mĂ©thodes. La premiĂšre proposition est totalement supervisĂ©e. Elle se base sur un modĂšle de langue de vitalitĂ©. La deuxiĂšme proposition mesure la fraĂźcheur des expressions temporelles contenues dans un document afin de dĂ©cider de sa vitalitĂ©. En ce qui concerne la deuxiĂšme problĂ©matique relative Ă  l'extraction d'informations vitales Ă  partir des documents vitaux, nous avons proposĂ© une mĂ©thode qui sĂ©lectionne les phrases comportant potentiellement ces informations vitales, en nous basant sur la prĂ©sence de mots dĂ©clencheurs rĂ©cupĂ©rĂ©s automatiquement Ă  partir de la connaissance dĂ©jĂ  reprĂ©sentĂ©e dans la base de connaissances (comme la description d'entitĂ©s similaires).L'Ă©valuation des approches proposĂ©es a Ă©tĂ© effectuĂ©e dans le cadre de la campagne d'Ă©valuation internationale TREC sur une collection de 1.2 milliard de documents avec diffĂ©rents types d'entitĂ©s (personnes, organisations, Ă©tablissements et Ă©vĂ©nements). Pour les approches de filtrage de documents vitaux, nous avons menĂ© nos expĂ©rimentations dans le cadre de la tĂąche "Knwoledge Base Acceleration (KBA)" pour les annĂ©es 2013 et 2014. L'exploitation des expressions temporelles dans le document a permis d'obtenir de bons rĂ©sultats dĂ©passant le meilleur systĂšme proposĂ© dans la tĂąche KBA 2013. Pour Ă©valuer les contributions concernant l'extraction des informations vitales relatives Ă  des entitĂ©s, nous nous sommes basĂ©s sur le cadre expĂ©rimental de la tĂąche "Temporal Summarization (TS)". Nous avons montrĂ© que notre approche permet de minimiser le temps de latence des mises Ă  jour de bases de connaissances

    Overview of the TREC 2014 Federated Web Search Track

    Get PDF
    The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants’ results for the tasks are introduced, analyzed, and compared

    ir_metadata: An Extensible Metadata Schema for IR Experiments

    Full text link
    The information retrieval (IR) community has a strong tradition of making the computational artifacts and resources available for future reuse, allowing the validation of experimental results. Besides the actual test collections, the underlying run files are often hosted in data archives as part of conferences like TREC, CLEF, or NTCIR. Unfortunately, the run data itself does not provide much information about the underlying experiment. For instance, the single run file is not of much use without the context of the shared task's website or the run data archive. In other domains, like the social sciences, it is good practice to annotate research data with metadata. In this work, we introduce ir_metadata - an extensible metadata schema for TREC run files based on the PRIMAD model. We propose to align the metadata annotations to PRIMAD, which considers components of computational experiments that can affect reproducibility. Furthermore, we outline important components and information that should be reported in the metadata and give evidence from the literature. To demonstrate the usefulness of these metadata annotations, we implement new features in repro_eval that support the outlined metadata schema for the use case of reproducibility studies. Additionally, we curate a dataset with run files derived from experiments with different instantiations of PRIMAD components and annotate these with the corresponding metadata. In the experiments, we cover reproducibility experiments that are identified by the metadata and classified by PRIMAD. With this work, we enable IR researchers to annotate TREC run files and improve the reuse value of experimental artifacts even further.Comment: Resource pape

    Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture

    Full text link
    We describe a new deep learning architecture for learning to rank question answer pairs. Our approach extends the long short-term memory (LSTM) network with holographic composition to model the relationship between question and answer representations. As opposed to the neural tensor layer that has been adopted recently, the holographic composition provides the benefits of scalable and rich representational learning approach without incurring huge parameter costs. Overall, we present Holographic Dual LSTM (HD-LSTM), a unified architecture for both deep sentence modeling and semantic matching. Essentially, our model is trained end-to-end whereby the parameters of the LSTM are optimized in a way that best explains the correlation between question and answer representations. In addition, our proposed deep learning architecture requires no extensive feature engineering. Via extensive experiments, we show that HD-LSTM outperforms many other neural architectures on two popular benchmark QA datasets. Empirical studies confirm the effectiveness of holographic composition over the neural tensor layer.Comment: SIGIR 2017 Full Pape
    • 

    corecore