124 research outputs found
Better contextual suggestions in ClueWeb12 using domain knowledge inferred from the open web
Proceedings of the 23rd Text Retrieval Conference (TREC 2014), held in Gaithersburg, Maryland, USA, on 2014This paper provides an overview of our participation in the Contextual
Suggestion Track. The TREC 2014 Contextual Suggestion Track allowed participants
to submit personalized rankings using documents either from the OpenWeb
or from an archived, static Web collection, the ClueWeb12 dataset. In this paper,
we focus on filtering the entire ClueWeb12 collection to exploit domain knowledge
from touristic sites available in the Open Web. We show that the generated
recommendations to the provided user profiles and contexts improve significantly
using this inferred domain knowledge.This research was supported by the Netherlands Organization for Scientific Research
(NWO project #640.005.001
What Makes a Top-Performing Precision Medicine Search Engine? Tracing Main System Features in a Systematic Way
From 2017 to 2019 the Text REtrieval Conference (TREC) held a challenge task
on precision medicine using documents from medical publications (PubMed) and
clinical trials. Despite lots of performance measurements carried out in these
evaluation campaigns, the scientific community is still pretty unsure about the
impact individual system features and their weights have on the overall system
performance. In order to overcome this explanatory gap, we first determined
optimal feature configurations using the Sequential Model-based Algorithm
Configuration (SMAC) program and applied its output to a BM25-based search
engine. We then ran an ablation study to systematically assess the individual
contributions of relevant system features: BM25 parameters, query type and
weighting schema, query expansion, stop word filtering, and keyword boosting.
For evaluation, we employed the gold standard data from the three TREC-PM
installments to evaluate the effectiveness of different features using the
commonly shared infNDCG metric.Comment: Accepted for SIGIR2020, 10 page
When temporal expressions help to detect vital documents related to an entity
International audienceIn this paper we aim at filtering documents containing timely relevant information about an entity (e.g., a person, a place, an organization) from a document stream. These documents that we call vital documents provide relevant and fresh information about the entity. The approach we propose leverages the temporal information reflected by the temporal expressions in the document in order to infer its vitality. Experiments carried out on the 2013 TREC Knowledge Base Acceleration (KBA) collection show the effectiveness of our approach compared to state-of-the-art ones
The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives
The Archive Query Log (AQL) is a previously unused, comprehensive query log
collected at the Internet Archive over the last 25 years. Its first version
includes 356 million queries, 166 million search result pages, and 1.7 billion
search results across 550 search providers. Although many query logs have been
studied in the literature, the search providers that own them generally do not
publish their logs to protect user privacy and vital business data. Of the few
query logs publicly available, none combines size, scope, and diversity. The
AQL is the first to do so, enabling research on new retrieval models and
(diachronic) search engine analyses. Provided in a privacy-preserving manner,
it promotes open research as well as more transparency and accountability in
the search industry.Comment: SIGIR 2023 resource paper, 13 page
Temporal Information Models for Real-Time Microblog Search
Real-time search in Twitter and other social media services is often biased
towards the most recent results due to the âin the momentâ nature of topic
trends and their ephemeral relevance to users and media in general. However,
âin the momentâ, it is often difficult to look at all emerging topics and single-out
the important ones from the rest of the social media chatter. This thesis proposes
to leverage on external sources to estimate the duration and burstiness of live
Twitter topics. It extends preliminary research where itwas shown that temporal
re-ranking using external sources could indeed improve the accuracy of results.
To further explore this topic we pursued three significant novel approaches: (1)
multi-source information analysis that explores behavioral dynamics of users,
such as Wikipedia live edits and page view streams, to detect topic trends
and estimate the topic interest over time; (2) efficient methods for federated
query expansion towards the improvement of query meaning; and (3) exploiting
multiple sources towards the detection of temporal query intent. It differs from
past approaches in the sense that it will work over real-time queries, leveraging
on live user-generated content. This approach contrasts with previous methods
that require an offline preprocessing step
Improving Contextual Suggestions using Open Web Domain Knowledge
Also published online by CEUR Workshop Proceedings (CEUR-WS.org, ISSN 1613-0073)Contextual suggestion aims at recommending items to users given
their current context, such as location-based tourist recommendations.
Our contextual suggestion ranking model consists of two
main components: selecting candidate suggestions and providing a
ranked list of personalized suggestions. We focus on selecting appropriate
suggestions from the ClueWeb12 collection using tourist
domain knowledge inferred from social sites and resources available
on the public Web (Open Web). Specifically, we generate two
candidate subsets retrieved from the ClueWeb12 collection, one by
filtering the content on mentions of the location context, and one
by integrating domain knowledge derived from the OpenWeb. The
impact of these candidate selection methods on contextual suggestion
effectiveness is analyzed using the test collection constructed
for the TREC Contextual Suggestion Track in 2014. Our main findings
are that contextual suggestion performance on the subset created
using OpenWeb domain knowledge is significantly better than
using only geographical information. Second, using a prior probability
estimated from domain knowledge leads to better suggestions
and improves the performance
Filtrage et agrégation d'informations vitales relatives à des entités
Nowadays, knowledge bases such as Wikipedia and DBpedia are the main sources to access information on a wide variety of entities (an entity is a thing that can be distinctly identified such a person, an organization, a product, an event, etc.). However, the update of these sources with new information related to a given entity is done manually by contributors with a significant latency time particularly if that entity is not popular. A system that analyzes documents when published on the Web to filter important information about entities will probably accelerate the update of these knowledge bases. In this thesis, we are interested in filtering timely and relevant information, called vital information, concerning the entities. We aim at answering the following two issues: (1) How to detect if a document is vital (i.e., it provides timely relevant information) to an entity? and (2) How to extract vital information from these documents to build a temporal summary about the entity that can be seen as a reference for updating the corresponding knowledge base entry?Regarding the first issue, we proposed two methods. The first proposal is fully supervised. It is based on a vitality language model. The second proposal measures the freshness of temporal expressions in a document to decide its vitality. Concerning the second issue, we proposed a method that selects the sentences based on the presence of triggers words automatically retrieved from the knowledge already represented in the knowledge base (such as the description of similar entities).We carried out our experiments on the TREC Stream corpus 2013 and 2014 with 1.2 billion documents and different types of entities (persons, organizations, facilities and events). For vital documents filtering approaches, we conducted our experiments in the context of the task "knowledge Base Acceleration (KBA)" for the years 2013 and 2014. Our method based on leveraging the temporal expressions in the document obtained good results outperforming the best participant system in the task KBA 2013. In addition, we showed the importance of our generated temporal summaries to accelerate the update of knowledge bases.Aujourd'hui, les bases de connaissances telles que Wikipedia et DBpedia reprĂ©sentent les sources principales pour accĂ©der aux informations disponibles sur une grande variĂ©tĂ© d'entitĂ©s (une entitĂ© est une chose qui peut ĂȘtre distinctement identifiĂ©e par exemple une personne, une organisation, un produit, un Ă©vĂ©nement, etc.). Cependant, la mise Ă jour de ces sources avec des informations nouvelles en rapport avec une entitĂ© donnĂ©e se fait manuellement par des contributeurs et avec un temps de latence important en particulier si cette entitĂ© n'est pas populaire. Concevoir un systĂšme qui analyse les documents dĂšs leur publication sur le Web pour filtrer les informations importantes relatives Ă des entitĂ©s pourra sans doute accĂ©lĂ©rer la mise Ă jour de ces bases de connaissances. Dans cette thĂšse, nous nous intĂ©ressons au filtrage d'informations pertinentes et nouvelles, appelĂ©es vitales, relatives Ă des entitĂ©s. Ces travaux rentrent dans le cadre de la recherche d'information mais visent aussi Ă enrichir les techniques d'ingĂ©nierie de connaissances en aidant Ă la sĂ©lection des informations Ă traiter. Nous souhaitons rĂ©pondre principalement aux deux problĂ©matiques suivantes: (1) Comment dĂ©tecter si un document est vital (c.Ă .d qu'il apporte une information pertinente et nouvelle) par rapport Ă une entitĂ© donnĂ©e? et (2) Comment extraire les informations vitales Ă partir de ces documents qui serviront comme rĂ©fĂ©rence pour mettre Ă jour des bases de connaissances? Concernant la premiĂšre problĂ©matique, nous avons proposĂ© deux mĂ©thodes. La premiĂšre proposition est totalement supervisĂ©e. Elle se base sur un modĂšle de langue de vitalitĂ©. La deuxiĂšme proposition mesure la fraĂźcheur des expressions temporelles contenues dans un document afin de dĂ©cider de sa vitalitĂ©. En ce qui concerne la deuxiĂšme problĂ©matique relative Ă l'extraction d'informations vitales Ă partir des documents vitaux, nous avons proposĂ© une mĂ©thode qui sĂ©lectionne les phrases comportant potentiellement ces informations vitales, en nous basant sur la prĂ©sence de mots dĂ©clencheurs rĂ©cupĂ©rĂ©s automatiquement Ă partir de la connaissance dĂ©jĂ reprĂ©sentĂ©e dans la base de connaissances (comme la description d'entitĂ©s similaires).L'Ă©valuation des approches proposĂ©es a Ă©tĂ© effectuĂ©e dans le cadre de la campagne d'Ă©valuation internationale TREC sur une collection de 1.2 milliard de documents avec diffĂ©rents types d'entitĂ©s (personnes, organisations, Ă©tablissements et Ă©vĂ©nements). Pour les approches de filtrage de documents vitaux, nous avons menĂ© nos expĂ©rimentations dans le cadre de la tĂąche "Knwoledge Base Acceleration (KBA)" pour les annĂ©es 2013 et 2014. L'exploitation des expressions temporelles dans le document a permis d'obtenir de bons rĂ©sultats dĂ©passant le meilleur systĂšme proposĂ© dans la tĂąche KBA 2013. Pour Ă©valuer les contributions concernant l'extraction des informations vitales relatives Ă des entitĂ©s, nous nous sommes basĂ©s sur le cadre expĂ©rimental de la tĂąche "Temporal Summarization (TS)". Nous avons montrĂ© que notre approche permet de minimiser le temps de latence des mises Ă jour de bases de connaissances
Overview of the TREC 2014 Federated Web Search Track
The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participantsâ results for the tasks are introduced, analyzed, and compared
ir_metadata: An Extensible Metadata Schema for IR Experiments
The information retrieval (IR) community has a strong tradition of making the
computational artifacts and resources available for future reuse, allowing the
validation of experimental results. Besides the actual test collections, the
underlying run files are often hosted in data archives as part of conferences
like TREC, CLEF, or NTCIR. Unfortunately, the run data itself does not provide
much information about the underlying experiment. For instance, the single run
file is not of much use without the context of the shared task's website or the
run data archive. In other domains, like the social sciences, it is good
practice to annotate research data with metadata. In this work, we introduce
ir_metadata - an extensible metadata schema for TREC run files based on the
PRIMAD model. We propose to align the metadata annotations to PRIMAD, which
considers components of computational experiments that can affect
reproducibility. Furthermore, we outline important components and information
that should be reported in the metadata and give evidence from the literature.
To demonstrate the usefulness of these metadata annotations, we implement new
features in repro_eval that support the outlined metadata schema for the use
case of reproducibility studies. Additionally, we curate a dataset with run
files derived from experiments with different instantiations of PRIMAD
components and annotate these with the corresponding metadata. In the
experiments, we cover reproducibility experiments that are identified by the
metadata and classified by PRIMAD. With this work, we enable IR researchers to
annotate TREC run files and improve the reuse value of experimental artifacts
even further.Comment: Resource pape
Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture
We describe a new deep learning architecture for learning to rank question
answer pairs. Our approach extends the long short-term memory (LSTM) network
with holographic composition to model the relationship between question and
answer representations. As opposed to the neural tensor layer that has been
adopted recently, the holographic composition provides the benefits of scalable
and rich representational learning approach without incurring huge parameter
costs. Overall, we present Holographic Dual LSTM (HD-LSTM), a unified
architecture for both deep sentence modeling and semantic matching.
Essentially, our model is trained end-to-end whereby the parameters of the LSTM
are optimized in a way that best explains the correlation between question and
answer representations. In addition, our proposed deep learning architecture
requires no extensive feature engineering. Via extensive experiments, we show
that HD-LSTM outperforms many other neural architectures on two popular
benchmark QA datasets. Empirical studies confirm the effectiveness of
holographic composition over the neural tensor layer.Comment: SIGIR 2017 Full Pape
- âŠ