240 research outputs found
The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives
The Archive Query Log (AQL) is a previously unused, comprehensive query log
collected at the Internet Archive over the last 25 years. Its first version
includes 356 million queries, 166 million search result pages, and 1.7 billion
search results across 550 search providers. Although many query logs have been
studied in the literature, the search providers that own them generally do not
publish their logs to protect user privacy and vital business data. Of the few
query logs publicly available, none combines size, scope, and diversity. The
AQL is the first to do so, enabling research on new retrieval models and
(diachronic) search engine analyses. Provided in a privacy-preserving manner,
it promotes open research as well as more transparency and accountability in
the search industry.Comment: SIGIR 2023 resource paper, 13 page
What Makes a Top-Performing Precision Medicine Search Engine? Tracing Main System Features in a Systematic Way
From 2017 to 2019 the Text REtrieval Conference (TREC) held a challenge task
on precision medicine using documents from medical publications (PubMed) and
clinical trials. Despite lots of performance measurements carried out in these
evaluation campaigns, the scientific community is still pretty unsure about the
impact individual system features and their weights have on the overall system
performance. In order to overcome this explanatory gap, we first determined
optimal feature configurations using the Sequential Model-based Algorithm
Configuration (SMAC) program and applied its output to a BM25-based search
engine. We then ran an ablation study to systematically assess the individual
contributions of relevant system features: BM25 parameters, query type and
weighting schema, query expansion, stop word filtering, and keyword boosting.
For evaluation, we employed the gold standard data from the three TREC-PM
installments to evaluate the effectiveness of different features using the
commonly shared infNDCG metric.Comment: Accepted for SIGIR2020, 10 page
Modeling Temporal Evidence from External Collections
Newsworthy events are broadcast through multiple mediums and prompt the
crowds to produce comments on social media. In this paper, we propose to
leverage on this behavioral dynamics to estimate the most relevant time periods
for an event (i.e., query). Recent advances have shown how to improve the
estimation of the temporal relevance of such topics. In this approach, we build
on two major novelties. First, we mine temporal evidences from hundreds of
external sources into topic-based external collections to improve the
robustness of the detection of relevant time periods. Second, we propose a
formal retrieval model that generalizes the use of the temporal dimension
across different aspects of the retrieval process. In particular, we show that
temporal evidence of external collections can be used to (i) infer a topic's
temporal relevance, (ii) select the query expansion terms, and (iii) re-rank
the final results for improved precision. Experiments with TREC Microblog
collections show that the proposed time-aware retrieval model makes an
effective and extensive use of the temporal dimension to improve search results
over the most recent temporal models. Interestingly, we observe a strong
correlation between precision and the temporal distribution of retrieved and
relevant documents.Comment: To appear in WSDM 201
Temporal Information Models for Real-Time Microblog Search
Real-time search in Twitter and other social media services is often biased
towards the most recent results due to the “in the moment” nature of topic
trends and their ephemeral relevance to users and media in general. However,
“in the moment”, it is often difficult to look at all emerging topics and single-out
the important ones from the rest of the social media chatter. This thesis proposes
to leverage on external sources to estimate the duration and burstiness of live
Twitter topics. It extends preliminary research where itwas shown that temporal
re-ranking using external sources could indeed improve the accuracy of results.
To further explore this topic we pursued three significant novel approaches: (1)
multi-source information analysis that explores behavioral dynamics of users,
such as Wikipedia live edits and page view streams, to detect topic trends
and estimate the topic interest over time; (2) efficient methods for federated
query expansion towards the improvement of query meaning; and (3) exploiting
multiple sources towards the detection of temporal query intent. It differs from
past approaches in the sense that it will work over real-time queries, leveraging
on live user-generated content. This approach contrasts with previous methods
that require an offline preprocessing step
Conversational Search with Random Walks over Entity Graphs
Funding Information: This work has been partially funded by the FCT project NOVA LINCS Ref. UIDP/04516/2020, by the Amazon Science - TaskBot Prize Challenge and the CMU|Portugal projects iFetch (LISBOA-01-0247-FEDER-045920) and GoLocal (CMUP-ERI/TIC/0046/2014), and by the FCT Ph.D. scholarship grant SFRH/BD/140924/2018. Any opinions, findings, and conclusions in this paper are the authors’ and do not necessarily reflect those of the sponsors. Publisher Copyright: © 2023 Owner/Author.The entities that emerge during a conversation can be used to model topics, but not all entities are equally useful for this task. Modeling the conversation with entity graphs and predicting each entity's centrality in the conversation provides additional information that improves the retrieval of answer passages for the current question. Experiments show that using random walks to estimate entity centrality on conversation entity graphs improves top precision answer passage ranking over competitive transformer-based baselines.publishersversionpublishe
- …