119 research outputs found
An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric
Many evaluation metrics have been defined to evaluate the effectiveness
ad-hoc retrieval and search result diversification systems. However, it is
often unclear which evaluation metric should be used to analyze the performance
of retrieval systems given a specific task. Axiomatic analysis is an
informative mechanism to understand the fundamentals of metrics and their
suitability for particular scenarios. In this paper, we define a
constraint-based axiomatic framework to study the suitability of existing
metrics in search result diversification scenarios. The analysis informed the
definition of Rank-Biased Utility (RBU) -- an adaptation of the well-known
Rank-Biased Precision metric -- that takes into account redundancy and the user
effort associated to the inspection of documents in the ranking. Our
experiments over standard diversity evaluation campaigns show that the proposed
metric captures quality criteria reflected by different metrics, being suitable
in the absence of knowledge about particular features of the scenario under
study.Comment: Original version: 10 pages. Preprint of full paper to appear at
SIGIR'18: The 41st International ACM SIGIR Conference on Research &
Development in Information Retrieval, July 8-12, 2018, Ann Arbor, MI, USA.
ACM, New York, NY, US
Human Preferences as Dueling Bandits
© 2022 Association for Computing Machinery. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,
http://dx.doi.org/10.1145/3477495.3531991The dramatic improvements in core information retrieval tasks engendered by neural rankers create a need for novel evaluation methods. If every ranker returns highly relevant items in the top ranks, it becomes difficult to recognize meaningful differences between them and to build reusable test collections. Several recent papers explore pairwise preference judgments as an alternative to traditional graded relevance assessments. Rather than viewing items one at a time, assessors view items side-by-side and indicate the one that provides the better response to a query, allowing fine-grained distinctions. If we employ preference judgments to identify the probably best items for each query, we can measure rankers by their ability to place these items as high as possible. We frame the problem of finding best items as a dueling bandits problem. While many papers explore dueling bandits for online ranker evaluation via interleaving, they have not been considered as a framework for offline evaluation via human preference judgments. We review the literature for possible solutions. For human preference judgments, any usable algorithm must tolerate ties, since two items may appear nearly equal to assessors, and it must minimize the number of judgments required for any specific pair, since each such comparison requires an independent assessor. Since the theoretical guarantees provided by most algorithms depend on assumptions that are not satisfied by human preference judgments, we simulate selected algorithms on representative test cases to provide insight into their practical utility. Based on these simulations, one algorithm stands out for its potential. Our simulations suggest modifications to further improve its performance. Using the modified algorithm, we collect over 10,000 preference judgments for pools derived from submissions to the TREC 2021 Deep Learning Track, confirming its suitability. We test the idea of best-item evaluation and suggest ideas for further theoretical and practical progress.We thank Mark Smucker, Gautam Kamath, and Ben Carterette for
their feedback. This research was supported by the Natural Science
and Engineering Research Council of Canada through its Discovery
Grants program
Controlling Risk of Web Question Answering
Web question answering (QA) has become an indispensable component in modern
search systems, which can significantly improve users' search experience by
providing a direct answer to users' information need. This could be achieved by
applying machine reading comprehension (MRC) models over the retrieved passages
to extract answers with respect to the search query. With the development of
deep learning techniques, state-of-the-art MRC performances have been achieved
by recent deep methods. However, existing studies on MRC seldom address the
predictive uncertainty issue, i.e., how likely the prediction of an MRC model
is wrong, leading to uncontrollable risks in real-world Web QA applications. In
this work, we first conduct an in-depth investigation over the risk of Web QA.
We then introduce a novel risk control framework, which consists of a qualify
model for uncertainty estimation using the probe idea, and a decision model for
selectively output. For evaluation, we introduce risk-related metrics, rather
than the traditional EM and F1 in MRC, for the evaluation of risk-aware Web QA.
The empirical results over both the real-world Web QA dataset and the academic
MRC benchmark collection demonstrate the effectiveness of our approach.Comment: 42nd International ACM SIGIR Conference on Research and Development
in Information Retrieva
Automatic Ground Truth Expansion for Timeline Evaluation
The development of automatic systems that can produce timeline summaries by filtering high-volume streams of text documents, retaining only those that are relevant to a particular information need (e.g. topic or event), remains a very challenging task. To advance the field of automatic timeline generation, robust and reproducible evaluation methodologies are needed. To this end, several evaluation metrics and labeling methodologies have recently been developed - focusing on information nugget or cluster-based ground truth representations, respectively. These methodologies rely on human assessors manually mapping timeline items (e.g. tweets) to an explicit representation of what information a 'good' summary should contain. However, while these evaluation methodologies produce reusable ground truth labels, prior works have reported cases where such labels fail to accurately estimate the performance of new timeline generation systems due to label incompleteness. In this paper, we first quantify the extent to which timeline summary ground truth labels fail to generalize to new summarization systems, then we propose and evaluate new automatic solutions to this issue. In particular, using a depooling methodology over 21 systems and across three high-volume datasets, we quantify the degree of system ranking error caused by excluding those systems when labeling. We show that when considering lower-effectiveness systems, the test collections are robust (the likelihood of systems being miss-ranked is low). However, we show that the risk of systems being miss-ranked increases as the effectiveness of systems held-out from the pool increases. To reduce the risk of miss-ranking systems, we also propose two different automatic ground truth label expansion techniques. Our results show that our proposed expansion techniques can be effective for increasing the robustness of the TREC-TS test collections, markedly reducing the number of miss-rankings by up to 50% on average among the scenarios tested
Relevance Prediction from Eye-movements Using Semi-interpretable Convolutional Neural Networks
We propose an image-classification method to predict the perceived-relevance
of text documents from eye-movements. An eye-tracking study was conducted where
participants read short news articles, and rated them as relevant or irrelevant
for answering a trigger question. We encode participants' eye-movement
scanpaths as images, and then train a convolutional neural network classifier
using these scanpath images. The trained classifier is used to predict
participants' perceived-relevance of news articles from the corresponding
scanpath images. This method is content-independent, as the classifier does not
require knowledge of the screen-content, or the user's information-task. Even
with little data, the image classifier can predict perceived-relevance with up
to 80% accuracy. When compared to similar eye-tracking studies from the
literature, this scanpath image classification method outperforms previously
reported metrics by appreciable margins. We also attempt to interpret how the
image classifier differentiates between scanpaths on relevant and irrelevant
documents
Discrete deep learning for fast content-aware recommendation
Cold-start problem and recommendation efficiency have been regarded as two crucial challenges in the recommender system. In this paper, we propose a hashing based deep learning framework called Discrete Deep Learning (DDL), to map users and items to Hamming space, where a user's preference for an item can be efficiently calculated by Hamming distance, and this computation scheme significantly improves the efficiency of online recommendation. Besides, DDL unifies the user-item interaction information and the item content information to overcome the issues of data sparsity and cold-start. To be more specific, to integrate content information into our DDL framework, a deep learning model, Deep Belief Network (DBN), is applied to extract effective item representation from the item content information. Besides, the framework imposes balance and irrelevant constraints on binary codes to derive compact but informative binary codes. Due to the discrete constraints in DDL, we propose an efficient alternating optimization method consisting of iteratively solving a series of mixed-integer programming subproblems. Extensive experiments have been conducted to evaluate the performance of our DDL framework on two different Amazon datasets, and the experimental results demonstrate the superiority of DDL over the state-of-the-art methods regarding online recommendation efficiency and cold-start recommendation accuracy
Patent Retrieval in Chemistry based on semantically tagged Named Entities
Gurulingappa H, Müller B, Klinger R, et al. Patent Retrieval in Chemistry based on semantically tagged Named Entities. In: Voorhees EM, Buckland LP, eds. The Eighteenth Text RETrieval Conference (TREC 2009) Proceedings. Gaithersburg, Maryland, USA; 2009.This paper reports on the work that has been conducted
by Fraunhofer SCAI for Trec Chemistry
(Trec-Chem) track 2009. The team of Fraunhofer
SCAI participated in two tasks, namely Technology
Survey and Prior Art Search. The core of the framework
is an index of 1.2 million chemical patents provided
as a data set by Trec. For the technology
survey, three runs were submitted based on semantic
dictionaries and noun phrases. For the prior art
search task, several elds were introduced into the index
that contained normalized noun phrases, biomedical
as well as chemical entities. Altogether, 36 runs
were submitted for this task that were based on automatic
querying with tokens, noun phrases and entities
along with dierent search strategies
Ricolinostat plus lenalidomide, and dexamethasone in relapsed or refractory multiple myeloma: a multicentre phase 1b trial
BACKGROUND: Histone deacetylase (HDAC) inhibitors are an important new class of therapeutics for treating multiple myeloma. Ricolinostat (ACY-1215) is the first oral selective HDAC6 inhibitor with reduced class I HDAC activity to be studied clinically. Motivated by findings from preclinical studies showing potent synergistic activity with ricolinostat and lenalidomide, our goal was to assess the safety and preliminary activity of the combination of ricolinostat with lenalidomide and dexamethasone in relapsed or refractory multiple myeloma.
METHODS: In this multicentre phase 1b trial, we recruited patients aged 18 years or older with previously treated relapsed or refractory multiple myeloma from five cancer centres in the USA. Inclusion criteria included a Karnofsky Performance Status score of at least 70, measureable disease, adequate bone marrow reserve, adequate hepatic function, and a creatinine clearance of at least 50 mL per min. Exclusion criteria included previous exposure to HDAC inhibitors; previous allogeneic stem-cell transplantation; previous autologous stem-cell transplantation within 12 weeks of baseline; active systemic infection; malignancy within the last 5 years; known or suspected HIV, hepatitis B, or hepatitis C infection; a QTc Fridericia of more than 480 ms; and substantial cardiovascular, gastrointestinal, psychiatric, or other medical disorders. We gave escalating doses (from 40-240 mg once daily to 160 mg twice daily) of oral ricolinostat according to a standard 3 + 3 design according to three different regimens on days 1-21 with a conventional 28 day schedule of oral lenalidomide (from 15 mg [in one cohort] to 25 mg [in all other cohorts] once daily) and oral dexamethasone (40 mg weekly). Primary outcomes were dose-limiting toxicities, the maximum tolerated dose of ricolinostat in this combination, and the dose and schedule of ricolinostat recommended for further phase 2 investigation. Secondary outcomes were the pharmacokinetics and pharmacodynamics of ricolinostat in this combination and the preliminary anti-tumour activity of this treatment. The trial is closed to accrual and is registered at ClinicalTrials.gov, number NCT01583283.
FINDINGS: Between July 12, 2012, and Aug 20, 2015, we enrolled 38 patients. We observed two dose-limiting toxicities with ricolinostat 160 mg twice daily: one (2%) grade 3 syncope and one (2%) grade 3 myalgia event in different cohorts. A maximum tolerated dose was not reached. We chose ricolinostat 160 mg once daily on days 1-21 of a 28 day cycle as the recommended dose for future phase 2 studies in combination with lenalidomide 25 mg and dexamethasone 40 mg. The most common adverse events were fatigue (grade 1-2 in 14 [37%] patients; grade 3 in seven [18%]) and diarrhoea (grade 1-2 in 15 [39%] patients; grade 3 in two [5%]). Our pharmacodynamic studies showed that at clinically relevant doses, ricolinostat selectively inhibits HDAC6 while retaining a low and tolerable level of class I HDAC inhibition. The pharmacokinetics of ricolinostat and lenalidomide were not affected by co-administration. In a preliminary assessment of antitumour activity, 21 (55% [95% CI 38-71]) of 38 patients had an overall response.
INTERPRETATION: The findings from this study provide preliminary evidence that ricolinostat is a safe and well tolerated selective HDAC6 inhibitor, which might partner well with lenalidomide and dexamethasone to enhance their efficacy in relapsed or refractory multiple myeloma.
FUNDING: Acetylon Pharmaceuticals
- …