75 research outputs found
Query-Specific Knowledge Graphs for Complex Finance Topics
Across the financial domain, researchers answer complex questions by
extensively "searching" for relevant information to generate long-form reports.
This workshop paper discusses automating the construction of query-specific
document and entity knowledge graphs (KGs) for complex research topics. We
focus on the CODEC dataset, where domain experts (1) create challenging
questions, (2) construct long natural language narratives, and (3) iteratively
search and assess the relevance of documents and entities. For the construction
of query-specific KGs, we show that state-of-the-art ranking systems have
headroom for improvement, with specific failings due to a lack of context or
explicit knowledge representation. We demonstrate that entity and document
relevance are positively correlated, and that entity-based query feedback
improves document ranking effectiveness. Furthermore, we construct
query-specific KGs using retrieval and evaluate using CODEC's "ground-truth
graphs", showing the precision and recall trade-offs. Lastly, we point to
future work, including adaptive KG retrieval algorithms and GNN-based weighting
methods, while highlighting key challenges such as high-quality data,
information extraction recall, and the size and sparsity of complex topic
graphs.Comment: AKBC 2022 Workshop, Knowledge Graphs in Finance and Economic
DREQ: Document Re-Ranking Using Entity-based Query Understanding
While entity-oriented neural IR models have advanced significantly, they often overlook a key nuance: the varying degrees of influence individual entities within a document have on its overall relevance. Addressing this gap, we present DREQ, an entity-oriented dense document re-ranking model. Uniquely, we emphasize the query-relevant entities within a documentâs representation while simultaneously attenuating the less relevant ones, thus obtaining a query-specific entity-centric document representation. We then combine this entity-centric document representation with the text-centric representation of the document to obtain a âhybridâ representation of the document. We learn a relevance score for the document using this hybrid representation. Using four largescale benchmarks, we show that DREQ outperforms state-of-the-art neural and non-neural re-ranking methods, highlighting the effectiveness of our entity-oriented representation approach
Generative and Pseudo-Relevant Feedback for Sparse, Dense and Learned Sparse Retrieval
Pseudo-relevance feedback (PRF) is a classical approach to address lexical mismatch by enriching the query using first-pass retrieval. Moreover, recent work on generative-relevance feedback (GRF) shows that query expansion models using text generated from large language models can improve sparse retrieval without depending on first-pass retrieval effectiveness. This work extends GRF to dense and learned sparse retrieval paradigms with experiments over six standard document ranking benchmarks. We find that GRF improves over comparable PRF techniques by around 10% on both precision and recall-oriented measures. Nonetheless, query analysis shows that GRF and PRF have contrasting benefits, with GRF providing external context not present in first-pass retrieval, whereas PRF grounds the query to the information contained within the target corpus. Thus, we propose combining generative and pseudo-relevance feedback ranking signals to achieve the benefits of both feedback classes, which significantly increases recall over PRF methods on 95% of experiments
Generative Relevance Feedback with Large Language Models
Current query expansion models use pseudo-relevance feedback to improve
first-pass retrieval effectiveness; however, this fails when the initial
results are not relevant. Instead of building a language model from retrieved
results, we propose Generative Relevance Feedback (GRF) that builds
probabilistic feedback models from long-form text generated from Large Language
Models. We study the effective methods for generating text by varying the
zero-shot generation subtasks: queries, entities, facts, news articles,
documents, and essays. We evaluate GRF on document retrieval benchmarks
covering a diverse set of queries and document collections, and the results
show that GRF methods significantly outperform previous PRF methods.
Specifically, we improve MAP between 5-19% and NDCG@10 17-24% compared to RM3
expansion, and achieve the best R@1k effectiveness on all datasets compared to
state-of-the-art sparse, dense, and expansion models.Comment: SIGIR 2023 Preprint, 6 page
Re-Rank - Expand - Repeat: Adaptive Query Expansion for Document Retrieval Using Words and Entities
Sparse and dense pseudo-relevance feedback (PRF) approaches perform poorly on
challenging queries due to low precision in first-pass retrieval. However,
recent advances in neural language models (NLMs) can re-rank relevant documents
to top ranks, even when few are in the re-ranking pool. This paper first
addresses the problem of poor pseudo-relevance feedback by simply applying
re-ranking prior to query expansion and re-executing this query. We find that
this change alone can improve the retrieval effectiveness of sparse and dense
PRF approaches by 5-8%. Going further, we propose a new expansion model, Latent
Entity Expansion (LEE), a fine-grained word and entity-based relevance
modelling incorporating localized features. Finally, we include an "adaptive"
component to the retrieval process, which iteratively refines the re-ranking
pool during scoring using the expansion model, i.e. we "re-rank - expand -
repeat". Using LEE, we achieve (to our knowledge) the best NDCG, MAP and R@1000
results on the TREC Robust 2004 and CODEC adhoc document datasets,
demonstrating a significant advancement in expansion effectiveness
Hereditary dentine disorders: dentinogenesis imperfecta and dentine dysplasia
The hereditary dentine disorders, dentinogenesis imperfecta (DGI) and dentine dysplasia (DD), comprise a group of autosomal dominant genetic conditions characterised by abnormal dentine structure affecting either the primary or both the primary and secondary dentitions. DGI is reported to have an incidence of 1 in 6,000 to 1 in 8,000, whereas that of DD type 1 is 1 in 100,000. Clinically, the teeth are discoloured and show structural defects such as bulbous crowns and small pulp chambers radiographically. The underlying defect of mineralisation often results in shearing of the overlying enamel leaving exposed weakened dentine which is prone to wear
CODEC: Complex Document and Entity Collection
CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert judgments on 17,509 documents and entities (416.9 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations, providing data for query performance prediction and automatic rewriting evaluation.
CODEC includes analysis of state-of-the-art systems, including dense retrieval and neural re-ranking. The results show the topics are challenging with headroom for document and entity ranking improvement. Query expansion with entity information shows significant gains on document ranking, demonstrating the resource's value for evaluating and improving entity-oriented search. We also show that the manual query reformulations significantly improve document ranking and entity ranking performance. Overall, CODEC provides challenging research topics to support the development and evaluation of entity-centric search methods
Between roost contact is essential for maintenance of European bat lyssavirus type-2 in Myotis daubentonii bat reservoir: 'The Swarming Hypothesis'
Many high-consequence human and animal pathogens persist in wildlife reservoirs. An understanding of the dynamics of these pathogens in their reservoir hosts is crucial to inform the risk of spill-over events, yet our understanding of these dynamics is frequently insufficient. Viral persistence in a wild bat population was investigated by combining empirical data and in-silico analyses to test hypotheses on mechanisms for viral persistence. A fatal zoonotic virus, European Bat lyssavirus type 2 (EBLV-2), in Daubenton's bats (Myotis daubentonii) was used as a model system. A total of 1839 M. daubentonii were sampled for evidence of virus exposure and excretion during a prospective nine year serial cross-sectional survey. Multivariable statistical models demonstrated age-related differences in seroprevalence, with significant variation in seropositivity over time and among roosts. An Approximate Bayesian Computation approach was used to model the infection dynamics incorporating the known host ecology. The results demonstrate that EBLV-2 is endemic in the study population, and suggest that mixing between roosts during seasonal swarming events is necessary to maintain EBLV-2 in the population. These findings contribute to understanding how bat viruses can persist despite low prevalence of infection, and why infection is constrained to certain bat species in multispecies roosts and ecosystems
An analysis of baseline data from the PROUD study: an open-label randomised trial of pre-exposure prophylaxis
Background: Pre-exposure prophylaxis (PrEP) has proven biological efficacy to reduce the sexual acquisition of the
human immunodeficiency virus (HIV). The PROUD study found that PrEP conferred higher protection than in
placebo-controlled trials, reducing HIV incidence by 86 % in a population with seven-fold higher HIV incidence
than expected. We present the baseline characteristics of the PROUD study population and place the findings in
the context of national sexual health clinic data.
Methods: The PROUD study was designed to explore the real-world effectiveness of PrEP (tenofovir-emtricitabine) by
randomising HIV-negative gay and other men who have sex with men (GMSM) to receive open-label PrEP immediately
or after a deferral period of 12 months. At enrolment, participants self-completed two baseline questionnaires collecting
information on demographics, sexual behaviour and lifestyle in the last 30 and 90 days. These data were compared to
data from HIV-negative GMSM attending sexual health clinics in 2013, collated by Public Health England using
the genitourinary medicine clinic activity database (GUMCAD).
Results: The median age of participants was 35 (IQR: 29â43). Typically participants were white (81 %), educated at a
university level (61 %) and in full-time employment (72 %). Of all participants, 217 (40 %) were born outside the UK. A
sexually transmitted infection (STI) was reported to have been diagnosed in the previous 12 months in 330/515 (64 %)
and 473/544 (87 %) participants reported ever having being diagnosed with an STI. At enrolment, 47/280 (17 %)
participants were diagnosed with an STI. Participants reported a median (IQR) of 10 (5â20) partners in the last 90 days,
a median (IQR) of 2 (1â5) were condomless sex acts where the participant was receptive and 2 (1â6) were condomless
where the participant was insertive. Post-exposure prophylaxis had been prescribed to 184 (34 %) participants in the
past 12 months. The number of STI diagnoses was high compared to those reported in GUMCAD attendees.
Conclusions: The PROUD study population are at substantially higher risk of acquiring HIV infection sexually than the
overall population of GMSM attending sexual health clinics in England. These findings contribute to explaining the
extraordinary HIV incidence rate during follow-up and demonstrate that, despite broad eligibility criteria, the
population interested in PrEP was highly selective.
Trial registration: Current Controlled TrialsISRCTN94465371. Date of registration: 28 February 2013
- âŚ