75 research outputs found

    Query-Specific Knowledge Graphs for Complex Finance Topics

    Full text link
    Across the financial domain, researchers answer complex questions by extensively "searching" for relevant information to generate long-form reports. This workshop paper discusses automating the construction of query-specific document and entity knowledge graphs (KGs) for complex research topics. We focus on the CODEC dataset, where domain experts (1) create challenging questions, (2) construct long natural language narratives, and (3) iteratively search and assess the relevance of documents and entities. For the construction of query-specific KGs, we show that state-of-the-art ranking systems have headroom for improvement, with specific failings due to a lack of context or explicit knowledge representation. We demonstrate that entity and document relevance are positively correlated, and that entity-based query feedback improves document ranking effectiveness. Furthermore, we construct query-specific KGs using retrieval and evaluate using CODEC's "ground-truth graphs", showing the precision and recall trade-offs. Lastly, we point to future work, including adaptive KG retrieval algorithms and GNN-based weighting methods, while highlighting key challenges such as high-quality data, information extraction recall, and the size and sparsity of complex topic graphs.Comment: AKBC 2022 Workshop, Knowledge Graphs in Finance and Economic

    DREQ: Document Re-Ranking Using Entity-based Query Understanding

    Get PDF
    While entity-oriented neural IR models have advanced significantly, they often overlook a key nuance: the varying degrees of influence individual entities within a document have on its overall relevance. Addressing this gap, we present DREQ, an entity-oriented dense document re-ranking model. Uniquely, we emphasize the query-relevant entities within a document’s representation while simultaneously attenuating the less relevant ones, thus obtaining a query-specific entity-centric document representation. We then combine this entity-centric document representation with the text-centric representation of the document to obtain a “hybrid” representation of the document. We learn a relevance score for the document using this hybrid representation. Using four largescale benchmarks, we show that DREQ outperforms state-of-the-art neural and non-neural re-ranking methods, highlighting the effectiveness of our entity-oriented representation approach

    Generative and Pseudo-Relevant Feedback for Sparse, Dense and Learned Sparse Retrieval

    Get PDF
    Pseudo-relevance feedback (PRF) is a classical approach to address lexical mismatch by enriching the query using first-pass retrieval. Moreover, recent work on generative-relevance feedback (GRF) shows that query expansion models using text generated from large language models can improve sparse retrieval without depending on first-pass retrieval effectiveness. This work extends GRF to dense and learned sparse retrieval paradigms with experiments over six standard document ranking benchmarks. We find that GRF improves over comparable PRF techniques by around 10% on both precision and recall-oriented measures. Nonetheless, query analysis shows that GRF and PRF have contrasting benefits, with GRF providing external context not present in first-pass retrieval, whereas PRF grounds the query to the information contained within the target corpus. Thus, we propose combining generative and pseudo-relevance feedback ranking signals to achieve the benefits of both feedback classes, which significantly increases recall over PRF methods on 95% of experiments

    Generative Relevance Feedback with Large Language Models

    Full text link
    Current query expansion models use pseudo-relevance feedback to improve first-pass retrieval effectiveness; however, this fails when the initial results are not relevant. Instead of building a language model from retrieved results, we propose Generative Relevance Feedback (GRF) that builds probabilistic feedback models from long-form text generated from Large Language Models. We study the effective methods for generating text by varying the zero-shot generation subtasks: queries, entities, facts, news articles, documents, and essays. We evaluate GRF on document retrieval benchmarks covering a diverse set of queries and document collections, and the results show that GRF methods significantly outperform previous PRF methods. Specifically, we improve MAP between 5-19% and NDCG@10 17-24% compared to RM3 expansion, and achieve the best R@1k effectiveness on all datasets compared to state-of-the-art sparse, dense, and expansion models.Comment: SIGIR 2023 Preprint, 6 page

    Re-Rank - Expand - Repeat: Adaptive Query Expansion for Document Retrieval Using Words and Entities

    Full text link
    Sparse and dense pseudo-relevance feedback (PRF) approaches perform poorly on challenging queries due to low precision in first-pass retrieval. However, recent advances in neural language models (NLMs) can re-rank relevant documents to top ranks, even when few are in the re-ranking pool. This paper first addresses the problem of poor pseudo-relevance feedback by simply applying re-ranking prior to query expansion and re-executing this query. We find that this change alone can improve the retrieval effectiveness of sparse and dense PRF approaches by 5-8%. Going further, we propose a new expansion model, Latent Entity Expansion (LEE), a fine-grained word and entity-based relevance modelling incorporating localized features. Finally, we include an "adaptive" component to the retrieval process, which iteratively refines the re-ranking pool during scoring using the expansion model, i.e. we "re-rank - expand - repeat". Using LEE, we achieve (to our knowledge) the best NDCG, MAP and R@1000 results on the TREC Robust 2004 and CODEC adhoc document datasets, demonstrating a significant advancement in expansion effectiveness

    Hereditary dentine disorders: dentinogenesis imperfecta and dentine dysplasia

    Get PDF
    The hereditary dentine disorders, dentinogenesis imperfecta (DGI) and dentine dysplasia (DD), comprise a group of autosomal dominant genetic conditions characterised by abnormal dentine structure affecting either the primary or both the primary and secondary dentitions. DGI is reported to have an incidence of 1 in 6,000 to 1 in 8,000, whereas that of DD type 1 is 1 in 100,000. Clinically, the teeth are discoloured and show structural defects such as bulbous crowns and small pulp chambers radiographically. The underlying defect of mineralisation often results in shearing of the overlying enamel leaving exposed weakened dentine which is prone to wear

    CODEC: Complex Document and Entity Collection

    Get PDF
    CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert judgments on 17,509 documents and entities (416.9 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations, providing data for query performance prediction and automatic rewriting evaluation. CODEC includes analysis of state-of-the-art systems, including dense retrieval and neural re-ranking. The results show the topics are challenging with headroom for document and entity ranking improvement. Query expansion with entity information shows significant gains on document ranking, demonstrating the resource's value for evaluating and improving entity-oriented search. We also show that the manual query reformulations significantly improve document ranking and entity ranking performance. Overall, CODEC provides challenging research topics to support the development and evaluation of entity-centric search methods

    Between roost contact is essential for maintenance of European bat lyssavirus type-2 in Myotis daubentonii bat reservoir: 'The Swarming Hypothesis'

    Get PDF
    Many high-consequence human and animal pathogens persist in wildlife reservoirs. An understanding of the dynamics of these pathogens in their reservoir hosts is crucial to inform the risk of spill-over events, yet our understanding of these dynamics is frequently insufficient. Viral persistence in a wild bat population was investigated by combining empirical data and in-silico analyses to test hypotheses on mechanisms for viral persistence. A fatal zoonotic virus, European Bat lyssavirus type 2 (EBLV-2), in Daubenton's bats (Myotis daubentonii) was used as a model system. A total of 1839 M. daubentonii were sampled for evidence of virus exposure and excretion during a prospective nine year serial cross-sectional survey. Multivariable statistical models demonstrated age-related differences in seroprevalence, with significant variation in seropositivity over time and among roosts. An Approximate Bayesian Computation approach was used to model the infection dynamics incorporating the known host ecology. The results demonstrate that EBLV-2 is endemic in the study population, and suggest that mixing between roosts during seasonal swarming events is necessary to maintain EBLV-2 in the population. These findings contribute to understanding how bat viruses can persist despite low prevalence of infection, and why infection is constrained to certain bat species in multispecies roosts and ecosystems

    An analysis of baseline data from the PROUD study: an open-label randomised trial of pre-exposure prophylaxis

    Get PDF
    Background: Pre-exposure prophylaxis (PrEP) has proven biological efficacy to reduce the sexual acquisition of the human immunodeficiency virus (HIV). The PROUD study found that PrEP conferred higher protection than in placebo-controlled trials, reducing HIV incidence by 86 % in a population with seven-fold higher HIV incidence than expected. We present the baseline characteristics of the PROUD study population and place the findings in the context of national sexual health clinic data. Methods: The PROUD study was designed to explore the real-world effectiveness of PrEP (tenofovir-emtricitabine) by randomising HIV-negative gay and other men who have sex with men (GMSM) to receive open-label PrEP immediately or after a deferral period of 12 months. At enrolment, participants self-completed two baseline questionnaires collecting information on demographics, sexual behaviour and lifestyle in the last 30 and 90 days. These data were compared to data from HIV-negative GMSM attending sexual health clinics in 2013, collated by Public Health England using the genitourinary medicine clinic activity database (GUMCAD). Results: The median age of participants was 35 (IQR: 29–43). Typically participants were white (81 %), educated at a university level (61 %) and in full-time employment (72 %). Of all participants, 217 (40 %) were born outside the UK. A sexually transmitted infection (STI) was reported to have been diagnosed in the previous 12 months in 330/515 (64 %) and 473/544 (87 %) participants reported ever having being diagnosed with an STI. At enrolment, 47/280 (17 %) participants were diagnosed with an STI. Participants reported a median (IQR) of 10 (5–20) partners in the last 90 days, a median (IQR) of 2 (1–5) were condomless sex acts where the participant was receptive and 2 (1–6) were condomless where the participant was insertive. Post-exposure prophylaxis had been prescribed to 184 (34 %) participants in the past 12 months. The number of STI diagnoses was high compared to those reported in GUMCAD attendees. Conclusions: The PROUD study population are at substantially higher risk of acquiring HIV infection sexually than the overall population of GMSM attending sexual health clinics in England. These findings contribute to explaining the extraordinary HIV incidence rate during follow-up and demonstrate that, despite broad eligibility criteria, the population interested in PrEP was highly selective. Trial registration: Current Controlled TrialsISRCTN94465371. Date of registration: 28 February 2013
    • …
    corecore