122,533 research outputs found
Structured Text Retrieval Models
Structured text retrieval models provide a formal definition or mathematical framework for querying semistructured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language [4]: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text modelâ\u80\u99s word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like â\u80\u9ccontaining â\u80\u9d and â\u80\u9ccontained-by â\u80\u9d to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like â\u80\u9cI want a paragraph discussing formal models near to a table discussing the differences between databases and information retrievalâ\u80\u9d. Here, â\u80\u9cformal models â\u80\u9d and â\u80\u9cdifferences between databases and information retrieval â\u80\u9d should match the content that needs to be retrieved from the database, whereas â\u80\u9cparagraph â\u80\u9d and â\u80\u9ctable â\u80\u9d refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed below. HISTORICAL BACKGROUND The STAIRS system (Storage and Information Retrieval System), which was developed at IBM already in the late 1950â\u80\u99s allowed querying both content and structure. Much like todayâ\u80\u99s On-line Public Access Catalogues, it wa
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text
Large Language Models (LLMs) have exhibited impressive generation
capabilities, but they suffer from hallucinations when solely relying on their
internal knowledge, especially when answering questions that require less
commonly known information. Retrieval-augmented LLMs have emerged as a
potential solution to ground LLMs in external knowledge. Nonetheless, recent
approaches have primarily emphasized retrieval from unstructured text corpora,
owing to its seamless integration into prompts. When using structured data such
as knowledge graphs, most methods simplify it into natural text, neglecting the
underlying structures. Moreover, a significant gap in the current landscape is
the absence of a realistic benchmark for evaluating the effectiveness of
grounding LLMs on heterogeneous knowledge sources (e.g., knowledge base and
text). To fill this gap, we have curated a comprehensive dataset that poses two
unique challenges: (1) Two-hop multi-source questions that require retrieving
information from both open-domain structured and unstructured knowledge
sources; retrieving information from structured knowledge sources is a critical
component in correctly answering the questions. (2) The generation of symbolic
queries (e.g., SPARQL for Wikidata) is a key requirement, which adds another
layer of challenge. Our dataset is created using a combination of automatic
generation through predefined reasoning chains and human annotation. We also
introduce a novel approach that leverages multiple retrieval tools, including
text passage retrieval and symbolic language-assisted retrieval. Our model
outperforms previous approaches by a significant margin, demonstrating its
effectiveness in addressing the above-mentioned reasoning challenges
Improving relevance feedback-based query expansion by the use of a weighted word pairs approach
In this article, the use of a new term extraction method for query expansion (QE) in text retrieval is investigated. The new method expands the initial query with a structured representation made of weighted word pairs (WWP) extracted from a set of training documents (relevance feedback). Standard text retrieval systems can handle a WWP structure through custom Boolean weighted models. We experimented with both the explicit and pseudorelevance feedback schemas and compared the proposed term extraction method with others in the literature, such as KLD and RM3. Evaluations have been conducted on a number of test collections (Text REtrivel Conference [TREC]-6, -7, -8, -9, and -10). Results demonstrated that the QE method based on this new structure outperforms the baseline
Efficient Indexing for Structured and Unstructured Data
The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation
From document to entity retrieval : improving precision and performance of focused text search
Text retrieval is an active area of research since decades. Several issues have\ud
been studied over the entire period, like the development of statistical models\ud
for the estimation of relevance, or the challenge to keep retrieval tasks efficient with ever growing text collections. Especially in the last decade, we have also seen a diversification of retrieval tasks. Passage or XML retrieval systems allow a more focused search. Question answering or expert search systems\ud
do not even return a ranked list of text units, but for instance persons with expertise on a given topic. The sketched situation forms the starting point of this thesis, which presents a number of task-specific search solutions and tries to set them into more generic frameworks. In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.\ud
In the first case, we show how different types of context information can\ud
be incorporated in the retrieval of documents. When users are searching for\ud
information, the search task is typically part of a wider working process. This\ud
search context, however, is often not reflected by the few search keywords\ud
stated to the retrieval system, though it can contain valuable information for\ud
query refinement. We address with this work two research questions related\ud
to the aim of developing context-aware retrieval systems. First, we show\ud
how already available information about the user’s context can be employed\ud
effectively to gain highly precise search results. Second, we investigate how\ud
such meta-data about the search context can be gathered. The proposed\ud
“query profiles” have a central role in the query refinement process. They\ud
automatically detect necessary context information and help the user to explicitly\ud
express context-dependent search constraints. The effectiveness of\ud
the approach is tested with retrieval experiments on newspaper data.\ud
When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is attractive to\ud
try to develop retrieval systems that make use of the additional structure\ud
information. Structured retrieval first asks for the design of a suitable language\ud
that enables the user to express queries on content and structure. We\ud
investigate here existing query languages, whether and how they support\ud
the basic needs of structured querying. However, our main focus lies on the\ud
efficiency of structured retrieval systems. Conventional inverted indices for\ud
document retrieval systems are not suitable for maintaining structure indices.\ud
We identify base operations involved in the execution of structured queries\ud
and show how they can be supported by new indices and algorithms on a\ud
database system. Efficient query processing has to be concerned with the\ud
optimization of query plans as well. We investigate low-level query plans of\ud
physical database operators for the execution of simple query patterns. Furthermore,\ud
It is demonstrated how complex queries benefit from higher level\ud
query optimization.\ud
New search tasks and interfaces for the presentation of search results,\ud
like faceted search applications, question answering, expert search, and automatic\ud
timeline construction, come with the need to rank entities instead of\ud
documents. By entities we mean unique (named) existences, such as persons,\ud
organizations or dates. Modern language processing tools are able to automatically\ud
detect and categorize named entities in large text collections. In\ud
order to estimate their relevance to a given search topic, we develop retrieval\ud
models for entities which are based on the relevance of texts that mention the\ud
entity. A graph-based relevance propagation framework is introduced for this\ud
purpose that enables to derive the relevance of entities. Several options for\ud
the modeling of entity containment graphs and different relevance propagation\ud
approaches are tested, demonstrating the usefulness of the graph-based\ud
ranking framework
PDFTriage: Question Answering over Long, Structured Documents
Large Language Models (LLMs) have issues with document question answering
(QA) in situations where the document is unable to fit in the small context
length of an LLM. To overcome this issue, most existing works focus on
retrieving the relevant context from the document, representing them as plain
text. However, documents such as PDFs, web pages, and presentations are
naturally structured with different pages, tables, sections, and so on.
Representing such structured documents as plain text is incongruous with the
user's mental model of these documents with rich structure. When a system has
to query the document for context, this incongruity is brought to the fore, and
seemingly trivial questions can trip up the QA system. To bridge this
fundamental gap in handling structured documents, we propose an approach called
PDFTriage that enables models to retrieve the context based on either structure
or content. Our experiments demonstrate the effectiveness of the proposed
PDFTriage-augmented models across several classes of questions where existing
retrieval-augmented LLMs fail. To facilitate further research on this
fundamental problem, we release our benchmark dataset consisting of 900+
human-generated questions over 80 structured documents from 10 different
categories of question types for document QA
Exploiting Query Structure and Document Structure to Improve Document Retrieval Effectiveness
In this paper we present a systematic analysis of document
retrieval using unstructured and structured queries within
the score region algebra (SRA) structured retrieval framework. The behavior of di®erent retrieval models, namely
Boolean, tf.idf, GPX, language models, and Okapi, is tested
using the transparent SRA framework in our three-level structured retrieval system called TIJAH. The retrieval models are implemented along four elementary retrieval aspects: element and term selection, element score computation, score combination, and score propagation.
The analysis is performed on a numerous experiments
evaluated on TREC and CLEF collections, using manually
generated unstructured and structured queries. Unstructured queries range from the short title queries to long title
+ description + narrative queries. For generating structured
queries we exploit the knowledge of the document structure
and the content used to semantically describe or classify
documents. We show that such structured information can
be utilized in retrieval engines to give more precise answers to user queries then when using unstructured queries
Distributed Information Retrieval using Keyword Auctions
This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions
- …