402 research outputs found
ANSWERING TOPICAL INFORMATION NEEDS USING NEURAL ENTITY-ORIENTED INFORMATION RETRIEVAL AND EXTRACTION
In the modern world, search engines are an integral part of human lives. The field of Information Retrieval (IR) is concerned with finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need (query) from within large collections (usually stored on computers). The search engine then displays a ranked list of results relevant to our query. Traditional document retrieval algorithms match a query to a document using the overlap of words in both. However, the last decade has seen the focus shifting to leveraging the rich semantic information available in the form of entities. Entities are uniquely identifiable objects or things such as places, events, diseases, etc. that exist in the real or fictional world. Entity-oriented search systems leverage the semantic information associated with entities (e.g., names, types, etc.) to better match documents to queries. Web search engines would provide better search results if they understand the meaning of a query.
This dissertation advances the state-of-the-art in IR by developing novel algorithmsthat understand text (query, document, question, sentence, etc.) at the semantic level. To this end, this dissertation aims to understand the fine-grained meaning of entities from the context in which the entities have been mentioned, for example, “oysters” in the context of food versus ecosystems. Further, we aim to automatically learn (vector) representations of entities that incorporate this fine-grained knowledge and knowledge about the query. This work refines the automatic understanding of text passages using deep learning, a modern artificial intelligence paradigm.
This dissertation utilized the semantic information extracted from entities to retrieve materials (text and entities) relevant to a query. The interplay between text and entities in the text is studied by addressing three related prediction problems: (1) Identify entities that are relevant for the query, (2) Understand an entity’s meaning in the context of the query, and (3) Identify text passages that elaborate the connection between the query and an entity.
The research presented in this dissertation may be integrated into a larger system de-signed for answering complex topical queries such as dark chocolate health benefits which require the search engine to automatically understand the connections between the query and the relevant material, thus transforming the search engine into an answering engine
Thematic Annotation: extracting concepts out of documents
Contrarily to standard approaches to topic annotation, the technique used in
this work does not centrally rely on some sort of -- possibly statistical --
keyword extraction. In fact, the proposed annotation algorithm uses a large
scale semantic database -- the EDR Electronic Dictionary -- that provides a
concept hierarchy based on hyponym and hypernym relations. This concept
hierarchy is used to generate a synthetic representation of the document by
aggregating the words present in topically homogeneous document segments into a
set of concepts best preserving the document's content.
This new extraction technique uses an unexplored approach to topic selection.
Instead of using semantic similarity measures based on a semantic resource, the
later is processed to extract the part of the conceptual hierarchy relevant to
the document content. Then this conceptual hierarchy is searched to extract the
most relevant set of concepts to represent the topics discussed in the
document. Notice that this algorithm is able to extract generic concepts that
are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure
Semantic Interaction in Web-based Retrieval Systems : Adopting Semantic Web Technologies and Social Networking Paradigms for Interacting with Semi-structured Web Data
Existing web retrieval models for exploration and interaction with web data do not take into account semantic information, nor do they allow for new forms of interaction by employing meaningful interaction and navigation metaphors in 2D/3D. This thesis researches means for introducing a semantic dimension into the search and exploration process of web content to enable a significantly positive user experience. Therefore, an inherently dynamic view beyond single concepts and models from semantic information processing, information extraction and human-machine interaction is adopted. Essential tasks for semantic interaction such as semantic annotation, semantic mediation and semantic human-computer interaction were identified and elaborated for two general application scenarios in web retrieval: Web-based Question Answering in a knowledge-based dialogue system and semantic exploration of information spaces in 2D/3D
Uncovering Hidden Semantics of Set Information in Knowledge Bases
Knowledge Bases (KBs) contain a wealth of structured information about entities and predicates. This paper focuses on set-valued predicates, i.e., the relationship between an entity and a set of entities. In KBs, this information is often represented in two formats: (i) via counting predicates such as numberOfChildren and staffSize, that store aggregated integers, and (ii) via enumerating predicates such as parentOf and worksFor, that store individual set memberships. Both formats are typically complementary: unlike enumerating predicates, counting predicates do not give away individuals, but are more likely informative towards the true set size, thus this coexistence could enable interesting applications in question answering and KB curation. In this paper we aim at uncovering this hidden knowledge. We proceed in two steps. (i) We identify set-valued predicates from a given KB predicates via statistical and embedding-based features. (ii) We link counting predicates and enumerating predicates by a combination of co-occurrence, correlation and textual relatedness metrics. We analyze the prevalence of count information in four prominent knowledge bases, and show that our linking method achieves up to 0.55 F1 score in set predicate identification versus 0.40 F1 score of a random selection, and normalized discounted gains of up to 0.84 at position 1 and 0.75 at position 3 in relevant predicate alignments. Our predicate alignments are showcased in a demonstration system available at https://counqer.mpi-inf.mpg.de/spo
Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling
Topic segmentation is critical for obtaining structured documents and
improving downstream tasks such as information retrieval. Due to its ability of
automatically exploring clues of topic shift from abundant labeled data, recent
supervised neural models have greatly promoted the development of long document
topic segmentation, but leaving the deeper relationship between coherence and
topic segmentation underexplored. Therefore, this paper enhances the ability of
supervised models to capture coherence from both logical structure and semantic
similarity perspectives to further improve the topic segmentation performance,
proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive
Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to
force the model to comprehend structural information by learning the original
relations between adjacent sentences in a disarrayed document, which is
constructed by jointly disrupting the original document at topic and sentence
levels. Moreover, we utilize inter- and intra-topic information to construct
contrastive samples and design the CSSL objective to ensure that the sentences
representations in the same topic have higher similarity, while those in
different topics are less similar. Extensive experiments show that the
Longformer with our approach significantly outperforms old state-of-the-art
(SOTA) methods. Our approach improve of old SOTA by 3.42 (73.74 -> 77.16)
and reduces by 1.11 points (15.0 -> 13.89) on WIKI-727K and achieves an
average relative reduction of 4.3% on on WikiSection. The average
relative drop of 8.38% on two out-of-domain datasets also demonstrates
the robustness of our approach.Comment: Accepted by EMNLP 2023. Codes is available at
https://github.com/alibaba-damo-academy/SpokenNLP
- …