12 research outputs found

    Knowledge Extraction for Hybrid Question Answering

    Get PDF
    Since the proposal of hypertext by Tim Berners-Lee to his employer CERN on March 12, 1989 the World Wide Web has grown to more than one billion Web pages and still grows. With the later proposed Semantic Web vision,Berners-Lee et al. suggested an extension of the existing (Document) Web to allow better reuse, sharing and understanding of data. Both the Document Web and the Web of Data (which is the current implementation of the Semantic Web) grow continuously. This is a mixed blessing, as the two forms of the Web grow concurrently and most commonly contain different pieces of information. Modern information systems must thus bridge a Semantic Gap to allow a holistic and unified access to information about a particular information independent of the representation of the data. One way to bridge the gap between the two forms of the Web is the extraction of structured data, i.e., RDF, from the growing amount of unstructured and semi-structured information (e.g., tables and XML) on the Document Web. Note, that unstructured data stands for any type of textual information like news, blogs or tweets. While extracting structured data from unstructured data allows the development of powerful information system, it requires high-quality and scalable knowledge extraction frameworks to lead to useful results. The dire need for such approaches has led to the development of a multitude of annotation frameworks and tools. However, most of these approaches are not evaluated on the same datasets or using the same measures. The resulting Evaluation Gap needs to be tackled by a concise evaluation framework to foster fine-grained and uniform evaluations of annotation tools and frameworks over any knowledge bases. Moreover, with the constant growth of data and the ongoing decentralization of knowledge, intuitive ways for non-experts to access the generated data are required. Humans adapted their search behavior to current Web data by access paradigms such as keyword search so as to retrieve high-quality results. Hence, most Web users only expect Web documents in return. However, humans think and most commonly express their information needs in their natural language rather than using keyword phrases. Answering complex information needs often requires the combination of knowledge from various, differently structured data sources. Thus, we observe an Information Gap between natural-language questions and current keyword-based search paradigms, which in addition do not make use of the available structured and unstructured data sources. Question Answering (QA) systems provide an easy and efficient way to bridge this gap by allowing to query data via natural language, thus reducing (1) a possible loss of precision and (2) potential loss of time while reformulating the search intention to transform it into a machine-readable way. Furthermore, QA systems enable answering natural language queries with concise results instead of links to verbose Web documents. Additionally, they allow as well as encourage the access to and the combination of knowledge from heterogeneous knowledge bases (KBs) within one answer. Consequently, three main research gaps are considered and addressed in this work: First, addressing the Semantic Gap between the unstructured Document Web and the Semantic Gap requires the development of scalable and accurate approaches for the extraction of structured data in RDF. This research challenge is addressed by several approaches within this thesis. This thesis presents CETUS, an approach for recognizing entity types to populate RDF KBs. Furthermore, our knowledge base-agnostic disambiguation framework AGDISTIS can efficiently detect the correct URIs for a given set of named entities. Additionally, we introduce REX, a Web-scale framework for RDF extraction from semi-structured (i.e., templated) websites which makes use of the semantics of the reference knowledge based to check the extracted data. The ongoing research on closing the Semantic Gap has already yielded a large number of annotation tools and frameworks. However, these approaches are currently still hard to compare since the published evaluation results are calculated on diverse datasets and evaluated based on different measures. On the other hand, the issue of comparability of results is not to be regarded as being intrinsic to the annotation task. Indeed, it is now well established that scientists spend between 60% and 80% of their time preparing data for experiments. Data preparation being such a tedious problem in the annotation domain is mostly due to the different formats of the gold standards as well as the different data representations across reference datasets. We tackle the resulting Evaluation Gap in two ways: First, we introduce a collection of three novel datasets, dubbed N3, to leverage the possibility of optimizing NER and NED algorithms via Linked Data and to ensure a maximal interoperability to overcome the need for corpus-specific parsers. Second, we present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools and frameworks on multiple datasets. The decentral architecture behind the Web has led to pieces of information being distributed across data sources with varying structure. Moreover, the increasing the demand for natural-language interfaces as depicted by current mobile applications requires systems to deeply understand the underlying user information need. In conclusion, the natural language interface for asking questions requires a hybrid approach to data usage, i.e., simultaneously performing a search on full-texts and semantic knowledge bases. To close the Information Gap, this thesis presents HAWK, a novel entity search approach developed for hybrid QA based on combining structured RDF and unstructured full-text data sources

    Systematic review of question answering over knowledge bases

    Get PDF
    Over the years, a growing number of semantic data repositories have been made available on the web. However, this has created new challenges in exploiting these resources efficiently. Querying services require knowledge beyond the typical user’s expertise, which is a critical issue in adopting semantic information solutions. Several proposals to overcome this dif- ficulty have suggested using question answering (QA) systems to provide user‐friendly interfaces and allow natural language use. Because question answering over knowledge bases (KBQAs) is a very active research topic, a comprehensive view of the field is essential. The purpose of this study was to conduct a systematic review of methods and systems for KBQAs to identify their main advantages and limitations. The inclusion criteria rationale was English full‐text articles published since 2015 on methods and systems for KBQAs.info:eu-repo/semantics/publishedVersio

    Template-Based Question Answering over Linked Data using Recursive Neural Networks

    Get PDF
    abstract: The Semantic Web contains large amounts of related information in the form of knowledge graphs such as DBpedia. These knowledge graphs are typically enormous and are not easily accessible for users as they need specialized knowledge in query languages (such as SPARQL) as well as deep familiarity of the ontologies used by these knowledge graphs. So, to make these knowledge graphs more accessible (even for non- experts) several question answering (QA) systems have been developed over the last decade. Due to the complexity of the task, several approaches have been undertaken that include techniques from natural language processing (NLP), information retrieval (IR), machine learning (ML) and the Semantic Web (SW). At a higher level, most question answering systems approach the question answering task as a conversion from the natural language question to its corresponding SPARQL query. These systems then utilize the query to retrieve the desired entities or literals. One approach to solve this problem, that is used by most systems today, is to apply deep syntactic and semantic analysis on the input question to derive the SPARQL query. This has resulted in the evolution of natural language processing pipelines that have common characteristics such as answer type detection, segmentation, phrase matching, part-of-speech-tagging, named entity recognition, named entity disambiguation, syntactic or dependency parsing, semantic role labeling, etc. This has lead to NLP pipeline architectures that integrate components that solve a specific aspect of the problem and pass on the results to subsequent components for further processing eg: DBpedia Spotlight for named entity recognition, RelMatch for relational mapping, etc. A major drawback in this approach is error propagation that is a common problem in NLP. This can occur due to mistakes early on in the pipeline that can adversely affect successive steps further down the pipeline. Another approach is to use query templates either manually generated or extracted from existing benchmark datasets such as Question Answering over Linked Data (QALD) to generate the SPARQL queries that is basically a set of predefined queries with various slots that need to be filled. This approach potentially shifts the question answering problem into a classification task where the system needs to match the input question to the appropriate template (class label). This thesis proposes a neural network approach to automatically learn and classify natural language questions into its corresponding template using recursive neural networks. An obvious advantage of using neural networks is the elimination for the need of laborious feature engineering that can be cumbersome and error prone. The input question would be encoded into a vector representation. The model will be trained and evaluated on the LC-QuAD Dataset (Large-scale Complex Question Answering Dataset). The dataset was created explicitly for machine learning based QA approaches for learning complex SPARQL queries. The dataset consists of 5000 questions along with their corresponding SPARQL queries over the DBpedia dataset spanning 5042 entities and 615 predicates. These queries were annotated based on 38 unique templates that the model will attempt to classify. The resulting model will be evaluated against both the LC-QuAD dataset and the Question Answering Over Linked Data (QALD-7) dataset. The recursive neural network achieves template classification accuracy of 0.828 on the LC-QuAD dataset and an accuracy of 0.618 on the QALD-7 dataset. When the top-2 most likely templates were considered the model achieves an accuracy of 0.945 on the LC-QuAD dataset and 0.786 on the QALD-7 dataset. After slot filling, the overall system achieves a macro F-score 0.419 on the LC- QuAD dataset and a macro F-score of 0.417 on the QALD-7 dataset.Dissertation/ThesisMasters Thesis Software Engineering 201

    Towards Dynamic Composition of Question Answering Pipelines

    Get PDF
    Question answering (QA) over knowledge graphs has gained significant momentum over the past five years due to the increasing availability of large knowledge graphs and the rising importance of question answering for user interaction. DBpedia has been the most prominently used knowledge graph in this setting. QA systems implement a pipeline connecting a sequence of QA components for translating an input question into its corresponding formal query (e.g. SPARQL); this query will be executed over a knowledge graph in order to produce the answer of the question. Recent empirical studies have revealed that albeit overall effective, the performance of QA systems and QA components depends heavily on the features of input questions, and not even the combination of the best performing QA systems or individual QA components retrieves complete and correct answers. Furthermore, these QA systems cannot be easily reused, extended, and results cannot be easily reproduced since the systems are mostly implemented in a monolithic fashion, lack standardised interfaces and are often not open source or available as Web services. All these drawbacks of the state of the art that prevents many of these approaches to be employed in real-world applications. In this thesis, we tackle the problem of QA over knowledge graph and propose a generic approach to promote reusability and build question answering systems in a collaborative effort. Firstly, we define qa vocabulary and Qanary methodology to develop an abstraction level on existing QA systems and components. Qanary relies on qa vocabulary to establish guidelines for semantically describing the knowledge exchange between the components of a QA system. We implement a component-based modular framework called "Qanary Ecosystem" utilising the Qanary methodology to integrate several heterogeneous QA components in a single platform. We further present Qaestro framework that provides an approach to semantically describing question answering components and effectively enumerates QA pipelines based on a QA developer requirements. Qaestro provides all valid combinations of available QA components respecting the input-output requirement of each component to build QA pipelines. Finally, we address the scalability of QA components within a framework and propose a novel approach that chooses the best component per task to automatically build QA pipeline for each input question. We implement this model within FRANKENSTEIN, a framework able to select QA components and compose pipelines. FRANKENSTEIN extends Qanary ecosystem and utilises qa vocabulary for data exchange. It has 29 independent QA components implementing five QA tasks resulting 360 unique QA pipelines. Each approach proposed in this thesis (Qanary methodology, Qaestro, and FRANKENSTEIN) is supported by extensive evaluation to demonstrate their effectiveness. Our contributions target a broader research agenda of offering the QA community an efficient way of applying their research to a research field which is driven by many different fields, consequently requiring a collaborative approach to achieve significant progress in the domain of question answering
    corecore