107 research outputs found

    Feasibility report: Delivering case-study based learning using artificial intelligence and gaming technologies

    Get PDF
    This document describes an investigation into the technical feasibility of a game to support learning based on case studies. Information systems students using the game will conduct fact-finding interviews with virtual characters. We survey relevant technologies in computational linguistics and games. We assess the applicability of the various approaches and propose an architecture for the game based on existing techniques. We propose a phased development plan for the development of the game

    Query-Time Data Integration

    Get PDF
    Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections

    Keyword-Based Querying for the Social Semantic Web

    Get PDF
    Enabling non-experts to publish data on the web is an important achievement of the social web and one of the primary goals of the social semantic web. Making the data easily accessible in turn has received only little attention, which is problematic from the point of view of incentives: users are likely to be less motivated to participate in the creation of content if the use of this content is mostly reserved to experts. Querying in semantic wikis, for example, is typically realized in terms of full text search over the textual content and a web query language such as SPARQL for the annotations. This approach has two shortcomings that limit the extent to which data can be leveraged by users: combined queries over content and annotations are not possible, and users either are restricted to expressing their query intent using simple but vague keyword queries or have to learn a complex web query language. The work presented in this dissertation investigates a more suitable form of querying for semantic wikis that consolidates two seemingly conflicting characteristics of query languages, ease of use and expressiveness. This work was carried out in the context of the semantic wiki KiWi, but the underlying ideas apply more generally to the social semantic and social web. We begin by defining a simple modular conceptual model for the KiWi wiki that enables rich and expressive knowledge representation. A component of this model are structured tags, an annotation formalism that is simple yet flexible and expressive, and aims at bridging the gap between atomic tags and RDF. The viability of the approach is confirmed by a user study, which finds that structured tags are suitable for quickly annotating evolving knowledge and are perceived well by the users. The main contribution of this dissertation is the design and implementation of KWQL, a query language for semantic wikis. KWQL combines keyword search and web querying to enable querying that scales with user experience and information need: basic queries are easy to express; as the search criteria become more complex, more expertise is needed to formulate the corresponding query. A novel aspect of KWQL is that it combines both paradigms in a bottom-up fashion. It treats neither of the two as an extension to the other, but instead integrates both in one framework. The language allows for rich combined queries of full text, metadata, document structure, and informal to formal semantic annotations. KWilt, the KWQL query engine, provides the full expressive power of first-order queries, but at the same time can evaluate basic queries at almost the speed of the underlying search engine. KWQL is accompanied by the visual query language visKWQL, and an editor that displays both the textual and visual form of the current query and reflects changes to either representation in the other. A user study shows that participants quickly learn to construct KWQL and visKWQL queries, even when given only a short introduction. KWQL allows users to sift the wealth of structure and annotations in an information system for relevant data. If relevant data constitutes a substantial fraction of all data, ranking becomes important. To this end, we propose PEST, a novel ranking method that propagates relevance among structurally related or similarly annotated data. Extensive experiments, including a user study on a real life wiki, show that pest improves the quality of the ranking over a range of existing ranking approaches

    Lifecycle of neural semantic parsing

    Get PDF
    Humans are born with the ability to learn to perceive, comprehend and communicate with language. Computing machines, on the other hand, only understand programming languages. To bridge the gap between humans and computers, deep semantic parsers convert natural language utterances into machine-understandable logical forms. The technique has a wide range of applications ranging from spoken dialogue systems and natural language interfaces. This thesis focuses on neural network-based semantic parsing. Traditional semantic parsers function with a domain-specific grammar that pairs utterances and logical forms, and parse with a CKY-like algorithm in polynomial time. Recent advances in neural semantic parsing reformulate the task as a sequence-to- sequence learning problem. Neural semantic parsers parse a sentence in linear time, and reduce the need for domain-specific assumptions, grammar learning, and extensive feature engineering. But this modeling flexibility comes at a cost since it is no longer possible to interpret how meaning composition is performed, given that logical forms are structured objects (trees or graphs). Such knowledge plays a critical role in understanding modeling limitations so as to build better semantic parsers. Moreover, the sequence-to-sequence learning problem is fairly unconstrained, both in terms of the possible derivations to consider and in terms of the target logical forms which can be ill-formed or unexecutable. The first contribution of this thesis is an improved neural semantic parser, which produces syntactically valid logical forms following a transition system and grammar constrains. The transition system integrates the generation of domain-general (i.e., valid tree-structures and language-specific predicates) and domain-specific aspects (i.e., domain-specific predicates and entities) in a unified way. The model employs various neural attention mechanisms to handle mismatches between natural language and formal language—a central challenge in semantic parsing. Training data to semantic parsers typically consists of utterances paired with logical forms. Another challenge of semantic parsing concerns the annotation of logical forms, which is labor-intensive. To write down the correct logical form of an utterance, one not only needs to have expertise in the semantic formalism, but also has to ensure the logical form matches the utterance semantics. We tackle this challenge in two ways. On the one hand, we extend the neural semantic parser to a weakly-supervised setting within a parser-ranker framework. The weakly-supervised setup uses training data of utterance-denotation (e.g., question-answer) pairs, which are much easier to obtain and therefore allow to scale semantic parsers to complex domains. Our framework combines the advantages of conventional weakly-supervised semantic parsers and neural semantic parsing. Candidate logical forms are generated by a neural decoder and subsequently scored by a ranking component. We present methods to efficiently search for candidate logical forms which involve spurious ambiguity—some logical forms do not match utterance semantics but coincidentally execute to the correct denotation. They should be excluded from training. On the other hand, we focus on how to quickly engineer a practical neural semantic parser for closed domains, by directly reducing the annotation difficulty of utterance-logical form pairs. We develop an interface for efficiently collecting compositional utterance-logical form pairs and then leverage the data collection method to train neural semantic parsers. Our method provides an end-to-end solution for closed-domain semantic parsing given only an ontology. We also extend the end-to-end solution to handle sequential utterances simulating a non-interactive user session. Specifically, the data collection interface is modified to collect utterance sequences which exhibit various co-reference patterns. Then the neural semantic parser is extended to parse context-dependent utterances. In summary, this thesis covers the lifecycle of designing a neural semantic parser: from model design (i.e., how to model a neural semantic parser with an appropriate inductive bias), training (i.e., how to perform fully supervised and weakly supervised training for a neural semantic parser) to engineering (i.e., how to build a neural semantic parser from a domain ontology)

    A Principled Framework for Constructing Natural Language Interfaces To Temporal Databases

    Get PDF
    Most existing natural language interfaces to databases (NLIDBs) were designed to be used with ``snapshot'' database systems, that provide very limited facilities for manipulating time-dependent data. Consequently, most NLIDBs also provide very limited support for the notion of time. The database community is becoming increasingly interested in _temporal_ database systems. These are intended to store and manipulate in a principled manner information not only about the present, but also about the past and future. This thesis develops a principled framework for constructing English NLIDBs for _temporal_ databases (NLITDBs), drawing on research in tense and aspect theories, temporal logics, and temporal databases. I first explore temporal linguistic phenomena that are likely to appear in English questions to NLITDBs. Drawing on existing linguistic theories of time, I formulate an account for a large number of these phenomena that is simple enough to be embodied in practical NLITDBs. Exploiting ideas from temporal logics, I then define a temporal meaning representation language, TOP, and I show how the HPSG grammar theory can be modified to incorporate the tense and aspect account of this thesis, and to map a wide range of English questions involving time to appropriate TOP expressions. Finally, I present and prove the correctness of a method to translate from TOP to TSQL2, TSQL2 being a temporal extension of the SQL-92 database language. This way, I establish a sound route from English questions involving time to a general-purpose temporal database language, that can act as a principled framework for building NLITDBs. To demonstrate that this framework is workable, I employ it to develop a prototype NLITDB, implemented using ALE and Prolog.Comment: PhD thesis; 405 pages; LaTeX2e, uses the packages/macros: amstex, xspace, avm, examples, dvips, varioref, makeidx, epic, eepic, ecltree; postscript figures include

    Data Infrastructure for Medical Research

    Get PDF
    While we are witnessing rapid growth in data across the sciences and in many applications, this growth is particularly remarkable in the medical domain, be it because of higher resolution instruments and diagnostic tools (e.g. MRI), new sources of structured data like activity trackers, the wide-spread use of electronic health records and many others. The sheer volume of the data is not, however, the only challenge to be faced when using medical data for research. Other crucial challenges include data heterogeneity, data quality, data privacy and so on. In this article, we review solutions addressing these challenges by discussing the current state of the art in the areas of data integration, data cleaning, data privacy, scalable data access and processing in the context of medical data. The techniques and tools we present will give practitioners — computer scientists and medical researchers alike — a starting point to understand the challenges and solutions and ultimately to analyse medical data and gain better and quicker insights

    GPT-4 Technical Report

    Full text link
    We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.Comment: 100 page

    Development and Specification of Virtual Environments

    Get PDF
    This thesis concerns the issues involved in the development of virtual environments (VEs). VEs are more than virtual reality. We identify four main characteristics of them: graphical interaction, multimodality, interface agents, and multi-user. These characteristics are illustrated with an overview of different classes of VE-like applications, and a number of state-of-the-art VEs. To further define the topic of research, we propose a general framework for VE systems development, in which we identify five major classes of development tools: methodology, guidelines, design specification, analysis, and development environments. Of each, we give an overview of existing best practices
    • …
    corecore