107 research outputs found
Feasibility report: Delivering case-study based learning using artificial intelligence and gaming technologies
This document describes an investigation into the technical feasibility of a game to support learning based on case studies. Information systems students using the game will conduct fact-finding interviews with virtual characters. We survey relevant technologies in computational linguistics and games. We assess the applicability of the various approaches and propose an architecture for the game based on existing techniques. We propose a phased development plan for the development of the game
Query-Time Data Integration
Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established.
This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need.
To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort.
Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections
Keyword-Based Querying for the Social Semantic Web
Enabling non-experts to publish data on the web is an important
achievement of the social web and one of the primary goals of the social
semantic web. Making the data easily accessible in turn has received only
little attention, which is problematic from the point of view of
incentives: users are likely to be less motivated to participate in the
creation of content if the use of this content is mostly reserved to
experts.
Querying in semantic wikis, for example, is typically realized in terms of
full text search over the textual content and a web query language such as
SPARQL for the annotations. This approach has two shortcomings that limit
the extent to which data can be leveraged by users: combined queries over
content and annotations are not possible, and users either are restricted
to expressing their query intent using simple but vague keyword queries or
have to learn a complex web query language.
The work presented in this dissertation investigates a more suitable form
of querying for semantic wikis that consolidates two seemingly conflicting
characteristics of query languages, ease of use and expressiveness. This
work was carried out in the context of the semantic wiki KiWi, but the
underlying ideas apply more generally to the social semantic and social
web.
We begin by defining a simple modular conceptual model for the KiWi wiki
that enables rich and expressive knowledge representation. A component of
this model are structured tags, an annotation formalism that is simple yet
flexible and expressive, and aims at bridging the gap between atomic tags
and RDF. The viability of the approach is confirmed by a user study, which
finds that structured tags are suitable for quickly annotating evolving
knowledge and are perceived well by the users.
The main contribution of this dissertation is the design and
implementation of KWQL, a query language for semantic wikis. KWQL combines
keyword search and web querying to enable querying that scales with user
experience and information need: basic queries are easy to express; as the
search criteria become more complex, more expertise is needed to formulate
the corresponding query. A novel aspect of KWQL is that it combines both
paradigms in a bottom-up fashion. It treats neither of the two as an
extension to the other, but instead integrates both in one framework. The
language allows for rich combined queries of full text, metadata, document
structure, and informal to formal semantic annotations. KWilt, the KWQL
query engine, provides the full expressive power of first-order queries,
but at the same time can evaluate basic queries at almost the speed of the
underlying search engine. KWQL is accompanied by the visual query language
visKWQL, and an editor that displays both the textual and visual form of
the current query and reflects changes to either representation in the
other. A user study shows that participants quickly learn to construct
KWQL and visKWQL queries, even when given only a short introduction.
KWQL allows users to sift the wealth of structure and annotations in an
information system for relevant data. If relevant data constitutes a
substantial fraction of all data, ranking becomes important. To this end,
we propose PEST, a novel ranking method that propagates relevance among
structurally related or similarly annotated data. Extensive experiments,
including a user study on a real life wiki, show that pest improves the
quality of the ranking over a range of existing ranking approaches
Lifecycle of neural semantic parsing
Humans are born with the ability to learn to perceive, comprehend and communicate
with language. Computing machines, on the other hand, only understand programming
languages. To bridge the gap between humans and computers, deep semantic parsers
convert natural language utterances into machine-understandable logical forms. The
technique has a wide range of applications ranging from spoken dialogue systems and
natural language interfaces. This thesis focuses on neural network-based semantic
parsing.
Traditional semantic parsers function with a domain-specific grammar that pairs
utterances and logical forms, and parse with a CKY-like algorithm in polynomial
time. Recent advances in neural semantic parsing reformulate the task as a sequence-to-
sequence learning problem. Neural semantic parsers parse a sentence in linear
time, and reduce the need for domain-specific assumptions, grammar learning, and
extensive feature engineering. But this modeling flexibility comes at a cost since
it is no longer possible to interpret how meaning composition is performed, given
that logical forms are structured objects (trees or graphs). Such knowledge plays
a critical role in understanding modeling limitations so as to build better semantic
parsers. Moreover, the sequence-to-sequence learning problem is fairly unconstrained,
both in terms of the possible derivations to consider and in terms of the target logical
forms which can be ill-formed or unexecutable. The first contribution of this thesis is
an improved neural semantic parser, which produces syntactically valid logical forms
following a transition system and grammar constrains. The transition system integrates
the generation of domain-general (i.e., valid tree-structures and language-specific predicates)
and domain-specific aspects (i.e., domain-specific predicates and entities) in a unified
way. The model employs various neural attention mechanisms to handle mismatches
between natural language and formal language—a central challenge in semantic parsing.
Training data to semantic parsers typically consists of utterances paired with logical
forms. Another challenge of semantic parsing concerns the annotation of logical forms,
which is labor-intensive. To write down the correct logical form of an utterance, one
not only needs to have expertise in the semantic formalism, but also has to ensure the
logical form matches the utterance semantics. We tackle this challenge in two ways.
On the one hand, we extend the neural semantic parser to a weakly-supervised setting
within a parser-ranker framework. The weakly-supervised setup uses training data
of utterance-denotation (e.g., question-answer) pairs, which are much easier to obtain
and therefore allow to scale semantic parsers to complex domains. Our framework
combines the advantages of conventional weakly-supervised semantic parsers and neural
semantic parsing. Candidate logical forms are generated by a neural decoder and
subsequently scored by a ranking component. We present methods to efficiently search
for candidate logical forms which involve spurious ambiguity—some logical forms do
not match utterance semantics but coincidentally execute to the correct denotation.
They should be excluded from training.
On the other hand, we focus on how to quickly engineer a practical neural semantic
parser for closed domains, by directly reducing the annotation difficulty of utterance-logical
form pairs. We develop an interface for efficiently collecting compositional
utterance-logical form pairs and then leverage the data collection method to train neural
semantic parsers. Our method provides an end-to-end solution for closed-domain
semantic parsing given only an ontology. We also extend the end-to-end solution to
handle sequential utterances simulating a non-interactive user session. Specifically,
the data collection interface is modified to collect utterance sequences which exhibit
various co-reference patterns. Then the neural semantic parser is extended to parse
context-dependent utterances.
In summary, this thesis covers the lifecycle of designing a neural semantic parser:
from model design (i.e., how to model a neural semantic parser with an appropriate
inductive bias), training (i.e., how to perform fully supervised and weakly supervised
training for a neural semantic parser) to engineering (i.e., how to build a neural semantic
parser from a domain ontology)
A Principled Framework for Constructing Natural Language Interfaces To Temporal Databases
Most existing natural language interfaces to databases (NLIDBs) were designed
to be used with ``snapshot'' database systems, that provide very limited
facilities for manipulating time-dependent data. Consequently, most NLIDBs also
provide very limited support for the notion of time. The database community is
becoming increasingly interested in _temporal_ database systems. These are
intended to store and manipulate in a principled manner information not only
about the present, but also about the past and future.
This thesis develops a principled framework for constructing English NLIDBs
for _temporal_ databases (NLITDBs), drawing on research in tense and aspect
theories, temporal logics, and temporal databases. I first explore temporal
linguistic phenomena that are likely to appear in English questions to NLITDBs.
Drawing on existing linguistic theories of time, I formulate an account for a
large number of these phenomena that is simple enough to be embodied in
practical NLITDBs. Exploiting ideas from temporal logics, I then define a
temporal meaning representation language, TOP, and I show how the HPSG grammar
theory can be modified to incorporate the tense and aspect account of this
thesis, and to map a wide range of English questions involving time to
appropriate TOP expressions. Finally, I present and prove the correctness of a
method to translate from TOP to TSQL2, TSQL2 being a temporal extension of the
SQL-92 database language. This way, I establish a sound route from English
questions involving time to a general-purpose temporal database language, that
can act as a principled framework for building NLITDBs. To demonstrate that
this framework is workable, I employ it to develop a prototype NLITDB,
implemented using ALE and Prolog.Comment: PhD thesis; 405 pages; LaTeX2e, uses the packages/macros: amstex,
xspace, avm, examples, dvips, varioref, makeidx, epic, eepic, ecltree;
postscript figures include
Data Infrastructure for Medical Research
While we are witnessing rapid growth in data across the sciences and in many applications, this growth is particularly remarkable in the medical domain, be it because of higher resolution instruments and diagnostic tools (e.g. MRI), new sources of structured data like activity trackers, the wide-spread use of electronic health records and many others. The sheer volume of the data is not, however, the only challenge to be faced when using medical data for research. Other crucial challenges include data heterogeneity, data quality, data privacy and so on. In this article, we review solutions addressing these challenges by discussing the current state of the art in the areas of data integration, data cleaning, data privacy, scalable data access and processing in the context of medical data. The techniques and tools we present will give practitioners — computer scientists and medical researchers alike — a starting point to understand the challenges and solutions and ultimately to analyse medical data and gain better and quicker insights
GPT-4 Technical Report
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance on
various professional and academic benchmarks, including passing a simulated bar
exam with a score around the top 10% of test takers. GPT-4 is a
Transformer-based model pre-trained to predict the next token in a document.
The post-training alignment process results in improved performance on measures
of factuality and adherence to desired behavior. A core component of this
project was developing infrastructure and optimization methods that behave
predictably across a wide range of scales. This allowed us to accurately
predict some aspects of GPT-4's performance based on models trained with no
more than 1/1,000th the compute of GPT-4.Comment: 100 page
Development and Specification of Virtual Environments
This thesis concerns the issues involved in the development of virtual environments (VEs). VEs are more than virtual reality. We identify four main characteristics of them: graphical interaction, multimodality, interface agents, and multi-user. These characteristics are illustrated with an overview of different classes of VE-like applications, and a number of state-of-the-art VEs. To further define the topic of research, we propose a general framework for VE systems development, in which we identify five major classes of development tools: methodology, guidelines, design specification, analysis, and development environments. Of each, we give an overview of existing best practices
- …