18 research outputs found
Language-integrated provenance in Haskell
Scientific progress increasingly depends on data management, particularly to
clean and curate data so that it can be systematically analyzed and reused. A
wealth of techniques for managing and curating data (and its provenance) have
been proposed, largely in the database community. In particular, a number of
influential papers have proposed collecting provenance information explaining
where a piece of data was copied from, or what other records were used to
derive it. Most of these techniques, however, exist only as research prototypes
and are not available in mainstream database systems. This means scientists
must either implement such techniques themselves or (all too often) go without.
This is essentially a code reuse problem: provenance techniques currently
cannot be implemented reusably, only as ad hoc, usually unmaintained extensions
to standard databases. An alternative, relatively unexplored approach is to
support such techniques at a higher abstraction level, using metaprogramming or
reflection techniques. Can advanced programming techniques make it easier to
transfer provenance research results into practice?
We build on a recent approach called language-integrated provenance, which
extends language-integrated query techniques with source-to-source query
translations that record provenance. In previous work, a proof of concept was
developed in a research programming language called Links, which supports
sophisticated Web and database programming. In this paper, we show how to adapt
this approach to work in Haskell building on top of the Database-Supported
Haskell (DSH) library.
Even though it seemed clear in principle that Haskell's rich programming
features ought to be sufficient, implementing language-integrated provenance in
Haskell required overcoming a number of technical challenges due to
interactions between these capabilities. Our implementation serves as a proof
of concept showing how this combination of metaprogramming features can, for
the first time, make data provenance facilities available to programmers as a
library in a widely-used, general-purpose language.
In our work we were successful in implementing forms of provenance known as
where-provenance and lineage. We have tested our implementation using a simple
database and query set and established that the resulting queries are executed
correctly on the database. Our implementation is publicly available on GitHub.
Our work makes provenance tracking available to users of DSH at little cost.
Although Haskell is not widely used for scientific database development, our
work suggests which languages features are necessary to support provenance as
library. We also highlight how combining Haskell's advanced type programming
features can lead to unexpected complications, which may motivate further
research into type system expressiveness
Language-integrated provenance by trace analysis
Language-integrated provenance builds on language-integrated query techniques
to make provenance information explaining query results readily available to
programmers. In previous work we have explored language-integrated approaches
to provenance in Links and Haskell. However, implementing a new form of
provenance in a language-integrated way is still a major challenge. We propose
a self-tracing transformation and trace analysis features that, together with
existing techniques for type-directed generic programming, make it possible to
define different forms of provenance as user code. We present our design as an
extension to a core language for Links called LinksT, give examples showing its
capabilities, and outline its metatheory and key correctness properties.Comment: DBPL 201
Language-integrated provenance
Provenance, or information about the origin or derivation of data, is
important for assessing the trustworthiness of data and identifying and
correcting mistakes. Most prior implementations of data provenance have
involved heavyweight modifications to database systems and little attention has
been paid to how the provenance data can be used outside such a system. We
present extensions to the Links programming language that build on its support
for language-integrated query to support provenance queries by rewriting and
normalizing monadic comprehensions and extending the type system to distinguish
provenance metadata from normal data. The main contribution of this article is
to show that the two most common forms of provenance can be implemented
efficiently and used safely as a programming language feature with no changes
to the database system.Comment: Accepted to Science of Computer Programming special issue on PPDP
201
Language-integrated provenance
Provenance is metadata about the where, the why, and the how of data. It is
evidence which can answer questions such as: Where exactly did this piece of
data come from? Why is this row in my result? How was it produced? Answers
to these questions are useful for judging the trustworthiness of data, and for
finding and correcting mistakes.
Most programs that use a database at all, already use one crude form of
provenance: they manually propagate row identifiers together with database
values, just in case they need to be updated later. More sophisticated forms
of provenance are exceedingly rare, because they are more difficult to implement
manually. Tools to calculate data provenance systematically, only exist
as research prototypes. Even standard database systems are hard to set up, as
evidenced by the rise of hosted database services, so there is little suprise that
prototypes of provenance systems are not used much.
This dissertation shows how a programming language can provide support
for provenance. Based on language-integrated query technology, it can systematically
rewrite queries to produce various forms of provenance. We describe
such query transformations for where-provenance and lineage, and discuss
how to enable programmers to define their own forms of provenance. Thanks
to query normalization the resulting queries still execute efficiently on mainstream
database systems. A programming language can help further by giving
provenance metadata precise types to ensure that it is handled appropriately.
Language-integrated queries make it easy to write programs that deal with
data, no special query language needed. Language-integrated provenance
makes it as easy to deal with data provenance, no special database needed
Language-integrated provenance in Links
Today’s programming languages provide no support for data prove-nance. In a world that increasingly relies on data, we need prove-nance to judge the reliability of data and therefore should aim for making it easily accessible to programmers. We report our work in progress on an extension to the Links programming language that builds on its support for language-integrated query to support where-provenance queries through query rewriting and a type system extension that distinguishes provenance metadata from other data. Our approach aims to work solely within the language implementa-tion and thus require no changes to the database system. The type system together with automatic propagation of provenance metadata will prevent programmers from accidentally changing provenance, losing it, or misattributing it to other data. 1
Language-integrated provenance by trace analysis
Language-integrated provenance builds on language-integrated query techniques
to make provenance information explaining query results readily available to
programmers. In previous work we have explored language-integrated approaches
to provenance in Links and Haskell. However, implementing a new form of
provenance in a language-integrated way is still a major challenge. We propose
a self-tracing transformation and trace analysis features that, together with
existing techniques for type-directed generic programming, make it possible to
define different forms of provenance as user code. We present our design as an
extension to a core language for Links called LinksT, give examples showing its
capabilities, and outline its metatheory and key correctness properties.Comment: DBPL 201
Query Lifting: Language-integrated query for heterogeneous nested collections
Language-integrated query based on comprehension syntax is a powerful
technique for safe database programming, and provides a basis for advanced
techniques such as query shredding or query flattening that allow efficient
programming with complex nested collections. However, the foundations of these
techniques are lacking: although SQL, the most widely-used database query
language, supports heterogeneous queries that mix set and multiset semantics,
these important capabilities are not supported by known correctness results or
implementations that assume homogeneous collections. In this paper we study
language-integrated query for a heterogeneous query language
that combines set and multiset constructs. We show how
to normalize and translate queries to SQL, and develop a novel approach to
querying heterogeneous nested collections, based on the insight that ``local''
query subexpressions that calculate nested subcollections can be ``lifted'' to
the top level analogously to lambda-lifting for local function definitions.Comment: Full version of ESOP 2021 conference pape
Cross-tier web programming for curated databases: a case study
Curated databases have become important sources of information across several scientific disciplines, and as the result of manual work of experts, often become important reference works. Features such as provenance tracking, archiving, and data citation are widely regarded as important features for the curated databases, but implementing such features is challenging, and small database projects often lack the resources to do so.
A scientific database application is not just the relational database itself, but also an ecosystem of web applications to display the data, and applications which allow data curation. Supporting advanced curation features requires changing all of these components, and there is currently no way to provide such capabilities in a reusable way.
Cross-tier programming languages have been proposed to simplify the creation of web applications, where developers can write an application in a single, uniform language. Consequently, database queries and updates can be written in the same language as the rest of the program, and at least in principle, it should be possible to provide curation features reusably via program transformations. As a first step towards this goal, it is important to establish that realistic curated databases can be implemented in a cross-tier programming language.
In this paper, we describe such a case study: reimplementing the web front end of a real world scientific database, the IUPHAR/BPS Guide to Pharmacology (GtoPdb), in the Links cross-tier programming language. We show how programming language features such as language-integrated query simplify the development process, and rule out common errors. Through a comparative performance evaluation, we show that the Links implementation performs fewer database queries, while the time needed to handle the queries is comparable to the Java version. Furthermore, while there is some overhead to using Links because of its comparative immaturity compared to Java, the Links version is usable as a proof-of-concept case study of cross-tier programming for curated databases.
[ This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review. The most up-to-date version of the paper can be found on arXiv https://arxiv.org/abs/2003.03845
The Structure of the Literary Problem in the Formation of the Local Text Substrate
The article aims to study the structure of the literary problem in the formation of the local text substrate. The study uses the methodology of studying the language when it changes in time and space. The article explains the basics of the methodological support of the translation complex and the structure of its application in private studies of foreign cultures and communicants. The results of the study showed the possibility of interaction between the subjects of linguistic exchange and the dynamics of the translation and literary component. The novelty of the study is determined by the fact that the work defines methods that can be used not only by local researchers but also by foreign-speaking communicants. The research results can be used in practical activities to bridge the gap between understanding the local text in translation studies and its structuring in the local versions of individual authors
Mixing set and bag semantics
The conservativity theorem for nested relational calculus implies that query
expressions can freely use nesting and unnesting, yet as long as the query
result type is a flat relation, these capabilities do not lead to an increase
in expressiveness over flat relational queries. Moreover, Wong showed how such
queries can be translated to SQL via a constructive rewriting algorithm. While
this result holds for queries over either set or multiset semantics, to the
best of our knowledge, the questions of conservativity and normalization have
not been studied for queries that mix set and bag collections, or provide
duplicate-elimination operations such as SQL's
. In this paper we formalize the problem,
and present partial progress: specifically, we introduce a calculus with both
set and multiset collection types, along with natural mappings from sets to
bags and vice versa, present a set of valid rewrite rules for normalizing such
queries, and give an inductive characterization of a set of queries whose
normal forms can be translated to SQL. We also consider examples that do not
appear straightforward to translate to SQL, illustrating that the relative
expressiveness of flat and nested queries with mixed set and multiset semantics
remains an open question.Comment: DBPL 2019 -- short pape