661 research outputs found
Language integrated relational lenses
Relational databases are ubiquitous. Such monolithic databases accumulate large
amounts of data, yet applications typically only work on small portions of the data
at a time. A subset of the database defined as a computation on the underlying
tables is called a view. Querying views is helpful, but it is also desirable to update
them and have these changes be applied to the underlying database. This view
update problem has been the subject of much previous work before, but support
by database servers is limited and only rarely available.
Lenses are a popular approach to bidirectional transformations, a generalization
of the view update problem in databases to arbitrary data. However, perhaps surprisingly, lenses have seldom actually been used to implement updatable views in
databases. Bohannon, Pierce and Vaughan propose an approach to updatable views called relational lenses. However, to the best of our knowledge this
proposal has not been implemented or evaluated prior to the work reported in
this thesis.
This thesis proposes programming language support for relational lenses. Language integrated relational lenses support expressive and efficient view updates,
without relying on updatable view support from the database server. By integrating relational lenses into the programming language, application development
becomes easier and less error-prone, avoiding the impedance mismatch of having
two programming languages. Integrating relational lenses into the language poses
additional challenges. As defined by Bohannon et al. relational lenses completely
recompute the database, making them inefficient as the database scales. The
other challenge is that some parts of the well-formedness conditions are too general for implementation. Bohannon et al. specify predicates using possibly infinite
abstract sets and define the type checking rules using relational algebra.
Incremental relational lenses equip relational lenses with change-propagating semantics that map small changes to the view into (potentially) small changes
to the source tables. We prove that our incremental semantics are functionally
equivalent to the non-incremental semantics, and our experimental results show
orders of magnitude improvement over the non-incremental approach. This thesis introduces a concrete predicate syntax and shows how the required checks
are performed on these predicates and show that they satisfy the abstract predicate specifications. We discuss trade-offs between static predicates that are fully
known at compile time vs dynamic predicates that are only known during execution and introduce hybrid predicates taking inspiration from both approaches.
This thesis adapts the typing rules for relational lenses from sequential composition to a functional style of sub-expressions. We prove that any well-typed
functional relational lens expression can derive a well-typed sequential lens.
We use these additions to relational lenses as the foundation for two practical implementations: an extension of the Links functional language and a library written
in Haskell. The second implementation demonstrates how type-level computation can be used to implement relational lenses without changes to the compiler.
These two implementations attest to the possibility of turning relational lenses
into a practical language feature
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Automated identification and behaviour classification for modelling social dynamics in group-housed mice
Mice are often used in biology as exploratory models of human conditions, due to their similar genetics and physiology. Unfortunately, research on behaviour has traditionally been limited to studying individuals in isolated environments and over short periods of time. This can miss critical time-effects, and, since mice are social creatures, bias results.
This work addresses this gap in research by developing tools to analyse the individual behaviour of group-housed mice in the home-cage over several days and with minimal disruption. Using data provided by the Mary Lyon Centre at MRC Harwell we designed an end-to-end system that (a) tracks and identifies mice in a cage, (b) infers their behaviour, and subsequently (c) models the group dynamics as functions of individual activities. In support of the above, we also curated and made available a large dataset of mouse localisation and behaviour classifications (IMADGE), as well as two smaller annotated datasets for training/evaluating the identification (TIDe) and behaviour inference (ABODe) systems. This research constitutes the first of its kind in terms of the scale and challenges addressed. The data source (side-view single-channel video with clutter and no identification markers for mice) presents challenging conditions for analysis, but has the potential to give richer information while using industry standard housing.
A Tracking and Identification module was developed to automatically detect, track and identify the (visually similar) mice in the cluttered home-cage using only single-channel IR video and coarse position from RFID readings. Existing detectors and trackers were combined with a novel Integer Linear Programming formulation to assign anonymous tracks to mouse identities. This utilised a probabilistic weight model of affinity between detections and RFID pickups.
The next task necessitated the implementation of the Activity Labelling module that classifies the behaviour of each mouse, handling occlusion to avoid giving unreliable classifications when the mice cannot be observed. Two key aspects of this were (a) careful feature-selection, and (b) judicious balancing of the errors of the system in line with the repercussions for our setup.
Given these sequences of individual behaviours, we analysed the interaction dynamics between mice in the same cage by collapsing the group behaviour into a sequence of interpretable latent regimes using both static and temporal (Markov) models. Using a permutation matrix, we were able to automatically assign mice to roles in the HMM, fit a global model to a group of cages and analyse abnormalities in data from a different demographic
Declarative Specification of Intraprocedural Control-flow and Dataflow Analysis
Static program analysis plays a crucial role in ensuring the quality and security of software applications by detecting and fixing bugs, and potential security vulnerabilities in the code. The use of declarative paradigms in dataflow analysis as part of static program analysis has become increasingly popular in recent years. This is due to its enhanced expressivity and modularity, allowing for a higher-level programming approach, resulting in easy and efficient development.The aim of this thesis is to explore the design and implementation of control-flow and dataflow analyses using the declarative Reference Attribute Grammars formalism. Specifically, we focus on the construction of analyses directly on the source code rather than on an intermediate representation.The main result of this thesis is our language-agnostic framework, called IntraCFG. IntraCFG enables efficient and effective dataflow analysis by allowing the construction of precise and source-level control-flow graphs. The framework superimposes control-flow graphs on top of the abstract syntax tree of the program. The effectiveness of IntraCFG is demonstrated through two case studies, IntraJ and IntraTeal. These case studies showcase the potential and flexibility of IntraCFG in diverse contexts, such as bug detection and education. IntraJ supports the Java programming language, while IntraTeal is a tool designed for teaching program analysis for an educational language, Teal.IntraJ has proven to be faster than and as precise as well-known industrial tools. The combination of precision, performance, and on-demand evaluation in IntraJ leads to low latency in querying the analysis results. This makes IntraJ a suitable tool for use in interactive tools. Preliminary experiments have also been conducted to demonstrate how IntraJ can be used to support interactive bug detection and fixing.Additionally, this thesis presents JFeature, a tool for automatically extracting and summarising the features of a Java corpus, including the use of different Java features (e.g., use of Lambda Expressions) across different Java versions. JFeature provides researchers and developers with a deeper understanding of the characteristics of corpora, enabling them to identify suitable benchmarks for the evaluation of their tools and methodologies
A Differential Datalog Interpreter
The core reasoning task for datalog engines is materialization, the
evaluation of a datalog program over a database alongside its physical
incorporation into the database itself. The de-facto method of computing it, is
through the recursive application of inference rules. Due to it being a costly
operation, it is a must for datalog engines to provide incremental
materialization, that is, to adjust the computation to new data, instead of
restarting from scratch. One of the major caveats, is that deleting data is
notoriously more involved than adding, since one has to take into account all
possible data that has been entailed from what is being deleted. Differential
Dataflow is a computational model that provides efficient incremental
maintenance, notoriously with equal performance between additions and
deletions, and work distribution, of iterative dataflows. In this paper we
investigate the performance of materialization with three reference datalog
implementations, out of which one is built on top of a lightweight relational
engine, and the two others are differential-dataflow and non-differential
versions of the same rewrite algorithm, with the same optimizations
Processing genome-wide association studies within a repository of heterogeneous genomic datasets
Background
Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants – typically single-nucleotide polymorphisms (SNPs) – in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions.
Results
To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multisample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals.
Conclusions
As a result of our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows
Generalised latent variable models for location, scale, and shape parameters
Latent Variable Models (LVM) are widely used in social, behavioural, and educational sciences to uncover underlying associations in multivariate data using a smaller number of latent variables. However, the classical LVM framework has certain assumptions that can be restrictive in empirical applications. In particular, the distribution of the observed variables being from the exponential family and the latent variables influencing only the conditional mean of the observed variables. This thesis addresses these limitations and contributes to the current literature in two ways. First, we propose a novel class of models called Generalised Latent Variable Models for Location, Scale, and Shape parameters (GLVM-LSS). These models use linear functions of latent factors to model location, scale, and shape parameters of the items’ conditional distributions. By doing so, we model higher order moments such as variance, skewness, and kurtosis in terms of the latent variables, providing a more flexible framework compared to classical factor models. The model parameters are estimated using maximum likelihood estimation. Second, we address the challenge of interpreting the GLVM-LSS, which can be complex due to its increased number of parameters. We propose a penalised maximum likelihood estimation approach with automatic selection of tuning parameters. This extends previous work on penalised estimation in the LVM literature to cases without closed-form solutions. Our findings suggest that modelling the entire distribution of items, not just the conditional mean, leads to improved model fit and deeper insights into how the items reflect the latent constructs they are intended to measure. To assess the performance of the proposed methods, we conduct extensive simulation studies and apply it to real-world data from educational testing and public opinion research. The results highlight the efficacy of the GLVM-LSS framework in capturing complex relationships between observed variables and latent factors, providing valuable insights for researchers in various fields
Recommended from our members
Molecular Signatures of Severe Acute Infections in Hospitalised African Children
Despite the improvement in global health over the last three decades, infectious diseases are still a major cause of morbidity and mortality globally, with the highest burden in developing countries and in children under 5 years. The heterogeneity in the clinical presentation of severely ill patients and the lack of rapid diagnostic and prognostic tools for the aetiological distinction of infecting pathogens complicates care decisions, leading to the indiscriminate use of antibiotics and consequently, increased antimicrobial resistance and mortality. Reducing childhood mortality and morbidity due to infectious diseases requires better diagnostics and prognostics, particularly in low-resource settings. Understanding the molecular processes that underlie different aetiologies and survival outcomes would enable the initiation of appropriate and timely treatment.
This thesis aimed to characterise protein and transcriptomic signatures associated with (i) different microbial aetiologies of severe acute infection and (ii) an elevated risk of death in African children hospitalised with severe acute infections. An untargeted high-performance liquid chromatography-tandem mass spectrometry (HPLC-MS/MS) approach was used to characterise the plasma proteome of children admitted to hospital with different infectious aetiologies. The resulting data was analysed using a collection of machine learning algorithms. The data was used to identify a protein signature with the highest power to discriminate between children with bacterial or viral infections. A novel protein microarray chip was then developed to validate the discovered signature in an independent cohort of hospitalised children with severe acute infections. Lastly, an RNA seq-based transcriptomics approach was used to characterise gene expression changes associated with post-admission outcomes. Using gene set enrichment and modular analysis, the correlation between gene expression and impending inpatient mortality was characterised.
Key findings from this work include a validated protein signature that could classify children according to bacterial or viral aetiology and a description of the immune response to severe disease that could be correlated with an increased likelihood of death in children admitted to hospital with severe acute infection. In the proteomics analysis, a random forest-derived protein signature made up of CRP, LBP, AGT, SERPINA1, SERPINA3, PON1 and HRG was identified that could correctly classify bacterial infection from viral infection in a held-out test set with AUC of 0.84 (95%CI 0.72 – 0.95). The performance of individual proteins in the signature was assessed in an independent cohort of children using a novel protein microarray chip and AGT had the best discriminatory power of 0.7 (95%CI 0.64 – 0.75) relative to a clinically-approved CRP test whose AUC was 0.76 (95%CI 0.68 - 0.84). In addition, there was an association between the expression levels of AGT, SERPINA1, PON1 and HRG with mortality. AGT was also associated with severe acute malnutrition in children. Functional analysis of the proteomics data showed that children with bacterial infections had an enrichment of acute phase responses and neutrophil degranulation pathways while platelet degranulation was negatively associated with bacterial infections.
Analysis of the transcriptomics data showed that imminent inpatient death was marked by a down regulation of CD8 T cell activation, type I interferon signalling and an over expression of the unfolded protein response and heme metabolism pathways
Resilient and Scalable Forwarding for Software-Defined Networks with P4-Programmable Switches
Traditional networking devices support only fixed features and limited configurability.
Network softwarization leverages programmable software and hardware platforms to remove those limitations.
In this context the concept of programmable data planes allows directly to program the packet processing pipeline of networking devices and create custom control plane algorithms.
This flexibility enables the design of novel networking mechanisms where the status quo struggles to meet high demands of next-generation networks like 5G, Internet of Things, cloud computing, and industry 4.0.
P4 is the most popular technology to implement programmable data planes.
However, programmable data planes, and in particular, the P4 technology, emerged only recently.
Thus, P4 support for some well-established networking concepts is still lacking and several issues remain unsolved due to the different characteristics of programmable data planes in comparison to traditional networking.
The research of this thesis focuses on two open issues of programmable data planes.
First, it develops resilient and efficient forwarding mechanisms for the P4 data plane as there are no satisfying state of the art best practices yet.
Second, it enables BIER in high-performance P4 data planes.
BIER is a novel, scalable, and efficient transport mechanism for IP multicast traffic which has only very limited support of high-performance forwarding platforms yet.
The main results of this thesis are published as 8 peer-reviewed and one post-publication peer-reviewed publication. The results cover the development of suitable resilience mechanisms for P4 data planes, the development and implementation of resilient BIER forwarding in P4, and the extensive evaluations of all developed and implemented mechanisms. Furthermore, the results contain a comprehensive P4 literature study.
Two more peer-reviewed papers contain additional content that is not directly related to the main results.
They implement congestion avoidance mechanisms in P4 and develop a scheduling concept to find cost-optimized load schedules based on day-ahead forecasts
Recommended from our members
Complaint Driven Training Data Debugging for Machine Learning Workflows
As the need for machine learning (ML) increases rapidly across all industry sectors, so has theinterest in building ML platforms that manage and automate parts of the ML life-cycle. This has enabled companies to use ML inference as a part of their downstream analytics or their applications. Unfortunately, debugging unexpected outcomes in the result of these ML workflows remains a necessary but difficult task of the ML life-cycle. The challenge of debugging ML workflows is that it requires reasoning about the correctness of the workflow logic, the datasets used for inference and training, the models, and interactions between them. Even if the workflow logic is correct, errors in the data used across the ML workflow can still lead to wrong outcomes. In short, developers are not just debugging the code, but also the data.
We advocate in favor of a complaint driven approach towards specifying and debugging data errors in ML workflows. The approach takes as input user specified complaints specified as constraints over the final or intermediate outputs of workflows that use trained ML models. The approach outputs explanations in the form of specific operator(s) or data subsets, and how they may be changed to address the constraint violations.
In this thesis we make the first steps towards our complaint driven approach to data debugging. As a stepping stone, we focus our attention on complaints specified on top of relational workflows that use ML model inference and whose errors are caused by errors in ML model’s training data. To the best of our knowledge, we contribute the first debugging system for this task, which we call Rain. In response to a user complaint, Rain ranks the ML model’s training examples based on their ability to address the user’s complaint if they were removed. Our experiments show that users can use Rain to debug training data errors by specifying complaints over aggregations of model predictions without having to specify the correct label for each individual prediction.
Unfortunately, Rain’s latency may be prohibitive for use in interactive applications like analytical dashboards or business intelligence tools where users are likely to observe errors and complain. To address Rain’s latency problem when scaling to large ML models and training sets, we propose Rain++. Rain++ pushes the majority of Rain’s computation offline ahead of user interaction, achieving orders of magnitude online latency improvements compared to Rain.
To go beyond Rain’s and Rain++’s approach that evaluates individual training example deletionsindependently we propose MetaRain, a framework for training classifiers that detect training data corruptions in response to user complaints. Thanks to the generality of MetaRain, users can adapt the classifiers chosen to the training corruptions and the complaints they seek to resolve. Our experiments indicate that making use of this ability results in improved debugging outcomes.
Last but not least, we study the problem of updating relational workflow results in response tochanges to the inference ML model used. This can be leveraged by current or future complaint driven debugging systems that repeatedly change the model and reevaluate the relational workflow. We propose FaDE, a compiler that generates efficient code for the workflow update problem by casting it as view maintenance under input tuple deletions. Our experiments indicate that the code generated by FaDE has orders of magnitude lower latency than existing view maintenance systems
- …