132 research outputs found
Recommended from our members
Complaint Driven Training Data Debugging for Machine Learning Workflows
As the need for machine learning (ML) increases rapidly across all industry sectors, so has theinterest in building ML platforms that manage and automate parts of the ML life-cycle. This has enabled companies to use ML inference as a part of their downstream analytics or their applications. Unfortunately, debugging unexpected outcomes in the result of these ML workflows remains a necessary but difficult task of the ML life-cycle. The challenge of debugging ML workflows is that it requires reasoning about the correctness of the workflow logic, the datasets used for inference and training, the models, and interactions between them. Even if the workflow logic is correct, errors in the data used across the ML workflow can still lead to wrong outcomes. In short, developers are not just debugging the code, but also the data.
We advocate in favor of a complaint driven approach towards specifying and debugging data errors in ML workflows. The approach takes as input user specified complaints specified as constraints over the final or intermediate outputs of workflows that use trained ML models. The approach outputs explanations in the form of specific operator(s) or data subsets, and how they may be changed to address the constraint violations.
In this thesis we make the first steps towards our complaint driven approach to data debugging. As a stepping stone, we focus our attention on complaints specified on top of relational workflows that use ML model inference and whose errors are caused by errors in ML modelâs training data. To the best of our knowledge, we contribute the first debugging system for this task, which we call Rain. In response to a user complaint, Rain ranks the ML modelâs training examples based on their ability to address the userâs complaint if they were removed. Our experiments show that users can use Rain to debug training data errors by specifying complaints over aggregations of model predictions without having to specify the correct label for each individual prediction.
Unfortunately, Rainâs latency may be prohibitive for use in interactive applications like analytical dashboards or business intelligence tools where users are likely to observe errors and complain. To address Rainâs latency problem when scaling to large ML models and training sets, we propose Rain++. Rain++ pushes the majority of Rainâs computation offline ahead of user interaction, achieving orders of magnitude online latency improvements compared to Rain.
To go beyond Rainâs and Rain++âs approach that evaluates individual training example deletionsindependently we propose MetaRain, a framework for training classifiers that detect training data corruptions in response to user complaints. Thanks to the generality of MetaRain, users can adapt the classifiers chosen to the training corruptions and the complaints they seek to resolve. Our experiments indicate that making use of this ability results in improved debugging outcomes.
Last but not least, we study the problem of updating relational workflow results in response tochanges to the inference ML model used. This can be leveraged by current or future complaint driven debugging systems that repeatedly change the model and reevaluate the relational workflow. We propose FaDE, a compiler that generates efficient code for the workflow update problem by casting it as view maintenance under input tuple deletions. Our experiments indicate that the code generated by FaDE has orders of magnitude lower latency than existing view maintenance systems
Recommended from our members
Analytical Query Execution Optimized for all Layers of Modern Hardware
Analytical database queries are at the core of business intelligence and decision support. To analyze the vast amounts of data available today, query execution needs to be orders of magnitude faster. Hardware advances have made a profound impact on database design and implementation. The large main memory capacity allows queries to execute exclusively in memory and shifts the bottleneck from disk access to memory bandwidth. In the new setting, to optimize query performance, databases must be aware of an unprecedented multitude of complicated hardware features. This thesis focuses on the design and implementation of highly efficient database systems by optimizing analytical query execution for all layers of modern hardware. The hardware layers include the network across multiple machines, main memory and the NUMA interconnection across multiple processors, the multiple levels of caches across multiple processor cores, and the execution pipeline within each core. For the network layer, we introduce a distributed join algorithm that minimizes the network traffic. For the memory hierarchy, we describe partitioning variants aware to the dynamics of the CPU caches and the NUMA interconnection. To improve the memory access rate of linear scans, we optimize lightweight compression variants and evaluate their trade-offs. To accelerate query execution within the core pipeline, we introduce advanced SIMD vectorization techniques generalizable across multiple operators. We evaluate our algorithms and techniques on both mainstream hardware and on many-integrated-core platforms, and combine our techniques in a new query engine design that can better utilize the features of many-core CPUs. In the era of hardware becoming increasingly parallel and datasets consistently growing in size, this thesis can serve as a compass for developing hardware-conscious databases with truly high-performance analytical query execution
Abstract delta modeling : software product lines and beyond
To prevent a large software system from collapsing under its own complexity, its code needs to be well-structured. Ideally we want all code related to a certain feature to be grouped together __called feature modularization__ and code belonging to different features not to mix __ called separation of concerns. But many concerns are known as 'cross-cutting concerns'. By their very nature their implementation needs to be spread around the code base. The software engineering discipline that has the most to gain from those properties is Software Product Line Engineering. It is concerned with the development and maintenance of multiple software systems at the same time, each possessing a different (but often overlapping) set of features. This gives rise to an additional need: The code for a given feature must not only be separated and modular; it also needs to be composable and able to deal gracefully with the presence or absence of other features. This thesis presents Abstract Delta Modeling, a formal framework developed to achieve these goals in software. The thesis is a product of the European HATS project. It formalizes the techniques of delta modeling, the main approach to variability used by HATSAlgorithms and the Foundations of Software technolog
Semantic In-Network Complex Event Processing for an Energy Efficient Wireless Sensor Network
Wireless Sensor Networks (WSNs) consist of spatially distributed sensor nodes that perform monitoring tasks in a region and the gateway nodes that provide the acquired sensor data to the end user. With advances in the WSN technology, it has now become possible to have different types of sensor nodes within a region to monitor the environment. This provides the flexibility to monitor the environment in a more extensive manner than before.
Sensor nodes are severely constrained devices with very limited battery sources and their resource scarcity remains a challenge. In traditional WSNs, the sensor nodes are used only for capturing data that is analysed later in more powerful gateway nodes. This continuous communication of data between sensor nodes and gateway nodes wastes energy at the sensor nodes, and consequently, the overall network lifetime is greatly reduced. Existing approaches to reduce energy consumption by processing at the sensor node level only work for homogeneous networks.
This thesis presents a sensor node architecture for heterogeneous WSNs, called SEPSen, where data is processed locally at the sensor node level to reduce energy consumption. We use ontology fragments at the sensor nodes to enable data exchange between heterogeneous sensor nodes within the WSN. We employ a rule engine based on a pattern matching algorithm for filtering events at the sensor node level. The event routing towards the gateway nodes is performed using a context-aware routing scheme that takes both the energy consumption and the heterogeneity of the sensor nodes into account.
As a proof of concept, we present a prototypical implementation of the SEPSen design in a simulation environment. By providing semantic support, in-network data processing capabilities and context-aware routing in SEPSen, the sensor nodes (1) communicate with each other despite their different sensor types, (2) filter events at the their own level to conserve the limited sensor node energy resources and (3) share the nodes' knowledge bases for collaboration between the sensor nodes using node-centric context-awareness in changing conditions. The SEPSen prototype has been evaluated based on a test case for water quality management. The results from the experiments show that the energy saved in SEPSen reaches almost 50% by processing events at the sensor node level and the overall network lifetime is increased by at least a factor of two against the shortest-path-first (Min-Hop) routing approach
Mobility-awareness in complex event processing systems
The proliferation and vast deployment of mobile devices and sensors over the last couple of years enables a huge number of Mobile Situation Awareness (MSA) applications. These applications need to react in near real-time to situations in the environment of mobile objects like vehicles, pedestrians, or cargo. To this end, Complex Event Processing (CEP) is becoming increasingly important as it allows to scalably detect situations âon-the-flyâ by continously processing distributed sensor data streams. Furthermore, recent trends in communication networks promise high real-time conformance to CEP systems by processing sensor data streams on distributed computing resources at the edge of the network, where low network latencies can be achieved. Yet, supporting MSA applications with a CEP middleware that utilizes distributed computing resources proves to be challenging due to the dynamics of mobile devices and sensors. In particular, situations need to be efficiently, scalably, and consistently detected with respect to ever-changing sensors in the environment of a mobile object. Moreover, the computing resources that provide low latencies change with the access points of mobile devices and sensors.
The goal of this thesis is to provide concepts and algorithms to i) continuously detect situations that recently occurred close to a mobile object, ii) support bandwidth and computational efficient detections of such situations on distributed computing resources, and iii) support consistent, low latency, and high quality detections of such situations. To this end, we introduce the distributed Mobile CEP (MCEP) system which automatically adapts the processing of sensor data streams according to a mobile objectâs location. MCEP provides an expressive, location-aware query model for situations that recently occurred at a location close to a mobile object. MCEP significantly reduces latency, bandwidth, and processing overhead by providing on-demand and opportunistic adaptation algorithms to dynamically assign event streams to queries of the MCEP system. Moreover, MCEP incorporates algorithms to adapt the deployment of MCEP queries in a network of computing resources. This way, MCEP supports latency-sensitive, large-scale deployments of MSA applications and ensures a low network utilization while mobile objects change their access points to the system. MCEP also provides methods to increase the scalability in terms of deployed MCEP queries by reusing event streams and computations for detecting common situations for several mobile objects
- âŠ