Search CORE

1,317 research outputs found

Labeling Workflow Views with Fine-Grained Dependencies

Author: Bao Zhuowei
Davidson Susan B.
Milo Tova
Publication venue
Publication date: 01/01/2012
Field of study

This paper considers the problem of efficiently answering reachability queries over views of provenance graphs, derived from executions of workflows that may include recursion. Such views include composite modules and model fine-grained dependencies between module inputs and outputs. A novel view-adaptive dynamic labeling scheme is developed for efficient query evaluation, in which view specifications are labeled statically (i.e. as they are created) and data items are labeled dynamically as they are produced during a workflow execution. Although the combination of fine-grained dependencies and recursive workflows entail, in general, long (linear-size) data labels, we show that for a large natural class of workflows and views, labels are compact (logarithmic-size) and reachability queries can be evaluated in constant time. Experimental results demonstrate the benefit of this approach over the state-of-the-art technique when applied for labeling multiple views.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

ScholarlyCommons@Penn

Answering Regular Path Queries on Workflow Provenance

Author: Bao Zhuowei
Davidson Susan B.
Huang Xiaocheng
Milo Tova
Yuan Xiaojie
Publication venue
Publication date: 04/08/2014
Field of study

This paper proposes a novel approach for efficiently evaluating regular path queries over provenance graphs of workflows that may include recursion. The approach assumes that an execution g of a workflow G is labeled with query-agnostic reachability labels using an existing technique. At query time, given g, G and a regular path query R, the approach decomposes R into a set of subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is rewritten so that, using the reachability labels of nodes in g, whether or not there is a path which matches Ri between two nodes can be decided in constant time. The results of each safe subquery are then composed, possibly with some small unsafe remainder, to produce an answer to R. The approach results in an algorithm that significantly reduces the number of subqueries k over existing techniques by increasing their size and complexity, and that evaluates each subquery in time bounded by its input and output size. Experimental results demonstrate the benefit of this approach

arXiv.org e-Print Archive

Crossref

Towards Exascale Scientific Metadata Management

Author: Blanas Spyros
Byna Surendra
Publication venue
Publication date: 29/03/2015
Field of study

Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence. To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadata-oblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions

arXiv.org e-Print Archive

eScholarship - University of California

Re-thinking Workflow Provenance against Data-Oriented Investigation Lifecycle

Author: Alper Pinar
Publication venue: No publisher name
Publication date: 06/05/2014
Field of study

The University of Manchester - Institutional Repository

Towards Automated Machine Learning on Imperfect Data for Situational Awareness in Power System

Author: Liu Yunchuan
Publication venue
Publication date: 13/09/2022
Field of study

The increasing penetration of renewable energy sources (such as solar and wind) and incoming widespread electric vehicles charging introduce new challenges in the power system. Due to the variability and uncertainty of these sources, reliable and cost-effective operations of the power system rely on high level of situational awareness. Thanks to the wide deployment of sensors (e.g., phasor measurement units (PMUs) and smart meters) and the emerging smart Internet of Things (IoT) sensing devices in the electric grid, large amounts of data are being collected, which provide golden opportunities to achieve high level of situational awareness for reliable and cost-effective grid operations.To better utilize the data, this dissertation aims to develop Machine Learning (ML) methods and provide fundamental understanding and systematic exploitation of ML for situational awareness using large amounts of imperfect data collected in power systems, in order to improve the reliability and resilience of power systems.However, building excellent ML models needs clean, accurate and sufficient training data. The data collected from the real-world power system is of low quality. For example, the data collected from wind farms contains a mixture of ramp and non-ramp as well as the mingle of heterogeneous dynamics data; the data in the transmission grid contains noisy, missing, insufficient and inaccurate timestamp data. Employing ML without considering these distinct features in real-world applications cannot build good ML models. This dissertation aims to address these challenges in two applications, wind generation forecast and power system event classification, by developing ML models in an automated way with less efforts from domain experts, as the cost of processing such large amounts of imperfect data by experts can be prohibitive in practice.First, we take heterogeneous dynamics into consideration, especially for ramp events. A Drifting Streaming Peaks-over-Threshold (DSPOT) enhanced self-evolving neural networks-based short-term wind farm generation forecast is proposed by utilizing dynamic ramp thresholds to separate the ramp and non-ramp events, based on which different neural networks are trained to learn different dynamics of wind farm generation. As the efficacy of the neural networks relies on the quality of training datasets (i.e., the classification accuracy of ramp and non-ramp events), a Bayesian optimization based approach is developed to optimize the parameters of DSPOT to enhance the quality of the training datasets and the corresponding performance of the neural networks. Experimental results show that compared with other forecast approaches, the proposed forecast approach can substantially improve the forecast accuracy, especially for ramp events. Next, we address the challenges of event classification due to the low-quality PMU measurements and event logs. A novel machine learning framework is proposed for robust event classification, which consists of three main steps: data preprocessing, fine-grained event data extraction, and feature engineering. Specifically, the data preprocessing step addresses the data quality issues of PMU measurements (e.g., bad data and missing data); in the fine-grained event data extraction step, a model-free event detection method is developed to accurately localize the events from the inaccurate event timestamps in the event logs; and the feature engineering step constructs the event features based on the patterns of different event types, in order to improve the performance and the interpretability of the event classifiers. Moreover, with the small number of good features, we need much less training data to train a good event classifier, which can address the challenge of insufficient and imbalanced training data, and the training time is negligible compared to neural network based approaches. Based on the proposed framework, we developed a workflow for event classification using the real-world PMU data streaming into the system in real time. Using the proposed framework, robust event classifiers can be efficiently trained based on many off-the-shelf lightweight machine learning models. Numerical experiments using the real-world dataset from the Western Interconnection of the U.S power transmission grid show that the event classifiers trained under the proposed framework can achieve high classification accuracy while being robust against low-quality data. Subsequently, we address the challenge of insufficient training labels. The real-world PMU data is often incomplete and noisy, which can significantly reduce the efficacy of existing machine learning techniques that require high-quality labeled training data. To obtain high-quality event logs for large amounts of PMU measurements, it requires significant efforts from domain experts to maintain the event logs and even hand-label the events, which can be prohibitively costly or impractical in practice. So we develop a weakly supervised machine learning approach that can learn a good event classifier using a few labeled PMU data. The key idea is to learn the labels from unlabeled data using a probabilistic generative model, in order to improve the training of the event classifiers. Experimental results show that even with 95\% of unlabeled data, the average accuracy of the proposed method can still achieve 78.4\%. This provides a promising way for domain experts to maintain the event logs in a less expensive and automated manner. Finally, we conclude the dissertation and discuss future directions

University of Nevada, Reno ScholarWorks Repository