Search CORE

205 research outputs found

Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

Author: Chapman Adriane
Lauro Luca
Missier Paolo
Torlone Riccardo
Publication venue
Publication date: 27/10/2023
Field of study

Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, our aim is to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of our implementations to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.Comment: 37 pages, 27 figures, submitted to a journa

arXiv.org e-Print Archive

Data distribution debugging in machine learning pipelines

Author: Grafberger S.
Groth P.
Schelter S.
Stoyanovich J.
Publication venue
Publication date: 01/09/2022
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE

High-Fidelity Provenance:Exploring the Intersection of Provenance and Security

Author: Stamatogiannakis Emmanouil
Publication venue
Publication date: 18/10/2021
Field of study

In the past 25 years, the World Wide Web has disrupted the way news are disseminated and consumed. However, the euphoria for the democratization of news publishing was soon followed by scepticism, as a new phenomenon emerged: fake news. With no gatekeepers to vouch for it, the veracity of the information served over the World Wide Web became a major public concern. The Reuters Digital News Report 2020 cites that in at least half of the EU member countries, 50% or more of the population is concerned about online fake news. To help address the problem of trust on information communi- cated over the World Wide Web, it has been proposed to also make available the provenance metadata of the information. Similar to artwork provenance, this would include a detailed track of how the information was created, updated and propagated to produce the result we read, as well as what agents—human or software—were involved in the process. However, keeping track of provenance information is a non-trivial task. Current approaches, are often of limited scope and may require modifying existing applications to also generate provenance information along with thei regular output. This thesis explores how provenance can be automatically tracked in an application-agnostic manner, without having to modify the individual applications. We frame provenance capture as a data flow analysis problem and explore the use of dynamic taint analysis in this context. Our work shows that this appoach improves on the quality of provenance captured compared to traditonal approaches, yielding what we term as high-fidelity provenance. We explore the performance cost of this approach and use deterministic record and replay to bring it down to a more practical level. Furthermore, we create and present the tooling necessary for the expanding the use of using deterministic record and replay for provenance analysis. The thesis concludes with an application of high-fidelity provenance as a tool for state-of-the art offensive security analysis, based on the intuition that software too can be misguided by "fake news". This demonstrates that the potential uses of high-fidelity provenance for security extend beyond traditional forensics analysis

VU Research Portal

Fatias de rede fim-a-fim : da extração de perfis de funções de rede a SLAs granulares

Author: Rosa Raphael Vicente, 1988-
Publication venue: [s.n.]
Publication date: 08/07/2020
Field of study

Orientador: Christian Rodolfo Esteve RothenbergTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Nos últimos dez anos, processos de softwarização de redes vêm sendo continuamente diversi- ficados e gradativamente incorporados em produção, principalmente através dos paradigmas de Redes Definidas por Software (ex.: regras de fluxos de rede programáveis) e Virtualização de Funções de Rede (ex.: orquestração de funções virtualizadas de rede). Embasado neste processo o conceito de network slice surge como forma de definição de caminhos de rede fim- a-fim programáveis, possivelmente sobre infrastruturas compartilhadas, contendo requisitos estritos de desempenho e dedicado a um modelo particular de negócios. Esta tese investiga a hipótese de que a desagregação de métricas de desempenho de funções virtualizadas de rede impactam e compõe critérios de alocação de network slices (i.e., diversas opções de utiliza- ção de recursos), os quais quando realizados devem ter seu gerenciamento de ciclo de vida implementado de forma transparente em correspondência ao seu caso de negócios de comu- nicação fim-a-fim. A verificação de tal assertiva se dá em três aspectos: entender os graus de liberdade nos quais métricas de desempenho de funções virtualizadas de rede podem ser expressas; métodos de racionalização da alocação de recursos por network slices e seus re- spectivos critérios; e formas transparentes de rastrear e gerenciar recursos de rede fim-a-fim entre múltiplos domínios administrativos. Para atingir estes objetivos, diversas contribuições são realizadas por esta tese, dentre elas: a construção de uma plataforma para automatização de metodologias de testes de desempenho de funções virtualizadas de redes; a elaboração de uma metodologia para análises de alocações de recursos de network slices baseada em um algoritmo classificador de aprendizado de máquinas e outro algoritmo de análise multi- critério; e a construção de um protótipo utilizando blockchain para a realização de contratos inteligentes envolvendo acordos de serviços entre domínios administrativos de rede. Por meio de experimentos e análises sugerimos que: métricas de desempenho de funções virtualizadas de rede dependem da alocação de recursos, configurações internas e estímulo de tráfego de testes; network slices podem ter suas alocações de recursos coerentemente classificadas por diferentes critérios; e acordos entre domínios administrativos podem ser realizados de forma transparente e em variadas formas de granularidade por meio de contratos inteligentes uti- lizando blockchain. Ao final deste trabalho, com base em uma ampla discussão as perguntas de pesquisa associadas à hipótese são respondidas, de forma que a avaliação da hipótese proposta seja realizada perante uma ampla visão das contribuições e trabalhos futuros desta teseAbstract: In the last ten years, network softwarisation processes have been continuously diversified and gradually incorporated into production, mainly through the paradigms of Software Defined Networks (e.g., programmable network flow rules) and Network Functions Virtualization (e.g., orchestration of virtualized network functions). Based on this process, the concept of network slice emerges as a way of defining end-to-end network programmable paths, possibly over shared network infrastructures, requiring strict performance metrics associated to a par- ticular business case. This thesis investigate the hypothesis that the disaggregation of network function performance metrics impacts and composes a network slice footprint incurring in di- verse slicing feature options, which when realized should have their Service Level Agreement (SLA) life cycle management transparently implemented in correspondence to their fulfilling end-to-end communication business case. The validation of such assertive takes place in three aspects: the degrees of freedom by which performance of virtualized network functions can be expressed; the methods of rationalizing the footprint of network slices; and transparent ways to track and manage network assets among multiple administrative domains. In order to achieve such goals, a series of contributions were achieved by this thesis, among them: the construction of a platform for automating methodologies for performance testing of virtual- ized network functions; an elaboration of a methodology for the analysis of footprint features of network slices based on a machine learning classifier algorithm and a multi-criteria analysis algorithm; and the construction of a prototype using blockchain to carry out smart contracts involving service level agreements between administrative systems. Through experiments and analysis we suggest that: performance metrics of virtualized network functions depend on the allocation of resources, internal configurations and test traffic stimulus; network slices can have their resource allocations consistently analyzed/classified by different criteria; and agree- ments between administrative domains can be performed transparently and in various forms of granularity through blockchain smart contracts. At the end of his thesis, through a wide discussion we answer all the research questions associated to the investigated hypothesis in such way its evaluation is performed in face of wide view of the contributions and future work of this thesisDoutoradoEngenharia de ComputaçãoDoutor em Engenharia ElétricaFUNCAM

Repositorio da Producao Cientifica e Intelectual da Unicamp

Understanding Legacy Workflows through Runtime Trace Analysis

Author
Publication venue
Publication date: 01/01/2015
Field of study

abstract: When scientific software is written to specify processes, it takes the form of a workflow, and is often written in an ad-hoc manner in a dynamic programming language. There is a proliferation of legacy workflows implemented by non-expert programmers due to the accessibility of dynamic languages. Unfortunately, ad-hoc workflows lack a structured description as provided by specialized management systems, making ad-hoc workflow maintenance and reuse difficult, and motivating the need for analysis methods. The analysis of ad-hoc workflows using compiler techniques does not address dynamic languages - a program has so few constrains that its behavior cannot be predicted. In contrast, workflow provenance tracking has had success using run-time techniques to record data. The aim of this work is to develop a new analysis method for extracting workflow structure at run-time, thus avoiding issues with dynamics. The method captures the dataflow of an ad-hoc workflow through its execution and abstracts it with a process for simplifying repetition. An instrumentation system first processes the workflow to produce an instrumented version, capable of logging events, which is then executed on an input to produce a trace. The trace undergoes dataflow construction to produce a provenance graph. The dataflow is examined for equivalent regions, which are collected into a single unit. The workflow is thus characterized in terms of its treatment of an input. Unlike other methods, a run-time approach characterizes the workflow's actual behavior; including elements which static analysis cannot predict (for example, code dynamically evaluated based on input parameters). This also enables the characterization of dataflow through external tools. The contributions of this work are: a run-time method for recording a provenance graph from an ad-hoc Python workflow, and a method to analyze the structure of a workflow from provenance. Methods are implemented in Python and are demonstrated on real world Python workflows. These contributions enable users to derive graph structure from workflows. Empowered by a graphical view, users can better understand a legacy workflow. This makes the wealth of legacy ad-hoc workflows accessible, enabling workflow reuse instead of investing time and resources into creating a workflow.Dissertation/ThesisMasters Thesis Computer Science 201

ASU Digital Repository

Recommended from our members

Automated Testing and Debugging for Big Data Analytics

Author: Gulzar Muhammad Ali
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, which are falling behind compared to the DISC framework design and implementation. In practice, big data applications often fail as users are unable to test all behaviors emerging from interleaving dataflow operators, user-defined functions, and framework's code. "Testing based on a random sample" rarely guarantees the reliability and "trial and error" and "print" debugging methods are expensive and time-consuming. Thus, the current practice of developing a big data application must be improved and the tools built to enhance the developer's productivity must adapt to the distinct characteristics of data-intensive scalable computing. By synthesizing ideas from software engineering and database systems, our hypothesis is that we can design effective and scalable testing and debugging algorithms for big data analytics without compromising the performance and efficiency of the underlying DISC framework. To design such techniques, we investigate how we can build interactive and responsive debugging primitives that significantly reduce the debugging time, yet do not pose much performance overhead on big data applications. Furthermore, we investigate how we can leverage data provenance techniques from databases and fault-isolation algorithms from software engineering to pinpoint the minimal subset of failure-inducing inputs efficiently. To improve the reliability of big data analytics, we investigate how we can abstract the semantics of dataflow operators and use them in tandem with the semantics of user-defined functions to generate a minimum set of synthetic test inputs capable of revealing more defects than the entire input dataset.To examine the first hypothesis, we introduce interactive, real-time debugging primitives for big data analytics through innovative and scalable debugging features such as simulated breakpoint, dynamic watchpoint, and crash culprit identification. Second, we design a new automated fault localization approach that combines insights from both the software engineering and database literature to bring delta debugging closer to a reality in the big data applications by leveraging data provenance and by constructing systems optimizations for debugging provenance queries. Lastly, we devise a new symbolic-execution based white-box testing algorithm for big data applications that abstracts the implementation of dataflow operators using logical specifications instead of modeling their implementations and combines them with the semantics of any arbitrary user-defined function. We instantiate the idea of an interactive debugging algorithm as BigDebug, the idea of an automated debugging algorithm as BigSift, and the idea of symbolic execution-based testing as BigTest. Our investigation shows that the interactive debugging primitives can scale to terabytes---our record-level tracing incurs less than 25% overhead on average and provides up to 100% time saving compared to the baseline replay debugger. Second, we observe that by combining data provenance with delta debugging, we can identify the minimum faulty input in just under 30% of the original job execution time. Lastly, we verify that by abstracting dataflow operators using logical specifications, we can efficiently generate the most concise test data suitable for local testing while revealing twice as many faults as prior approaches. Our investigations collectively demonstrate that developer productivity can be significantly improved through effective and scalable testing and debugging techniques for big data analytics, without impacting the DISC framework's performance. This dissertation affirms the feasibility of automated debugging and testing techniques for big data analytics---techniques that were previously considered infeasible for large-scale data processing

eScholarship - University of California

Operationalizing Machine Learning: An Interview Study

Author: Garcia Rolando
Hellerstein Joseph M.
Parameswaran Aditya G.
Shankar Shreya
Publication venue
Publication date: 16/09/2022
Field of study

Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in production. The process of operationalizing ML, or MLOps, consists of a continual loop of (i) data collection and labeling, (ii) experimentation to improve ML performance, (iii) evaluation throughout a multi-staged deployment process, and (iv) monitoring of performance drops in production. When considered together, these responsibilities seem staggering -- how does anyone do MLOps, what are the unaddressed challenges, and what are the implications for tool builders? We conducted semi-structured ethnographic interviews with 18 MLEs working across many applications, including chatbots, autonomous vehicles, and finance. Our interviews expose three variables that govern success for a production ML deployment: Velocity, Validation, and Versioning. We summarize common practices for successful ML experimentation, deployment, and sustaining production performance. Finally, we discuss interviewees' pain points and anti-patterns, with implications for tool design.Comment: 20 pages, 4 figure

arXiv.org e-Print Archive