205 research outputs found

    Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

    Full text link
    Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, our aim is to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of our implementations to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.Comment: 37 pages, 27 figures, submitted to a journa

    High-Fidelity Provenance:Exploring the Intersection of Provenance and Security

    Get PDF
    In the past 25 years, the World Wide Web has disrupted the way news are disseminated and consumed. However, the euphoria for the democratization of news publishing was soon followed by scepticism, as a new phenomenon emerged: fake news. With no gatekeepers to vouch for it, the veracity of the information served over the World Wide Web became a major public concern. The Reuters Digital News Report 2020 cites that in at least half of the EU member countries, 50% or more of the population is concerned about online fake news. To help address the problem of trust on information communi- cated over the World Wide Web, it has been proposed to also make available the provenance metadata of the information. Similar to artwork provenance, this would include a detailed track of how the information was created, updated and propagated to produce the result we read, as well as what agents—human or software—were involved in the process. However, keeping track of provenance information is a non-trivial task. Current approaches, are often of limited scope and may require modifying existing applications to also generate provenance information along with thei regular output. This thesis explores how provenance can be automatically tracked in an application-agnostic manner, without having to modify the individual applications. We frame provenance capture as a data flow analysis problem and explore the use of dynamic taint analysis in this context. Our work shows that this appoach improves on the quality of provenance captured compared to traditonal approaches, yielding what we term as high-fidelity provenance. We explore the performance cost of this approach and use deterministic record and replay to bring it down to a more practical level. Furthermore, we create and present the tooling necessary for the expanding the use of using deterministic record and replay for provenance analysis. The thesis concludes with an application of high-fidelity provenance as a tool for state-of-the art offensive security analysis, based on the intuition that software too can be misguided by "fake news". This demonstrates that the potential uses of high-fidelity provenance for security extend beyond traditional forensics analysis

    Fatias de rede fim-a-fim : da extração de perfis de funções de rede a SLAs granulares

    Get PDF
    Orientador: Christian Rodolfo Esteve RothenbergTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Nos últimos dez anos, processos de softwarização de redes vêm sendo continuamente diversi- ficados e gradativamente incorporados em produção, principalmente através dos paradigmas de Redes Definidas por Software (ex.: regras de fluxos de rede programáveis) e Virtualização de Funções de Rede (ex.: orquestração de funções virtualizadas de rede). Embasado neste processo o conceito de network slice surge como forma de definição de caminhos de rede fim- a-fim programáveis, possivelmente sobre infrastruturas compartilhadas, contendo requisitos estritos de desempenho e dedicado a um modelo particular de negócios. Esta tese investiga a hipótese de que a desagregação de métricas de desempenho de funções virtualizadas de rede impactam e compõe critérios de alocação de network slices (i.e., diversas opções de utiliza- ção de recursos), os quais quando realizados devem ter seu gerenciamento de ciclo de vida implementado de forma transparente em correspondência ao seu caso de negócios de comu- nicação fim-a-fim. A verificação de tal assertiva se dá em três aspectos: entender os graus de liberdade nos quais métricas de desempenho de funções virtualizadas de rede podem ser expressas; métodos de racionalização da alocação de recursos por network slices e seus re- spectivos critérios; e formas transparentes de rastrear e gerenciar recursos de rede fim-a-fim entre múltiplos domínios administrativos. Para atingir estes objetivos, diversas contribuições são realizadas por esta tese, dentre elas: a construção de uma plataforma para automatização de metodologias de testes de desempenho de funções virtualizadas de redes; a elaboração de uma metodologia para análises de alocações de recursos de network slices baseada em um algoritmo classificador de aprendizado de máquinas e outro algoritmo de análise multi- critério; e a construção de um protótipo utilizando blockchain para a realização de contratos inteligentes envolvendo acordos de serviços entre domínios administrativos de rede. Por meio de experimentos e análises sugerimos que: métricas de desempenho de funções virtualizadas de rede dependem da alocação de recursos, configurações internas e estímulo de tráfego de testes; network slices podem ter suas alocações de recursos coerentemente classificadas por diferentes critérios; e acordos entre domínios administrativos podem ser realizados de forma transparente e em variadas formas de granularidade por meio de contratos inteligentes uti- lizando blockchain. Ao final deste trabalho, com base em uma ampla discussão as perguntas de pesquisa associadas à hipótese são respondidas, de forma que a avaliação da hipótese proposta seja realizada perante uma ampla visão das contribuições e trabalhos futuros desta teseAbstract: In the last ten years, network softwarisation processes have been continuously diversified and gradually incorporated into production, mainly through the paradigms of Software Defined Networks (e.g., programmable network flow rules) and Network Functions Virtualization (e.g., orchestration of virtualized network functions). Based on this process, the concept of network slice emerges as a way of defining end-to-end network programmable paths, possibly over shared network infrastructures, requiring strict performance metrics associated to a par- ticular business case. This thesis investigate the hypothesis that the disaggregation of network function performance metrics impacts and composes a network slice footprint incurring in di- verse slicing feature options, which when realized should have their Service Level Agreement (SLA) life cycle management transparently implemented in correspondence to their fulfilling end-to-end communication business case. The validation of such assertive takes place in three aspects: the degrees of freedom by which performance of virtualized network functions can be expressed; the methods of rationalizing the footprint of network slices; and transparent ways to track and manage network assets among multiple administrative domains. In order to achieve such goals, a series of contributions were achieved by this thesis, among them: the construction of a platform for automating methodologies for performance testing of virtual- ized network functions; an elaboration of a methodology for the analysis of footprint features of network slices based on a machine learning classifier algorithm and a multi-criteria analysis algorithm; and the construction of a prototype using blockchain to carry out smart contracts involving service level agreements between administrative systems. Through experiments and analysis we suggest that: performance metrics of virtualized network functions depend on the allocation of resources, internal configurations and test traffic stimulus; network slices can have their resource allocations consistently analyzed/classified by different criteria; and agree- ments between administrative domains can be performed transparently and in various forms of granularity through blockchain smart contracts. At the end of his thesis, through a wide discussion we answer all the research questions associated to the investigated hypothesis in such way its evaluation is performed in face of wide view of the contributions and future work of this thesisDoutoradoEngenharia de ComputaçãoDoutor em Engenharia ElétricaFUNCAM

    Understanding Legacy Workflows through Runtime Trace Analysis

    Get PDF
    abstract: When scientific software is written to specify processes, it takes the form of a workflow, and is often written in an ad-hoc manner in a dynamic programming language. There is a proliferation of legacy workflows implemented by non-expert programmers due to the accessibility of dynamic languages. Unfortunately, ad-hoc workflows lack a structured description as provided by specialized management systems, making ad-hoc workflow maintenance and reuse difficult, and motivating the need for analysis methods. The analysis of ad-hoc workflows using compiler techniques does not address dynamic languages - a program has so few constrains that its behavior cannot be predicted. In contrast, workflow provenance tracking has had success using run-time techniques to record data. The aim of this work is to develop a new analysis method for extracting workflow structure at run-time, thus avoiding issues with dynamics. The method captures the dataflow of an ad-hoc workflow through its execution and abstracts it with a process for simplifying repetition. An instrumentation system first processes the workflow to produce an instrumented version, capable of logging events, which is then executed on an input to produce a trace. The trace undergoes dataflow construction to produce a provenance graph. The dataflow is examined for equivalent regions, which are collected into a single unit. The workflow is thus characterized in terms of its treatment of an input. Unlike other methods, a run-time approach characterizes the workflow's actual behavior; including elements which static analysis cannot predict (for example, code dynamically evaluated based on input parameters). This also enables the characterization of dataflow through external tools. The contributions of this work are: a run-time method for recording a provenance graph from an ad-hoc Python workflow, and a method to analyze the structure of a workflow from provenance. Methods are implemented in Python and are demonstrated on real world Python workflows. These contributions enable users to derive graph structure from workflows. Empowered by a graphical view, users can better understand a legacy workflow. This makes the wealth of legacy ad-hoc workflows accessible, enabling workflow reuse instead of investing time and resources into creating a workflow.Dissertation/ThesisMasters Thesis Computer Science 201

    Operationalizing Machine Learning: An Interview Study

    Full text link
    Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in production. The process of operationalizing ML, or MLOps, consists of a continual loop of (i) data collection and labeling, (ii) experimentation to improve ML performance, (iii) evaluation throughout a multi-staged deployment process, and (iv) monitoring of performance drops in production. When considered together, these responsibilities seem staggering -- how does anyone do MLOps, what are the unaddressed challenges, and what are the implications for tool builders? We conducted semi-structured ethnographic interviews with 18 MLEs working across many applications, including chatbots, autonomous vehicles, and finance. Our interviews expose three variables that govern success for a production ML deployment: Velocity, Validation, and Versioning. We summarize common practices for successful ML experimentation, deployment, and sustaining production performance. Finally, we discuss interviewees' pain points and anti-patterns, with implications for tool design.Comment: 20 pages, 4 figure
    corecore