1,603 research outputs found

    Understanding Legacy Workflows through Runtime Trace Analysis

    Get PDF
    abstract: When scientific software is written to specify processes, it takes the form of a workflow, and is often written in an ad-hoc manner in a dynamic programming language. There is a proliferation of legacy workflows implemented by non-expert programmers due to the accessibility of dynamic languages. Unfortunately, ad-hoc workflows lack a structured description as provided by specialized management systems, making ad-hoc workflow maintenance and reuse difficult, and motivating the need for analysis methods. The analysis of ad-hoc workflows using compiler techniques does not address dynamic languages - a program has so few constrains that its behavior cannot be predicted. In contrast, workflow provenance tracking has had success using run-time techniques to record data. The aim of this work is to develop a new analysis method for extracting workflow structure at run-time, thus avoiding issues with dynamics. The method captures the dataflow of an ad-hoc workflow through its execution and abstracts it with a process for simplifying repetition. An instrumentation system first processes the workflow to produce an instrumented version, capable of logging events, which is then executed on an input to produce a trace. The trace undergoes dataflow construction to produce a provenance graph. The dataflow is examined for equivalent regions, which are collected into a single unit. The workflow is thus characterized in terms of its treatment of an input. Unlike other methods, a run-time approach characterizes the workflow's actual behavior; including elements which static analysis cannot predict (for example, code dynamically evaluated based on input parameters). This also enables the characterization of dataflow through external tools. The contributions of this work are: a run-time method for recording a provenance graph from an ad-hoc Python workflow, and a method to analyze the structure of a workflow from provenance. Methods are implemented in Python and are demonstrated on real world Python workflows. These contributions enable users to derive graph structure from workflows. Empowered by a graphical view, users can better understand a legacy workflow. This makes the wealth of legacy ad-hoc workflows accessible, enabling workflow reuse instead of investing time and resources into creating a workflow.Dissertation/ThesisMasters Thesis Computer Science 201

    Querying and managing opm-compliant scientific workflow provenance

    Get PDF
    Provenance, the metadata that records the derivation history of scientific results, is important in scientific workflows to interpret, validate, and analyze the result of scientific computing. Recently, to promote and facilitate interoperability among heterogeneous provenance systems, the Open Provenance Model (OPM) has been proposed and has played an important role in the community. In this dissertation, to efficiently query and manage OPM-compliant provenance, we first propose a provenance collection framework that collects both prospective provenance, which captures an abstract workflow specification as a recipe for future data derivation and retrospective provenance, which captures past workflow execution and data derivation information. We then propose a relational database-based provenance system, called OPMPROV that stores, reasons, and queries prospective and retrospective provenance, which is OPM-compliant provenance. We finally propose OPQL, an OPM-level provenance query language, that is directly defined over the OPM model. An OPQL query takes an OPM graph as input and produces an OPM graph as output; therefore, OPQL queries are not tightly coupled to the underlying provenance storage strategies. Our provenance store, provenance collection framework, and provenance query language feature the native support of the OPM model

    ExplainIt! -- A declarative root-cause analysis engine for time series data (extended version)

    Full text link
    We present ExplainIt!, a declarative, unsupervised root-cause analysis engine that uses time series monitoring data from large complex systems such as data centres. ExplainIt! empowers operators to succinctly specify a large number of causal hypotheses to search for causes of interesting events. ExplainIt! then ranks these hypotheses, reducing the number of causal dependencies from hundreds of thousands to a handful for human understanding. We show how a declarative language, such as SQL, can be effective in declaratively enumerating hypotheses that probe the structure of an unknown probabilistic graphical causal model of the underlying system. Our thesis is that databases are in a unique position to enable users to rapidly explore the possible causal mechanisms in data collected from diverse sources. We empirically demonstrate how ExplainIt! had helped us resolve over 30 performance issues in a commercial product since late 2014, of which we discuss a few cases in detail.Comment: SIGMOD Industry Track 201

    Data mining and fusion

    No full text

    Metadata and provenance management

    Get PDF
    Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata and provenance in the context of the data lifecycle. It also gives an overview of the approaches to metadata and provenance management, followed by examples of how applications use metadata and provenance in their scientific processes

    Work ow-based systematic design of high throughput genome annotation

    No full text
    The genus Eimeria belongs to the phylum Apicomplexa, which includes many obligate intra-cellular protozoan parasites of man and livestock. E. tenella is one of seven species that infect the domestic chicken and cause the intestinal disease coccidiosis which is economy important for poultry industry. E. tenella is highly pathogenic and is often used as a model species for the Eimeria biology studies. In this PhD thesis, a comprehensive annotation system named as \WAGA" (Workflow-based Automatically Genome Annotation) was built and applied to the E. tenella genome. InforSense KDE, and its BioSense plug-in (products of the InforSense Company), were the core softwares used to build the workflows. Workflows were made by integrating individual bioinformatics tools into a single platform. Each workflow was designed to provide a standalone service for a particular task. Three major workflows were developed based on the genomic resources currently available for E. tenella. These were of ESTs-based gene construction, HMM-based gene prediction and protein-based annotation. Finally, a combining workflow was built to sit above the individual ones to generate a set of automatic annotations using all of the available information. The overall system and its three major components were deployed as web servers that are fully tuneable and reusable for end users. WAGA does not require users to have programming skills or knowledge of the underlying algorithms or mechanisms of its low level components. E. tenella was the target genome here and all the results obtained were displayed by GBrowse. A sample of the results is selected for experimental validation. For evaluation purpose, WAGA was also applied to another Apicomplexa parasite, Plasmodium falciparum, the causative agent of human malaria, which has been extensively annotated. The results obtained were compared with gene predictions of PHAT, a gene finder designed for and used in the P. falciparum genome project

    Metodología dirigida por modelos para las pruebas de un sistema distribuido multiagente de fabricación

    Get PDF
    Las presiones del mercado han empujado a las empresas de fabricación a reducir costes a la vez que mejoran sus productos, especializándose en las actividades sobre las que pueden añadir valor y colaborando con especialistas de las otras áreas para el resto. Estos sistemas distribuidos de fabricación conllevan nuevos retos, dado que es difícil integrar los distintos sistemas de información y organizarlos de forma coherente. Esto ha llevado a los investigadores a proponer una variedad de abstracciones, arquitecturas y especificaciones que tratan de atacar esta complejidad. Entre ellas, los sistemas de fabricación holónicos han recibido una atención especial: ven las empresas como redes de holones, entidades que a la vez están formados y forman parte de varios otros holones. Hasta ahora, los holones se han implementado para control de fabricación como agentes inteligentes autoconscientes, pero su curva de aprendizaje y las dificultades a la hora de integrarlos con sistemas tradicionales han dificultado su adopción en la industria. Por otro lado, su comportamiento emergente puede que no sea deseable si se necesita que las tareas cumplan ciertas garantías, como ocurren en las relaciones de negocio a negocio o de negocio a cliente y en las operaciones de alto nivel de gestión de planta. Esta tesis propone una visión más flexible del concepto de holón, permitiendo que se sitúe en un espectro más amplio de niveles de inteligencia, y defiende que sea mejor implementar los holones de negocio como servicios, componentes software que pueden ser reutilizados a través de tecnologías estándar desde cualquier parte de la organización. Estos servicios suelen organizarse como catálogos coherentes, conocidos como Arquitecturas Orientadas a Servicios (‘Service Oriented Architectures’ o SOA). Una iniciativa SOA exitosa puede reportar importantes beneficios, pero no es una tarea trivial. Por este motivo, se han propuesto muchas metodologías SOA en la literatura, pero ninguna de ellas cubre explícitamente la necesidad de probar los servicios. Considerando que la meta de las SOA es incrementar la reutilización del software en la organización, es una carencia importante: tener servicios de alta calidad es crucial para una SOA exitosa. Por este motivo, el objetivo principal de la presente Tesis es definir una metodología extendida que ayude a los usuarios a probar los servicios que implementan a sus holones de negocio. Tras considerar las opciones disponibles, se tomó la metodología dirigida por modelos SODM como punto de partida y se reescribió en su mayor parte con el framework Epsilon de código abierto, permitiendo a los usuarios que modelen su conocimiento parcial sobre el rendimiento esperado de los servicios. Este conocimiento parcial es aprovechado por varios nuevos algoritmos de inferencia de requisitos de rendimiento, que extraen los requisitos específicos de cada servicio. Aunque el algoritmo de inferencia de peticiones por segundo es sencillo, el algoritmo de inferencia de tiempos límite pasó por numerosas revisiones hasta obtener el nivel deseado de funcionalidad y rendimiento. Tras una primera formulación basada en programación lineal, se reemplazó con un algoritmo sencillo ad-hoc que recorría el grafo y después con un algoritmo incremental mucho más rápido y avanzado. El algoritmo incremental produce resultados equivalentes y tarda mucho menos, incluso con modelos grandes. Para sacar más partidos de los modelos, esta Tesis también propone un enfoque general para generar artefactos de prueba para múltiples tecnologías a partir de los modelos anotados por los algoritmos. Para evaluar la viabilidad de este enfoque, se implementó para dos posibles usos: reutilizar pruebas unitarias escritas en Java como pruebas de rendimiento, y generar proyectos completos de prueba de rendimiento usando el framework The Grinder para cualquier Servicio Web que esté descrito usando el estándar Web Services Description Language. La metodología completa es finalmente aplicada con éxito a un caso de estudio basado en un área de fabricación de losas cerámicas rectificadas de un grupo de empresas español. En este caso de estudio se parte de una descripción de alto nivel del negocio y se termina con la implementación de parte de uno de los holones y la generación de pruebas de rendimiento para uno de sus Servicios Web. Con su soporte para tanto diseñar como implementar pruebas de rendimiento de los servicios, se puede concluir que SODM+T ayuda a que los usuarios tengan una mayor confianza en sus implementaciones de los holones de negocio observados en sus empresas

    Knowledge Components and Methods for Policy Propagation in Data Flows

    Get PDF
    Data-oriented systems and applications are at the centre of current developments of the World Wide Web (WWW). On the Web of Data (WoD), information sources can be accessed and processed for many purposes. Users need to be aware of any licences or terms of use, which are associated with the data sources they want to use. Conversely, publishers need support in assigning the appropriate policies alongside the data they distribute. In this work, we tackle the problem of policy propagation in data flows - an expression that refers to the way data is consumed, manipulated and produced within processes. We pose the question of what kind of components are required, and how they can be acquired, managed, and deployed, to support users on deciding what policies propagate to the output of a data-intensive system from the ones associated with its input. We observe three scenarios: applications of the Semantic Web, workflow reuse in Open Science, and the exploitation of urban data in City Data Hubs. Starting from the analysis of Semantic Web applications, we propose a data-centric approach to semantically describe processes as data flows: the Datanode ontology, which comprises a hierarchy of the possible relations between data objects. By means of Policy Propagation Rules, it is possible to link data flow steps and policies derivable from semantic descriptions of data licences. We show how these components can be designed, how they can be effectively managed, and how to reason efficiently with them. In a second phase, the developed components are verified using a Smart City Data Hub as a case study, where we developed an end-to-end solution for policy propagation. Finally, we evaluate our approach and report on a user study aimed at assessing both the quality and the value of the proposed solution
    • …
    corecore