6 research outputs found

    PREDON Scientific Data Preservation 2014

    Get PDF
    LPSC14037Scientific data collected with modern sensors or dedicated detectors exceed very often the perimeter of the initial scientific design. These data are obtained more and more frequently with large material and human efforts. A large class of scientific experiments are in fact unique because of their large scale, with very small chances to be repeated and to superseded by new experiments in the same domain: for instance high energy physics and astrophysics experiments involve multi-annual developments and a simple duplication of efforts in order to reproduce old data is simply not affordable. Other scientific experiments are in fact unique by nature: earth science, medical sciences etc. since the collected data is "time-stamped" and thereby non-reproducible by new experiments or observations. In addition, scientific data collection increased dramatically in the recent years, participating to the so-called "data deluge" and inviting for common reflection in the context of "big data" investigations. The new knowledge obtained using these data should be preserved long term such that the access and the re-use are made possible and lead to an enhancement of the initial investment. Data observatories, based on open access policies and coupled with multi-disciplinary techniques for indexing and mining may lead to truly new paradigms in science. It is therefore of outmost importance to pursue a coherent and vigorous approach to preserve the scientific data at long term. The preservation remains nevertheless a challenge due to the complexity of the data structure, the fragility of the custom-made software environments as well as the lack of rigorous approaches in workflows and algorithms. To address this challenge, the PREDON project has been initiated in France in 2012 within the MASTODONS program: a Big Data scientific challenge, initiated and supported by the Interdisciplinary Mission of the National Centre for Scientific Research (CNRS). PREDON is a study group formed by researchers from different disciplines and institutes. Several meetings and workshops lead to a rich exchange in ideas, paradigms and methods. The present document includes contributions of the participants to the PREDON Study Group, as well as invited papers, related to the scientific case, methodology and technology. This document should be read as a "facts finding" resource pointing to a concrete and significant scientific interest for long term research data preservation, as well as to cutting edge methods and technologies to achieve this goal. A sustained, coherent and long term action in the area of scientific data preservation would be highly beneficial

    SimiFlow: uma arquitetura para agrupamento de Workflows por similaridade

    Get PDF
    Os cientistas tem utilizado Sistemas de Gerência de Workflows Científicos (SGWfC) para apoiar experimentos científicos. Contudo, um SGWfC utiliza uma linguagem própria para a modelagem de um workflow, a ser futuramente executado. Os cientistas não possuem um auxílio ou orientação para obter o workflow modelado. As linhas de experimentos, que são uma nova abordagem para lidar com essas limitações, permitem uma representação abstrata e uma composição sistêmica dos experimentos. Dado que já existem muitos workflows científicos previamente modelados, os cientistas podem usá-los para alavancar a construção de novas representações abstratas. Esses experimentos anteriores podem ser úteis para formar uma estrutura abstrata, se conseguirmos agrupá-los por meio de critérios de similaridade. Esse projeto propõe a SimiFlow, que é uma arquitetura para comparação baseada na similaridade e agrupamento para construir linhas de experimentos através de uma abordagem ascendente

    Clustering of RCE Workflow Graphs

    Get PDF
    RCE is an integration environment which allows to create automated workflows orchestrating multi-disciplinary simulation tools in a distributed manner. A workflow consists of components representing tools and connections between these components. The components can be grouped by users within the GUI by creating colored labels. This requires specialist knowledge and is a fully manual task. We investigate the feasibility of automating this task by applying graph clustering methods on such workflows. To this end, we model graphs based on workflows by adopting components as vertices and connections as edges whereby we transfer connection properties to edge weights. We examine three different hierarchical clustering algorithms: edge betweenness, spectral bisection and agglomerative clustering. Additionally, we apply four different metrics to stop the algorithms when a cluster is found: cluster density, global clustering coefficient, average local clustering coefficient and modularity. We examine different mappings of edge weights in combination with the mentioned algorithms and metrics. As groups in workflows have no canonical definition we evaluate our approach qualitatively. We consider 27 results of 1008 parameter combinations as useful. The most expedient approach across multiple workflows is the edge betweenness algorithm with the modularity metric with an undirected graph representation. The scores for the metrics and the mapping vary across workflows and do not enable us to draw general conclusions. We show that our approach is feasible, whereas we remark that a quantitative study is necessary to validate our results in general

    Big Data Analytics in Static and Streaming Provenance

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics and Computing,, 2016With recent technological and computational advances, scientists increasingly integrate sensors and model simulations to understand spatial, temporal, social, and ecological relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on over-time behavior under study. However, provenance can be overwhelming in both volume and complexity; the now forecasting potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making

    Utilizing the blackboard paradigm to implement a workflow engine

    Get PDF
    Workflow management has evolved into a mature field with numerous workflow management systems with scores of features. These systems are designed to automate business processes of organisations. However, many of these workflow engines struggle to support complex workflows. There has been relatively little research into building a workflow engine utilizing the blackboard paradigm. The blackboard paradigm can be characterized as specialists interacting with and updating a centralized data structure, namely the blackboard, with partial and complete solutions. The opportunistic control innate to the blackboard paradigm can be leveraged to support the execution of complex workflows. Furthermore, the blackboard architecture can be seen to accommodate comprehensive workflow functionality. This research aims to verify whether or not the blackboard paradigm can be used to build a workflow engine. To validate this research, a prototype was designed and developed following stringent guidelines in order to remain true to the blackboard paradigm. Four main perspectives of workflow management namely the functional, behavioural, informational and operational aspects with their quality indicators and requirements were used to evaluate the prototype. This evaluation approach was chosen since it is universally applicable to any workflow engine and thereby provides a common platform on which the prototype can be judged and compared against other workflow engines. The two most important quality indicators are the level of support a workflow engine can provide for 20 main workflow patterns and 40 main data patterns. Test cases based on these patterns were developed and executed within the prototype to determine the level of support. It was found that the prototype supports 85% of all the workflow patterns and 72.5% of all the data patterns. This reveals some functional limitations in the prototype and improvement suggestions are given that can boost these scores to 95% and 90% for workflow and data patterns respectively. The nature of the blackboard paradigm only prevents support of only 5% and 10% of the workflow and data patterns respectively. The prototype is shown to substantially outperform most other workflow engines in the level of patterns support. Besides support for these patterns, other less important quality indicators provided by the main aspects of workflow management are also found to be present in the prototype. Given the above evidence, it is possible to conclude that a workflow engine can be successfully built utilizing the blackboard paradigm
    corecore