1,104 research outputs found

    Improving Pipelining Tools for Pre-processing Data

    Get PDF
    The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features

    Improving pipelining tools for pre-processing data

    Get PDF
    The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features.Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-RXunta de Galicia | Ref. ED481D-2021/024Xunta de Galicia | Ref. ED431C2018/55-GR

    Fine-Grain Interoperability of Scientific Workflows in Distributed Computing Infrastructures

    Get PDF
    Today there exist a wide variety of scientific workflow management systems, each designed to fulfill the needs of a certain scientific community. Unfortunately, once a workflow application has been designed in one particular system it becomes very hard to share it with users working with different systems. Portability of workflows and interoperability between current systems barely exists. In this work, we present the fine-grained interoperability solution proposed in the SHIWA European project that brings together four representative European workflow systems: ASKALON, MOTEUR, WS-PGRADE, and Triana. The proposed interoperability is realised at two levels of abstraction: abstract and concrete. At the abstract level, we propose a generic Interoperable Workflow Intermediate Representation (IWIR) that can be used as a common bridge for translating workflows between different languages independent of the underlying distributed computing infrastructure. At the concrete level, we propose a bundling technique that aggregates the abstract IWIR representation and concrete task representations to enable workflow instantiation, execution and scheduling. We illustrate case studies using two real-workflow applications designed in a native environment and then translated and executed by a foreign workflow system in a foreign distributed computing infrastructure. © 2013 Springer Science+Business Media Dordrecht
    corecore