586 research outputs found

    Optimisation of the enactment of fine-grained distributed data-intensive work flows

    Get PDF
    The emergence of data-intensive science as the fourth science paradigm has posed a data deluge challenge for enacting scientific work-flows. The scientific community is facing an imminent flood of data from the next generation of experiments and simulations, besides dealing with the heterogeneity and complexity of data, applications and execution environments. New scientific work-flows involve execution on distributed and heterogeneous computing resources across organisational and geographical boundaries, processing gigabytes of live data streams and petabytes of archived and simulation data, in various formats and from multiple sources. Managing the enactment of such work-flows not only requires larger storage space and faster machines, but the capability to support scalability and diversity of the users, applications, data, computing resources and the enactment technologies. We argue that the enactment process can be made efficient using optimisation techniques in an appropriate architecture. This architecture should support the creation of diversified applications and their enactment on diversified execution environments, with a standard interface, i.e. a work-flow language. The work-flow language should be both human readable and suitable for communication between the enactment environments. The data-streaming model central to this architecture provides a scalable approach to large-scale data exploitation. Data-flow between computational elements in the scientific work-flow is implemented as streams. To cope with the exploratory nature of scientific work-flows, the architecture should support fast work-flow prototyping, and the re-use of work-flows and work-flow components. Above all, the enactment process should be easily repeated and automated. In this thesis, we present a candidate data-intensive architecture that includes an intermediate work-flow language, named DISPEL. We create a new fine-grained measurement framework to capture performance-related data during enactments, and design a performance database to organise them systematically. We propose a new enactment strategy to demonstrate that optimisation of data-streaming work-flows can be automated by exploiting performance data gathered during previous enactments

    Towards optimising distributed data streaming graphs using parallel streams

    Full text link
    Modern scientific collaborations have opened up the op-portunity of solving complex problems that involve multi-disciplinary expertise and large-scale computational experi-ments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organi-sations. A common strategy to make the experiments more manageable is executing the processing steps as a work-flow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes rep-resent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a mea-surement tool to evaluate each enactment. We conducted ex-periments to evaluate our optimisation strategies with a real world problem in the Life Sciences—EURExpress-II. The paper presents our distributed data-handling model, the op-timisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy

    Scientific Workflows: Moving Across Paradigms

    Get PDF
    Modern scientific collaborations have opened up the opportunity to solve complex problems that require both multidisciplinary expertise and large-scale computational experiments. These experiments typically consist of a sequence of processing steps that need to be executed on selected computing platforms. Execution poses a challenge, however, due to (1) the complexity and diversity of applications, (2) the diversity of analysis goals, (3) the heterogeneity of computing platforms, and (4) the volume and distribution of data. A common strategy to make these in silico experiments more manageable is to model them as workflows and to use a workflow management system to organize their execution. This article looks at the overall challenge posed by a new order of scientific experiments and the systems they need to be run on, and examines how this challenge can be addressed by workflows and workflow management systems. It proposes a taxonomy of workflow management system (WMS) characteristics, including aspects previously overlooked. This frames a review of prevalent WMSs used by the scientific community, elucidates their evolution to handle the challenges arising with the emergence of the “fourth paradigm,” and identifies research needed to maintain progress in this area

    Scientific Workflows: Past, Present and Future

    Get PDF
    International audienceThis special issue and our editorial celebrate 10 years of progress with data-intensive or scientific workflows. There have been very substantial advances in the representation of workflows and in the engineering of workflow management systems (WMS). The creation and refinement stages are now well supported, with a significant improvement in usability. Improved abstraction supports cross-fertilisation between different workflow communities and consistent interpretation as WMS evolve. Through such re-engineering the WMS deliver much improved performance, significantly increased scale and sophisticated reliability mechanisms. Further improvement is anticipated from substantial advances in optimisation. We invited papers from those who have delivered these advances and selected 14 to represent today's achievements and representative plans for future progress. This editorial introduces those contributions with an overview and categorisation of the papers. Furthermore, it elucidates responses from a survey of major workflow systems, which provides evidence of substantial progress and a structured index of related papers. We conclude with suggestions on areas where further research and development is needed and offer a vision of future research directions

    Microservices-based IoT Applications Scheduling in Edge and Fog Computing: A Taxonomy and Future Directions

    Full text link
    Edge and Fog computing paradigms utilise distributed, heterogeneous and resource-constrained devices at the edge of the network for efficient deployment of latency-critical and bandwidth-hungry IoT application services. Moreover, MicroService Architecture (MSA) is increasingly adopted to keep up with the rapid development and deployment needs of the fast-evolving IoT applications. Due to the fine-grained modularity of the microservices along with their independently deployable and scalable nature, MSA exhibits great potential in harnessing both Fog and Cloud resources to meet diverse QoS requirements of the IoT application services, thus giving rise to novel paradigms like Osmotic computing. However, efficient and scalable scheduling algorithms are required to utilise the said characteristics of the MSA while overcoming novel challenges introduced by the architecture. To this end, we present a comprehensive taxonomy of recent literature on microservices-based IoT applications scheduling in Edge and Fog computing environments. Furthermore, we organise multiple taxonomies to capture the main aspects of the scheduling problem, analyse and classify related works, identify research gaps within each category, and discuss future research directions.Comment: 35 pages, 10 figures, submitted to ACM Computing Survey

    DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud

    Get PDF
    The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.PublishedSan Diego (CA, USA)3IT. Calcolo scientific

    Mechanisms Driving Digital New Venture Creation & Performance: An Insider Action Research Study of Pure Digital Entrepreneurship in EdTech

    Get PDF
    Digitisation has ushered in a new era of value creation where cross border data flows generate more economic value than traditional flows of goods. The powerful new combination of digital and traditional forms of innovation has seen several new industries branded with a ‘tech’ suffix. In the education technology sector (EdTech), which is the industry context of this research, digitisation is driving double-digit growth into a projected $240 billion industry by 2021. Yet, despite its contemporary significance, the field of entrepreneurship has paid little attention to the phenomenon of digital entrepreneurship. As several scholars observe, digitisation challenges core organising axioms of entrepreneurship, with significant implications for the new venture creation process in new sectors such as EdTech. New venture creation no longer appears to follow discrete and linear models of innovation, as spatial and temporal boundaries get compressed. Given the paradigmatic shift, this study investigates three interrelated themes. Firstly, it seeks to determine how a Pure Digital Entrepreneurship (PDE) process develops over time; and more importantly, how the journey challenges extant assumptions of the entrepreneurial process. Secondly, it strives to identify and theorise the deep structures which underlie the PDE process through mechanism-based explanations. Consequently, the study also seeks to determine the causal pathways and enablers which overtly or covertly interrelate to power new venture emergence and performance. Thirdly, it aims to offer practical guidelines for nurturing the growth of PDE ventures, and for the development of supportive ecosystems. To meet the stated objectives, this study utilises an Insider Action Research (IAR) approach to inquiry, which incorporates reflective practice, collaborative inquiry and design research for third-person knowledge production. This three-pronged approach to inquiry allows for the enactment of a PDE journey in real-time, while acquiring a holistic narrative in the ‘swampy lowlands’ of new venture creation. The findings indicate that the PDE process is differentiated by the centrality of digital artifacts in new venture ideas, which in turn result in less-bounded processes that deliver temporal efficiencies – hence, the shorter new venture creation processes than in traditional forms of entrepreneurship. Further, PDE action is defined by two interrelated events – digital product development and digital growth marketing. These events are characterised by the constant forking, merging and termination of diverse activities. Secondly, concurrent enactment and piecemeal co-creation were found to be consequential mechanisms driving temporal efficiencies in digital product development. Meanwhile, data-driven operation and flexibility combine in digital growth marketing, to form higher order mechanisms which considerably reduce the levels of task-specific and outcome uncertainties. Finally, the study finds that digital growth marketing is differentiated from traditional marketing by the critical role of algorithmic agencies in their capacity as gatekeepers. Thus, unlike traditional marketing, which emphasises customer sovereignty, digital growth marketing involves a dual focus on the needs of human and algorithmic stakeholders. Based on the findings, this research develops a pragmatic model of pure digital new venture creation and suggests critical policy guidelines for nurturing the growth of PDE ventures and ecosystems

    dispel4py: A Python framework for data-intensive scientific computing

    Get PDF
    This paper presents dispel4py, a new Python framework for describing abstract stream-based workflows for distributed data-intensive applications. These combine the familiarity of Python programming with the scalability of workflows. Data streaming is used to gain performance, rapid prototyping and applicability to live observations. dispel4py enables scientists to focus on their scientific goals, avoiding distracting details and retaining flexibility over the computing infrastructure they use. The implementation, therefore, has to map dispel4py abstract workflows optimally onto target platforms chosen dynamically. We present four dispel4py mappings: Apache Storm, message-passing interface (MPI), multi-threading and sequential, showing two major benefits: a) smooth transitions from local development on a laptop to scalable execution for production work, and b) scalable enactment on significantly different distributed computing infrastructures. Three application domains are reported and measurements on multiple infrastructures show the optimisations achieved; they have provided demanding real applications and helped us develop effective training. The dispel4py.org is an open-source project to which we invite participation. The effective mapping of dispel4py onto multiple target infrastructures demonstrates exploitation of data-intensive and high-performance computing (HPC) architectures and consistent scalability.</p

    Dispel4Py: A Python Framework for Data-intensive eScience

    Get PDF
    We present dispel4py, a novel data intensive and high performance computing middleware provided as a standard Python library for describing stream-based workows. It allows its users to develop their scientific applications locally and then run them on a wide range of HPC-infrastructures without any changes to the code. Moreover, it provides automated and efficient parallel mappings toMPI, multiprocessing, Storm and Spark frameworks, commonly used in big data applications. It builds on the wide availability of Python in many environments and only requires familiarity with basic Python syntax. We will show the dispel4py advantages by walking through an example. We will conclude demonstrating how dispel4py can be employed as an easy-to-use tool for designing scientific applications using real-world scenarios.</p

    A workflow runtime environment for manycore parallel architectures

    Get PDF
    We introduce a new Manycore Workflow Runtime Environment (MWRE) to efficiently enact traditional scientific workflows on modern manycore computing architectures. MWRE is compiler-based and translates workflows specified in the XML-based Interoperable Workflow Intermediate Representation (IWIR) into an equivalent C++-based program. This program efficiently enacts the workflow as a stand-alone executable by means of a new callback mechanism that resolves dependencies, transfers data, and handles composite activities. Furthermore, a core feature of MWRE is explicit support for full-ahead scheduling and enactment. Experimental results on a number of real-world workflows demonstrate that MWRE clearly outperforms existing Java-based workflow engines designed for distributed (Grid or Cloud) computing infrastructures in terms of enactment time, is generally better than an existing script-based engine for manycore architectures (Swift), and sometimes gets even close to an artificial baseline implementation of the workflows in the standard OpenMP language for shared memory systems. Experimental results also show that full-ahead scheduling with MWRE using a state-of-the-art heuristic can improve the workflow performance up to 40%.(VLID)2196062Accepted versio
    corecore