1,074 research outputs found

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

    Digital cockpits and decision support systems : design of technics and tools to extract and process data from heterogeneous databases

    Get PDF
    Tableau d'honneur de la Faculté des études supérieures et postdoctorales, 2006-200

    Workflow Provenance: from Modeling to Reporting

    Get PDF
    Workflow provenance is a crucial part of a workflow system as it enables data lineage analysis, error tracking, workflow monitoring, usage pattern discovery, and so on. Integrating provenance into a workflow system or modifying a workflow system to capture or analyze different provenance information is burdensome, requiring extensive development because provenance mechanisms rely heavily on the modelling, architecture, and design of the workflow system. Various tools and technologies exist for logging events in a software system. Unfortunately, logging tools and technologies are not designed for capturing and analyzing provenance information. Workflow provenance is not only about logging, but also about retrieving workflow related information from logs. In this work, we propose a taxonomy of provenance questions and guided by these questions, we created a workflow programming model 'ProvMod' with a supporting run-time library to provide automated provenance and log analysis for any workflow system. The design and provenance mechanism of ProvMod is based on recommendations from prominent research and is easy to integrate into any workflow system. ProvMod offers Neo4j graph database support to manage semi-structured heterogeneous JSON logs. The log structure is adaptable to any NoSQL technology. For each provenance question in our taxonomy, ProvMod provides the answer with data visualization using Neo4j and the ELK Stack. Besides analyzing performance from various angles, we demonstrate the ease of integration by integrating ProvMod with Apache Taverna and evaluate ProvMod usability by engaging users. Finally, we present two Software Engineering research cases (clone detection and architecture extraction) where our proposed model ProvMod and provenance questions taxonomy can be applied to discover meaningful insights

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    10381 Summary and Abstracts Collection -- Robust Query Processing

    Get PDF
    Dagstuhl seminar 10381 on robust query processing (held 19.09.10 - 24.09.10) brought together a diverse set of researchers and practitioners with a broad range of expertise for the purpose of fostering discussion and collaboration regarding causes, opportunities, and solutions for achieving robust query processing. The seminar strove to build a unified view across the loosely-coupled system components responsible for the various stages of database query processing. Participants were chosen for their experience with database query processing and, where possible, their prior work in academic research or in product development towards robustness in database query processing. In order to pave the way to motivate, measure, and protect future advances in robust query processing, seminar 10381 focused on developing tests for measuring the robustness of query processing. In these proceedings, we first review the seminar topics, goals, and results, then present abstracts or notes of some of the seminar break-out sessions. We also include, as an appendix, the robust query processing reading list that was collected and distributed to participants before the seminar began, as well as summaries of a few of those papers that were contributed by some participants

    On Evaluating Commercial Cloud Services: A Systematic Review

    Full text link
    Background: Cloud Computing is increasingly booming in industry with many competing providers and services. Accordingly, evaluation of commercial Cloud services is necessary. However, the existing evaluation studies are relatively chaotic. There exists tremendous confusion and gap between practices and theory about Cloud services evaluation. Aim: To facilitate relieving the aforementioned chaos, this work aims to synthesize the existing evaluation implementations to outline the state-of-the-practice and also identify research opportunities in Cloud services evaluation. Method: Based on a conceptual evaluation model comprising six steps, the Systematic Literature Review (SLR) method was employed to collect relevant evidence to investigate the Cloud services evaluation step by step. Results: This SLR identified 82 relevant evaluation studies. The overall data collected from these studies essentially represent the current practical landscape of implementing Cloud services evaluation, and in turn can be reused to facilitate future evaluation work. Conclusions: Evaluation of commercial Cloud services has become a world-wide research topic. Some of the findings of this SLR identify several research gaps in the area of Cloud services evaluation (e.g., the Elasticity and Security evaluation of commercial Cloud services could be a long-term challenge), while some other findings suggest the trend of applying commercial Cloud services (e.g., compared with PaaS, IaaS seems more suitable for customers and is particularly important in industry). This SLR study itself also confirms some previous experiences and reveals new Evidence-Based Software Engineering (EBSE) lessons

    Scientific Workflows for Metabolic Flux Analysis

    Get PDF
    Metabolic engineering is a highly interdisciplinary research domain that interfaces biology, mathematics, computer science, and engineering. Metabolic flux analysis with carbon tracer experiments (13 C-MFA) is a particularly challenging metabolic engineering application that consists of several tightly interwoven building blocks such as modeling, simulation, and experimental design. While several general-purpose workflow solutions have emerged in recent years to support the realization of complex scientific applications, the transferability of these approaches are only partially applicable to 13C-MFA workflows. While problems in other research fields (e.g., bioinformatics) are primarily centered around scientific data processing, 13C-MFA workflows have more in common with business workflows. For instance, many bioinformatics workflows are designed to identify, compare, and annotate genomic sequences by "pipelining" them through standard tools like BLAST. Typically, the next workflow task in the pipeline can be automatically determined by the outcome of the previous step. Five computational challenges have been identified in the endeavor of conducting 13 C-MFA studies: organization of heterogeneous data, standardization of processes and the unification of tools and data, interactive workflow steering, distributed computing, and service orientation. The outcome of this thesis is a scientific workflow framework (SWF) that is custom-tailored for the specific requirements of 13 C-MFA applications. The proposed approach – namely, designing the SWF as a collection of loosely-coupled modules that are glued together with web services – alleviates the realization of 13C-MFA workflows by offering several features. By design, existing tools are integrated into the SWF using web service interfaces and foreign programming language bindings (e.g., Java or Python). Although the attributes "easy-to-use" and "general-purpose" are rarely associated with distributed computing software, the presented use cases show that the proposed Hadoop MapReduce framework eases the deployment of computationally demanding simulations on cloud and cluster computing resources. An important building block for allowing interactive researcher-driven workflows is the ability to track all data that is needed to understand and reproduce a workflow. The standardization of 13 C-MFA studies using a folder structure template and the corresponding services and web interfaces improves the exchange of information for a group of researchers. Finally, several auxiliary tools are developed in the course of this work to complement the SWF modules, i.e., ranging from simple helper scripts to visualization or data conversion programs. This solution distinguishes itself from other scientific workflow approaches by offering a system of loosely-coupled components that are flexibly arranged to match the typical requirements in the metabolic engineering domain. Being a modern and service-oriented software framework, new applications are easily composed by reusing existing components

    An integrative framework for cooperative production resources in smart manufacturing

    Get PDF
    Under the push of Industry 4.0 paradigm modern manufacturing companies are dealing with a significant digital transition, with the aim to better address the challenges posed by the growing complexity of globalized businesses (Hermann, Pentek, & Otto, Design principles for industrie 4.0 scenarios, 2016). One basic principle of this paradigm is that products, machines, systems and business are always connected to create an intelligent network along the entire factory’s value chain. According to this vision, manufacturing resources are being transformed from monolithic entities into distributed components, which are loosely coupled and autonomous but nevertheless provided of the networking and connectivity capabilities enabled by the increasingly widespread Industrial Internet of Things technology. Under these conditions, they become capable of working together in a reliable and predictable manner, collaborating among themselves in a highly efficient way. Such a mechanism of synergistic collaboration is crucial for the correct evolution of any organization ranging from a multi-cellular organism to a complex modern manufacturing system (Moghaddam & Nof, 2017). Specifically of the last scenario, which is the field of our study, collaboration enables involved resources to exchange relevant information about the evolution of their context. These information can be in turn elaborated to make some decisions, and trigger some actions. In this way connected resources can modify their structure and configuration in response to specific business or operational variations (Alexopoulos, Makris, Xanthakis, Sipsas, & Chryssolouris, 2016). Such a model of “social” and context-aware resources can contribute to the realization of a highly flexible, robust and responsive manufacturing system, which is an objective particularly relevant in the modern factories, as its inclusion in the scope of the priority research lines for the H2020 three-year period 2018-2020 can demonstrate (EFFRA, 2016). Interesting examples of these resources are self-organized logistics which can react to unexpected changes occurred in production or machines capable to predict failures on the basis of the contextual information and then trigger adjustments processes autonomously. This vision of collaborative and cooperative resources can be realized with the support of several studies in various fields ranging from information and communication technologies to artificial intelligence. An update state of the art highlights significant recent achievements that have been making these resources more intelligent and closer to the user needs. However, we are still far from an overall implementation of the vision, which is hindered by three major issues. The first one is the limited capability of a large part of the resources distributed within the shop floor to automatically interpret the exchanged information in a meaningful manner (semantic interoperability) (Atzori, Iera, & Morabito, 2010). This issue is mainly due to the high heterogeneity of data model formats adopted by the different resources used within the shop floor (Modoni, Doukas, Terkaj, Sacco, & Mourtzis, 2016). Another open issue is the lack of efficient methods to fully virtualize the physical resources (Rosen, von Wichert, Lo, & Bettenhausen, 2015), since only pairing physical resource with its digital counterpart that abstracts the complexity of the real world, it is possible to augment communication and collaboration capabilities of the physical component. The third issue is a side effect of the ongoing technological ICT evolutions affecting all the manufacturing companies and consists in the continuous growth of the number of threats and vulnerabilities, which can both jeopardize the cybersecurity of the overall manufacturing system (Wells, Camelio, Williams, & White, 2014). For this reason, aspects related with cyber-security should be considered at the early stage of the design of any ICT solution, in order to prevent potential threats and vulnerabilities. All three of the above mentioned open issues have been addressed in this research work with the aim to explore and identify a precise, secure and efficient model of collaboration among the production resources distributed within the shop floor. This document illustrates main outcomes of the research, focusing mainly on the Virtual Integrative Manufacturing Framework for resources Interaction (VICKI), a potential reference architecture for a middleware application enabling semantic-based cooperation among manufacturing resources. Specifically, this framework provides a technological and service-oriented infrastructure offering an event-driven mechanism that dynamically propagates the changing factors to the interested devices. The proposed system supports the coexistence and combination of physical components and their virtual counterparts in a network of interacting collaborative elements in constant connection, thus allowing to bring back the manufacturing system to a cooperative Cyber-physical Production System (CPPS) (Monostori, 2014). Within this network, the information coming from the productive chain can be promptly and seamlessly shared, distributed and understood by any actor operating in such a context. In order to overcome the problem of the limited interoperability among the connected resources, the framework leverages a common data model based on the Semantic Web technologies (SWT) (Berners-Lee, Hendler, & Lassila, 2001). The model provides a shared understanding on the vocabulary adopted by the distributed resources during their knowledge exchange. In this way, this model allows to integrate heterogeneous data streams into a coherent semantically enriched scheme that represents the evolution of the factory objects, their context and their smart reactions to all kind of situations. The semantic model is also machine-interpretable and re-usable. In addition to modeling, the virtualization of the overall manufacturing system is empowered by the adoption of an agent-based modeling, which contributes to hide and abstract the control functions complexity of the cooperating entities, thus providing the foundations to achieve a flexible and reconfigurable system. Finally, in order to mitigate the risk of internal and external attacks against the proposed infrastructure, it is explored the potential of a strategy based on the analysis and assessment of the manufacturing systems cyber-security aspects integrated into the context of the organization’s business model. To test and validate the proposed framework, a demonstration scenarios has been identified, which are thought to represent different significant case studies of the factory’s life cycle. To prove the correctness of the approach, the validation of an instance of the framework is carried out within a real case study. Moreover, as for data intensive systems such as the manufacturing system, the quality of service (QoS) requirements in terms of latency, efficiency, and scalability are stringent, an evaluation of these requirements is needed in a real case study by means of a defined benchmark, thus showing the impact of the data storage, of the connected resources and of their requests
    • …
    corecore