967 research outputs found

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    Implementation of a Large-Scale Platform for Cyber-Physical System Real-Time Monitoring

    Get PDF
    The emergence of Industry 4.0 and the Internet of Things (IoT) has meant that the manufacturing industry has evolved from embedded systems to cyber-physical systems (CPSs). This transformation has provided manufacturers with the ability to measure the performance of industrial equipment by means of data gathered from on-board sensors. This allows the status of industrial systems to be monitored and can detect anomalies. However, the increased amount of measured data has prompted many companies to investigate innovative ways to manage these volumes of data. In recent years, cloud computing and big data technologies have emerged among the scientific communities as key enabling technologies to address the current needs of CPSs. This paper presents a large-scale platform for CPS real-time monitoring based on big data technologies, which aims to perform real-time analysis that targets the monitoring of industrial machines in a real work environment. This paper is validated by implementing the proposed solution on a real industrial use case that includes several industrial press machines. The formal experiments in a real scenario are conducted to demonstrate the effectiveness of this solution and also its adequacy and scalability for future demand requirements. As a result of the implantation of this solution, the overall equipment effectiveness has been improved.The authors are grateful to Goizper and Fagor Arrasate for providing the industrial case study, and specifically Jon Rodriguez and David Chico (Fagor Arrasate) for their help and support. Any opinions, findings and conclusions expressed in this article are those of the authors and do not necessarily reflect the views of the funding agencies

    SCABBARD: single-node fault-tolerant stream processing

    Get PDF
    Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single-node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single-node SPEs rely on upstream distributed systems, such as Apache Kafka, to recover stream data after failure, necessitating complex cluster-based deployments. This lack of built-in fault-tolerance features has hindered the adoption of single-node SPEs.We describe Scabbard, the first single-node SPE that supports exactly-once fault-tolerance semantics despite limited local I/O bandwidth. Scabbard achieves this by integrating persistence operations with the query workload. Within the operator graph, Scabbard determines when to persist streams based on the selectivity of operators: by persisting streams after operators that discard data, it can substantially reduce the required I/O bandwidth. As part of the operator graph, Scabbard supports parallel persistence operations and uses markers to decide when to discard persisted data. The persisted data volume is further reduced using workload-specific compression: Scabbard monitors stream statistics and dynamically generates computationally efficient compression operators. Our experiments show that Scabbard can execute stream queries that process over 200 million tuples per second while recovering from failures with sub-second latencies

    Antibacterial Ti-Cu alloy with enhanced mechanical properties as implant applications

    Get PDF
    The service life as hard tissue implantation for clinical application needs compatible mechanical properties, e.g. strength, modulus, etc, and certain self-healing in case of internal infection. Therefore, for sake of improving the properties of Ti-Cu alloy, the microstructure, mechanical properties, corrosion resistance and antibacterial properties of Ti-xCu alloy (x = 2, 5, 7 and 10 wt.%) prepared by Ar-arc melting followed by heat treatment were studied. The results show that the Ti-Cu alloy was mainly composed of α-Ti matrix and precipitated Ti2Cu phase. The Cu element mainly accumulates in the lamellar structure and forms the precipitated Ti2Cu phase. As the increase of Cu content, the lamellar Ti2Cu phase increases, the compressive strength and elastic modulus also were altered. The Ti-7Cu alloy exhibited the higher compressive strength (2169 MPa) and the lower elastic modulus (108 GPa) compared with other Ti-Cu alloys. The corrosion resistance of Ti-xCu alloys increases with the increase of Cu content. When the Cu content was greater than 5 wt.%, the value of corrosion current density for Ti-Cu alloy was less than 1 μAcenterdotcm−2, which is also significantly lower than that of CP-Ti. The antibacterial test revealed that only the Ti-Cu alloy with 5 wt.% or greater Cu content could display a strong antibacterial rate against E. coli and S. aureus. Therefore, the prepared Ti-7Cu alloy via heat treatment showed excellent mechanical properties, corrosion resistance, and antibacterial properties, which would meet the replacement of human hard tissue and clinical applications

    ETL and analysis of IoT data using OpenTSDB, Kafka, and Spark

    Get PDF
    Master's thesis in Computer scienceThe Internet of Things (IoT) is becoming increasingly prevalent in today's society. Innovations in storage and processing methodologies enable the processing of large amounts of data in a scalable manner, and generation of insights in near real-time. Data from IoT are typically time-series data but they may also have a strong spatial correlation. In addition, many time-series data are deployed in industries that still place the data in inappropriate relational databases. Many open-source time-series databases exist today with inspiring features in terms of storage, analytic representation, and visualization. Finding an efficient method to migrate data into a time-series database is the first objective of the thesis. In recent decades, machine learning has become one of the backbones of data innovation. With the constantly expanding amounts of information available, there is good reason to expect that smart data analysis will become more pervasive as an essential element for innovative progress. Methods for modeling time-series data in machine learning and migrating time-series data from a database to a big data machine learning framework, such as Apache Spark, is explored in this thesis

    Why High-Performance Modelling and Simulation for Big Data Applications Matters

    Get PDF
    Modelling and Simulation (M&S) offer adequate abstractions to manage the complexity of analysing big data in scientific and engineering domains. Unfortunately, big data problems are often not easily amenable to efficient and effective use of High Performance Computing (HPC) facilities and technologies. Furthermore, M&S communities typically lack the detailed expertise required to exploit the full potential of HPC solutions while HPC specialists may not be fully aware of specific modelling and simulation requirements and applications. The COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications has created a strategic framework to foster interaction between M&S experts from various application domains on the one hand and HPC experts on the other hand to develop effective solutions for big data applications. One of the tangible outcomes of the COST Action is a collection of case studies from various computing domains. Each case study brought together both HPC and M&S experts, giving witness of the effective cross-pollination facilitated by the COST Action. In this introductory article we argue why joining forces between M&S and HPC communities is both timely in the big data era and crucial for success in many application domains. Moreover, we provide an overview on the state of the art in the various research areas concerned

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    A Survey on the Evolution of Stream Processing Systems

    Full text link
    Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.Comment: 34 pages, 15 figures, 5 table
    corecore