Search CORE

1,624 research outputs found

Real-Time Data Processing With Lambda Architecture

Author: Malusare Omkar Ashok
Publication venue: SJSU ScholarWorks
Publication date: 20/05/2019
Field of study

Data has evolved immensely in recent years, in type, volume and velocity. There are several frameworks to handle the big data applications. The project focuses on the Lambda Architecture proposed by Marz and its application to obtain real-time data processing. The architecture is a solution that unites the benefits of the batch and stream processing techniques. Data can be historically processed with high precision and involved algorithms without loss of short-term information, alerts and insights. Lambda Architecture has an ability to serve a wide range of use cases and workloads that withstands hardware and human mistakes. The layered architecture enhances loose coupling and flexibility in the system. This a huge benefit that allows understanding the trade-offs and application of various tools and technologies across the layers. There has been an advancement in the approach of building the LA due to improvements in the underlying tools. The project demonstrates a simplified architecture for the LA that is maintainable

SJSU ScholarWorks

When Things Matter: A Data-Centric View of the Internet of Things

Author: Dustdar Schahram
Falkner Nickolas J. G.
Qin Yongrui
Sheng Quan Z.
Vasilakos Athanasios V.
Wang Hua
Publication venue
Publication date: 01/01/2014
Field of study

With the recent advances in radio-frequency identification (RFID), low-cost wireless sensor devices, and Web technologies, the Internet of Things (IoT) approach has gained momentum in connecting everyday objects to the Internet and facilitating machine-to-human and machine-to-machine communication with the physical world. While IoT offers the capability to connect and integrate both digital and physical entities, enabling a whole new class of applications and services, several significant challenges need to be addressed before these applications and services can be fully realized. A fundamental challenge centers around managing IoT data, typically produced in dynamic and volatile environments, which is not only extremely large in scale and volume, but also noisy, and continuous. This article surveys the main techniques and state-of-the-art research efforts in IoT from data-centric perspectives, including data stream processing, data storage models, complex event processing, and searching in IoT. Open research issues for IoT data management are also discussed

arXiv.org e-Print Archive

Victoria University Eprints Repository

Pathway: a fast and flexible unified stream data processing framework for analytical and Machine Learning applications

Author: Bartoszkiewicz Michal
Chorowski Jan
Kosowski Adrian
Kowalski Jakub
Kulik Sergey
Lewandowski Mateusz
Nowicki Krzysztof
Piechowiak Kamil
Ruas Olivier
Stamirowska Zuzanna
Uznanski Przemyslaw
Publication venue
Publication date: 12/07/2023
Field of study

We present Pathway, a new unified data processing framework that can run workloads on both bounded and unbounded data streams. The framework was created with the original motivation of resolving challenges faced when analyzing and processing data from the physical economy, including streams of data generated by IoT and enterprise systems. These required rapid reaction while calling for the application of advanced computation paradigms (machinelearning-powered analytics, contextual analysis, and other elements of complex event processing). Pathway is equipped with a Table API tailored for Python and Python/SQL workflows, and is powered by a distributed incremental dataflow in Rust. We describe the system and present benchmarking results which demonstrate its capabilities in both batch and streaming contexts, where it is able to surpass state-of-the-art industry frameworks in both scenarios. We also discuss streaming use cases handled by Pathway which cannot be easily resolved with state-of-the-art industry frameworks, such as streaming iterative graph algorithms (PageRank, etc.)

arXiv.org e-Print Archive

Progressive Analytics: A Computation Paradigm for Exploratory Data Analysis

Author: Fekete Jean-Daniel
Primet Romain
Publication venue
Publication date: 18/07/2016
Field of study

Exploring data requires a fast feedback loop from the analyst to the system, with a latency below about 10 seconds because of human cognitive limitations. When data becomes large or analysis becomes complex, sequential computations can no longer be completed in a few seconds and data exploration is severely hampered. This article describes a novel computation paradigm called Progressive Computation for Data Analysis or more concisely Progressive Analytics, that brings at the programming language level a low-latency guarantee by performing computations in a progressive fashion. Moving this progressive computation at the language level relieves the programmer of exploratory data analysis systems from implementing the whole analytics pipeline in a progressive way from scratch, streamlining the implementation of scalable exploratory data analysis systems. This article describes the new paradigm through a prototype implementation called ProgressiVis, and explains the requirements it implies through examples.Comment: 10 page

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

IIoT Data Ness: From Streaming to Added Value

Author: Correia Ricardo André Araújo
Publication venue
Publication date: 01/01/2022
Field of study

In the emerging Industry 4.0 paradigm, the internet of things has been an innovation driver, allowing for environment visibility and control through sensor data analysis. However the data is of such volume and velocity that data quality cannot be assured by conventional architectures. It has been argued that the quality and observability of data are key to a project’s success, allowing users to interact with data more effectively and rapidly. In order for a project to become successful in this context, it is of imperative importance to incorporate data quality mechanisms in order to extract the most value out of data. If this goal is achieved one can expect enormous advantages that could lead to financial and innovation gains for the industry. To cope with this reality, this work presents a data mesh oriented methodology based on the state-of-the-art data management tools that exist to design a solution which leverages data quality in the Industrial Internet of Things (IIoT) space, through data contextualization. In order to achieve this goal, practices such as FAIR data principles and data observability concepts were incorporated into the solution. The result of this work allowed for the creation of an architecture that focuses on data and metadata management to elevate data context, ownership and quality.O conceito de Internet of Things (IoT) é um dos principais fatores de sucesso para a nova Indústria 4.0. Através de análise de dados sobre os valores que os sensores coletam no seu ambiente, é possível a construção uma plataforma capaz de identificar condições de sucesso e eventuais problemas antes que estes ocorram, resultando em ganho monetário relevante para as empresas. No entanto, este caso de uso não é de fácil implementação, devido à elevada quantidade e velocidade de dados proveniente de um ambiente de IIoT (Industrial Internet of Things)

Repositório Científico do Instituto Politécnico do Porto

Performance Analysis of Cloud-Based Stream Processing Pipelines for Real-Time Vehicle Data

Author: Zhang Pei
Publication venue
Publication date: 19/08/2019
Field of study

The recent advancements in stream processing systems enabled applications to exploit fast-changing data and provide real-time services to companies and users. This kind of application requires high throughput and low latency to provide the most value. This thesis work, in collaboration with Scania, provides fundamental blocks for the efficient development of latency-optimized, cloud-based, real-time processing pipelines. With investigation and analysis of the real-time Scania pipeline, this thesis delivers three contributions, that can be employed to speed up the process of developing, testing and optimizing low-latency streaming pipelines in many different contexts. The first contribution is the design and implementation of a generic framework for testing and benchmarking AWS based streaming pipelines. This framework allows collecting latency statistics from every step of the pipeline. The insights it produces can be used to quickly identify bottlenecks of the pipeline. Employing this framework, the study then proceeds to analyze the behaviour of Scania serverless streaming pipeline, which is AWS Kinesis and AWS Lambda services. The results show the importance of optimizing configuration parameters such as memory size and batch size. Several suggestions of best configurations and optimization of the pipeline are discussed. Finally, the thesis offers a survey of the main alternatives to Scania pipeline, including Apache Spark Streaming and Apache Flink. With an analysis of the benefits and drawbacks of each framework, We choose Flink as an alternative solution. Scania pipeline is adapted to Flink with new design and implementation. Benefits of Flink pipeline and performance comparison are discussed in detail. Overall, this work can be used as an extensive guide to the design and implementation of efficient, low-latency pipelines to be deployed on the cloud

Aaltodoc Publication Archive

Why High-Performance Modelling and Simulation for Big Data Applications Matters

Author: Aldinucci M.
Bracciali A.
Grelck C.
Larsson E.
Niewiadomska-Szynkiewicz E.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

International Migration, Integration and Social Cohesion online publications

Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications

Author: Al Jawarneh Isam Mashhour Hasan <1981>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 02/04/2020
Field of study

Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS

AMS Tesi di Dottorato