1,624 research outputs found
Real-Time Data Processing With Lambda Architecture
Data has evolved immensely in recent years, in type, volume and velocity. There are several frameworks to handle the big data applications. The project focuses on the Lambda Architecture proposed by Marz and its application to obtain real-time data processing. The architecture is a solution that unites the benefits of the batch and stream processing techniques. Data can be historically processed with high precision and involved algorithms without loss of short-term information, alerts and insights. Lambda Architecture has an ability to serve a wide range of use cases and workloads that withstands hardware and human mistakes. The layered architecture enhances loose coupling and flexibility in the system. This a huge benefit that allows understanding the trade-offs and application of various tools and technologies across the layers. There has been an advancement in the approach of building the LA due to improvements in the underlying tools. The project demonstrates a simplified architecture for the LA that is maintainable
When Things Matter: A Data-Centric View of the Internet of Things
With the recent advances in radio-frequency identification (RFID), low-cost
wireless sensor devices, and Web technologies, the Internet of Things (IoT)
approach has gained momentum in connecting everyday objects to the Internet and
facilitating machine-to-human and machine-to-machine communication with the
physical world. While IoT offers the capability to connect and integrate both
digital and physical entities, enabling a whole new class of applications and
services, several significant challenges need to be addressed before these
applications and services can be fully realized. A fundamental challenge
centers around managing IoT data, typically produced in dynamic and volatile
environments, which is not only extremely large in scale and volume, but also
noisy, and continuous. This article surveys the main techniques and
state-of-the-art research efforts in IoT from data-centric perspectives,
including data stream processing, data storage models, complex event
processing, and searching in IoT. Open research issues for IoT data management
are also discussed
Pathway: a fast and flexible unified stream data processing framework for analytical and Machine Learning applications
We present Pathway, a new unified data processing framework that can run
workloads on both bounded and unbounded data streams. The framework was created
with the original motivation of resolving challenges faced when analyzing and
processing data from the physical economy, including streams of data generated
by IoT and enterprise systems. These required rapid reaction while calling for
the application of advanced computation paradigms (machinelearning-powered
analytics, contextual analysis, and other elements of complex event
processing). Pathway is equipped with a Table API tailored for Python and
Python/SQL workflows, and is powered by a distributed incremental dataflow in
Rust. We describe the system and present benchmarking results which demonstrate
its capabilities in both batch and streaming contexts, where it is able to
surpass state-of-the-art industry frameworks in both scenarios. We also discuss
streaming use cases handled by Pathway which cannot be easily resolved with
state-of-the-art industry frameworks, such as streaming iterative graph
algorithms (PageRank, etc.)
Progressive Analytics: A Computation Paradigm for Exploratory Data Analysis
Exploring data requires a fast feedback loop from the analyst to the system,
with a latency below about 10 seconds because of human cognitive limitations.
When data becomes large or analysis becomes complex, sequential computations
can no longer be completed in a few seconds and data exploration is severely
hampered. This article describes a novel computation paradigm called
Progressive Computation for Data Analysis or more concisely Progressive
Analytics, that brings at the programming language level a low-latency
guarantee by performing computations in a progressive fashion. Moving this
progressive computation at the language level relieves the programmer of
exploratory data analysis systems from implementing the whole analytics
pipeline in a progressive way from scratch, streamlining the implementation of
scalable exploratory data analysis systems. This article describes the new
paradigm through a prototype implementation called ProgressiVis, and explains
the requirements it implies through examples.Comment: 10 page
IIoT Data Ness: From Streaming to Added Value
In the emerging Industry 4.0 paradigm, the internet of things has been an innovation driver, allowing for
environment visibility and control through sensor data analysis. However the data is of such volume and
velocity that data quality cannot be assured by conventional architectures. It has been argued that the
quality and observability of data are key to a project’s success, allowing users to interact with data more
effectively and rapidly. In order for a project to become successful in this context, it is of imperative
importance to incorporate data quality mechanisms in order to extract the most value out of data. If this
goal is achieved one can expect enormous advantages that could lead to financial and innovation gains
for the industry. To cope with this reality, this work presents a data mesh oriented methodology based
on the state-of-the-art data management tools that exist to design a solution which leverages data quality
in the Industrial Internet of Things (IIoT) space, through data contextualization. In order to achieve this
goal, practices such as FAIR data principles and data observability concepts were incorporated into the
solution. The result of this work allowed for the creation of an architecture that focuses on data and
metadata management to elevate data context, ownership and quality.O conceito de Internet of Things (IoT) é um dos principais fatores de sucesso para a nova Indústria 4.0. Através de análise de dados sobre os valores que os sensores coletam no seu ambiente, é possível a construção uma plataforma capaz de identificar condições de sucesso e eventuais problemas antes que estes ocorram, resultando em ganho monetário relevante para as empresas. No entanto, este caso de uso não é de fácil implementação, devido à elevada quantidade e velocidade de dados proveniente de um ambiente de IIoT (Industrial Internet of Things)
Performance Analysis of Cloud-Based Stream Processing Pipelines for Real-Time Vehicle Data
The recent advancements in stream processing systems enabled applications to exploit fast-changing data and provide real-time services to companies and users. This kind of application requires high throughput and low latency to provide the most value. This thesis work, in collaboration with Scania, provides fundamental blocks for the efficient development of latency-optimized, cloud-based, real-time processing pipelines.
With investigation and analysis of the real-time Scania pipeline, this thesis delivers three contributions, that can be employed to speed up the process of developing, testing and optimizing low-latency streaming pipelines in many different contexts.
The first contribution is the design and implementation of a generic framework for testing and benchmarking AWS based streaming pipelines. This framework allows collecting latency statistics from every step of the pipeline. The insights it produces can be used to quickly identify bottlenecks of the pipeline.
Employing this framework, the study then proceeds to analyze the behaviour of Scania serverless streaming pipeline, which is AWS Kinesis and AWS Lambda services. The results show the importance of optimizing configuration parameters such as memory size and batch size. Several suggestions of best configurations and optimization of the pipeline are discussed.
Finally, the thesis offers a survey of the main alternatives to Scania pipeline, including Apache Spark Streaming and Apache Flink. With an analysis of the benefits and drawbacks of each framework, We choose Flink as an alternative solution. Scania pipeline is adapted to Flink with new design and implementation. Benefits of Flink pipeline and performance comparison are discussed in detail.
Overall, this work can be used as an extensive guide to the design and implementation of efficient, low-latency pipelines to be deployed on the cloud
Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications
Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS
- …