491 research outputs found
Performance and Reliability Evaluation of Apache Kafka Messaging System
Streaming data is now flowing across various devices and applications around us. This type of data means any unbounded, ever growing, infinite data set which is continuously generated by all kinds of sources. Examples include sensor data transmitted among different Internet of Things (IoT) devices, user activity records collected on websites and payment requests sent from mobile devices. In many application scenarios, streaming data needs to be processed in real-time because its value can be futile over time. A variety of stream processing systems have been developed in the last decade and are evolving to address rising challenges.
A typical stream processing system consists of multiple processing nodes in the topology of a DAG (directed acyclic graph). To build real-time streaming data pipelines across those nodes, message middleware technology is widely applied. As a distributed messaging system with high durability and scalability, Apache Kafka has become very popular among modern companies. It ingests streaming data from upstream applications and store the data in its distributed cluster, which provides a fault-tolerant data source for stream processors. Therefore, Kafka plays a critical role to ensure the completeness, correctness and timeliness of streaming data delivery.
However, it is impossible to meet all the user requirements in real-time cases with a simple and fixed data delivery strategy. In this thesis, we address the challenge of choosing a proper configuration to guarantee both performance and reliability of Kafka for complex streaming application scenarios. We investigate the features that have an impact on the performance and reliability metrics. We propose a queueing based prediction model to predict the performance metrics, including producer throughput and packet latency of Kafka. We define two reliability metrics, the probability of message loss and the probability of message duplication. We create an ANN model to predict these metrics given unstable network metrics like network delay and packet loss rate. To collect sufficient training data we build a Docker-based Kafka testbed with a fault injection module. We use a new quality-of-service metric, timely throughput to help us choosing proper batch size in Kafka. Based on this metric, we propose a dynamic configuration method, which reactively guarantees both performance and reliability of Kafka under complex operation conditions
Seer: Empowering Software Defined Networking with Data Analytics
Network complexity is increasing, making network control and orchestration a
challenging task. The proliferation of network information and tools for data
analytics can provide an important insight into resource provisioning and
optimisation. The network knowledge incorporated in software defined networking
can facilitate the knowledge driven control, leveraging the network
programmability. We present Seer: a flexible, highly configurable data
analytics platform for network intelligence based on software defined
networking and big data principles. Seer combines a computational engine with a
distributed messaging system to provide a scalable, fault tolerant and
real-time platform for knowledge extraction. Our first prototype uses Apache
Spark for streaming analytics and open network operating system (ONOS)
controller to program a network in real-time. The first application we
developed aims to predict the mobility pattern of mobile devices inside a smart
city environment.Comment: 8 pages, 6 figures, Big data, data analytics, data mining, knowledge
centric networking (KCN), software defined networking (SDN), Seer, 2016 15th
International Conference on Ubiquitous Computing and Communications and 2016
International Symposium on Cyberspace and Security (IUCC-CSS 2016
ETL and analysis of IoT data using OpenTSDB, Kafka, and Spark
Master's thesis in Computer scienceThe Internet of Things (IoT) is becoming increasingly prevalent in today's society. Innovations in storage and processing methodologies enable the processing of large amounts of data in a scalable manner, and generation of insights in near real-time. Data from IoT are typically time-series data but they may also have a strong spatial correlation. In addition, many time-series data are deployed in industries that still place the data in inappropriate relational databases.
Many open-source time-series databases exist today with inspiring features in terms of storage, analytic representation, and visualization. Finding an efficient method to migrate data into a time-series database is the first objective of the thesis.
In recent decades, machine learning has become one of the backbones of data innovation. With the constantly expanding amounts of information available, there is good reason to expect that smart data analysis will become more pervasive as an essential element for innovative progress. Methods for modeling time-series data in machine learning and migrating time-series data from a database to a big data machine learning framework, such as Apache Spark, is explored in this thesis
Real-time predictive maintenance for wind turbines using Big Data frameworks
This work presents the evolution of a solution for predictive maintenance to
a Big Data environment. The proposed adaptation aims for predicting failures on
wind turbines using a data-driven solution deployed in the cloud and which is
composed by three main modules. (i) A predictive model generator which
generates predictive models for each monitored wind turbine by means of Random
Forest algorithm. (ii) A monitoring agent that makes predictions every 10
minutes about failures in wind turbines during the next hour. Finally, (iii) a
dashboard where given predictions can be visualized. To implement the solution
Apache Spark, Apache Kafka, Apache Mesos and HDFS have been used. Therefore, we
have improved the previous work in terms of data process speed, scalability and
automation. In addition, we have provided fault-tolerant functionality with a
centralized access point from where the status of all the wind turbines of a
company localized all over the world can be monitored, reducing O&M costs
Anomaly Detection in Cloud-Native systems
In recent years, microservices have gained popularity due to their benefits such as increased maintainability and scalability of the system. The microservice architectural pattern was adopted for the development of a large scale system which is commonly deployed on public and private clouds, and therefore the aim is to ensure that it always maintains an optimal level of performance. Consequently, the system is monitored by collecting different metrics including performancerelated metrics.
The first part of this thesis focuses on the creation of a dataset of realistic time series with anomalies at deterministic locations. This dataset addresses the lack of labeled data for training of supervised models and the absence of publicly available data, in fact the data are not usually shared due to privacy concerns.
The second part consists of an empirical study on the detection of anomalies occurring in the different services that compose the system. Specifically, the aim is to understand if it is possible to predict the anomalies in order to perform actions before system failures or performance degradation. Consequently, eight different classification-based Machine Learning algorithms were compared by collecting accuracy, training time and testing time, to figure out which technique might be most suitable for reducing system overload.
The results showed that there are strong correlations between metrics and that it is possible to predict the anomalies in the system with approximately 90% of accuracy. The most important outcome is that performance-related anomalies can be detected by monitoring a limited number of metrics collected at runtime with a short training time. Future work includes the adoption of prediction-based approaches and the development of some tools for the prediction of anomalies in cloud native environments
On Efficiently Partitioning a Topic in Apache Kafka
Apache Kafka addresses the general problem of delivering extreme high volume
event data to diverse consumers via a publish-subscribe messaging system. It
uses partitions to scale a topic across many brokers for producers to write
data in parallel, and also to facilitate parallel reading of consumers. Even
though Apache Kafka provides some out of the box optimizations, it does not
strictly define how each topic shall be efficiently distributed into
partitions. The well-formulated fine-tuning that is needed in order to improve
an Apache Kafka cluster performance is still an open research problem. In this
paper, we first model the Apache Kafka topic partitioning process for a given
topic. Then, given the set of brokers, constraints and application requirements
on throughput, OS load, replication latency and unavailability, we formulate
the optimization problem of finding how many partitions are needed and show
that it is computationally intractable, being an integer program. Furthermore,
we propose two simple, yet efficient heuristics to solve the problem: the first
tries to minimize and the second to maximize the number of brokers used in the
cluster. Finally, we evaluate its performance via large-scale simulations,
considering as benchmarks some Apache Kafka cluster configuration
recommendations provided by Microsoft and Confluent. We demonstrate that,
unlike the recommendations, the proposed heuristics respect the hard
constraints on replication latency and perform better w.r.t. unavailability
time and OS load, using the system resources in a more prudent way.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible. This work was funded by the European Union's Horizon
2020 research and innovation programme MARVEL under grant agreement No 95733
Orchestration of machine learning workflows on Internet of Things data
Applications empowered by machine learning (ML) and the Internet of Things (IoT) are changing the way people live and impacting a broad range of industries. However, creating and automating ML workflows at scale using real-world IoT data often leads to complex systems integration and production issues. Examples of challenges faced during the development of these ML applications include glue code, hidden dependencies, and data pipeline jungles. This research proposes the Machine Learning Framework for IoT data (ML4IoT), which is designed to orchestrate ML workflows to perform training and enable inference by ML models on IoT data. In the proposed framework, containerized microservices are used to automate the execution of tasks specified in ML workflows, which are defined through REST APIs. To address the problem of integrating big data tools and machine learning into a unified platform, the proposed framework enables the definition and execution of end-to-end ML workflows on large volumes of IoT data. In addition, to address the challenges of running multiple ML workflows in parallel, the ML4IoT has been designed to use container-based components that provide a convenient mechanism to enable the training and deployment of numerous ML models in parallel. Finally, to address the common production issues faced during the development of ML applications, the proposed framework used microservices architecture to bring flexibility, reusability, and extensibility to the framework. Through the experiments, we demonstrated the feasibility of the (ML4IoT), which managed to train and deploy predictive ML models in two types of IoT data. The obtained results suggested that the proposed framework can manage real-world IoT data, by providing elasticity to execute 32 ML workflows in parallel, which were used to train 128 ML models simultaneously. Also, results demonstrated that in the ML4IoT, the performance of rendering online predictions is not affected when 64 ML models are deployed concurrently to infer new information using online IoT data
- …