279 research outputs found
On Efficiently Partitioning a Topic in Apache Kafka
Apache Kafka addresses the general problem of delivering extreme high volume
event data to diverse consumers via a publish-subscribe messaging system. It
uses partitions to scale a topic across many brokers for producers to write
data in parallel, and also to facilitate parallel reading of consumers. Even
though Apache Kafka provides some out of the box optimizations, it does not
strictly define how each topic shall be efficiently distributed into
partitions. The well-formulated fine-tuning that is needed in order to improve
an Apache Kafka cluster performance is still an open research problem. In this
paper, we first model the Apache Kafka topic partitioning process for a given
topic. Then, given the set of brokers, constraints and application requirements
on throughput, OS load, replication latency and unavailability, we formulate
the optimization problem of finding how many partitions are needed and show
that it is computationally intractable, being an integer program. Furthermore,
we propose two simple, yet efficient heuristics to solve the problem: the first
tries to minimize and the second to maximize the number of brokers used in the
cluster. Finally, we evaluate its performance via large-scale simulations,
considering as benchmarks some Apache Kafka cluster configuration
recommendations provided by Microsoft and Confluent. We demonstrate that,
unlike the recommendations, the proposed heuristics respect the hard
constraints on replication latency and perform better w.r.t. unavailability
time and OS load, using the system resources in a more prudent way.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible. This work was funded by the European Union's Horizon
2020 research and innovation programme MARVEL under grant agreement No 95733
Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture
We present the architecture behind Twitter's real-time related query
suggestion and spelling correction service. Although these tasks have received
much attention in the web search literature, the Twitter context introduces a
real-time "twist": after significant breaking news events, we aim to provide
relevant results within minutes. This paper provides a case study illustrating
the challenges of real-time data processing in the era of "big data". We tell
the story of how our system was built twice: our first implementation was built
on a typical Hadoop-based analytics stack, but was later replaced because it
did not meet the latency requirements necessary to generate meaningful
real-time results. The second implementation, which is the system deployed in
production, is a custom in-memory processing engine specifically designed for
the task. This experience taught us that the current typical usage of Hadoop as
a "big data" platform, while great for experimentation, is not well suited to
low-latency processing, and points the way to future work on data analytics
platforms that can handle "big" as well as "fast" data
The Intersection of Function-as-a-Service and Stream Computing
With recent advancements in the field of computing including the emergence of cloud computing, the consumption and accessibility of computational resources have increased drastically. Although there have been significant movements towards more sustainable computing, there are many more steps to be taken to decrease the amount of energy consumed and greenhouse gases released from the computing sector. Historically, the switch from on-premises computing to cloud computing has led to less energy consumption through the design of efficient data centers. By releasing direct control of the hardware that their software is run on, an organization can also increase efficiency and reduce costs. A new development in cloud computing has been serverless computing. Even though the term "serverless" is a misnomer because all applications are still executed on servers, serverless lets an organization resign another level of control, managing instances of virtual machines, to their cloud provider in order to reduce their cost. The cloud provider then provisions resources on-demand enabling less idle time. This reduction of idle time is a direct reduction of computing resources used, therefore resulting in a decrease in energy consumption. One form of serverless computing, Function-as-a-Service (FaaS), may have a promising future replacing some stream computing applications in order to increase efficiency and reduce waste. To explore these possibilities, the development of a stream processing application using traditional methods through Kafka Streams and FaaS through AWS Lambda was completed in order to demonstrate that FaaS can be used for stateless stream processing
DMLA: A Dynamic Model-Based Lambda Architecture for Learning and Recognition of Features in Big Data
Title from PDF of title page, viewed April 19, 2017Thesis advisor: Yugyung LeeVitaIncludes bibliographical references (pages 57-58)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2016Real-time event modeling and recognition is one of the major research areas that is yet to reach its fullest potential. In the exploration of a system to fit in the tremendous challenges posed by data growth, several big data ecosystems have evolved. Big Data Ecosystems are currently dealing with various architectural models, each one aimed to solve a real-time problem with ease. There is an increasing demand for building a dynamic architecture using the powers of real-time and computational intelligence under a single workflow to effectively handle fast-changing business environments. To the best of our knowledge, there is no attempt at supporting a distributed machine-learning paradigm by separating learning and recognition tasks using Big Data Ecosystems. The focus of our study is to design a distributed machine learning model by evaluating the various machine-learning algorithms for event detection learning and predictive analysis with different features in audio domains. We propose an integrated architectural model, called DMLA, to handle real-time problems that can enhance the richness in the information level and at the same time reduce the overhead of dealing with diverse architectural constraints. The DMLA architecture is the variant of a Lambda Architecture that combines the power of Apache Spark, Apache Storm (Heron), and Apache Kafka to handle massive amounts of data using both streaming and batch processing techniques. The primary dimension of this study is to demonstrate how DMLA recognizes real-time, real-world events (e.g., fire alarm alerts, babies needing immediate attention, etc.) that would require a quick response by the users. Detection of contextual information and utilizing the appropriate model dynamically has been distributed among the components of the DMLA architecture. In the DMLA framework, a dynamic predictive model, learned from the training data in Spark, is loaded from the context information into a Storm topology to recognize/predict the possible events. The event-based context aware solution was designed for real-time, real-world events. The Spark based learning had the highest accuracy of over 80% among several machine-learning models and the Storm topology model achieved a recognition rate of 75% in the best performance. We verify the effectiveness of the proposed architecture is effective in real-time event-based recognition in audio domains.Introduction -- Background and related work -- Proposed framework -- Results and evaluation -- Conclusion and future wor
Benchmarking real-time vehicle data streaming models for a smart city
The information systems of smart cities offer project developers, institutions, industry and experts the possibility to handle massive incoming data from diverse information sources in order to produce new information services for citizens. Much of this information has to be processed as it arrives because a real-time response is often needed. Stream processing architectures solve this kind of problems, but sometimes it is not easy to benchmark the load capacity or the efficiency of a proposed architecture. This work presents a real case project in which an infrastructure was needed for gathering information from drivers in a big city, analyzing that information and sending real-time recommendations to improve driving efficiency and safety on roads. The challenge was to support the real-time recommendation service in a city with thousands of simultaneous drivers at the lowest possible cost. In addition, in order to estimate the ability of an infrastructure to handle load, a simulator that emulates the data produced by a given amount of simultaneous drivers was also developed. Experiments with the simulator show how recent stream processing platforms like Apache Kafka could replace custom-made streaming servers in a smart city to achieve a higher scalability and faster responses, together with cost reduction.This research is partially supported by the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund (ERDF) through the âHERMES â SmartDriverâ project (TIN2013-46801-C4-2-R), the âHERMES â Smart Citizenâ project (TIN2013-46801-C4-1-R), and the âHERMES âSpace&Timeâ project (TIN2013-46801-C4-3-R)
Near Real Time Data Aggregation for NLP
Com o aumento do uso das redes sociais, o nĂșmero de opçÔes de rede para usar e a variedade
de funcionalidades que elas permitem leva Ă necessidade de os gestores desportivos prestarem
uma atenção especial a estes meios. à seguindo este pensamento que surge o Projeto PLAYOFF
e consequentemente esta tese.
Foi feito um levantamento da literatura existente de soluçÔes que combinam Apache Kafka com
modelos de machine learning e foi possĂvel verificar que, apesar de soluçÔes diferentes, jĂĄ
existem referencias nesses domĂnios.
à apresentada uma comparação entre Apache Kafka e RabbitMQ e as razÔes da escolha ter
recaĂdo para o Kafka. Ă apresentada de forma geral uma arquitetura de um projeto Kafka e,
posteriormente, as diferentes abordagens pensadas e desenvolvidas no ùmbito da dissertação,
assim como o formato das mensagens trocadas usando este sistema.
Uma serie de testes e seus resultados sĂŁo descritos, de modo a comprovar a sua escolha e
utilização. Nestes testes diferentes abordagem de execução paralela (threads e processos) são
apresentadas, assim como a forma de obter dados das APIs das redes sociais também possui
diferentes abordagens.
As alteraçÔes que foram realizadas aos modelos originais são descritas e explicadas as razÔes
para essas mudanças e de que forma se enquadram na ferramenta desenvolvida.
Foi realizado um teste global e final, designado por âTeste Pilotoâ, onde em ambiente real, com
um evento real foram testados todos os componentes deste projeto, incluindo os sistemas
externos desenvolvidos pela MOG Technologies e os componentes desenvolvidos no Ăąmbito
desta dissertação.
Por fim, Ă© possĂvel comprovar as soluçÔes apresentadas e opçÔes finais escolhidas para o
projeto, através dos resultados obtidos nos diferentes testes. à ainda proposto trabalho futuro
de continuação do desenvolvido.With the increasing use of social networks, the number of network options to use and the
variety of functionalities that they allow leads to the need for sports managers to pay special
attention to these media. It is following this thought that the PLAYOFF Project emerges and
consequently this thesis.
A search of the existing literature on solutions that combine Apache Kafka with machine
learning models was carried out and it was possible to verify that, despite different solutions,
there are already references in these domains.
A comparison between Apache Kafka and RabbitMQ and the reasons for choosing Kafka are
presented. A general architecture of a Kafka project is presented, as well as the different
approaches thought and developed within the scope of the dissertation, as well as the format
of the messages exchanged using this system.
A series of tests and their results are described, in order to prove their choice and use. In these
tests different parallel execution approaches (threads and processes) are presented, as well as
the way of obtaining data from the APIs of social networks also has different approaches.
The changes that were made to the original models are described and explained the reasons
for these changes and how they fit into the developed tool.
A final and global test was carried out, called âPilot Testâ, where in a real environment, with a
real event, all the components of this project were tested, including the external systems
developed by MOG Technologies and the components developed within the scope of this
dissertation.
Finally, it is possible to verify the solutions presented and final options chosen for the project,
through the results obtained in the different tests. It is also proposed future work of
continuation of the developed
- âŠ