279 research outputs found

    On Efficiently Partitioning a Topic in Apache Kafka

    Full text link
    Apache Kafka addresses the general problem of delivering extreme high volume event data to diverse consumers via a publish-subscribe messaging system. It uses partitions to scale a topic across many brokers for producers to write data in parallel, and also to facilitate parallel reading of consumers. Even though Apache Kafka provides some out of the box optimizations, it does not strictly define how each topic shall be efficiently distributed into partitions. The well-formulated fine-tuning that is needed in order to improve an Apache Kafka cluster performance is still an open research problem. In this paper, we first model the Apache Kafka topic partitioning process for a given topic. Then, given the set of brokers, constraints and application requirements on throughput, OS load, replication latency and unavailability, we formulate the optimization problem of finding how many partitions are needed and show that it is computationally intractable, being an integer program. Furthermore, we propose two simple, yet efficient heuristics to solve the problem: the first tries to minimize and the second to maximize the number of brokers used in the cluster. Finally, we evaluate its performance via large-scale simulations, considering as benchmarks some Apache Kafka cluster configuration recommendations provided by Microsoft and Confluent. We demonstrate that, unlike the recommendations, the proposed heuristics respect the hard constraints on replication latency and perform better w.r.t. unavailability time and OS load, using the system resources in a more prudent way.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. This work was funded by the European Union's Horizon 2020 research and innovation programme MARVEL under grant agreement No 95733

    Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

    Full text link
    We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data

    The Intersection of Function-as-a-Service and Stream Computing

    Get PDF
    With recent advancements in the field of computing including the emergence of cloud computing, the consumption and accessibility of computational resources have increased drastically. Although there have been significant movements towards more sustainable computing, there are many more steps to be taken to decrease the amount of energy consumed and greenhouse gases released from the computing sector. Historically, the switch from on-premises computing to cloud computing has led to less energy consumption through the design of efficient data centers. By releasing direct control of the hardware that their software is run on, an organization can also increase efficiency and reduce costs. A new development in cloud computing has been serverless computing. Even though the term "serverless" is a misnomer because all applications are still executed on servers, serverless lets an organization resign another level of control, managing instances of virtual machines, to their cloud provider in order to reduce their cost. The cloud provider then provisions resources on-demand enabling less idle time. This reduction of idle time is a direct reduction of computing resources used, therefore resulting in a decrease in energy consumption. One form of serverless computing, Function-as-a-Service (FaaS), may have a promising future replacing some stream computing applications in order to increase efficiency and reduce waste. To explore these possibilities, the development of a stream processing application using traditional methods through Kafka Streams and FaaS through AWS Lambda was completed in order to demonstrate that FaaS can be used for stateless stream processing

    DMLA: A Dynamic Model-Based Lambda Architecture for Learning and Recognition of Features in Big Data

    Get PDF
    Title from PDF of title page, viewed April 19, 2017Thesis advisor: Yugyung LeeVitaIncludes bibliographical references (pages 57-58)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2016Real-time event modeling and recognition is one of the major research areas that is yet to reach its fullest potential. In the exploration of a system to fit in the tremendous challenges posed by data growth, several big data ecosystems have evolved. Big Data Ecosystems are currently dealing with various architectural models, each one aimed to solve a real-time problem with ease. There is an increasing demand for building a dynamic architecture using the powers of real-time and computational intelligence under a single workflow to effectively handle fast-changing business environments. To the best of our knowledge, there is no attempt at supporting a distributed machine-learning paradigm by separating learning and recognition tasks using Big Data Ecosystems. The focus of our study is to design a distributed machine learning model by evaluating the various machine-learning algorithms for event detection learning and predictive analysis with different features in audio domains. We propose an integrated architectural model, called DMLA, to handle real-time problems that can enhance the richness in the information level and at the same time reduce the overhead of dealing with diverse architectural constraints. The DMLA architecture is the variant of a Lambda Architecture that combines the power of Apache Spark, Apache Storm (Heron), and Apache Kafka to handle massive amounts of data using both streaming and batch processing techniques. The primary dimension of this study is to demonstrate how DMLA recognizes real-time, real-world events (e.g., fire alarm alerts, babies needing immediate attention, etc.) that would require a quick response by the users. Detection of contextual information and utilizing the appropriate model dynamically has been distributed among the components of the DMLA architecture. In the DMLA framework, a dynamic predictive model, learned from the training data in Spark, is loaded from the context information into a Storm topology to recognize/predict the possible events. The event-based context aware solution was designed for real-time, real-world events. The Spark based learning had the highest accuracy of over 80% among several machine-learning models and the Storm topology model achieved a recognition rate of 75% in the best performance. We verify the effectiveness of the proposed architecture is effective in real-time event-based recognition in audio domains.Introduction -- Background and related work -- Proposed framework -- Results and evaluation -- Conclusion and future wor

    Benchmarking real-time vehicle data streaming models for a smart city

    Get PDF
    The information systems of smart cities offer project developers, institutions, industry and experts the possibility to handle massive incoming data from diverse information sources in order to produce new information services for citizens. Much of this information has to be processed as it arrives because a real-time response is often needed. Stream processing architectures solve this kind of problems, but sometimes it is not easy to benchmark the load capacity or the efficiency of a proposed architecture. This work presents a real case project in which an infrastructure was needed for gathering information from drivers in a big city, analyzing that information and sending real-time recommendations to improve driving efficiency and safety on roads. The challenge was to support the real-time recommendation service in a city with thousands of simultaneous drivers at the lowest possible cost. In addition, in order to estimate the ability of an infrastructure to handle load, a simulator that emulates the data produced by a given amount of simultaneous drivers was also developed. Experiments with the simulator show how recent stream processing platforms like Apache Kafka could replace custom-made streaming servers in a smart city to achieve a higher scalability and faster responses, together with cost reduction.This research is partially supported by the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund (ERDF) through the “HERMES – SmartDriver” project (TIN2013-46801-C4-2-R), the “HERMES – Smart Citizen” project (TIN2013-46801-C4-1-R), and the “HERMES –Space&Time” project (TIN2013-46801-C4-3-R)

    Near Real Time Data Aggregation for NLP

    Get PDF
    Com o aumento do uso das redes sociais, o nĂșmero de opçÔes de rede para usar e a variedade de funcionalidades que elas permitem leva Ă  necessidade de os gestores desportivos prestarem uma atenção especial a estes meios. É seguindo este pensamento que surge o Projeto PLAYOFF e consequentemente esta tese. Foi feito um levantamento da literatura existente de soluçÔes que combinam Apache Kafka com modelos de machine learning e foi possĂ­vel verificar que, apesar de soluçÔes diferentes, jĂĄ existem referencias nesses domĂ­nios. É apresentada uma comparação entre Apache Kafka e RabbitMQ e as razĂ”es da escolha ter recaĂ­do para o Kafka. É apresentada de forma geral uma arquitetura de um projeto Kafka e, posteriormente, as diferentes abordagens pensadas e desenvolvidas no Ăąmbito da dissertação, assim como o formato das mensagens trocadas usando este sistema. Uma serie de testes e seus resultados sĂŁo descritos, de modo a comprovar a sua escolha e utilização. Nestes testes diferentes abordagem de execução paralela (threads e processos) sĂŁo apresentadas, assim como a forma de obter dados das APIs das redes sociais tambĂ©m possui diferentes abordagens. As alteraçÔes que foram realizadas aos modelos originais sĂŁo descritas e explicadas as razĂ”es para essas mudanças e de que forma se enquadram na ferramenta desenvolvida. Foi realizado um teste global e final, designado por “Teste Piloto”, onde em ambiente real, com um evento real foram testados todos os componentes deste projeto, incluindo os sistemas externos desenvolvidos pela MOG Technologies e os componentes desenvolvidos no Ăąmbito desta dissertação. Por fim, Ă© possĂ­vel comprovar as soluçÔes apresentadas e opçÔes finais escolhidas para o projeto, atravĂ©s dos resultados obtidos nos diferentes testes. É ainda proposto trabalho futuro de continuação do desenvolvido.With the increasing use of social networks, the number of network options to use and the variety of functionalities that they allow leads to the need for sports managers to pay special attention to these media. It is following this thought that the PLAYOFF Project emerges and consequently this thesis. A search of the existing literature on solutions that combine Apache Kafka with machine learning models was carried out and it was possible to verify that, despite different solutions, there are already references in these domains. A comparison between Apache Kafka and RabbitMQ and the reasons for choosing Kafka are presented. A general architecture of a Kafka project is presented, as well as the different approaches thought and developed within the scope of the dissertation, as well as the format of the messages exchanged using this system. A series of tests and their results are described, in order to prove their choice and use. In these tests different parallel execution approaches (threads and processes) are presented, as well as the way of obtaining data from the APIs of social networks also has different approaches. The changes that were made to the original models are described and explained the reasons for these changes and how they fit into the developed tool. A final and global test was carried out, called “Pilot Test”, where in a real environment, with a real event, all the components of this project were tested, including the external systems developed by MOG Technologies and the components developed within the scope of this dissertation. Finally, it is possible to verify the solutions presented and final options chosen for the project, through the results obtained in the different tests. It is also proposed future work of continuation of the developed
    • 

    corecore