262 research outputs found

    A Comparative Analysis of Batch, Real-Time, Stream Processing, and Lambda Architecture for Modern Analytics Workloads

    Get PDF
    The explosion of big data has necessitated robust, scalable, and low-latency data processing paradigms to address modern analytics workloads. This paper provides a technical comparative analysis of batch processing, real-time processing, stream processing, and the hybrid Lambda architecture, highlighting their architectural principles, data flow models, performance characteristics, and trade-offs. Batch processing operates on static, large-scale datasets and prioritizes high throughput but incurs significant latency. Real-time and stream processing frameworks enable continuous or near-instant processing of unbounded data streams, focusing on minimal latency while maintaining system resilience. The Lambda architecture integrates batch and stream layers to provide fault-tolerant, scalable analytics with accurate and timely results. This paper dissects these paradigms based on technical metrics such as latency, fault tolerance, scalability, data consistency, resource utilization, and operational complexity. We further analyze real-world use cases, highlighting how each paradigm addresses specific workload requirements in domains such as IoT, finance, and big data systems. Our findings emphasize that while no single paradigm is universally optimal, selecting the right architecture requires balancing latency, throughput, and computational efficiency based on workload characteristics and business priorities

    Scalable architecture for big data financial analytics : user-defined functions vs. SQL

    Get PDF
    Large financial organizations have hundreds of millions of financial contracts on their balance sheets. Moreover, highly volatile financial markets and heterogeneous data sets within and across banks world-wide make near real-time financial analytics very challenging and their handling thus requires cutting edge financial algorithms. However, due to a lack of data modeling standards, current financial risk algorithms are typically inconsistent and non-scalable. In this paper, we present a novel implementation of a real-world use case for performing large-scale financial analytics leveraging Big Data technology. We first provide detailed background information on the financial underpinnings of our framework along with the major financial calculations. Afterwards we analyze the performance of different parallel implementations in Apache Spark based on existing computation kernels that apply the ACTUS data and algorithmic standard for financial contract modeling. The major contribution is a detailed discussion of the design trade-offs between applying user-defined functions on existing computation kernels vs. partially re-writing the kernel in SQL and thus taking advantage of the underlying SQL query optimizer. Our performance evaluation demonstrates almost linear scalability for the best design choice

    Towards a unifying modeling framework for data-intensive tools

    Get PDF
    LAUREA MAGISTRALENegli ultimi due decenni, il volume di dati raccolti ed elaborati dai sistemi informatici `e cresciuto esponenzialmente. I database, originariamente impiegati per archiviare e interrogare banche dati, risultano ora inappropriati nel risolvere problemi relativi ai big data. Essi sono infatti limitati nella scalabilit`a orizzontale a causa di architetture concepite per una singola macchina. I nuovi requisiti relativi ai big data come il grande volume, l’alta velocit`a e l’eterogeneit`a dei formati hanno portato allo sviluppo di una nuova famiglia di strumenti definita ad alta intensit`a di dati. Con l’avvento di queste tecnologie sono stati introdotti nuovi paradigmi computazionali. Uno tra i primi, Map-Reduce, ha introdotto l’idea di inviare l’elaborazione verso i dati salvati su pi`u nodi. Successivamente, un’altra tecnica ha proposto il concetto di flusso di dati mantenendo il calcolo allocato staticamente e facendo fluire l’infromazione attraverso il sistema. Tramite quest’ultimo approccio non `e pi`u necessario salvare i dati prima della loro elaborazione. I requisiti odierni rispetto alla processazione di big data sono sia di trasformare dati grezzi in preziose conoscenze ma anche di poter utilizzare funzionalit`a simili ai database come la gestione di uno stato e il supporto a garanzie transazionali. In questo scenario, i modelli di progettazione dei database centralizzati degli anni ’80 imporrebbero un compromesso fortemente limitativo tra garanzie e scalabilit`a. L’approccio NoSQL `e stato il primo tentativo fatto nel risolvere il problema. Questa categoria di sistemi ha la capacit`a di scalare ma non fornisce n ́e garanzie transazionali n ́e un’interfaccia SQL standard. Successivamente, la forte richiesta di sfruttare nuovamente le garanzie forti e di reintrodurre il linguaggio SQL su larga scala ha motivato l’origine dei NewSQL: database con il modello relazionale classico ma in grado di fornire garanzie e anche scalare orizzontalmente. In questo lavoro di tesi abbiamo studiato gli strumenti ad alta intensit`a di dati per confrontare paradigmi di funzionamento e nuovi approcci introdotti a supporto di garanzie “forti” in scenari distribuiti. Abbiamo analizzato le differenze e gli aspetti comuni che stanno emergendo e che potrebbero guidare i modelli di progettazione futuri. Abbiamo sviluppato un framework di modellazione per condurre sistematicamente la nostra analisi. I modelli proposti potrebbero essere impiegati nella progettazione dei sistemi di prossima generazione, sfruttando le tecniche affermatesi ed emerse negli strumenti ad alta intesit`a di dati odierni per salvare, eseguire queries e analizzare dati.In the past two decades, the volume of data collected and processed by computer systems has grown exponentially. The databases, employed initially to store and query data, are now inappropriate to address big data-related issues. They are indeed limited in horizontal scalability due to architectures that were conceived for single machines. The new big data requirements such as the large volume, the high velocity, and heterogeneity of formats have led to the development of a new tools’ family defined as data-intensive. With the advent of these technologies, new computational paradigms were introduced. MapReduce was one of the first and brought the idea to send the processing towards the data kept stored on multiple nodes. Later, another technique proposed the concept of stream of data by keeping the computation statically allocated and making the information flow through the system. With this approach, it is no longer necessary to store the data before it is processed. Today’s big data requirements are to transform raw data inputs into valuable knowledge, but also they ask for database-like capabilities such as state management and transactional guarantees. In this scenario, the design patterns of centralized databases from the ’80s would impose a very limiting trade-off between strong guarantees and scalability. The NoSQL approach is the first attempt made to address the issue. This category of systems is able to scale, but it provides neither strong guarantees nor a standardized SQL interface. Subsequently, the strong demand to exploit transactional guarantees again and to reintroduce the SQL language at scale motivated the origin of NewSQLs: databases with the legacy relational model but able to provide guarantees and also largely scale horizontally. In this work, we studied data-intensive tools to compare functioning paradigms and new approaches to strong guarantees in distributed scenarios. We analyzed the differences and common aspects that are emerging and could drive future design patterns. We developed a modeling framework to conduct our analysis systematically. The proposed models could be exploited in next-generation systems’ design to leverage established and emerging techniques in today’s data-intensive tools to store, query, and analyze data

    Seer: Empowering Software Defined Networking with Data Analytics

    Get PDF
    Network complexity is increasing, making network control and orchestration a challenging task. The proliferation of network information and tools for data analytics can provide an important insight into resource provisioning and optimisation. The network knowledge incorporated in software defined networking can facilitate the knowledge driven control, leveraging the network programmability. We present Seer: a flexible, highly configurable data analytics platform for network intelligence based on software defined networking and big data principles. Seer combines a computational engine with a distributed messaging system to provide a scalable, fault tolerant and real-time platform for knowledge extraction. Our first prototype uses Apache Spark for streaming analytics and open network operating system (ONOS) controller to program a network in real-time. The first application we developed aims to predict the mobility pattern of mobile devices inside a smart city environment.Comment: 8 pages, 6 figures, Big data, data analytics, data mining, knowledge centric networking (KCN), software defined networking (SDN), Seer, 2016 15th International Conference on Ubiquitous Computing and Communications and 2016 International Symposium on Cyberspace and Security (IUCC-CSS 2016
    • …
    corecore