262 research outputs found
A Comparative Analysis of Batch, Real-Time, Stream Processing, and Lambda Architecture for Modern Analytics Workloads
The explosion of big data has necessitated robust, scalable, and low-latency data processing paradigms to address modern analytics workloads. This paper provides a technical comparative analysis of batch processing, real-time processing, stream processing, and the hybrid Lambda architecture, highlighting their architectural principles, data flow models, performance characteristics, and trade-offs. Batch processing operates on static, large-scale datasets and prioritizes high throughput but incurs significant latency. Real-time and stream processing frameworks enable continuous or near-instant processing of unbounded data streams, focusing on minimal latency while maintaining system resilience. The Lambda architecture integrates batch and stream layers to provide fault-tolerant, scalable analytics with accurate and timely results. This paper dissects these paradigms based on technical metrics such as latency, fault tolerance, scalability, data consistency, resource utilization, and operational complexity. We further analyze real-world use cases, highlighting how each paradigm addresses specific workload requirements in domains such as IoT, finance, and big data systems. Our findings emphasize that while no single paradigm is universally optimal, selecting the right architecture requires balancing latency, throughput, and computational efficiency based on workload characteristics and business priorities
Proceedings of the 16th ACM International Conference on Distributed and Event-based Systems, DEBS 2022, Copenhagen, Denmark, June 27 - 30, 2022
Scalable architecture for big data financial analytics : user-defined functions vs. SQL
Large financial organizations have hundreds of millions of financial contracts on their balance sheets. Moreover, highly volatile financial markets and heterogeneous data sets within and across banks world-wide make near real-time financial analytics very challenging and their handling thus requires cutting edge financial algorithms. However, due to a lack of data modeling standards, current financial risk algorithms are typically inconsistent and non-scalable.
In this paper, we present a novel implementation of a real-world use case for performing large-scale financial analytics leveraging Big Data technology. We first provide detailed background information on the financial underpinnings of our framework along with the major financial calculations. Afterwards we analyze the performance of different parallel implementations in Apache Spark based on existing computation kernels that apply the ACTUS data and algorithmic standard for financial contract modeling. The major contribution is a detailed discussion of the design trade-offs between applying user-defined functions on existing computation kernels vs. partially re-writing the kernel in SQL and thus taking advantage of the underlying SQL query optimizer. Our performance evaluation demonstrates almost linear scalability for the best design choice
Towards a unifying modeling framework for data-intensive tools
LAUREA MAGISTRALENegli ultimi due decenni, il volume di dati raccolti ed elaborati dai sistemi informatici `e cresciuto
esponenzialmente. I database, originariamente impiegati per archiviare e interrogare banche dati,
risultano ora inappropriati nel risolvere problemi relativi ai big data. Essi sono infatti limitati
nella scalabilit`a orizzontale a causa di architetture concepite per una singola macchina. I nuovi
requisiti relativi ai big data come il grande volume, lâalta velocit`a e lâeterogeneit`a dei formati
hanno portato allo sviluppo di una nuova famiglia di strumenti definita ad alta intensit`a di dati.
Con lâavvento di queste tecnologie sono stati introdotti nuovi paradigmi computazionali. Uno
tra i primi, Map-Reduce, ha introdotto lâidea di inviare lâelaborazione verso i dati salvati su pi`u
nodi. Successivamente, unâaltra tecnica ha proposto il concetto di flusso di dati mantenendo
il calcolo allocato staticamente e facendo fluire lâinfromazione attraverso il sistema. Tramite
questâultimo approccio non `e pi`u necessario salvare i dati prima della loro elaborazione.
I requisiti odierni rispetto alla processazione di big data sono sia di trasformare dati grezzi in
preziose conoscenze ma anche di poter utilizzare funzionalit`a simili ai database come la gestione
di uno stato e il supporto a garanzie transazionali. In questo scenario, i modelli di progettazione
dei database centralizzati degli anni â80 imporrebbero un compromesso fortemente limitativo tra
garanzie e scalabilit`a.
Lâapproccio NoSQL `e stato il primo tentativo fatto nel risolvere il problema. Questa categoria
di sistemi ha la capacit`a di scalare ma non fornisce n Ěe garanzie transazionali n Ěe unâinterfaccia
SQL standard. Successivamente, la forte richiesta di sfruttare nuovamente le garanzie forti e di
reintrodurre il linguaggio SQL su larga scala ha motivato lâorigine dei NewSQL: database con il
modello relazionale classico ma in grado di fornire garanzie e anche scalare orizzontalmente.
In questo lavoro di tesi abbiamo studiato gli strumenti ad alta intensit`a di dati per confrontare
paradigmi di funzionamento e nuovi approcci introdotti a supporto di garanzie âfortiâ in scenari
distribuiti. Abbiamo analizzato le differenze e gli aspetti comuni che stanno emergendo e che
potrebbero guidare i modelli di progettazione futuri. Abbiamo sviluppato un framework di
modellazione per condurre sistematicamente la nostra analisi. I modelli proposti potrebbero
essere impiegati nella progettazione dei sistemi di prossima generazione, sfruttando le tecniche
affermatesi ed emerse negli strumenti ad alta intesit`a di dati odierni per salvare, eseguire queries
e analizzare dati.In the past two decades, the volume of data collected and processed by computer systems has
grown exponentially. The databases, employed initially to store and query data, are now inappropriate to address big data-related issues. They are indeed limited in horizontal scalability due
to architectures that were conceived for single machines. The new big data requirements such as
the large volume, the high velocity, and heterogeneity of formats have led to the development of
a new toolsâ family defined as data-intensive.
With the advent of these technologies, new computational paradigms were introduced. MapReduce was one of the first and brought the idea to send the processing towards the data kept
stored on multiple nodes. Later, another technique proposed the concept of stream of data
by keeping the computation statically allocated and making the information flow through the
system. With this approach, it is no longer necessary to store the data before it is processed.
Todayâs big data requirements are to transform raw data inputs into valuable knowledge,
but also they ask for database-like capabilities such as state management and transactional
guarantees. In this scenario, the design patterns of centralized databases from the â80s would
impose a very limiting trade-off between strong guarantees and scalability.
The NoSQL approach is the first attempt made to address the issue. This category of systems
is able to scale, but it provides neither strong guarantees nor a standardized SQL interface.
Subsequently, the strong demand to exploit transactional guarantees again and to reintroduce
the SQL language at scale motivated the origin of NewSQLs: databases with the legacy relational
model but able to provide guarantees and also largely scale horizontally.
In this work, we studied data-intensive tools to compare functioning paradigms and new approaches to strong guarantees in distributed scenarios. We analyzed the differences and common
aspects that are emerging and could drive future design patterns. We developed a modeling
framework to conduct our analysis systematically. The proposed models could be exploited
in next-generation systemsâ design to leverage established and emerging techniques in todayâs
data-intensive tools to store, query, and analyze data
Seer: Empowering Software Defined Networking with Data Analytics
Network complexity is increasing, making network control and orchestration a
challenging task. The proliferation of network information and tools for data
analytics can provide an important insight into resource provisioning and
optimisation. The network knowledge incorporated in software defined networking
can facilitate the knowledge driven control, leveraging the network
programmability. We present Seer: a flexible, highly configurable data
analytics platform for network intelligence based on software defined
networking and big data principles. Seer combines a computational engine with a
distributed messaging system to provide a scalable, fault tolerant and
real-time platform for knowledge extraction. Our first prototype uses Apache
Spark for streaming analytics and open network operating system (ONOS)
controller to program a network in real-time. The first application we
developed aims to predict the mobility pattern of mobile devices inside a smart
city environment.Comment: 8 pages, 6 figures, Big data, data analytics, data mining, knowledge
centric networking (KCN), software defined networking (SDN), Seer, 2016 15th
International Conference on Ubiquitous Computing and Communications and 2016
International Symposium on Cyberspace and Security (IUCC-CSS 2016
- âŚ