Search CORE

4 research outputs found

A Memory-Optimal Many-To-Many Semi-Stream Join

Author: Lutteroth Christof
Naeem Asif
Weber Gerald
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2019
Field of study

Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.<br/

OPUS

Parallelisation of a cache-based stream-relation join for a near-real-time data warehouse

Author: Asif Naeem M.
Aslam Saad
Jamil Noreen
Khan Habib Ullah
Publication venue: 'MDPI AG'
Publication date: 12/08/2020
Field of study

Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN

Qatar University Institutional Repository

Big data velocity management-from stream to warehouse via high performance memory optimized index join

Author: Jamil Noreen
Khan Habib Ullah
Mirza Farhaan
Naeem M. Asif
Sundaram David
Weber Gerald
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/10/2020
Field of study

Efficient resource optimization is critical to manage the velocity and volume of real-time streaming data in near-real-time data warehousing and business intelligence. This article presents a memory optimisation algorithm for rapidly joining streaming data with persistent master data in order to reduce data latency. Typically during the transformation phase of ETL (Extraction, Transformation, and Loading) a stream of transactional data needs to be joined with master data stored on disk. To implement this process, a semi-stream join operator is commonly used. Most semi-stream join operators cache frequent parts of the master data to improve their performance, this process requires careful distribution of allocated memory among the components of the join operator. This article presents a cache inequality approach to optimise cache size and memory. To test this approach, we present a novel Memory Optimal Index-based Join (MOIJ) algorithm. MOIJ supports many-to-many types of joins and adapts to dynamic streaming data. We also present a cost model for MOIJ and compare the performance with existing algorithms empirically as well as analytically. We envisage the enhanced ability of processing near-real-time streaming data using minimal memory will reduce latency in processing big data and will contribute to the development of highperformance real-time business intelligence systems

Qatar University Institutional Repository

A memory-optimal many-to-many semi-stream join

Author: C Anderson
Christof Lutteroth
CJ Hahn
DJ DeWitt
Gerald Weber
J Chen
M. Asif Naeem
MA Naeem
MW Abadi
MW Blasgen
N. Polyzotis
S Chakravarthy
SR Jeffery
T Urhan
ZG Ives
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref