5 research outputs found
NMSTREAM: A SCALABLE EVENT-DRIVEN ETL FRAMEWORK FOR PROCESSING HETEROGENEOUS STREAMING DATA
ETL (Extraction-Transform-Load) tools, traditionally developed to operate offline on historical data for feeding Data-warehouses need to be enhanced to deal with continuously increased streaming data and be executed at network level during data streams acquisition. In this paper, a scalable and web-based ETL system called NMStream was presented. NMStream is based on event-driven architecture and designed for integrating distributed and heterogeneous streaming data by integrating the Apache Flume and Cassandra DB system, and the ETL processes were conducted through the Flume agent object. NMStream can be used for feeding traditional/real-time data-warehouses or data analytic tools in a stable and effective manner
Understanding the Performance of Low Power Raspberry Pi Cloud for Big Data
Nowadays, Internet-of-Things (IoT) devices generate data at high speed and large volume.
Often the data require real-time processing to support high system responsiveness which can be
supported by localised Cloud and/or Fog computing paradigms. However, there are considerably
large deployments of IoT such as sensor networks in remote areas where Internet connectivity is
sparse, challenging the localised Cloud and/or Fog computing paradigms. With the advent of the
Raspberry Pi, a credit card-sized single board computer, there is a great opportunity to construct
low-cost, low-power portable cloud to support real-time data processing next to IoT deployments.
In this paper, we extend our previous work on constructing Raspberry Pi Cloud to study its
feasibility for real-time big data analytics under realistic application-level workload in both native
and virtualised environments. We have extensively tested the performance of a single node Raspberry
Pi 2 Model B with httperf and a cluster of 12 nodes with Apache Spark and HDFS (Hadoop Distributed
File System). Our results have demonstrated that our portable cloud is useful for supporting real-time
big data analytics. On the other hand, our results have also unveiled that overhead for CPU-bound
workload in virtualised environment is surprisingly high, at 67.2%. We have found that, for big data
applications, the virtualisation overhead is fractional for small jobs but becomes more significant for
large jobs, up to 28.6%