16 research outputs found
Modelling low power compute clusters for cloud simulation
In order to minimise their energy use, data centre operators are constantly exploring new ways to construct computing infrastructures. As low power CPUs, exemplified by ARM-based devices, are becoming increasingly popular, there is a growing trend for the large scale deployment of low power servers in data centres. For example, recent research has shown promising results on constructing small scale data centres using Raspberry Pi (RPi) single-board computers as their building blocks. To enable larger scale experimentation and feasibility studies, cloud simulators could be utilised. Unfortunately, stateof-the-art simulators often need significant modification to include such low power devices as core data centre components. In this paper, we introduce models and extensions to estimate the behaviour of these new components in the DISSECT-CF cloud computing simulator. We show that how a RPi based
cloud could be simulated with the use of the new models. We evaluate the precision and behaviour of the implemented models using a Hadoop-based application scenario executed both in real life and simulated clouds
Harnessing single board computers for military data analytics
Executive summary: This chapter covers the use of Single Board Computers (SBCs) to expedite onsite data analytics for a variety of military applications. Onsite data summarization and analytics is increasingly critical for command, control, and intelligence (C2I) operations, as excessive power consumption and communication latency can restrict the efficacy of down-range operations. SBCs offer power-efficient, inexpensive data-processing capabilities while maintaining a small form factor. We discuss the use of SBCs in a variety of domains, including wireless sensor networks, unmanned vehicles, and cluster computing. We conclude with a discussion of existing challenges and opportunities for future use.https://digitalcommons.usmalibrary.org/books/1010/thumbnail.jp
Big Data and Large-scale Data Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures
One of the significant shifts of the next-generation computing technologies will certainly be in
the development of Big Data (BD) deployment architectures. Apache Hadoop, the BD
landmark, evolved as a widely deployed BD operating system. Its new features include
federation structure and many associated frameworks, which provide Hadoop 3.x with the
maturity to serve different markets. This dissertation addresses two leading issues involved in
exploiting BD and large-scale data analytics realm using the Hadoop platform. Namely,
(i)Scalability that directly affects the system performance and overall throughput using
portable Docker containers. (ii) Security that spread the adoption of data protection practices
among practitioners using access controls. An Enhanced Mapreduce Environment (EME),
OPportunistic and Elastic Resource Allocation (OPERA) scheduler, BD Federation Access Broker
(BDFAB), and a Secure Intelligent Transportation System (SITS) of multi-tiers architecture for
data streaming to the cloud computing are the main contribution of this thesis study
Understanding the Performance of Low Power Raspberry Pi Cloud for Big Data
Nowadays, Internet-of-Things (IoT) devices generate data at high speed and large volume.
Often the data require real-time processing to support high system responsiveness which can be
supported by localised Cloud and/or Fog computing paradigms. However, there are considerably
large deployments of IoT such as sensor networks in remote areas where Internet connectivity is
sparse, challenging the localised Cloud and/or Fog computing paradigms. With the advent of the
Raspberry Pi, a credit card-sized single board computer, there is a great opportunity to construct
low-cost, low-power portable cloud to support real-time data processing next to IoT deployments.
In this paper, we extend our previous work on constructing Raspberry Pi Cloud to study its
feasibility for real-time big data analytics under realistic application-level workload in both native
and virtualised environments. We have extensively tested the performance of a single node Raspberry
Pi 2 Model B with httperf and a cluster of 12 nodes with Apache Spark and HDFS (Hadoop Distributed
File System). Our results have demonstrated that our portable cloud is useful for supporting real-time
big data analytics. On the other hand, our results have also unveiled that overhead for CPU-bound
workload in virtualised environment is surprisingly high, at 67.2%. We have found that, for big data
applications, the virtualisation overhead is fractional for small jobs but becomes more significant for
large jobs, up to 28.6%
Design and management of image processing pipelines within CPS : Acquired experience towards the end of the FitOptiVis ECSEL Project
Cyber-Physical Systems (CPSs) are dynamic and reactive systems interacting with processes, environment and, sometimes, humans. They are often distributed with sensors and actuators, characterized for being smart, adaptive, predictive and react in real-time. Indeed, image- and video-processing pipelines are a prime source for environmental information for systems allowing them to take better decisions according to what they see. Therefore, in FitOptiVis, we are developing novel methods and tools to integrate complex image- and video-processing pipelines. FitOptiVis aims to deliver a reference architecture for describing and optimizing quality and resource management for imaging and video pipelines in CPSs both at design- and run-time. The architecture is concretized in low-power, high-performance, smart components, and in methods and tools for combined design-time and run-time multi-objective optimization and adaptation within system and environment constraints.Peer reviewe
Dynamic resource provisioning for data center workloads with data constraints
Dynamic resource provisioning, as an important data center software building block, helps to achieve high resource usage efficiency, leading to enormous monetary benefits. Most existing work for data center dynamic provisioning target on stateless servers, where any request can be routed to any server. However, the assumption of stateless behaviors no longer holds for subsystems that subject to data constraints, as a request may depend on a certain dataset stored on a small subset of servers. Routing a request to a server without the required dataset violates data locality or data availability properties, which may negatively impact on the response times.
To solve this problem, this thesis provides an unified framework consisting of two main steps: 1) determining the proper amount of resources to serve the workload by analyzing the schedulability utilization bound; 2) avoiding transition penalties during cluster resizing operations by deliberately design data distribution policies. We apply this framework to both storage and computing subsystems, where the former includes distributed file systems, databases, memory caches, and the latter refers to systems such as Hadoop, Spark, and Storm. Proposed solutions are implemented into MemCached, HBase/HDFS, and Spark, and evaluated using various datasets, including Wikipedia, NYC taxi trace, Twitter traces, etc
The Role of Distributed Computing in Big Data Science: Case Studies in Forensics and Bioinformatics
2014 - 2015The era of Big Data is leading the generation of large amounts of data,
which require storage and analysis capabilities that can be only ad-
dressed by distributed computing systems. To facilitate large-scale
distributed computing, many programming paradigms and frame-
works have been proposed, such as MapReduce and Apache Hadoop,
which transparently address some issues of distributed systems and
hide most of their technical details.
Hadoop is currently the most popular and mature framework sup-
porting the MapReduce paradigm, and it is widely used to store and
process Big Data using a cluster of computers. The solutions such
as Hadoop are attractive, since they simplify the transformation
of an application from non-parallel to the distributed one by means
of general utilities and without many skills. However, without any
algorithm engineering activity, some target applications are not alto-
gether fast and e cient, and they can su er from several problems
and drawbacks when are executed on a distributed system. In fact, a
distributed implementation is a necessary but not su cient condition
to obtain remarkable performance with respect to a non-parallel coun-
terpart. Therefore, it is required to assess how distributed solutions
are run on a Hadoop cluster, and/or how their performance can be
improved to reduce resources consumption and completion times.
In this dissertation, we will show how Hadoop-based implementations
can be enhanced by using carefully algorithm engineering activity,
tuning, pro ling and code improvements. It is also analyzed how to
achieve these goals by working on some critical points, such as: data
local computation, input split size, number and granularity of tasks,
cluster con guration, input/output representation, etc.
i
In particular, to address these issues, we choose some case studies
coming from two research areas where the amount of data is rapidly
increasing, namely, Digital Image Forensics and Bioinformatics. We
mainly describe full- edged implementations to show how to design,
engineer, improve and evaluate Hadoop-based solutions for Source
Camera Identi cation problem, i.e., recognizing the camera used for
taking a given digital image, adopting the algorithm by Fridrich et al.,
and for two of the main problems in Bioinformatics, i.e., alignment-
free sequence comparison and extraction of k-mer cumulative or local
statistics.
The results achieved by our improved implementations show that they
are substantially faster than the non-parallel counterparts, and re-
markably faster than the corresponding Hadoop-based naive imple-
mentations. In some cases, for example, our solution for k-mer statis-
tics is approximately 30× faster than our Hadoop-based naive im-
plementation, and about 40× faster than an analogous tool build on
Hadoop. In addition, our applications are also scalable, i.e., execution
times are (approximately) halved by doubling the computing units.
Indeed, algorithm engineering activities based on the implementation
of smart improvements and supported by careful pro ling and tun-
ing may lead to a much better experimental performance avoiding
potential problems.
We also highlight how the proposed solutions, tips, tricks and insights
can be used in other research areas and problems.
Although Hadoop simpli es some tasks of the distributed environ-
ments, we must thoroughly know it to achieve remarkable perfor-
mance. It is not enough to be an expert of the application domain
to build Hadop-based implementations, indeed, in order to achieve
good performance, an expert of distributed systems, algorithm engi-
neering, tuning, pro ling, etc. is also required. Therefore, the best
performance depend heavily on the cooperation degree between the
domain expert and the distributed algorithm engineer. [edited by Author]XIV n.s