3,069 research outputs found
Data locality in Hadoop
Current market tendencies show the need of storing and processing rapidly
growing amounts of data. Therefore, it implies the demand for distributed
storage and data processing systems. The Apache Hadoop is an open-source
framework for managing such computing clusters in an effective, fault-tolerant
way.
Dealing with large volumes of data, Hadoop, and its storage system HDFS
(Hadoop Distributed File System), face challenges to keep the high efficiency
with computing in a reasonable time. The typical Hadoop implementation
transfers computation to the data, rather than shipping data across the cluster.
Otherwise, moving the big quantities of data through the network could significantly
delay data processing tasks. However, while a task is already running,
Hadoop favours local data access and chooses blocks from the nearest nodes.
Next, the necessary blocks are moved just when they are needed in the given
ask.
For supporting the Hadoop’s data locality preferences, in this thesis, we propose
adding an innovative functionality to its distributed file system (HDFS), that
enables moving data blocks on request. In-advance shipping of data makes it
possible to forcedly redistribute data between nodes in order to easily adapt it to
the given processing tasks. New functionality enables the instructed movement
of data blocks within the cluster. Data can be shifted either by user running
the proper HDFS shell command or programmatically by other module like an
appropriate scheduler.
In order to develop such functionality, the detailed analysis of Apache Hadoop
source code and its components (specifically HDFS) was conducted. Research
resulted in a deep understanding of internal architecture, what made it possible
to compare the possible approaches to achieve the desired solution, and develop
the chosen one
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Sketch of Big Data Real-Time Analytics Model
Big Data has drawn huge attention from researchers in information sciences, decision makers in governments and enterprises. However, there is a lot of potential and highly useful value hidden in the huge volume of data. Data is the new oil, but unlike oil data can be refined further to create even more value. Therefore, a new scientific paradigm is born as data-intensive scientific discovery, also known as Big Data. The growth volume of real-time data requires new techniques and technologies to discover insight value. In this paper we introduce the Big Data real-time analytics model as a new technique. We discuss and compare several Big Data technologies for real-time processing along with various challenges and issues in adapting Big Data. Real-time Big Data analysis based on cloud computing approach is our future research direction
Big Data and the Internet of Things
Advances in sensing and computing capabilities are making it possible to
embed increasing computing power in small devices. This has enabled the sensing
devices not just to passively capture data at very high resolution but also to
take sophisticated actions in response. Combined with advances in
communication, this is resulting in an ecosystem of highly interconnected
devices referred to as the Internet of Things - IoT. In conjunction, the
advances in machine learning have allowed building models on this ever
increasing amounts of data. Consequently, devices all the way from heavy assets
such as aircraft engines to wearables such as health monitors can all now not
only generate massive amounts of data but can draw back on aggregate analytics
to "improve" their performance over time. Big data analytics has been
identified as a key enabler for the IoT. In this chapter, we discuss various
avenues of the IoT where big data analytics either is already making a
significant impact or is on the cusp of doing so. We also discuss social
implications and areas of concern.Comment: 33 pages. draft of upcoming book chapter in Japkowicz and Stefanowski
(eds.) Big Data Analysis: New algorithms for a new society, Springer Series
on Studies in Big Data, to appea
- …