27,530 research outputs found
A Survey on Geographically Distributed Big-Data Processing using MapReduce
Hadoop and Spark are widely used distributed processing frameworks for
large-scale data processing in an efficient and fault-tolerant manner on
private or public clouds. These big-data processing systems are extensively
used by many industries, e.g., Google, Facebook, and Amazon, for solving a
large class of problems, e.g., search, clustering, log analysis, different
types of join operations, matrix multiplication, pattern matching, and social
network analysis. However, all these popular systems have a major drawback in
terms of locally distributed computations, which prevent them in implementing
geographically distributed data processing. The increasing amount of
geographically distributed massive data is pushing industries and academia to
rethink the current big-data processing systems. The novel frameworks, which
will be beyond state-of-the-art architectures and technologies involved in the
current system, are expected to process geographically distributed data at
their locations without moving entire raw datasets to a single location. In
this paper, we investigate and discuss challenges and requirements in designing
geographically distributed data processing frameworks and protocols. We
classify and study batch processing (MapReduce-based systems), stream
processing (Spark-based systems), and SQL-style processing geo-distributed
frameworks, models, and algorithms with their overhead issues.Comment: IEEE Transactions on Big Data; Accepted June 2017. 20 page
A Survey on Large Scale Metadata Server for Big Data Storage
Big Data is defined as high volume of variety of data with an exponential
data growth rate. Data are amalgamated to generate revenue, which results a
large data silo. Data are the oils of modern IT industries. Therefore, the data
are growing at an exponential pace. The access mechanism of these data silos
are defined by metadata. The metadata are decoupled from data server for
various beneficial reasons. For instance, ease of maintenance. The metadata are
stored in metadata server (MDS). Therefore, the study on the MDS is mandatory
in designing of a large scale storage system. The MDS requires many parameters
to augment with its architecture. The architecture of MDS depends on the demand
of the storage system's requirements. Thus, MDS is categorized in various ways
depending on the underlying architecture and design methodology. The article
surveys on the various kinds of MDS architecture, designs, and methodologies.
This article emphasizes on clustered MDS (cMDS) and the reports are prepared
based on a) Bloom filterbased MDS, b) Clientfunded MDS, c) Geoaware
MDS, d) Cacheaware MDS, e) Loadaware MDS, f) Hashbased MDS, and g)
Treebased MDS. Additionally, the article presents the issues and challenges
of MDS for mammoth sized data.Comment: Submitted to ACM for possible publicatio
Energy-efficient Analytics for Geographically Distributed Big Data
Big data analytics on geographically distributed datasets (across data
centers or clusters) has been attracting increasing interests from both
academia and industry, but also significantly complicates the system and
algorithm designs. In this article, we systematically investigate the
geo-distributed big-data analytics framework by analyzing the fine-grained
paradigm and the key design principles. We present a dynamic global manager
selection algorithm (GMSA) to minimize energy consumption cost by fully
exploiting the system diversities in geography and variation over time. The
algorithm makes real-time decisions based on the measurable system parameters
through stochastic optimization methods, while achieving the performance
balances between energy cost and latency. Extensive trace-driven simulations
verify the effectiveness and efficiency of the proposed algorithm. We also
highlight several potential research directions that remain open and require
future elaborations in analyzing geo-distributed big data
Fog Computing: Focusing on Mobile Users at the Edge
With smart devices, particular smartphones, becoming our everyday companions,
the ubiquitous mobile Internet and computing applications pervade people daily
lives. With the surge demand on high-quality mobile services at anywhere, how
to address the ubiquitous user demand and accommodate the explosive growth of
mobile traffics is the key issue of the next generation mobile networks. The
Fog computing is a promising solution towards this goal. Fog computing extends
cloud computing by providing virtualized resources and engaged location-based
services to the edge of the mobile networks so as to better serve mobile
traffics. Therefore, Fog computing is a lubricant of the combination of cloud
computing and mobile applications. In this article, we outline the main
features of Fog computing and describe its concept, architecture and design
goals. Lastly, we discuss some of the future research issues from the
networking perspective.Comment: 11 pages, 6 figure
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
With the breakthroughs in deep learning, the recent years have witnessed a
booming of artificial intelligence (AI) applications and services, spanning
from personal assistant to recommendation systems to video/audio surveillance.
More recently, with the proliferation of mobile computing and
Internet-of-Things (IoT), billions of mobile and IoT devices are connected to
the Internet, generating zillions Bytes of data at the network edge. Driving by
this trend, there is an urgent need to push the AI frontiers to the network
edge so as to fully unleash the potential of the edge big data. To meet this
demand, edge computing, an emerging paradigm that pushes computing tasks and
services from the network core to the network edge, has been widely recognized
as a promising solution. The resulted new inter-discipline, edge AI or edge
intelligence, is beginning to receive a tremendous amount of interest. However,
research on edge intelligence is still in its infancy stage, and a dedicated
venue for exchanging the recent advances of edge intelligence is highly desired
by both the computer system and artificial intelligence communities. To this
end, we conduct a comprehensive survey of the recent research efforts on edge
intelligence. Specifically, we first review the background and motivation for
artificial intelligence running at the network edge. We then provide an
overview of the overarching architectures, frameworks and emerging key
technologies for deep learning model towards training/inference at the network
edge. Finally, we discuss future research opportunities on edge intelligence.
We believe that this survey will elicit escalating attentions, stimulate
fruitful discussions and inspire further research ideas on edge intelligence.Comment: Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang,
"Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge
Computing," Proceedings of the IEE
Designing and Implementing Data Warehouse for Agricultural Big Data
In recent years, precision agriculture that uses modern information and
communication technologies is becoming very popular. Raw and semi-processed
agricultural data are usually collected through various sources, such as:
Internet of Thing (IoT), sensors, satellites, weather stations, robots, farm
equipment, farmers and agribusinesses, etc. Besides, agricultural datasets are
very large, complex, unstructured, heterogeneous, non-standardized, and
inconsistent. Hence, the agricultural data mining is considered as Big Data
application in terms of volume, variety, velocity and veracity. It is a key
foundation to establishing a crop intelligence platform, which will enable
resource efficient agronomy decision making and recommendations. In this paper,
we designed and implemented a continental level agricultural data warehouse by
combining Hive, MongoDB and Cassandra. Our data warehouse capabilities: (1)
flexible schema; (2) data integration from real agricultural multi datasets;
(3) data science and business intelligent support; (4) high performance; (5)
high storage; (6) security; (7) governance and monitoring; (8) replication and
recovery; (9) consistency, availability and partition tolerant; (10)
distributed and cloud deployment. We also evaluate the performance of our data
warehouse.Comment: Business intelligent, data warehouse, constellation schema, Big Data,
precision agricultur
Achieving Energy Efficiency in Cloud Brokering
The proliferation of cloud providers has brought substantial interoperability
complexity to the public cloud market, in which cloud brokering has been
playing an important role. However, energy-related issues for public clouds
have not been well addressed in the literature. In this paper, we claim that
the broker is also situated in a perfect position where necessary actions can
be taken to achieve energy efficiency for public cloud systems, particularly
through job assignment and scheduling. We formulate the problem by a mixed
integer program and prove its NP-hardness. Based on the complexity analysis, we
simplify the problem by introducing admission control on jobs. In the sequel,
optimal job assignment can be done straightforwardly and the problem is
transformed into improving job admission rate by scheduling on two coupled
phases: data transfer and job execution. The two scheduling phases are further
decoupled and we develop efficient scheduling algorithm for each of them.
Experimental results show that the proposed solution can achieve significant
reduction on energy consumption with admission rates improved as well, even in
large-scale public cloud systems
An edge-fog-cloud platform for anticipatory learning process designed for Internet of Mobile Things
This paper presents a novel architecture for data analytics targeting an
anticipatory learning process in the context of the Internet of Mobile Things.
The architecture is geo-distributed and composed by edge, fog, and cloud
resources that operate collectively to support such an anticipatory learning
process. We designed the architecture to manage large volumes of data streams
coming from the IoMT devices, analyze in successive phases climbing up in the
hierarchy of resources from edge, fog and cloud. We discuss the characteristics
of the analytical tasks at each layer. We notice that the amount of data being
transported in the network decreases going from the edge, to the fog and
finally to the cloud, while the complexity of the computation increases. Such
design allows to support different kind of analytical needs, from real-time to
historical according to the type of resource being utilized. We have
implemented the proposed architecture as a proof-of-concept using the transit
data feeds from the area of Greater Moncton, Canada.Comment: Keywords: Internet of Mobile Things, data streams, edge-fog-cloud
platform, anticipatory learnin
Workflow-Based Big Data Analytics in The Cloud Environment Present Research Status and Future Prospects
Workflow is a common term used to describe a systematic breakdown of tasks
that need to be performed to solve a problem. This concept has found best use
in scientific and business applications for streamlining and improving the
performance of the underlying processes targeted towards achieving an outcome.
The growing complexity of big data analytical problems has invited the use of
scientific workflows for performing complex tasks for specific domain
applications. This research investigates the efficacy of workflow-based big
data analytics in the cloud environment, giving insights on the research
already performed in the area and possible future research directions in the
field
Consistency models in distributed systems: A survey on definitions, disciplines, challenges and applications
The replication mechanism resolves some challenges with big data such as data
durability, data access, and fault tolerance. Yet, replication itself gives
birth to another challenge known as the consistency in distributed systems.
Scalability and availability are the challenging criteria on which the
replication is based upon in distributed systems which themselves require the
consistency. Consistency in distributed computing systems has been employed in
three different applicable fields, such as system architecture, distributed
database, and distributed systems. Consistency models based on their
applicability could be sorted from strong to weak. Our goal is to propose a
novel viewpoint to different consistency models utilized in the distributed
systems. This research proposes two different categories of consistency models.
Initially, consistency models are categorized into three groups of
data-centric, client-centric and hybrid models. Each of which is then grouped
into three subcategories of traditional, extended, and novel consistency
models. Consequently, the concepts and procedures are expressed in mathematical
terms, which are introduced in order to present our models' behavior without
implementation. Moreover, we have surveyed different aspects of challenges with
respect to the consistency i.e., availability, scalability, security, fault
tolerance, latency, violation, and staleness, out of which the two latter i.e.
violation and staleness, play the most pivotal roles in terms of consistency
and trade-off balancing. Finally, the contribution extent of each of the
consistency models and the growing need for them in distributed systems are
investigated.Comment: 52 pages, 13 figure
- …