Search CORE

27,530 research outputs found

A Survey on Geographically Distributed Big-Data Processing using MapReduce

Author: Dolev Shlomi
Florissi Patricia
Gudes Ehud
Sharma Shantanu
Singer Ido
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/07/2017
Field of study

Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved in the current system, are expected to process geographically distributed data at their locations without moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and requirements in designing geographically distributed data processing frameworks and protocols. We classify and study batch processing (MapReduce-based systems), stream processing (Spark-based systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues.Comment: IEEE Transactions on Big Data; Accepted June 2017. 20 page

arXiv.org e-Print Archive

A Survey on Large Scale Metadata Server for Big Data Storage

Author: Nayak Sabuzima
Patgiri Ripon
Publication venue
Publication date: 11/04/2020
Field of study

Big Data is defined as high volume of variety of data with an exponential data growth rate. Data are amalgamated to generate revenue, which results a large data silo. Data are the oils of modern IT industries. Therefore, the data are growing at an exponential pace. The access mechanism of these data silos are defined by metadata. The metadata are decoupled from data server for various beneficial reasons. For instance, ease of maintenance. The metadata are stored in metadata server (MDS). Therefore, the study on the MDS is mandatory in designing of a large scale storage system. The MDS requires many parameters to augment with its architecture. The architecture of MDS depends on the demand of the storage system's requirements. Thus, MDS is categorized in various ways depending on the underlying architecture and design methodology. The article surveys on the various kinds of MDS architecture, designs, and methodologies. This article emphasizes on clustered MDS (cMDS) and the reports are prepared based on a) Bloom filter

-

based MDS, b) Client

-

funded MDS, c) Geo

-

aware MDS, d) Cache

-

aware MDS, e) Load

-

aware MDS, f) Hash

-

based MDS, and g) Tree

-

based MDS. Additionally, the article presents the issues and challenges of MDS for mammoth sized data.Comment: Submitted to ACM for possible publicatio

arXiv.org e-Print Archive

Energy-efficient Analytics for Geographically Distributed Big Data

Author: Lin Jie
Yang Shusen
Yang Xinyu
Yu Wei
Zhao Peng
Publication venue
Publication date: 26/08/2017
Field of study

Big data analytics on geographically distributed datasets (across data centers or clusters) has been attracting increasing interests from both academia and industry, but also significantly complicates the system and algorithm designs. In this article, we systematically investigate the geo-distributed big-data analytics framework by analyzing the fine-grained paradigm and the key design principles. We present a dynamic global manager selection algorithm (GMSA) to minimize energy consumption cost by fully exploiting the system diversities in geography and variation over time. The algorithm makes real-time decisions based on the measurable system parameters through stochastic optimization methods, while achieving the performance balances between energy cost and latency. Extensive trace-driven simulations verify the effectiveness and efficiency of the proposed algorithm. We also highlight several potential research directions that remain open and require future elaborations in analyzing geo-distributed big data

arXiv.org e-Print Archive

Fog Computing: Focusing on Mobile Users at the Edge

Author: Gao Longxiang
Li Zhi
Luan Tom H.
Sun Limin
Wei Guiyi
Xiang Yang
Publication venue
Publication date: 29/03/2016
Field of study

With smart devices, particular smartphones, becoming our everyday companions, the ubiquitous mobile Internet and computing applications pervade people daily lives. With the surge demand on high-quality mobile services at anywhere, how to address the ubiquitous user demand and accommodate the explosive growth of mobile traffics is the key issue of the next generation mobile networks. The Fog computing is a promising solution towards this goal. Fog computing extends cloud computing by providing virtualized resources and engaged location-based services to the edge of the mobile networks so as to better serve mobile traffics. Therefore, Fog computing is a lubricant of the combination of cloud computing and mobile applications. In this article, we outline the main features of Fog computing and describe its concept, architecture and design goals. Lastly, we discuss some of the future research issues from the networking perspective.Comment: 11 pages, 6 figure

arXiv.org e-Print Archive

Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing

Author: Chen Xu
Li En
Luo Ke
Zeng Liekang
Zhang Junshan
Zhou Zhi
Publication venue
Publication date: 24/05/2019
Field of study

With the breakthroughs in deep learning, the recent years have witnessed a booming of artificial intelligence (AI) applications and services, spanning from personal assistant to recommendation systems to video/audio surveillance. More recently, with the proliferation of mobile computing and Internet-of-Things (IoT), billions of mobile and IoT devices are connected to the Internet, generating zillions Bytes of data at the network edge. Driving by this trend, there is an urgent need to push the AI frontiers to the network edge so as to fully unleash the potential of the edge big data. To meet this demand, edge computing, an emerging paradigm that pushes computing tasks and services from the network core to the network edge, has been widely recognized as a promising solution. The resulted new inter-discipline, edge AI or edge intelligence, is beginning to receive a tremendous amount of interest. However, research on edge intelligence is still in its infancy stage, and a dedicated venue for exchanging the recent advances of edge intelligence is highly desired by both the computer system and artificial intelligence communities. To this end, we conduct a comprehensive survey of the recent research efforts on edge intelligence. Specifically, we first review the background and motivation for artificial intelligence running at the network edge. We then provide an overview of the overarching architectures, frameworks and emerging key technologies for deep learning model towards training/inference at the network edge. Finally, we discuss future research opportunities on edge intelligence. We believe that this survey will elicit escalating attentions, stimulate fruitful discussions and inspire further research ideas on edge intelligence.Comment: Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang, "Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing," Proceedings of the IEE

arXiv.org e-Print Archive

Designing and Implementing Data Warehouse for Agricultural Big Data

Author: Kechadi M-Tahar
Le-Khac Nhien-An
Ngo Vuong M.
Publication venue
Publication date: 29/05/2019
Field of study

In recent years, precision agriculture that uses modern information and communication technologies is becoming very popular. Raw and semi-processed agricultural data are usually collected through various sources, such as: Internet of Thing (IoT), sensors, satellites, weather stations, robots, farm equipment, farmers and agribusinesses, etc. Besides, agricultural datasets are very large, complex, unstructured, heterogeneous, non-standardized, and inconsistent. Hence, the agricultural data mining is considered as Big Data application in terms of volume, variety, velocity and veracity. It is a key foundation to establishing a crop intelligence platform, which will enable resource efficient agronomy decision making and recommendations. In this paper, we designed and implemented a continental level agricultural data warehouse by combining Hive, MongoDB and Cassandra. Our data warehouse capabilities: (1) flexible schema; (2) data integration from real agricultural multi datasets; (3) data science and business intelligent support; (4) high performance; (5) high storage; (6) security; (7) governance and monitoring; (8) replication and recovery; (9) consistency, availability and partition tolerant; (10) distributed and cloud deployment. We also evaluate the performance of our data warehouse.Comment: Business intelligent, data warehouse, constellation schema, Big Data, precision agricultur

arXiv.org e-Print Archive

Achieving Energy Efficiency in Cloud Brokering

Author: Bouvry Pascal
Guzek Mateusz
Jiao Lei
Kliazovich Dzmitry
Wang Lin
Publication venue
Publication date: 11/04/2016
Field of study

The proliferation of cloud providers has brought substantial interoperability complexity to the public cloud market, in which cloud brokering has been playing an important role. However, energy-related issues for public clouds have not been well addressed in the literature. In this paper, we claim that the broker is also situated in a perfect position where necessary actions can be taken to achieve energy efficiency for public cloud systems, particularly through job assignment and scheduling. We formulate the problem by a mixed integer program and prove its NP-hardness. Based on the complexity analysis, we simplify the problem by introducing admission control on jobs. In the sequel, optimal job assignment can be done straightforwardly and the problem is transformed into improving job admission rate by scheduling on two coupled phases: data transfer and job execution. The two scheduling phases are further decoupled and we develop efficient scheduling algorithm for each of them. Experimental results show that the proposed solution can achieve significant reduction on energy consumption with admission rates improved as well, even in large-scale public cloud systems

arXiv.org e-Print Archive

An edge-fog-cloud platform for anticipatory learning process designed for Internet of Mobile Things

Author: Cao Hung
Carlini Emanuele
Renso Chiara
Wachowicz Monica
Publication venue
Publication date: 19/06/2018
Field of study

This paper presents a novel architecture for data analytics targeting an anticipatory learning process in the context of the Internet of Mobile Things. The architecture is geo-distributed and composed by edge, fog, and cloud resources that operate collectively to support such an anticipatory learning process. We designed the architecture to manage large volumes of data streams coming from the IoMT devices, analyze in successive phases climbing up in the hierarchy of resources from edge, fog and cloud. We discuss the characteristics of the analytical tasks at each layer. We notice that the amount of data being transported in the network decreases going from the edge, to the fog and finally to the cloud, while the complexity of the computation increases. Such design allows to support different kind of analytical needs, from real-time to historical according to the type of resource being utilized. We have implemented the proposed architecture as a proof-of-concept using the transit data feeds from the area of Greater Moncton, Canada.Comment: Keywords: Internet of Mobile Things, data streams, edge-fog-cloud platform, anticipatory learnin

arXiv.org e-Print Archive

Workflow-Based Big Data Analytics in The Cloud Environment Present Research Status and Future Prospects

Author: Alam Mansaf
Khan Samiya
Shakil Kashish Ara
Publication venue
Publication date: 04/11/2017
Field of study

Workflow is a common term used to describe a systematic breakdown of tasks that need to be performed to solve a problem. This concept has found best use in scientific and business applications for streamlining and improving the performance of the underlying processes targeted towards achieving an outcome. The growing complexity of big data analytical problems has invited the use of scientific workflows for performing complex tasks for specific domain applications. This research investigates the efficacy of workflow-based big data analytics in the cloud environment, giving insights on the research already performed in the area and possible future research directions in the field

arXiv.org e-Print Archive

Consistency models in distributed systems: A survey on definitions, disciplines, challenges and applications

Author: Aldin Hesam Nejati Sharif
Deldari Hossein
Ghods Mostafa Razavi
Moattar Mohammad Hossein
Publication venue
Publication date: 08/02/2019
Field of study

The replication mechanism resolves some challenges with big data such as data durability, data access, and fault tolerance. Yet, replication itself gives birth to another challenge known as the consistency in distributed systems. Scalability and availability are the challenging criteria on which the replication is based upon in distributed systems which themselves require the consistency. Consistency in distributed computing systems has been employed in three different applicable fields, such as system architecture, distributed database, and distributed systems. Consistency models based on their applicability could be sorted from strong to weak. Our goal is to propose a novel viewpoint to different consistency models utilized in the distributed systems. This research proposes two different categories of consistency models. Initially, consistency models are categorized into three groups of data-centric, client-centric and hybrid models. Each of which is then grouped into three subcategories of traditional, extended, and novel consistency models. Consequently, the concepts and procedures are expressed in mathematical terms, which are introduced in order to present our models' behavior without implementation. Moreover, we have surveyed different aspects of challenges with respect to the consistency i.e., availability, scalability, security, fault tolerance, latency, violation, and staleness, out of which the two latter i.e. violation and staleness, play the most pivotal roles in terms of consistency and trade-off balancing. Finally, the contribution extent of each of the consistency models and the growing need for them in distributed systems are investigated.Comment: 52 pages, 13 figure

arXiv.org e-Print Archive