Search CORE

92 research outputs found

Overview of Caching Mechanisms to Improve Hadoop Performance

Author: Down Douglas G.
Ghazali Rana
Publication venue
Publication date: 23/10/2023
Field of study

Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time

arXiv.org e-Print Archive

Datacenter Traffic Control: Understanding Techniques and Trade-offs

Author: Noormohammadpour Mohammad
Raghavendra Cauligi S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/12/2017
Field of study

Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today's cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial

arXiv.org e-Print Archive

Crossref

OSF Preprints

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

FigShare

Recommended from our members

Resource Allocation In Large-Scale Distributed Systems

Author: Shafiee Mehrnoosh
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2021
Field of study

The focus of this dissertation is design and analysis of scheduling algorithms for distributed computer systems, i.e., data centers. Today’s data centers can contain thousands of servers and typically use a multi-tier switch network to provide connectivity among the servers. Data centers are the host for execution of various data-parallel applications. As an abstraction, a job in a data center can be thought of as a group of interdependent tasks, each with various requirements which need to be scheduled for execution on the servers and the data flows between the tasks that need to be scheduled in the switch network. In this thesis, we study both flow and task scheduling problems under the features of modern parallel computing frameworks.For the flow scheduling problem, we study three models. The first model considers a general network topology where flows among the various source-destination pairs of servers are generated dynamically over time. The goal is to assign the end-to-end data flows among the available paths in order to efficiently balance the load in the network. We propose a myopic algorithm that is computationally efficient and prove that it asymptotically minimizes the total network cost using a convex optimization model, fluid limit and Lyapunov analysis. We further propose randomized versions of our myopic algorithm. The second model consider the case that there is dependence among flows. Specifically, a coflow is defined as a collection of parallel flows whose completion time is determined by the completion time of the last flow in the collection. Our main result is a 5-approximation deterministic algorithm that schedule coflows in polynomial time so as to minimize the total weighted completion times. The key ingredient of our approach is an improved linear program formulation for sorting the coflows followed by a simple list scheduling policy. Lastly, we study scheduling coflows of multi-stage jobs to minimize the jobs’ total weighted completion times. Each job is represented by a DAG (Directed Acyclic Graph) among its coflows that captures the dependencies among the coflows. We define g(m) = log(m)/log(log(m)) and h(m, μ) = log(mμ)/(log(log(mμ)), where m is number of servers, μ is the maximum number of coflows in a job. We develop two algorithms with approximation ratios O(√μg(m)) and O(√μg(m)h(m, μ)) for jobs with general DAGs and rooted trees, respectively. The algorithms rely on random delaying and merging optimal schedules of the coflows in the jobs’ DAG, followed by enforcing dependency among coflows and the links’ capacity constraints. For the task scheduling problem, we study two models. We consider a setting where each job consists of a set of parallel tasks that need to be processed on different servers, and the job is completed once all its tasks finish processing. In the first model, each job is associated with a utility which is a decreasing function of its completion time. The objective is to schedule tasks in a way that achieves max-min fairness for jobs’ utilities. We first show a strong result regarding NP-hardness of this problem. We then proceed to define two notions of approximation solutions and develop scheduling algorithms that provide guarantees under these approximation notions, using dynamic programming and random perturbation of tasks’ processing times. In the second model, we further assume that processing times of tasks can be server dependent and a server can process (pack) multiple tasks at the same time subject to its capacity. We then propose three algorithms with approximation ratios of 4, (6 + ε), and 24 for different cases where preemption and migration of tasks among the servers are or are not allowed. Our algorithms use a combination of linear program relaxation and greedy packing techniques. To demonstrate the gains in practice, we evaluate all the proposed algorithms and compare their performances with the prior approaches through extensive simulations using real and synthesized traffic traces. We hope this work inspires improvements to existing job management and scheduling in distributed computer systems

Columbia University Academic Commons

Towards Deadline Guaranteed Cloud Storage Services

Author: Liu Guoxin
Shen Haiying
Yu Lei
Publication venue: Clemson University Libraries
Publication date: 01/01/2017
Field of study

More and more organizations move their data and workload to commercial cloud storage systems. However, the multiplexing and sharing of the resources in a cloud storage system present unpredictable data access latency to tenants, which may make online data-intensive applications unable to satisfy their deadline requirements. Thus, it is important for cloud storage systems to provide deadline guaranteed services. In this paper, to meet a current form of service level objective (SLO) that constrains the percentage of each tenant\u27s data access requests failing to meet its required deadline below a given threshold, we build a mathematical model to derive the upper bound of acceptable request arrival rate on each server. We then propose a Deadline Guaranteed storage service (called DGCloud) that incorporates three algorithms. Its deadline-aware load balancing scheme redirects requests and creates replicas to release the excess load of each server beyond the derived upper bound. Its workload consolidation algorithm tries to maximally reduce servers while still satisfying the SLO to maximize the resource utilization. Its data placement optimization algorithm re-schedules the data placement to minimize the transmission cost of data replication. Our trace-driven experiments in simulation and Amazon EC2 show the higher performance of DGCloud compared with previous methods in terms of deadline guarantees and system resource utilization, and the effectiveness of its individual algorithms

Clemson University: TigerPrints

Frugal Topology Construction for Stream Aggregation in the Cloud

Author: Guerraoui Rachid
Le Merrer Erwan
Patra Rhicheek
Tran Bao-Duy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/12/2016
Field of study

Aggregation of streamed data is key to the expansion of the Internet of Things. This paper addresses the problem of designing a topology for reliably aggregating data flows from many devices arriving at a datacenter. Reliability here means ensuring operation without data loss. We seek a frugal solution that prevents wasteful resource consumption (over-provisioning). This problem is salient when building an aggregation service out of components (here aggregation nodes) that exhibit hard constraints on the amount of information they can handle per unit of time. We first formalize the problem and provide an analysis of the relation between monitored devices (plus information they send), and the operations performed at aggregation nodes, in terms of data rates. Building on this rate analysis, we devise a novel algorithm, which we call CSA, that basically outputs an aggregation topology capable of handling those incoming data rates, preventing thereby empirical trial-and-error design. We analyze the algorithm, before validating it on the Amazon Kinesis platform, using a device dataset from a European telco operator

Infoscience - École polytechnique fédérale de Lausanne

Crossref

A Tutorial on Clique Problems in Communications and Signal Processing

Author: Al-Naffouri Tareq Y.
Alouini Mohamed-Slim
Dahrouj Hayssam
Douik Ahmed
Publication venue
Publication date: 25/02/2020
Field of study

Since its first use by Euler on the problem of the seven bridges of K\"onigsberg, graph theory has shown excellent abilities in solving and unveiling the properties of multiple discrete optimization problems. The study of the structure of some integer programs reveals equivalence with graph theory problems making a large body of the literature readily available for solving and characterizing the complexity of these problems. This tutorial presents a framework for utilizing a particular graph theory problem, known as the clique problem, for solving communications and signal processing problems. In particular, the paper aims to illustrate the structural properties of integer programs that can be formulated as clique problems through multiple examples in communications and signal processing. To that end, the first part of the tutorial provides various optimal and heuristic solutions for the maximum clique, maximum weight clique, and

k

-clique problems. The tutorial, further, illustrates the use of the clique formulation through numerous contemporary examples in communications and signal processing, mainly in maximum access for non-orthogonal multiple access networks, throughput maximization using index and instantly decodable network coding, collision-free radio frequency identification networks, and resource allocation in cloud-radio access networks. Finally, the tutorial sheds light on the recent advances of such applications, and provides technical insights on ways of dealing with mixed discrete-continuous optimization problems

arXiv.org e-Print Archive

Caltech Authors

ON SCHEDULING AND COMMUNICATION ISSUES IN DATA CENTERS

Author: Yang Sheng
Publication venue
Publication date: 01/01/2020
Field of study

The proliferation of datacenters to handle the rapidly growing amount of data being managed in the cloud, necessitates the design, management and effective utilization of the thousands of machines that constitute a data center. Many modern big data applications require access to a large number of machines and datasets for training neural nets or for other big data processing. In this thesis, we present research challenges and progress along two fronts. The first challenge addresses the need to schedule communication between machines in a much more effective manner, as several running applications compete for network bandwidth. We address a basic question known as coflow scheduling to optimize the weighted average completion time of tasks that are running across different machines in a datacenter and to effectively handle their communication needs. Sometimes, we are forced to distribute a task among multiple datacenters due to cost or legal reasons. For this case, we also study a related model that addresses communication needs of tasks that process data on multiple data centers and handles communication requirements of such tasks across a wide area network with possibly widely varying bandwidth and network structures across different pairs of machines. The second challenge is from a cloud user's perspective - since access to resources such as those provided by Amazon AWS can be expensive at scale, cloud computing providers often sell under utilized resources at a significant discount via a spot instance market. However, these instances are not dedicated and while they offer a cheaper alternative, there is a chance that the user's job will be interrupted to make room for higher priority tasks. Certain non-critical applications are not significantly impacted by delays due to interruptions, and we develop an initial framework to study some basic scheduling questions under this circumstance. In all of these topics, the problems we study are NP-hard and our focus is on developing good approximation algorithms. In addition, while we attack these problems from a theoretical perspective, all the algorithms developed in this thesis are practical and efficient, and can be easily deployed in practice, some are already deployed

Digital Repository at the University of Maryland