Search CORE

113 research outputs found

Simulation and Performance Evaluation of Hadoop Capacity Scheduler

Author: Chauhan Jagmohan
Publication venue: 'University of Saskatchewan Library'
Publication date
Field of study

MapReduce is a parallel programming paradigm used for processing huge datasets on certain classes of distributable problems using a cluster. Budgetary constraints and the need for better usage of resources in a MapReduce cluster often make organizations rent or share hardware resources for their main data processing and analysis tasks. Thus, there may be many competing jobs from different clients performing simultaneous requests to the MapReduce framework on a particular cluster. Schedulers like Fair Share and Capacity have been specially designed for such purposes. Administrators and users run into performance problems, however, because they do not know the exact meaning of different task scheduler settings and what impact they can have with respect to the resource allocation scheme across organizations for a shared MapReduce cluster. In this work, Capacity Scheduler is integrated into an existing MRPERF simulator to predict the performance of MapReduce jobs in a shared cluster under different settings for Capacity Scheduler. A few case studies on the behaviour of Capacity Scheduler across different job patterns etc. using integrated simulator are also conducted

eCommons@USASK

University of Saskatchewan Research Archive

Application profiling and resource management for MapReduce

Author: Nagtegaal Iris D.
Porte Robert J.
Ruys Anthony T.
Tanis Pieter J.
van Duijvendijk Peter
van Gulik Thomas M.
Verhoef Cornelis
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 01/01/2014
Field of study

Scale of data generated and processed is exponential growth in the Big Data ear. It poses a challenge that is far beyond the goal of a single computing system. Processing such vast amount of data on a single machine is impracticable in term of time or cost. Hence, distributed systems, which can harness very large clusters of commodity computers and processing data within restrictive time deadlines, are imperative. In this thesis, we target two aspects of distributed systems: application profiling and resource management. We study a MapReduce system in detail, which is a programming paradigm for large scale distributed computing, and presents solutions to tackle three key problems. Firstly, this thesis analyzes the characteristics of jobs running on the MapReduce system to reveal the problem—the Application scope of MapReduce has been extended beyond the original design goal that was large-scale data processing. This problem enables us to present a Workload Characteristic Oriented Scheduler (WCO), which strives for co-locating tasks of possibly different MapReduce jobs with complementing resource usage characteristics. Secondly, this thesis studies the current job priority mechanism focusing on resource management. In the MapReduce system, job priority only exists at scheduling level. High priority jobs are placed at the front of the scheduling queue and dispatched first. Resource, however, is fairly shared among jobs running at the same worker node without any consideration for their priorities. In order to resolve this, this thesis presents a non-intrusive slot layering solution, which dynamically allocates resource between running jobs based on their priority and efficiently reduces the execution time of high priority jobs while improves overall throughput. Last, based on the fact of underutilization of resource at each individual worker node, this thesis propose a new way, Local Resource Shaper (LRS), to smooth resource consumption of each individual job by automatically tuning the execution of concurrent jobs to maximize resource utilization while minimizing resource contention

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

PubMed Central

EUR Research Repository

Sydney eScholarship

Radboud Repository

Dissertations of the University of Groningen

Application profiling and resource management for MapReduce

Author: Lu Peng
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 01/01/2015
Field of study

Sydney eScholarship

Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis

Author: Kondikoppa Praveenkumar
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Analysis of Big data to gain better insights has been the focus of researchers in the recent past. Traditional desktop computers or database management systems may not be suitable for efficient and timely analysis, due to the requirement of massive parallel processing. Distributed computing frameworks are being explored as a viable solution. For example, Google proposed MapReduce, which is becoming a de facto computing architecture for Big data solutions. However, scheduling in MapReduce is coarse grained and remains as a challenge for improvement. Related with MapReduce scheduler when configured over distributed clusters, we identify two issues: data locality disruption and random assignment of non-local map tasks. We propose a network aware scheduler to extend the existing rack awareness. The tasks are scheduled in the order of node, rack and any other rack within the same cluster to achieve cluster level data locality. The issue of random assignment non-local map tasks is handled by enhancing the scheduler to consider the network parameters, such as delay, bandwidth and packet loss between remote clusters. As part of Big data analysis at computational biology, we consider two major data intensive applications: indexing genome sequences and de Novo assembly. Both of these applications deal with the massive amount data generated from DNA sequencers. We developed a scalable algorithm to construct sub-trees of a suffix tree in parallel to address huge memory requirements needed for indexing the human genome. For the de Novo assembly, we propose Parallel Giraph based Assembler (PGA) to address the challenges associated with the assembly of large genomes over commodity hardware. PGA uses the de Bruijn graph to represent the data generated from sequencers. Huge memory demands and performance expectations are addressed by developing parallel algorithms based on the distributed graph-processing framework, Apache Giraph

Louisiana State University

Efficient Mapping of Large-scale Data under Heterogeneous Big Data Computing Systems

Author: K. Pavani Krishna P. Neelima
Publication venue: Auricle Global Society of Education and Research
Publication date: 29/02/2020
Field of study

Hadoop biological systems become progressively significant for professionals of huge scale information examination, they likewise acquire huge energy cost. This pattern is dynamic up the requirement for planning energy-effective Hadoop clusters so as to lessen the operational costs and the carbon emanation related with its energy utilization. Be that as it may, in spite of broad investigations of the issue, existing methodologies for energy proficiency have not completely measured the heterogeneity of both workloads. So that here enhancing the model by find that heterogeneity-unaware task task methodologies are hindering to both execution and energy effectiveness of Hadoop clusters. Our perception demonstrates that even heterogeneity-mindful methods that intend to decrease the job fulfillment time don't ensure a decrease in energy utilization of heterogeneous machines. We propose E-Ant which plans to get better the general energy utilization in a heterogeneous Hadoop group without giving up job execution. It adaptively plans heterogeneous workloads on energy-effective machines. E-Ant utilizes a subterranean insect state improvement approach that creates task assignment arrangements dependent on the input of each jobs energy utilization by Tasktrackers and also we incorporate DVFS method with E-Ant to further improve the energy proficiency

International Journal on Future Revolution in Computer Science & Communication Engineering