4,191 research outputs found
A Survey on Job and Task Scheduling in Big Data
Bigdata handles the datasets which exceeds the ability of commonly used software tools for storing, sharing and processing the data. Classification of workload is a major issue to the Big Data community namely job type evolution and job size evolution. On the basis of job type, job size and disk performance, clusters are been formed with data node, name node and secondary name node. To classify the workload and to perform the job scheduling, mapreduce algorithm is going to be applied. Based on the performance of individual machine, workload has been allocated. Mapreduce has two phases for processing the data: map and reduce phases. In map phase, the input dataset taken is splitted into keyvalue pairs and an intermediate output is obtained and in reduce phase that key value pair undergoes shuffle and sort operation. Intermediate files are created from map tasks are written to local disk and output files are written to distributed file system of Hadoop. Scheduling of different jobs to different disks are identified after completing mapreduce tasks. Johnson algorithm is used to schedule the jobs and used to find out the optimal solution of different jobs. It schedules the jobs into different pools and performs the scheduling. The main task to be carried out is to minimize the computation time for entire jobs and analyze the performance using response time factors in hadoop distributed file system. Based on the dataset size and number of nodes which is formed in hadoop cluster, the performance of individual jobs are identified\ud
Keywords — \ud
hadoop; mapreduce; johnson algorith
A Mathematical Appraisal: Evolution of Distributed File System and Hadoop
The fast growing technology has left a great impact on the human life. Many traditional systems are either replaced or running in parallel with their electronic counterpart. As for example: - the traditional postal system is now nearly replaced by mobile phones and emails. The electronic system is providing more functionalities than their traditional counterparts. Due to social media, peoples may communicate with each other, share their thoughts and moments of life in form of texts, images or videos. On the other hand, to enhance technologies and knowledge many research activities are propelled and data from different sources are gathered in large volume for further analysis. In short today’s world is surrounded with large volume of data in different form. This put a requirement for effective management of these billions of terabytes of electronic data generally called BIG DATA. The effective management must be based on proven mathematical concepts so that chance of casualties may be reduced. This paper presents a mathematical appraisal for evolution of distribution of file data and explains some basic solution of primitive problems based on probability theory. Keywords: BIG DATA, Distributed System, DFS, Commodity Hardware, Hadoop, Hadoop File System, HDF
ALOJA: A benchmarking and predictive platform for big data performance analysis
The main goals of the ALOJA research project from BSC-MSR, are to explore and automate the characterization of cost-effectivenessof Big Data deployments. The development of the project over its first year, has resulted in a open source benchmarking platform, an online public repository of results with over 42,000 Hadoop job runs, and web-based analytic tools to gather insights about system's cost-performance1.
This article describes the evolution of the project's focus and research
lines from over a year of continuously benchmarking Hadoop under dif-
ferent configuration and deployments options, presents results, and dis
cusses the motivation both technical and market-based of such changes.
During this time, ALOJA's target has evolved from a previous low-level
profiling of Hadoop runtime, passing through extensive benchmarking
and evaluation of a large body of results via aggregation, to currently
leveraging Predictive Analytics (PA) techniques. Modeling benchmark
executions allow us to estimate the results of new or untested configu-
rations or hardware set-ups automatically, by learning techniques from
past observations saving in benchmarking time and costs.This work is partially supported the BSC-Microsoft Research Centre, the Span-
ish Ministry of Education (TIN2012-34557), the MINECO Severo Ochoa Research program (SEV-2011-0067) and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
Scientific problems that depend on processing large amounts of data require
overcoming challenges in multiple areas: managing large-scale data
distribution, co-placement and scheduling of data with compute resources, and
storing and transferring large volumes of data. We analyze the ecosystems of
the two prominent paradigms for data-intensive applications, hereafter referred
to as the high-performance computing and the Apache-Hadoop paradigm. We propose
a basis, common terminology and functional factors upon which to analyze the
two approaches of both paradigms. We discuss the concept of "Big Data Ogres"
and their facets as means of understanding and characterizing the most common
application workloads found across the two paradigms. We then discuss the
salient features of the two paradigms, and compare and contrast the two
approaches. Specifically, we examine common implementation/approaches of these
paradigms, shed light upon the reasons for their current "architecture" and
discuss some typical workloads that utilize them. In spite of the significant
software distinctions, we believe there is architectural similarity. We discuss
the potential integration of different implementations, across the different
levels and components. Our comparison progresses from a fully qualitative
examination of the two paradigms, to a semi-quantitative methodology. We use a
simple and broadly used Ogre (K-means clustering), characterize its performance
on a range of representative platforms, covering several implementations from
both paradigms. Our experiments provide an insight into the relative strengths
of the two paradigms. We propose that the set of Ogres will serve as a
benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Multi-Objective Big Data Optimization with jMetal and Spark
Big Data Optimization is the term used to refer to optimization problems which have to manage very large amounts of data. In this paper, we focus on the parallelization of metaheuristics with the Apache Spark cluster computing system for solving multi-objective Big Data Optimization problems. Our purpose is to study the influence of accessing data stored in the Hadoop File System (HDFS) in each evaluation step of a metaheuristic and to provide a software tool to solve these kinds of problems. This tool combines the jMetal multi-objective optimization framework with Apache Spark. We have carried out experiments to measure the performance of the proposed parallel infrastructure in an environment based on virtual machines in a local cluster comprising up to 100 cores. We obtained interesting results for computational e ort and propose guidelines to face multi-objective Big Data Optimization
problems.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
- …