Search CORE

4,095 research outputs found

Automatically configuring parallelism for hybrid layouts

Author: Abelló Gamazo Alberto
Lehner Wolfgang
Munir Rana Faisal
Romero Moral Óscar
Thiele Maik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.). To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

A methodology for Spark parameter tuning

Author: Gounaris Anastasios
Torres Viñals Jordi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Spark has been established as an attractive platform for big data analysis, since it manages to hide most of the complexities related to parallelism, fault tolerance and cluster setting from developers. However, this comes at the expense of having over 150 configurable parameters, the impact of which cannot be exhaustively examined due to the exponential amount of their combinations. The default values allow developers to quickly deploy their applications but leave the question as to whether performance can be improved open. In this work, we investigate the impact of the most important tunable Spark parameters with regards to shuffling, compression and serialization on the application performance through extensive experimentation using the Spark-enabled Marenostrum III (MN3) computing infrastructure of the Barcelona Supercomputing Center. The overarching aim is to guide developers on how to proceed to changes to the default values. We build upon our previous work, where we mapped our experience to a trial-and-error iterative improvement methodology for tuning parameters in arbitrary applications based on evidence from a very small number of experimental runs. The main contribution of this work is that we propose an alternative systematic methodology for parameter tuning, which can be easily applied onto any computing infrastructure and is shown to yield comparable if not better results than the initial one when applied to MN3; observed speedups in our validating test case studies start from 20%. In addition, the new methodology can rely on runs using samples instead of runs on the complete datasets, which render it significantly more practical.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Exploring Algorithmic Limits of Matrix Rank Minimization under Affine Constraints

Author: Wipf David
Xin Bo
Publication venue
Publication date: 01/07/2015
Field of study

Many applications require recovering a matrix of minimal rank within an affine constraint set, with matrix completion a notable special case. Because the problem is NP-hard in general, it is common to replace the matrix rank with the nuclear norm, which acts as a convenient convex surrogate. While elegant theoretical conditions elucidate when this replacement is likely to be successful, they are highly restrictive and convex algorithms fail when the ambient rank is too high or when the constraint set is poorly structured. Non-convex alternatives fare somewhat better when carefully tuned; however, convergence to locally optimal solutions remains a continuing source of failure. Against this backdrop we derive a deceptively simple and parameter-free probabilistic PCA-like algorithm that is capable, over a wide battery of empirical tests, of successful recovery even at the theoretical limit where the number of measurements equal the degrees of freedom in the unknown low-rank matrix. Somewhat surprisingly, this is possible even when the affine constraint set is highly ill-conditioned. While proving general recovery guarantees remains evasive for non-convex algorithms, Bayesian-inspired or otherwise, we nonetheless show conditions whereby the underlying cost function has a unique stationary point located at the global optimum; no existing cost function we are aware of satisfies this same property. We conclude with a simple computer vision application involving image rectification and a standard collaborative filtering benchmark

arXiv.org e-Print Archive

CiteSeerX

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Author: Chen Yuxing
Herodotou Herodotos
Lu Jiaheng
Publication venue
Publication date: 01/04/2020
Field of study

Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe

Ktisis

Helsingin yliopiston digitaalinen arkisto

Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand

Author: Ahmed Nasim
Publication venue: 'Massey University'
Publication date: 01/01/2022
Field of study

Big Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics. Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations. This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions

Massey Research Online

Meta-heuristic algorithms in car engine design: a literature survey

Author: Tayarani-N Mohammad-H.
Xu Hongming
Yao Xin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2015
Field of study

Meta-heuristic algorithms are often inspired by natural phenomena, including the evolution of species in Darwinian natural selection theory, ant behaviors in biology, flock behaviors of some birds, and annealing in metallurgy. Due to their great potential in solving difficult optimization problems, meta-heuristic algorithms have found their way into automobile engine design. There are different optimization problems arising in different areas of car engine management including calibration, control system, fault diagnosis, and modeling. In this paper we review the state-of-the-art applications of different meta-heuristic algorithms in engine management systems. The review covers a wide range of research, including the application of meta-heuristic algorithms in engine calibration, optimizing engine control systems, engine fault diagnosis, and optimizing different parts of engines and modeling. The meta-heuristic algorithms reviewed in this paper include evolutionary algorithms, evolution strategy, evolutionary programming, genetic programming, differential evolution, estimation of distribution algorithm, ant colony optimization, particle swarm optimization, memetic algorithms, and artificial immune system

University of Birmingham Research Portal

Enlighten

A controlled migration genetic algorithm operator for hardware-in-the-loop experimentation

Author: Alander
Back
D. Gladwin
DeJong
DeJong
Fogarty
Fogel
Garcia-Martinez
Goldberg
Goldberg
Goldberg
Gouvenc
Isermann
J. Stewart
Keane
Mann
Michalewicz
Nakama
Nsakandaa
Oh
P. Stewart
Petter
Po
Power
Schaffer
Smith
Spears
Stewart
Thierens
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

In this paper, we describe the development of an extended migration operator, which combats the negative effects of noise on the effective search capabilities of genetic algorithms. The research is motivated by the need to minimize the num- ber of evaluations during hardware-in-the-loop experimentation, which can carry a significant cost penalty in terms of time or financial expense. The authors build on previous research, where convergence for search methods such as Simulated Annealing and Variable Neighbourhood search was accelerated by the implementation of an adaptive decision support operator. This methodology was found to be effective in searching noisy data surfaces. Providing that noise is not too significant, Genetic Al- gorithms can prove even more effective guiding experimentation. It will be shown that with the introduction of a Controlled Migration operator into the GA heuristic, data, which repre- sents a significant signal-to-noise ratio, can be searched with significant beneficial effects on the efficiency of hardware-in-the- loop experimentation, without a priori parameter tuning. The method is tested on an engine-in-the-loop experimental example, and shown to bring significant performance benefits

University of Lincoln Institutional Repository

Crossref

White Rose Research Online

UDORA - University of Derby Online Research Archive

On-line PID tuning for engine idle-speed control using continuous action reinforcement learning automata

Author: M.N. Howell (7122899)
Matt Best (1251441)
Publication venue
Publication date: 01/01/2000
Field of study

PID systems are widely used to apply control without the need to obtain a dynamic model. However, the performance of controllers designed using standard on-line tuning methods, such as Ziegler-Nichols, can often be significantly improved. In this paper the tuning process is automated through the use of continuous action reinforcement learning automata (CARLA). These are used to simultaneously tune the parameters of a three term controller on-line to minimise a performance objective. Here the method is demonstrated in the context of engine idle speed control; the algorithm is first applied in simulation on a nominal engine model, and this is followed by a practical study using a Ford Zetec engine in a test cell. The CARLA provides marked performance benefits over a comparable Ziegler-Nichols tuned controller in this application

Loughborough University Institutional Repository

Hive on spark and MapReduce : a methodology for parameter tuning

Author: Forster Rodrigo Richard
Publication venue
Publication date: 29/10/2018
Field of study

Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster

Repositório da Universidade Nova de Lisboa