Search CORE

4,490 research outputs found

A genetic algorithm-based job scheduling model for big data analytics

Author: A Konak
DY Chen
J Berlińska
J Dean
JJ Durillo
K Deb
Lei Zhang
Qinghua Lu
Shanshan Li
SM Thede
T Nykiel
Weishan Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Author: Chen Yuxing
Herodotou Herodotos
Lu Jiaheng
Publication venue
Publication date: 01/04/2020
Field of study

Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe

Ktisis

Helsingin yliopiston digitaalinen arkisto

Recommended from our members

Hadoop performance modeling and job optimization for big data analytics

Author: Khan Mukhtaj
Publication venue: Brunel University London
Publication date: 01/01/2015
Field of study

This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.Abdul Wali Khan University Marda

Brunel University Research Archive

An autonomous system for maintenance scheduling data-rich complex infrastructure:Fusing the railways’ condition, planning and cost

Author: Amorim-Melo
Andrew Starr
Andrews
Antonios Tsourdos
Ashutosh Tiwari
Bernardi
Bevilacqua
Bouillaut
Christopher J. Turner
Christos Emmanouilidis
Deb
DeLone
DeLone
Dhillon
Dote
Durazo-Cardenas
Essam Shehab
Esteban
García Márquez
Guler
Guler
Hall
Hall
Hess
Isidro Durazo-Cardenas
Jabri
Jovanović
Khaleghi
Kirkwood
Klein
Knowles
Le
Leigh Kirkwood
Li
Lidén
Lidén
Liggins
Magee
Maurizio Bevilacqua
Morant
Niu
Nunez
Nunez
Nyström
Patra
Paul Baguley
Popovic
Provost
Raheja
Santos
Scholz
Selby
Sinha
Su
Thaduri
Turner
Waltz
Wang
Yuchun Xu
Zhang
Zhao
Publication venue: 'Elsevier BV'
Publication date: 22/02/2018
Field of study

National railways are typically large and complex systems. Their network infrastructure usually includes extended track sections, bridges, stations and other supporting assets. In recent years, railways have also become a data-rich environment. Railway infrastructure assets have a very long life, but inherently degrade. Interventions are necessary but they can cause lateness, damage and hazards. Every day, thousands of discrete maintenance jobs are scheduled according to time and urgency. Service disruption has a direct economic impact. Planning for maintenance can be complex, expensive and uncertain. Autonomous scheduling of maintenance jobs is essential. The design strategy of a novel integrated system for automatic job scheduling is presented; from concept formulation to the examination of the data to information transitional level interface, and at the decision making level. The underlying architecture configures high-level fusion of technical and business drivers; scheduling optimized intervention plans that factor-in cost impact and added value. A proof of concept demonstrator was developed to validate the system principle and to test algorithm functionality. It employs a dashboard for visualization of the system response and to present key information. Real track incident and inspection datasets were analyzed to raise degradation alarms that initiate the automatic scheduling of maintenance tasks. Optimum scheduling was realized through data analytics and job sequencing heuristic and genetic algorithms, taking into account specific cost & value inputs from comprehensive task cost modelling. Formal face validation was conducted with railway infrastructure specialists and stakeholders. The demonstrator structure was found fit for purpose with logical component relationships, offering further scope for research and commercial exploitation

Crossref

University of Surrey

Aston Publications Explorer

Cranfield CERES

White Rose Research Online

Surrey Research Insight

DLCD-CCE: A Local Community Detection Algorithm for Complex IoT Networks

Author: Hu Nan
Palmieri Francesco
PANDEY HARI MOHAN
RAY JEFFREY
TROVATI MARCELLO
Xu Xiaolong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Edge Hill University Research Information Repository

Archivio della Ricerca - Università di Salerno

Cloud computing resource scheduling and a survey of its evolutionary approaches

Author: Chung Henry Shu-Hung
Gong Yue-Jiao
Li Yun
Liu Xiao-Fang
Zhan Zhi-Hui
Zhang Jun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2015
Field of study

A disruptive technology fundamentally transforming the way that computing services are delivered, cloud computing offers information and communication technology users a new dimension of convenience of resources, as services via the Internet. Because cloud provides a finite pool of virtualized on-demand resources, optimally scheduling them has become an essential and rewarding topic, where a trend of using Evolutionary Computation (EC) algorithms is emerging rapidly. Through analyzing the cloud computing architecture, this survey first presents taxonomy at two levels of scheduling cloud resources. It then paints a landscape of the scheduling problem and solutions. According to the taxonomy, a comprehensive survey of state-of-the-art approaches is presented systematically. Looking forward, challenges and potential future research directions are investigated and invited, including real-time scheduling, adaptive dynamic scheduling, large-scale scheduling, multiobjective scheduling, and distributed and parallel scheduling. At the dawn of Industry 4.0, cloud computing scheduling for cyber-physical integration with the presence of big data is also discussed. Research in this area is only in its infancy, but with the rapid fusion of information and data technology, more exciting and agenda-setting topics are likely to emerge on the horizon

Crossref

Enlighten