Search CORE

77 research outputs found

Overview of Caching Mechanisms to Improve Hadoop Performance

Author: Down Douglas G.
Ghazali Rana
Publication venue
Publication date: 23/10/2023
Field of study

Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time

arXiv.org e-Print Archive

SEH: Size Estimate Hedging for Single-Server Queues

Author: Akbari-Moghaddam Maryam
Down Douglas G.
Publication venue
Publication date: 27/06/2021
Field of study

For a single server system, Shortest Remaining Processing Time (SRPT) is an optimal size-based policy. In this paper, we discuss scheduling a single-server system when exact information about the jobs' processing times is not available. When the SRPT policy uses estimated processing times, the underestimation of large jobs can significantly degrade performance. We propose a simple heuristic, Size Estimate Hedging (SEH), that only uses jobs' estimated processing times for scheduling decisions. A job's priority is increased dynamically according to an SRPT rule until it is determined that it is underestimated, at which time the priority is frozen. Numerical results suggest that SEH has desirable performance when estimation errors are not unreasonably large

arXiv.org e-Print Archive

Linearized Data Center Workload and Cooling Management

Author: Down Douglas G.
Karakostas George
Rostami Somayye
Publication venue
Publication date: 10/04/2023
Field of study

With the current high levels of energy consumption of data centers, reducing power consumption by even a small percentage is beneficial. We propose a framework for thermal-aware workload distribution in a data center to reduce cooling power consumption. The framework includes linearization of the general optimization problem and proposing a heuristic to approximate the solution for the resulting Integer Linear Programming (ILP) problems. We first define a general nonlinear power optimization problem including several cooling parameters, heat recirculation effects, and constraints on server temperatures. We propose to study a linearized version of the problem, which is easier to analyze. As an energy saving scenario and as a proof of concept for our approach, we also consider the possibility that the red-line temperature for idle servers is higher than that for busy servers. For the resulting ILP problem, we propose a heuristic for intelligent rounding of the fractional solution. Through numerical simulations, we compare our heuristics with two baseline algorithms. We also evaluate the performance of the solution of the linearized system on the original system. The results show that the proposed approach can reduce the cooling power consumption by more than 30 percent compared to the case of continuous utilizations and a single red-line temperature

arXiv.org e-Print Archive

Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers

Author: Douglas G Down
Mohammad H Yarmand
Publication venue
Publication date: 24/04/2020
Field of study

Abstract For tandem queues with no buffer spaces and both dedicated and flexible servers, we study how flexible servers should be assigned to maximize the throughput. When there is one flexible server and two stations each with a dedicated server, we completely characterize the optimal policy. We use the insights gained from applying the Policy Iteration algorithm on systems with three, four, and five stations to devise heuristics for systems of arbitrary size. These heuristics are verified by numerical analysis. We also discuss the throughput improvement, when for a given server assignment, dedicated servers are changed to flexible servers

CiteSeerX

Hadoop-Oriented SVM-LRU (H-SVM-LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance

Author: Adabi Sahar
Down Douglas G.
Ghazali Rana
Movaghar Ali
Rezaee Ali
Publication venue
Publication date: 28/09/2023
Field of study

Modern applications can generate a large amount of data from different sources with high velocity, a combination that is difficult to store and process via traditional tools. Hadoop is one framework that is used for the parallel processing of a large amount of data in a distributed environment, however, various challenges can lead to poor performance. Two particular issues that can limit performance are the high access time for I/O operations and the recomputation of intermediate data. The combination of these two issues can result in resource wastage. In recent years, there have been attempts to overcome these problems by using caching mechanisms. Due to cache space limitations, it is crucial to use this space efficiently and avoid cache pollution (the cache contains data that is not used in the future). We propose Hadoop-oriented SVM-LRU (HSVM- LRU) to improve Hadoop performance. For this purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that combines the well-known LRU mechanism with a machine learning algorithm, SVM, to classify cached data into two groups based on their future usage. Experimental results show a significant decrease in execution time as a result of an increased cache hit ratio, leading to a positive impact on Hadoop performance

arXiv.org e-Print Archive

On Accommodating Customer Flexibility in Service Systems

Author: Douglas G. Down
He Y.-T.
Reiman M.I.
Yu-Tong He
Publication venue: 'University of Toronto Press Inc. (UTPress)'
Publication date
Field of study

Crossref

Dynamic control of a single-server system with abandonments

Author: A.R. Ward
A.R. Ward
C. Buyukkoc
Douglas G. Down
F. Iravani
Ger Koole
J. Harrison
J.-P. Gayon
M. Armony
M. Armony
M. Armony
M. Puterman
Mark E. Lewis
N. Gans
N.T. Argon
P. Nain
R. Serfozo
S. Lippman
T. Prieto-Rumeau
X. Guo
X. Guo
Z. Aksin
Publication venue
Publication date: 01/01/2011
Field of study

In this paper, we discuss the dynamic server control in a two-class service system with abandonments. Two models are considered. In the first case, rewards are received upon service completion, and there are no abandonment costs (other than the lost opportunity to gain rewards). In the second, holding costs per customer per unit time are accrued, and each abandonment involves a fixed cost. Both cases are considered under the discounted or average reward/cost criterion. These are extensions of the classic scheduling question (without abandonments) where it is well known that simple priority rules hold. The contributions in this paper are twofold. First, we show that the classic c-μ rule does not hold in general. An added condition on the ordering of the abandonment rates is sufficient to recover the priority rule. Counterexamples show that this condition is not necessary, but when it is violated, significant loss can occur. In the reward case, we show that the decision involves an intuitive tradeoff between getting more rewards and avoiding idling. Secondly, we note that traditional solution techniques are not directly applicable. Since customers may leave in between services, an interchange argument cannot be applied. Since the abandonment rates are unbounded we cannot apply uniformization-and thus cannot use the usual discrete-time Markov decision process techniques. After formulating the problem as a continuous-time Markov decision process (CTMDP), we use sample path arguments in the reward case and a savvy use of truncation in the holding cost case to yield the results. As far as we know, this is the first time that either have been used in conjunction with the CTMDP to show structure in a queueing control problem. The insights made in each model are supported by a detailed numerical study. © 2010 Springer Science+Business Media, LLC

Crossref

VU Research Portal

Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

Author: A. Reghan Foley
Aarno Palotie
Aaron Day-Williams
Aaron Isaacs
Adam Shaw
Alastair Kent
Alexandros Onoufriadis
Alireza Moayyeri
Allan Daly
Ana M. Valdes
Andrew G. McKechanie
Andrew M. McIntosh
Andrew McQuillin
Anette Varbo
Angela Matchan
Anja Kolb-Kokocinski
Anne Tybjaerg-Hansen
Anthony M. Vandersteen
Antoinette Amuzu
Antonio Ciampi
Audrey E. Hendricks
Beate St Pourcain
Bryan Howie
Børge G. Nordestgaard
Carl A. Anderson
Carol Smee
Catherine Cosgrove
Celia Greenwood
ChangJiang Xu
Cheryl K. Ridout
Chris Boustred
Chris Joyce
Christopher S. Franklin
Claudia Langenberg
Cordelia Langford
Cornelia M. van Duijn
Crispian Wilson
Daniel G. MacArthur
Daniel Geschwind
Daniel Lawson
Daniela Toniolo
David A. Collier
David B. Savage
David Curtis
David K. Jackson
David M. Evans
David R. Fitzpatrick
David Skuse
David St Clair
Dawn Muddyman
Deborah Hart
Detelina Grozeva
Dinu Antony
Douglas Blackwood
Eleanor Wheeler
Eleftheria Zeggini
Elena Bochukova
Elisabeth M. van Leeuwen
Elisabetta Trabetti
Elizabeth Stevens
Eva Serra
Ewan Birney
Farrah Khawaja
Felicity Payne
Feng Zhang
Francesco Muntoni
Gabriela Surdulescu
Gail Clement
Gaëlle Marenne
Genevieve Lachance
George Davey Smith
George Dedoussis
Gerome Breen
Gianluigi Zaza
Giovanni Gambaro
Giovanni Malerba
Graham R. S. Ritchie
Guangbiao Wang
Guy Coates
Hannah M. Mitchison
Hashem A. Shihab
Heather Griffin
Hong Lin
Hou-Feng Zheng
Hugh Gurling
Hywel J. Williams
I. Sadaf Farooqi
Iain Mathieson
Ian Dunham
Ian N. M. Day
Inês Barroso
Ioanna Tachmazidou
Irene Lee
J. Brent Richards
Jaana Suvisaari
James Floyd
James Morris
James T. R. Walters
Jamie Bentham
Jane Kaye
Jaspal S. Kooner
Jeffrey C. Barrett
Jeremy R. Parr
Jian Yang
Jian'an Luan
Jianping Sun
Jie Huang
Jieqin Liang
Jim Stalker
Jing Tian
John C. Chambers
John Maslen
John P. Kemp
John R. B. Perry
Jon Johnson
Jonathan Marchini
Josine L. Min
Jouko Lönnqvist
Juan Pablo Casas
Julia Keogh
Jun Wang
Kalliope Panoutsopoulou
Karen Kennedy
Karim Oualkacha
Karola Rehnström
Kate Northstone
Kathleen A. Williamson
Keren Carss
Kerrin S. Small
Kim Wong
Kirsten Ward
Klaudia Walter
Konrad J. Karczewski
Krishna Chatterjee
Lavinia Paternoster
Liren Huang
Lorraine Southam
Louise Gallagher
Louise V. Wain
Lu Chen
Lucy Crooks
Lucy Raymond
Luis R. Lopes
Lydia Quaye
Marcus E. Kleber
Margarida Lopes
Margriet van Kogelenberg
Marianne Benn
Marta Futema
Martin Bobrow
Martin D. Tobin
María Soler Artigas
Massimiliano Cocca
Massimo Mangino
Matthew E. Hurles
Matthias Geihs
Mattia Calissano
Michael A. Quail
Michael C. O'Donovan
Michael J. Owen
Michela Traglia
Miriam Schmidts
Monkol Lek
Muhammad Ayub
Nadia Schoenmakers
Nicholas J. Timpson
Nick Craddock
Nicola Migone
Nicola Roberts
Nicole Soranzo
null null
Olivera Spasic-Boskovic
Olli Pietilainen
Paolo Gasparini
Parthiban Vijayarangakannan
Patrick F. Bolton
Paul Flicek
Peter Clapham
Peter Ellis
Peter Holmans
Peter M. Visscher
Peter McGuffin
Peter Scambler
Peter Whincup
Petr Danecek
Petros Syrris
Phil Beales
Pingbo Zhang
Pirro Hysi
Rachel L. Robinson
Rebecca Bounds
Rebecca C. Pollitt
Richard Anney
Richard Durbin
Richard H. Scott
Richard Morris
Robert A. Scott
Robert K. Semple
Rohan Taylor
Rosemary Ekong
Rui Li
Ruth Charlton
Ryan Liu
Saeed Al Turki
Sally I. Sharp
Sarah Curran
Sarah Edkins
Sarah Metrustry
Scott G. Wilson
Sebahattin Cirak
Senduran Bala
Shane McCarthy
Shoumo Bhattacharya
So-Youn Shin
Stephan Schiffels
Stephen O'Rahilly
Steve E. Humphries
Stewart J. Payne
Sue Povey
Susan Ring
Tamieka Whyte
Thomas Down
Thomas Keane
Tiina Paunio
Tim Hubbard
Timothy D. Spector
Tom R. Gaunt
Tony Cox
Valentina Iotchkova
Victoria Parker
Vincent Plagnol
Weihua Zhang
Winfried März
Xiaosen Guo
Xueqin Guo
Yalda Jamshidi
Yasin Memari
Yingrui Li
Yu Wang
Yuanping Du
Publication venue: NATURE PUBLISHING GROUP
Publication date: 01/01/2015
Field of study

Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants