96 research outputs found
Overview of Caching Mechanisms to Improve Hadoop Performance
Nowadays distributed computing environments, large amounts of data are
generated from different resources with a high velocity, rendering the data
difficult to capture, manage, and process within existing relational databases.
Hadoop is a tool to store and process large datasets in a parallel manner
across a cluster of machines in a distributed environment. Hadoop brings many
benefits like flexibility, scalability, and high fault tolerance; however, it
faces some challenges in terms of data access time, I/O operation, and
duplicate computations resulting in extra overhead, resource wastage, and poor
performance. Many researchers have utilized caching mechanisms to tackle these
challenges. For example, they have presented approaches to improve data access
time, enhance data locality rate, remove repetitive calculations, reduce the
number of I/O operations, decrease the job execution time, and increase
resource efficiency. In the current study, we provide a comprehensive overview
of caching strategies to improve Hadoop performance. Additionally, a novel
classification is introduced based on cache utilization. Using this
classification, we analyze the impact on Hadoop performance and discuss the
advantages and disadvantages of each group. Finally, a novel hybrid approach
called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods
from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental
results show that our hybrid method achieves an average improvement of 31.2% in
job execution time
SEH: Size Estimate Hedging for Single-Server Queues
For a single server system, Shortest Remaining Processing Time (SRPT) is an
optimal size-based policy. In this paper, we discuss scheduling a single-server
system when exact information about the jobs' processing times is not
available. When the SRPT policy uses estimated processing times, the
underestimation of large jobs can significantly degrade performance. We propose
a simple heuristic, Size Estimate Hedging (SEH), that only uses jobs' estimated
processing times for scheduling decisions. A job's priority is increased
dynamically according to an SRPT rule until it is determined that it is
underestimated, at which time the priority is frozen. Numerical results suggest
that SEH has desirable performance when estimation errors are not unreasonably
large
Linearized Data Center Workload and Cooling Management
With the current high levels of energy consumption of data centers, reducing
power consumption by even a small percentage is beneficial. We propose a
framework for thermal-aware workload distribution in a data center to reduce
cooling power consumption. The framework includes linearization of the general
optimization problem and proposing a heuristic to approximate the solution for
the resulting Integer Linear Programming (ILP) problems. We first define a
general nonlinear power optimization problem including several cooling
parameters, heat recirculation effects, and constraints on server temperatures.
We propose to study a linearized version of the problem, which is easier to
analyze. As an energy saving scenario and as a proof of concept for our
approach, we also consider the possibility that the red-line temperature for
idle servers is higher than that for busy servers. For the resulting ILP
problem, we propose a heuristic for intelligent rounding of the fractional
solution. Through numerical simulations, we compare our heuristics with two
baseline algorithms. We also evaluate the performance of the solution of the
linearized system on the original system. The results show that the proposed
approach can reduce the cooling power consumption by more than 30 percent
compared to the case of continuous utilizations and a single red-line
temperature
Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers
Abstract For tandem queues with no buffer spaces and both dedicated and flexible servers, we study how flexible servers should be assigned to maximize the throughput. When there is one flexible server and two stations each with a dedicated server, we completely characterize the optimal policy. We use the insights gained from applying the Policy Iteration algorithm on systems with three, four, and five stations to devise heuristics for systems of arbitrary size. These heuristics are verified by numerical analysis. We also discuss the throughput improvement, when for a given server assignment, dedicated servers are changed to flexible servers
Hadoop-Oriented SVM-LRU (H-SVM-LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance
Modern applications can generate a large amount of data from different
sources with high velocity, a combination that is difficult to store and
process via traditional tools. Hadoop is one framework that is used for the
parallel processing of a large amount of data in a distributed environment,
however, various challenges can lead to poor performance. Two particular issues
that can limit performance are the high access time for I/O operations and the
recomputation of intermediate data. The combination of these two issues can
result in resource wastage. In recent years, there have been attempts to
overcome these problems by using caching mechanisms. Due to cache space
limitations, it is crucial to use this space efficiently and avoid cache
pollution (the cache contains data that is not used in the future). We propose
Hadoop-oriented SVM-LRU (HSVM- LRU) to improve Hadoop performance. For this
purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that
combines the well-known LRU mechanism with a machine learning algorithm, SVM,
to classify cached data into two groups based on their future usage.
Experimental results show a significant decrease in execution time as a result
of an increased cache hit ratio, leading to a positive impact on Hadoop
performance
Dynamic control of a single-server system with abandonments
In this paper, we discuss the dynamic server control in a two-class service system with abandonments. Two models are considered. In the first case, rewards are received upon service completion, and there are no abandonment costs (other than the lost opportunity to gain rewards). In the second, holding costs per customer per unit time are accrued, and each abandonment involves a fixed cost. Both cases are considered under the discounted or average reward/cost criterion. These are extensions of the classic scheduling question (without abandonments) where it is well known that simple priority rules hold. The contributions in this paper are twofold. First, we show that the classic c-Ό rule does not hold in general. An added condition on the ordering of the abandonment rates is sufficient to recover the priority rule. Counterexamples show that this condition is not necessary, but when it is violated, significant loss can occur. In the reward case, we show that the decision involves an intuitive tradeoff between getting more rewards and avoiding idling. Secondly, we note that traditional solution techniques are not directly applicable. Since customers may leave in between services, an interchange argument cannot be applied. Since the abandonment rates are unbounded we cannot apply uniformization-and thus cannot use the usual discrete-time Markov decision process techniques. After formulating the problem as a continuous-time Markov decision process (CTMDP), we use sample path arguments in the reward case and a savvy use of truncation in the holding cost case to yield the results. As far as we know, this is the first time that either have been used in conjunction with the CTMDP to show structure in a queueing control problem. The insights made in each model are supported by a detailed numerical study. © 2010 Springer Science+Business Media, LLC
Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel
Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants
- âŠ