77 research outputs found

    Overview of Caching Mechanisms to Improve Hadoop Performance

    Full text link
    Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time

    SEH: Size Estimate Hedging for Single-Server Queues

    Full text link
    For a single server system, Shortest Remaining Processing Time (SRPT) is an optimal size-based policy. In this paper, we discuss scheduling a single-server system when exact information about the jobs' processing times is not available. When the SRPT policy uses estimated processing times, the underestimation of large jobs can significantly degrade performance. We propose a simple heuristic, Size Estimate Hedging (SEH), that only uses jobs' estimated processing times for scheduling decisions. A job's priority is increased dynamically according to an SRPT rule until it is determined that it is underestimated, at which time the priority is frozen. Numerical results suggest that SEH has desirable performance when estimation errors are not unreasonably large

    Linearized Data Center Workload and Cooling Management

    Full text link
    With the current high levels of energy consumption of data centers, reducing power consumption by even a small percentage is beneficial. We propose a framework for thermal-aware workload distribution in a data center to reduce cooling power consumption. The framework includes linearization of the general optimization problem and proposing a heuristic to approximate the solution for the resulting Integer Linear Programming (ILP) problems. We first define a general nonlinear power optimization problem including several cooling parameters, heat recirculation effects, and constraints on server temperatures. We propose to study a linearized version of the problem, which is easier to analyze. As an energy saving scenario and as a proof of concept for our approach, we also consider the possibility that the red-line temperature for idle servers is higher than that for busy servers. For the resulting ILP problem, we propose a heuristic for intelligent rounding of the fractional solution. Through numerical simulations, we compare our heuristics with two baseline algorithms. We also evaluate the performance of the solution of the linearized system on the original system. The results show that the proposed approach can reduce the cooling power consumption by more than 30 percent compared to the case of continuous utilizations and a single red-line temperature

    Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers

    Get PDF
    Abstract For tandem queues with no buffer spaces and both dedicated and flexible servers, we study how flexible servers should be assigned to maximize the throughput. When there is one flexible server and two stations each with a dedicated server, we completely characterize the optimal policy. We use the insights gained from applying the Policy Iteration algorithm on systems with three, four, and five stations to devise heuristics for systems of arbitrary size. These heuristics are verified by numerical analysis. We also discuss the throughput improvement, when for a given server assignment, dedicated servers are changed to flexible servers

    Hadoop-Oriented SVM-LRU (H-SVM-LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance

    Full text link
    Modern applications can generate a large amount of data from different sources with high velocity, a combination that is difficult to store and process via traditional tools. Hadoop is one framework that is used for the parallel processing of a large amount of data in a distributed environment, however, various challenges can lead to poor performance. Two particular issues that can limit performance are the high access time for I/O operations and the recomputation of intermediate data. The combination of these two issues can result in resource wastage. In recent years, there have been attempts to overcome these problems by using caching mechanisms. Due to cache space limitations, it is crucial to use this space efficiently and avoid cache pollution (the cache contains data that is not used in the future). We propose Hadoop-oriented SVM-LRU (HSVM- LRU) to improve Hadoop performance. For this purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that combines the well-known LRU mechanism with a machine learning algorithm, SVM, to classify cached data into two groups based on their future usage. Experimental results show a significant decrease in execution time as a result of an increased cache hit ratio, leading to a positive impact on Hadoop performance

    On Accommodating Customer Flexibility in Service Systems

    Full text link

    Dynamic control of a single-server system with abandonments

    Get PDF
    In this paper, we discuss the dynamic server control in a two-class service system with abandonments. Two models are considered. In the first case, rewards are received upon service completion, and there are no abandonment costs (other than the lost opportunity to gain rewards). In the second, holding costs per customer per unit time are accrued, and each abandonment involves a fixed cost. Both cases are considered under the discounted or average reward/cost criterion. These are extensions of the classic scheduling question (without abandonments) where it is well known that simple priority rules hold. The contributions in this paper are twofold. First, we show that the classic c-Ό rule does not hold in general. An added condition on the ordering of the abandonment rates is sufficient to recover the priority rule. Counterexamples show that this condition is not necessary, but when it is violated, significant loss can occur. In the reward case, we show that the decision involves an intuitive tradeoff between getting more rewards and avoiding idling. Secondly, we note that traditional solution techniques are not directly applicable. Since customers may leave in between services, an interchange argument cannot be applied. Since the abandonment rates are unbounded we cannot apply uniformization-and thus cannot use the usual discrete-time Markov decision process techniques. After formulating the problem as a continuous-time Markov decision process (CTMDP), we use sample path arguments in the reward case and a savvy use of truncation in the holding cost case to yield the results. As far as we know, this is the first time that either have been used in conjunction with the CTMDP to show structure in a queueing control problem. The insights made in each model are supported by a detailed numerical study. © 2010 Springer Science+Business Media, LLC

    Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

    Get PDF
    Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants
    • 

    corecore