8 research outputs found
File Fragmentation over an Unreliable Channel
It has been recently discovered that heavy-tailed
file completion time can result from protocol interaction even
when file sizes are light-tailed. A key to this phenomenon is
the RESTART feature where if a file transfer is interrupted
before it is completed, the transfer needs to restart from the
beginning. In this paper, we show that independent or bounded
fragmentation guarantees light-tailed file completion time as long
as the file size is light-tailed, i.e., in this case, heavy-tailed file
completion time can only originate from heavy-tailed file sizes.
If the file size is heavy-tailed, then the file completion time is
necessarily heavy-tailed. For this case, we show that when the
file size distribution is regularly varying, then under independent
or bounded fragmentation, the completion time tail distribution
function is asymptotically upper bounded by that of the original
file size stretched by a constant factor. We then prove that if the
failure distribution has non-decreasing failure rate, the expected
completion time is minimized by dividing the file into equal sized
fragments; this optimal fragment size is unique but depends on
the file size. We also present a simple blind fragmentation policy
where the fragment sizes are constant and independent of the
file size and prove that it is asymptotically optimal. Finally, we
bound the error in expected completion time due to error in
modeling of the failure process
On Channel Failures, File Fragmentation Policies, and Heavy-Tailed Completion Times
It has been recently discovered that heavy-tailed completion times can result from protocol interaction even when file sizes are light-tailed. A key to this phenomenon is the use of a restart policy where if the file is interrupted before it is completed, it needs to restart from the beginning. In this paper, we show that fragmenting a file into pieces whose sizes are either bounded or independently chosen after each interruption guarantees light-tailed completion time as long as the file size is light-tailed; i.e., in this case, heavy-tailed completion time can only originate from heavy-tailed file sizes. If the file size is heavy-tailed, then the completion time is necessarily heavy-tailed. For this case, we show that when the file size distribution is regularly varying, then under independent or bounded fragmentation, the completion time tail distribution function is asymptotically bounded above by that of the original file size stretched by a constant factor. We then prove that if the distribution of times between interruptions has nondecreasing failure rate, the expected completion time is minimized by dividing the file into equal-sized fragments; this optimal fragment size is unique but depends on the file size. We also present a simple blind fragmentation policy where the fragment sizes are constant and independent of the file size and prove that it is asymptotically optimal. Both these policies are also shown to have desirable completion time tail behavior. Finally, we bound the error in expected completion time due to error in modeling of the failure process
Recommended from our members
Heavy Tails and Instabilities in Large-Scale Systems with Failures
Modern engineering systems, e.g., wireless communication networks, distributed computing systems, etc., are characterized by high variability and susceptibility to failures. Failure recovery is required to guarantee the successful operation of these systems. One straight- forward and widely used mechanism is to restart the interrupted jobs from the beginning after a failure occurs. In network design, retransmissions are the primary building blocks of the network architecture that guarantee data delivery in the presence of channel failures. Retransmissions have recently been identified as a new origin of power laws in modern information networks. In particular, it was discovered that retransmissions give rise to long tails (delays) and possibly zero throughput. To this end, we investigate the impact of the ‘retransmission phenomenon’ on the performance of failure prone systems and propose adaptive solutions to address emerging instabilities.
The preceding finding of power law phenomena due to retransmissions holds under the assumption that data sizes have infinite support. In practice, however, data sizes are upper bounded 0 ≤ L ≤ b, e.g., WaveLAN’s maximum transfer unit is 1500 bytes, YouTube videos are of limited duration, e-mail attachments cannot exceed 10MB, etc. To this end, we first provide a uniform characterization of the entire body of the distribution of the number of retransmissions, which can be represented as a product of a power law and the Gamma distribution. This rigorous approximation clearly demonstrates the transition from power law distributions in the main body to exponential tails. Furthermore, the results highlight the importance of wisely determining the size of data fragments in order to accommodate the performance needs in these systems as well as provide the appropriate tools for this fragmentation.
Second, we extend the analysis to the practically important case of correlated channels using modulated processes, e.g., Markov modulated, to capture the underlying dependencies. Our study shows that the tails of the retransmission and delay distributions are asymptotically insensitive to the channel correlations and are determined by the state that generates the lightest tail in the independent channel case. This insight is beneficial both for capacity planning and channel modeling since the independent model is sufficient and the correlation details do not matter. However, the preceding finding may be overly optimistic when the best state is atypical, since the effects of ‘bad’ states may still downgrade the performance.
Third, we examine the effects of scheduling policies in queueing systems with failures and restarts. Fair sharing, e.g., processor sharing (PS), is a widely accepted approach to resource allocation among multiple users. We revisit the well-studied M/G/1 PS queue with a new focus on server failures and restarts. Interestingly, we discover a new phenomenon showing that PS-based scheduling induces complete instability in the presence of retransmissions, regardless of how low the traffic load may be. This novel phenomenon occurs even when the job sizes are bounded/fragmented, e.g., deterministic. This work demonstrates that scheduling one job at a time, such as first-come-first-serve, achieves a larger stability region and should be preferred in these systems.
Last, we delve into the area of distributed computing and study the effects of commonly used mechanisms, i.e., restarts, fragmentation, replication, especially in cloud computing services. We evaluate the efficiency of these techniques under different assumptions on the data streams and discuss the corresponding optimization problem. These findings are useful for optimal resource allocation and fault tolerance in rapidly developing computing networks. In addition to networking and distributed computing systems, the aforementioned results improve our understanding of failure recovery management in large manufacturing and service systems, e.g., call centers. Scalable solutions to this problem increase in significance as these systems continuously grow in scale and complexity. The new phenomena and the techniques developed herein provide new insights in the areas of parallel computing, probability and statistics, as well as financial engineering
Optimal job fragmentation
It has been recently discovered that on an unreliable server, the job completion time distribution function (df) can be heavy-tailed (HT) even when job size df is light-tailed (LT) [1, 5]. A key to this phenomenon is the RESTART feature where if a job is interrupted in the middle of its processing, the entire job needs to restart from the beginning, i.e., the work that is partially completed is lost.
A standard mechanism for reducing the job completion
time in an unreliable service environment is checkpointing
[3, 4, 6]. We view checkpointing as a job fragmentation operation, where the server processes one fragment of the job at a time. If the server becomes unavailable, say due to failure, then only the work corresponding to the fragment being processed at the time of failure is lost. In this paper, we are motivated by the question: Can fragmentation ‘lighten’ the tail df of the completion time? In Section 3, we provide sufficient conditions on the fragmentation policy that gives rise to LT completion time so long as the job size df is LT. We then characterize the optimal fragmentation policy seeking to minimize the expected job completion time. This policy
requires a priori knowledge of the job size. We then describe a sub-optimal fragmentation policy that is blind to the job size and is provably very close to optimal. We also describe the asymptotic tail behavior of the job completion time df under both policies. Assuming the server unavailability periods are LT, both policies produce LT completion times when the job size df is LT. For the case of regularly varying job size df, the job completion time under both policies is regularly varying with the same degree - this is the lightest possible asymptotic tail behavior (in the degree sense)
Optimal job fragmentation
It has been recently discovered that on an unreliable server, the job completion time distribution function (df) can be heavy-tailed (HT) even when job size df is light-tailed (LT) [1, 5]. A key to this phenomenon is the RESTART feature where if a job is interrupted in the middle of its processing, the entire job needs to restart from the beginning, i.e., the work that is partially completed is lost.
A standard mechanism for reducing the job completion
time in an unreliable service environment is checkpointing
[3, 4, 6]. We view checkpointing as a job fragmentation operation, where the server processes one fragment of the job at a time. If the server becomes unavailable, say due to failure, then only the work corresponding to the fragment being processed at the time of failure is lost. In this paper, we are motivated by the question: Can fragmentation ‘lighten’ the tail df of the completion time? In Section 3, we provide sufficient conditions on the fragmentation policy that gives rise to LT completion time so long as the job size df is LT. We then characterize the optimal fragmentation policy seeking to minimize the expected job completion time. This policy
requires a priori knowledge of the job size. We then describe a sub-optimal fragmentation policy that is blind to the job size and is provably very close to optimal. We also describe the asymptotic tail behavior of the job completion time df under both policies. Assuming the server unavailability periods are LT, both policies produce LT completion times when the job size df is LT. For the case of regularly varying job size df, the job completion time under both policies is regularly varying with the same degree - this is the lightest possible asymptotic tail behavior (in the degree sense)