1 research outputs found
Performance modeling of a distributed file-system
Data centers have become center of big data processing. Most programs running
in a data center processes big data. The storage requirements of such programs
cannot be fulfilled by a single node in the data center, and hence a
distributed file system is used where the the storage resource are pooled
together from more than one node and presents a unified view of it to outside
world. Optimum performance of these distributed file-systems given a workload
is of paramount important as disk being the slowest component in the framework.
Owning to this fact, many big data processing frameworks implement their own
file-system to get the optimal performance by fine tuning it for their specific
workloads. However, fine-tuning a file system for a particular workload results
in poor performance for workloads that do not match the profile of desired
workload. Hence, these file systems cannot be used for general purpose usage,
where the workload characteristics shows high variation. In this paper we model
the performance of a general purpose file-system and analyse the impact of
tuning the file-system on its performance. Performance of these parallel
file-systems are not easy to model because the performance depends on a lot of
configuration parameters, like the network, disk, under lying file system,
number of servers, number of clients, parallel file-system configuration etc.
We present a Multiple Linear regression model that can capture the relationship
between the configuration parameters of a file system, hardware configuration,
workload configuration (collectively called features) and the performance
metrics. We use this to rank the features according to their importance in
deciding the performance of the file-system