46,125 research outputs found
CATS: linearizability and partition tolerance in scalable and self-organizing key-value stores
Distributed key-value stores provide scalable, fault-tolerant, and self-organizing
storage services, but fall short of guaranteeing linearizable consistency
in partially synchronous, lossy, partitionable, and dynamic networks, when data
is distributed and replicated automatically by the principle of consistent hashing.
This paper introduces consistent quorums as a solution for achieving atomic
consistency. We present the design and implementation of CATS, a distributed
key-value store which uses consistent quorums to guarantee linearizability and partition tolerance in such adverse and dynamic network conditions. CATS is
scalable, elastic, and self-organizing; key properties for modern cloud storage
middleware. Our system shows that consistency can be achieved with practical
performance and modest throughput overhead (5%) for read-intensive workloads
Partial mixture model for tight clustering of gene expression time-course
Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to
this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.
Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate
information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a
simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.
Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset
under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion
Approximation Algorithms for Energy Minimization in Cloud Service Allocation under Reliability Constraints
We consider allocation problems that arise in the context of service
allocation in Clouds. More specifically, we assume on the one part that each
computing resource is associated to a capacity constraint, that can be chosen
using Dynamic Voltage and Frequency Scaling (DVFS) method, and to a probability
of failure. On the other hand, we assume that the service runs as a set of
independent instances of identical Virtual Machines. Moreover, there exists a
Service Level Agreement (SLA) between the Cloud provider and the client that
can be expressed as follows: the client comes with a minimal number of service
instances which must be alive at the end of the day, and the Cloud provider
offers a list of pairs (price,compensation), this compensation being paid by
the Cloud provider if it fails to keep alive the required number of services.
On the Cloud provider side, each pair corresponds actually to a guaranteed
success probability of fulfilling the constraint on the minimal number of
instances. In this context, given a minimal number of instances and a
probability of success, the question for the Cloud provider is to find the
number of necessary resources, their clock frequency and an allocation of the
instances (possibly using replication) onto machines. This solution should
satisfy all types of constraints during a given time period while minimizing
the energy consumption of used resources. We consider two energy consumption
models based on DVFS techniques, where the clock frequency of physical
resources can be changed. For each allocation problem and each energy model, we
prove deterministic approximation ratios on the consumed energy for algorithms
that provide guaranteed probability failures, as well as an efficient
heuristic, whose energy ratio is not guaranteed
Instant restore after a media failure
Media failures usually leave database systems unavailable for several hours
until recovery is complete, especially in applications with large devices and
high transaction volume. Previous work introduced a technique called
single-pass restore, which increases restore bandwidth and thus substantially
decreases time to repair. Instant restore goes further as it permits read/write
access to any data on a device undergoing restore--even data not yet
restored--by restoring individual data segments on demand. Thus, the restore
process is guided primarily by the needs of applications, and the observed mean
time to repair is effectively reduced from several hours to a few seconds.
This paper presents an implementation and evaluation of instant restore. The
technique is incrementally implemented on a system starting with the
traditional ARIES design for logging and recovery. Experiments show that the
transaction latency perceived after a media failure can be cut down to less
than a second and that the overhead imposed by the technique on normal
processing is minimal. The net effect is that a few "nines" of availability are
added to the system using simple and low-overhead software techniques
Algorithms of maximum likelihood data clustering with applications
We address the problem of data clustering by introducing an unsupervised,
parameter free approach based on maximum likelihood principle. Starting from
the observation that data sets belonging to the same cluster share a common
information, we construct an expression for the likelihood of any possible
cluster structure. The likelihood in turn depends only on the Pearson's
coefficient of the data. We discuss clustering algorithms that provide a fast
and reliable approximation to maximum likelihood configurations. Compared to
standard clustering methods, our approach has the advantages that i) it is
parameter free, ii) the number of clusters need not be fixed in advance and
iii) the interpretation of the results is transparent. In order to test our
approach and compare it with standard clustering algorithms, we analyze two
very different data sets: Time series of financial market returns and gene
expression data. We find that different maximization algorithms produce similar
cluster structures whereas the outcome of standard algorithms has a much wider
variability.Comment: Accepted by Physica A; 12 pag., 5 figures. More information at:
http://www.sissa.it/dataclusterin
- âŠ