15 research outputs found
Application of geographic information systems to archaeological intra-site recording and analysis : a case study of the Kissonerga Chalcolithic site, Cyprus
Optimal Locally Repairable Codes and Connections to Matroid Theory
Petabyte-scale distributed storage systems are currently transitioning to
erasure codes to achieve higher storage efficiency. Classical codes like
Reed-Solomon are highly sub-optimal for distributed environments due to their
high overhead in single-failure events. Locally Repairable Codes (LRCs) form a
new family of codes that are repair efficient. In particular, LRCs minimize the
number of nodes participating in single node repairs during which they generate
small network traffic. Two large-scale distributed storage systems have already
implemented different types of LRCs: Windows Azure Storage and the Hadoop
Distributed File System RAID used by Facebook. The fundamental bounds for LRCs,
namely the best possible distance for a given code locality, were recently
discovered, but few explicit constructions exist. In this work, we present an
explicit and optimal LRCs that are simple to construct. Our construction is
based on grouping Reed-Solomon (RS) coded symbols to obtain RS coded symbols
over a larger finite field. We then partition these RS symbols in small groups,
and re-encode them using a simple local code that offers low repair locality.
For the analysis of the optimality of the code, we derive a new result on the
matroid represented by the code generator matrix.Comment: Submitted for publication, a shorter version was presented at ISIT
201
A Repair Framework for Scalar MDS Codes
Several works have developed vector-linear maximum-distance separable (MDS)
storage codes that min- imize the total communication cost required to repair a
single coded symbol after an erasure, referred to as repair bandwidth (BW).
Vector codes allow communicating fewer sub-symbols per node, instead of the
entire content. This allows non trivial savings in repair BW. In sharp
contrast, classic codes, like Reed- Solomon (RS), used in current storage
systems, are deemed to suffer from naive repair, i.e. downloading the entire
stored message to repair one failed node. This mainly happens because they are
scalar-linear. In this work, we present a simple framework that treats scalar
codes as vector-linear. In some cases, this allows significant savings in
repair BW. We show that vectorized scalar codes exhibit properties that
simplify the design of repair schemes. Our framework can be seen as a finite
field analogue of real interference alignment. Using our simplified framework,
we design a scheme that we call clique-repair which provably identifies the
best linear repair strategy for any scalar 2-parity MDS code, under some
conditions on the sub-field chosen for vectorization. We specify optimal repair
schemes for specific (5,3)- and (6,4)-Reed- Solomon (RS) codes. Further, we
present a repair strategy for the RS code currently deployed in the Facebook
Analytics Hadoop cluster that leads to 20% of repair BW savings over naive
repair which is the repair scheme currently used for this code.Comment: 10 Pages; accepted to IEEE JSAC -Distributed Storage 201
Locality and Availability in Distributed Storage
This paper studies the problem of code symbol availability: a code symbol is
said to have -availability if it can be reconstructed from disjoint
groups of other symbols, each of size at most . For example, -replication
supports -availability as each symbol can be read from its other
(disjoint) replicas, i.e., . However, the rate of replication must vanish
like as the availability increases.
This paper shows that it is possible to construct codes that can support a
scaling number of parallel reads while keeping the rate to be an arbitrarily
high constant. It further shows that this is possible with the minimum distance
arbitrarily close to the Singleton bound. This paper also presents a bound
demonstrating a trade-off between minimum distance, availability and locality.
Our codes match the aforementioned bound and their construction relies on
combinatorial objects called resolvable designs.
From a practical standpoint, our codes seem useful for distributed storage
applications involving hot data, i.e., the information which is frequently
accessed by multiple processes in parallel.Comment: Submitted to ISIT 201
MCMC methods for integer least-squares problems
We consider the problem of finding the least-squares solution to a system of linear equations where the unknown vector has integer entries (or, more precisely, has entries belonging to a subset of the integers), yet where the coefficient matrix and given vector are comprised of real numbers. Geometrically, this problem is equivalent to finding the closest lattice point to a given point and is known to be NP hard. In communication applications, however, the given vector is not arbitrary, but is a lattice point perturbed by some noise vector. Therefore it is of interest to study the computational complexity of various algorithms as a function of the noise variance or, often more appropriately, the SNR.
In this paper, we apply a particular version of the Monte Carlo Markov chain (MCMC) approach to solving this problem, which is called a "heat bath". We show that there is a trade-off between the mixing time of the Markov chain (how long it takes until the chain reaches its stationary distribution) and how long it takes for the algorithm to find the optimal solution once the chain has mixed. The complexity of the algorithm is essentially the sum of these two times. More specifically, the higher the temperature, the faster the mixing, yet the slower the discovery of the optimal solution in steady state. Conversely, the lower the temperature, the slower the mixing, yet the faster the discovery of the optimal solution once the chain is mixed.
We first show that for the probability of error of the maximum-likelihood (ML) solution to go to zero the SNR must scale at least as 2 ln N + α(N), where N is the ambient problem dimension and α(N) is any sequence that tends to positive infinity. We further obtain the optimal value of the temperature such that the average time required to encounter the optimal solution in steady state is polynomial. Simulations show that, with this choice of the temperature parameter, the optimal solution can be found in reasonable time-
-
. This suggests that the Markov chain mixes in polynomial-time, though we have not been able to prove this. It seems reasonable to conjecture that for SNR scaling as O((ln(N))1+∈), and for appropriate choice of the temperature parameter, the heat bath algorithm finds the optimal solution in polynomial-time
Orthogonal NMF through Subspace Exploration
Abstract Orthogonal Nonnegative Matrix Factorization (ONMF) aims to approximate a nonnegative matrix as the product of two k-dimensional nonnegative factors, one of which has orthonormal columns. It yields potentially useful data representations as superposition of disjoint parts, while it has been shown to work well for clustering tasks where traditional methods underperform. Existing algorithms rely mostly on heuristics, which despite their good empirical performance, lack provable performance guarantees. We present a new ONMF algorithm with provable approximation guarantees. For any constant dimension k, we obtain an additive EPTAS without any assumptions on the input. Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks k nonnegative components that jointly capture most of the variance. Our NNPCA algorithm is of independent interest and generalizes previous work that could only obtain guarantees for a single component. We evaluate our algorithms on several real and synthetic datasets and show that their performance matches or outperforms the state of the art