120,857 research outputs found
A tight lower bound instance for k-means++ in constant dimension
The k-means++ seeding algorithm is one of the most popular algorithms that is
used for finding the initial centers when using the k-means heuristic. The
algorithm is a simple sampling procedure and can be described as follows: Pick
the first center randomly from the given points. For , pick a point to
be the center with probability proportional to the square of the
Euclidean distance of this point to the closest previously chosen
centers.
The k-means++ seeding algorithm is not only simple and fast but also gives an
approximation in expectation as shown by Arthur and Vassilvitskii.
There are datasets on which this seeding algorithm gives an approximation
factor of in expectation. However, it is not clear from these
results if the algorithm achieves good approximation factor with reasonably
high probability (say ). Brunsch and R\"{o}glin gave a dataset where
the k-means++ seeding algorithm achieves an approximation ratio
with probability that is exponentially small in . However, this and all
other known lower-bound examples are high dimensional. So, an open problem was
to understand the behavior of the algorithm on low dimensional datasets. In
this work, we give a simple two dimensional dataset on which the seeding
algorithm achieves an approximation ratio with probability
exponentially small in . This solves open problems posed by Mahajan et al.
and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with
arXiv:1306.420
Identifying hidden contexts
In this study we investigate how to identify hidden contexts from the data in classification tasks.
Contexts are artifacts in the data, which do not predict the class label directly.
For instance, in speech recognition task speakers might have different accents, which do not directly discriminate between the spoken words.
Identifying hidden contexts is considered as data preprocessing task, which can help to build more accurate classifiers, tailored for particular contexts and give an insight into the data structure.
We present three techniques to identify hidden contexts, which hide class label information from the input data and partition it using clustering techniques.
We form a collection of performance measures to ensure that the resulting contexts are valid.
We evaluate the performance of the proposed techniques on thirty real datasets.
We present a case study illustrating how the identified contexts can be used to build specialized more accurate classifiers
On-line PCA with Optimal Regrets
We carefully investigate the on-line version of PCA, where in each trial a
learning algorithm plays a k-dimensional subspace, and suffers the compression
loss on the next instance when projected into the chosen subspace. In this
setting, we analyze two popular on-line algorithms, Gradient Descent (GD) and
Exponentiated Gradient (EG). We show that both algorithms are essentially
optimal in the worst-case. This comes as a surprise, since EG is known to
perform sub-optimally when the instances are sparse. This different behavior of
EG for PCA is mainly related to the non-negativity of the loss in this case,
which makes the PCA setting qualitatively different from other settings studied
in the literature. Furthermore, we show that when considering regret bounds as
function of a loss budget, EG remains optimal and strictly outperforms GD.
Next, we study the extension of the PCA setting, in which the Nature is allowed
to play with dense instances, which are positive matrices with bounded largest
eigenvalue. Again we can show that EG is optimal and strictly better than GD in
this setting
Noetherianity up to symmetry
These lecture notes for the 2013 CIME/CIRM summer school Combinatorial
Algebraic Geometry deal with manifestly infinite-dimensional algebraic
varieties with large symmetry groups. So large, in fact, that subvarieties
stable under those symmetry groups are defined by finitely many orbits of
equations---whence the title Noetherianity up to symmetry. It is not the
purpose of these notes to give a systematic, exhaustive treatment of such
varieties, but rather to discuss a few "personal favourites": exciting examples
drawn from applications in algebraic statistics and multilinear algebra. My
hope is that these notes will attract other mathematicians to this vibrant area
at the crossroads of combinatorics, commutative algebra, algebraic geometry,
statistics, and other applications.Comment: To appear in Springer's LNM C.I.M.E. series; several typos fixe
Alternative efficiency measures for multiple-output production
This paper has two main purposes. Firstly, we develop various ways of defining efficiency in the case of multiple-output production. Our framework extends a previous model by allowing for nonseparability of inputs and outputs. We also specifically consider the case where some of the outputs are undesirable, such as pollutants. We investigate how these efficiency definitions relate to one another and to other approaches proposed in the literature. Secondly, we examine the behavior of these definitions in two examples of practically relevant size and complexity. One of these involves banking and the other agricultural data. Our main findings can be summarized as follows. For a given efficiency definition, efficiency rankings are found to be informative, despite the considerable uncertainty in the inference on efficiencies. It is, however, important for the researcher to select an efficiency concept appropriate to the particular issue under study, since different efficiency definitions can lead to quite different conclusions
Composable security proof for continuous-variable quantum key distribution with coherent states
We give the first composable security proof for continuous-variable quantum
key distribution with coherent states against collective attacks. Crucially, in
the limit of large blocks the secret key rate converges to the usual value
computed from the Holevo bound. Combining our proof with either the de Finetti
theorem or the Postselection technique then shows the security of the protocol
against general attacks, thereby confirming the long-standing conjecture that
Gaussian attacks are optimal asymptotically in the composable security
framework.
We expect that our parameter estimation procedure, which does not rely on any
assumption, will find applications elsewhere, for instance for the reliable
quantification of continuous-variable entanglement in finite-size settings.Comment: 27 pages, 1 figure. v2: added a version of the AEP valid for
conditional state
- …