Search CORE

120,857 research outputs found

A tight lower bound instance for k-means++ in constant dimension

Author: A. Aggarwal
B. Bahmani
D. Arthur
D. Arthur
M. Agarwal
M.R. Ackermann
R. Jaiswal
Publication venue
Publication date: 01/01/2014
Field of study

The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial

k

centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For

i > 1

, pick a point to be the

i^{th}

center with probability proportional to the square of the Euclidean distance of this point to the closest previously

(i-1)

chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an

O(\log{k})

approximation in expectation as shown by Arthur and Vassilvitskii. There are datasets on which this seeding algorithm gives an approximation factor of

\Omega(\log{k})

in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say

1/poly(k)

). Brunsch and R\"{o}glin gave a dataset where the k-means++ seeding algorithm achieves an

O(\log{k})

approximation ratio with probability that is exponentially small in

k

. However, this and all other known lower-bound examples are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an

O(\log{k})

approximation ratio with probability exponentially small in

k

. This solves open problems posed by Mahajan et al. and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with arXiv:1306.420

arXiv.org e-Print Archive

CiteSeerX

Identifying hidden contexts

Author: Zliobaite Indre
Publication venue: Springer LNAI
Publication date: 24/05/2011
Field of study

In this study we investigate how to identify hidden contexts from the data in classification tasks. Contexts are artifacts in the data, which do not predict the class label directly. For instance, in speech recognition task speakers might have different accents, which do not directly discriminate between the spoken words. Identifying hidden contexts is considered as data preprocessing task, which can help to build more accurate classifiers, tailored for particular contexts and give an insight into the data structure. We present three techniques to identify hidden contexts, which hide class label information from the input data and partition it using clustering techniques. We form a collection of performance measures to ensure that the resulting contexts are valid. We evaluate the performance of the proposed techniques on thirty real datasets. We present a case study illustrating how the identified contexts can be used to build specialized more accurate classifiers

Bournemouth University Research Online

On-line PCA with Optimal Regrets

Author: A.T. Kalai
D.P. Helmbold
J. Kivinen
K. Tsuda
K.S. Azoury
M. Herbster
M.K. Warmuth
M.K. Warmuth
N. Cesa-Bianchi
N. Cesa-Bianchi
N. Cesa-Bianchi
Publication venue
Publication date: 01/01/2013
Field of study

We carefully investigate the on-line version of PCA, where in each trial a learning algorithm plays a k-dimensional subspace, and suffers the compression loss on the next instance when projected into the chosen subspace. In this setting, we analyze two popular on-line algorithms, Gradient Descent (GD) and Exponentiated Gradient (EG). We show that both algorithms are essentially optimal in the worst-case. This comes as a surprise, since EG is known to perform sub-optimally when the instances are sparse. This different behavior of EG for PCA is mainly related to the non-negativity of the loss in this case, which makes the PCA setting qualitatively different from other settings studied in the literature. Furthermore, we show that when considering regret bounds as function of a loss budget, EG remains optimal and strictly outperforms GD. Next, we study the extension of the PCA setting, in which the Nature is allowed to play with dense instances, which are positive matrices with bounded largest eigenvalue. Again we can show that EG is optimal and strictly better than GD in this setting

arXiv.org e-Print Archive

CiteSeerX

Noetherianity up to symmetry

Author: Draisma Jan
Publication venue
Publication date: 01/01/2013
Field of study

These lecture notes for the 2013 CIME/CIRM summer school Combinatorial Algebraic Geometry deal with manifestly infinite-dimensional algebraic varieties with large symmetry groups. So large, in fact, that subvarieties stable under those symmetry groups are defined by finitely many orbits of equations---whence the title Noetherianity up to symmetry. It is not the purpose of these notes to give a systematic, exhaustive treatment of such varieties, but rather to discuss a few "personal favourites": exciting examples drawn from applications in algebraic statistics and multilinear algebra. My hope is that these notes will attract other mathematicians to this vibrant area at the crossroads of combinatorics, commutative algebra, algebraic geometry, statistics, and other applications.Comment: To appear in Springer's LNM C.I.M.E. series; several typos fixe

arXiv.org e-Print Archive

CiteSeerX

Repository TU/e

Alternative efficiency measures for multiple-output production

Author: Fernandez Carmen
Koop Gary
Steel Mark
Publication venue: 'Elsevier BV'
Publication date: 01/01/2005
Field of study

This paper has two main purposes. Firstly, we develop various ways of defining efficiency in the case of multiple-output production. Our framework extends a previous model by allowing for nonseparability of inputs and outputs. We also specifically consider the case where some of the outputs are undesirable, such as pollutants. We investigate how these efficiency definitions relate to one another and to other approaches proposed in the literature. Secondly, we examine the behavior of these definitions in two examples of practically relevant size and complexity. One of these involves banking and the other agricultural data. Our main findings can be summarized as follows. For a given efficiency definition, efficiency rankings are found to be informative, despite the considerable uncertainty in the inference on efficiencies. It is, however, important for the researcher to select an efficiency concept appropriate to the particular issue under study, since different efficiency definitions can lead to quite different conclusions

CiteSeerX

Warwick Research Archives Portal Repository

Composable security proof for continuous-variable quantum key distribution with coherent states

Author: Leverrier Anthony
Publication venue: 'American Physical Society (APS)'
Publication date: 01/09/2014
Field of study

We give the first composable security proof for continuous-variable quantum key distribution with coherent states against collective attacks. Crucially, in the limit of large blocks the secret key rate converges to the usual value computed from the Holevo bound. Combining our proof with either the de Finetti theorem or the Postselection technique then shows the security of the protocol against general attacks, thereby confirming the long-standing conjecture that Gaussian attacks are optimal asymptotically in the composable security framework. We expect that our parameter estimation procedure, which does not rely on any assumption, will find applications elsewhere, for instance for the reliable quantification of continuous-variable entanglement in finite-size settings.Comment: 27 pages, 1 figure. v2: added a version of the AEP valid for conditional state

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server