10 research outputs found

    Constructing practical Fuzzy Extractors using QIM

    Get PDF
    Fuzzy extractors are a powerful tool to extract randomness from noisy data. A fuzzy extractor can extract randomness only if the source data is discrete while in practice source data is continuous. Using quantizers to transform continuous data into discrete data is a commonly used solution. However, as far as we know no study has been made of the effect of the quantization strategy on the performance of fuzzy extractors. We construct the encoding and the decoding function of a fuzzy extractor using quantization index modulation (QIM) and we express properties of this fuzzy extractor in terms of parameters of the used QIM. We present and analyze an optimal (in the sense of embedding rate) two dimensional construction. Our 6-hexagonal tiling construction offers ( log2 6 / 2-1) approx. 3 extra bits per dimension of the space compared to the known square quantization based fuzzy extractor

    Stein's method and Poisson process approximation for a class of Wasserstein metrics

    Full text link
    Based on Stein's method, we derive upper bounds for Poisson process approximation in the L1L_1-Wasserstein metric d2(p)d_2^{(p)}, which is based on a slightly adapted LpL_p-Wasserstein metric between point measures. For the case p=1p=1, this construction yields the metric d2d_2 introduced in [Barbour and Brown Stochastic Process. Appl. 43 (1992) 9--31], for which Poisson process approximation is well studied in the literature. We demonstrate the usefulness of the extension to general pp by showing that d2(p)d_2^{(p)}-bounds control differences between expectations of certain ppth order average statistics of point processes. To illustrate the bounds obtained for Poisson process approximation, we consider the structure of 2-runs and the hard core model as concrete examples.Comment: Published in at http://dx.doi.org/10.3150/08-BEJ161 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

    Information theoretic bounds for Compressed Sensing

    Full text link
    In this paper we derive information theoretic performance bounds to sensing and reconstruction of sparse phenomena from noisy projections. We consider two settings: output noise models where the noise enters after the projection and input noise models where the noise enters before the projection. We consider two types of distortion for reconstruction: support errors and mean-squared errors. Our goal is to relate the number of measurements, mm, and \snr, to signal sparsity, kk, distortion level, dd, and signal dimension, nn. We consider support errors in a worst-case setting. We employ different variations of Fano's inequality to derive necessary conditions on the number of measurements and \snr required for exact reconstruction. To derive sufficient conditions we develop new insights on max-likelihood analysis based on a novel superposition property. In particular this property implies that small support errors are the dominant error events. Consequently, our ML analysis does not suffer the conservatism of the union bound and leads to a tighter analysis of max-likelihood. These results provide order-wise tight bounds. For output noise models we show that asymptotically an \snr of Θ(log(n))\Theta(\log(n)) together with Θ(klog(n/k))\Theta(k \log(n/k)) measurements is necessary and sufficient for exact support recovery. Furthermore, if a small fraction of support errors can be tolerated, a constant \snr turns out to be sufficient in the linear sparsity regime. In contrast for input noise models we show that support recovery fails if the number of measurements scales as o(nlog(n)/SNR)o(n\log(n)/SNR) implying poor compression performance for such cases. We also consider Bayesian set-up and characterize tradeoffs between mean-squared distortion and the number of measurements using rate-distortion theory.Comment: 30 pages, 2 figures, submitted to IEEE Trans. on I

    Reprezentacije i metrike za mašinsko učenje i analizu podataka velikih dimenzija

    Get PDF
    In the current information age, massive amounts of data are gathered, at a rate prohibiting their effective structuring, analysis, and conversion into useful knowledge. This information overload is manifested both in large numbers of data objects recorded in data sets, and large numbers of attributes, also known as high dimensionality. This dis-sertation deals with problems originating from high dimensionality of data representation, referred to as the “curse of dimensionality,” in the context of machine learning, data mining, and information retrieval. The described research follows two angles: studying the behavior of (dis)similarity metrics with increasing dimensionality, and exploring feature-selection methods, primarily with regard to document representation schemes for text classification. The main results of the dissertation, relevant to the first research angle, include theoretical insights into the concentration behavior of cosine similarity, and a detailed analysis of the phenomenon of hubness, which refers to the tendency of some points in a data set to become hubs by being in-cluded in unexpectedly many k-nearest neighbor lists of other points. The mechanisms behind the phenomenon are studied in detail, both from a theoretical and empirical perspective, linking hubness with the (intrinsic) dimensionality of data, describing its interaction with the cluster structure of data and the information provided by class la-bels, and demonstrating the interplay of the phenomenon and well known algorithms for classification, semi-supervised learning, clustering, and outlier detection, with special consideration being given to time-series classification and information retrieval. Results pertaining to the second research angle include quantification of the interaction between various transformations of high-dimensional document representations, and feature selection, in the context of text classification.U tekućem „informatičkom dobu“, masivne količine podataka se sakupljaju brzinom koja ne dozvoljava njihovo efektivno strukturiranje, analizu, i pretvaranje u korisno znanje. Ovo zasićenje informacijama se manifestuje kako kroz veliki broj objekata uključenih u skupove podataka, tako i kroz veliki broj atributa, takođe poznat kao velika dimenzionalnost. Disertacija se bavi problemima koji proizilaze iz velike dimenzionalnosti reprezentacije podataka, često nazivanim „prokletstvom dimenzionalnosti“, u kontekstu mašinskog učenja, data mining-a i information retrieval-a. Opisana istraživanja prate dva pravca: izučavanje ponašanja metrika (ne)sličnosti u odnosu na rastuću dimenzionalnost, i proučavanje metoda odabira atributa, prvenstveno u interakciji sa tehnikama reprezentacije dokumenata za klasifikaciju teksta. Centralni rezultati disertacije, relevantni za prvi pravac istraživanja, uključuju teorijske uvide u fenomen koncentracije kosinusne mere sličnosti, i detaljnu analizu fenomena habovitosti koji se odnosi na tendenciju nekih tačaka u skupu podataka da postanu habovi tako što bivaju uvrštene u neočekivano mnogo lista k najbližih suseda ostalih tačaka. Mehanizmi koji pokreću fenomen detaljno su proučeni, kako iz teorijske tako i iz empirijske perspektive. Habovitost je povezana sa (latentnom) dimenzionalnošću podataka, opisana je njena interakcija sa strukturom klastera u podacima i informacijama koje pružaju oznake klasa, i demonstriran je njen efekat na poznate algoritme za klasifikaciju, semi-supervizirano učenje, klastering i detekciju outlier-a, sa posebnim osvrtom na klasifikaciju vremenskih serija i information retrieval. Rezultati koji se odnose na drugi pravac istraživanja uključuju kvantifikaciju interakcije između različitih transformacija višedimenzionalnih reprezentacija dokumenata i odabira atributa, u kontekstu klasifikacije teksta

    Reprezentacije i metrike za mašinsko učenje i analizu podataka velikih dimenzija

    Get PDF
    In the current information age, massive amounts of data are gathered, at a rate prohibiting their effective structuring, analysis, and conversion into useful knowledge. This information overload is manifested both in large numbers of data objects recorded in data sets, and large numbers of attributes, also known as high dimensionality. This dis-sertation deals with problems originating from high dimensionality of data representation, referred to as the “curse of dimensionality,” in the context of machine learning, data mining, and information retrieval. The described research follows two angles: studying the behavior of (dis)similarity metrics with increasing dimensionality, and exploring feature-selection methods, primarily with regard to document representation schemes for text classification. The main results of the dissertation, relevant to the first research angle, include theoretical insights into the concentration behavior of cosine similarity, and a detailed analysis of the phenomenon of hubness, which refers to the tendency of some points in a data set to become hubs by being in-cluded in unexpectedly many k-nearest neighbor lists of other points. The mechanisms behind the phenomenon are studied in detail, both from a theoretical and empirical perspective, linking hubness with the (intrinsic) dimensionality of data, describing its interaction with the cluster structure of data and the information provided by class la-bels, and demonstrating the interplay of the phenomenon and well known algorithms for classification, semi-supervised learning, clustering, and outlier detection, with special consideration being given to time-series classification and information retrieval. Results pertaining to the second research angle include quantification of the interaction between various transformations of high-dimensional document representations, and feature selection, in the context of text classification.U tekućem „informatičkom dobu“, masivne količine podataka se sakupljaju brzinom koja ne dozvoljava njihovo efektivno strukturiranje, analizu, i pretvaranje u korisno znanje. Ovo zasićenje informacijama se manifestuje kako kroz veliki broj objekata uključenih u skupove podataka, tako i kroz veliki broj atributa, takođe poznat kao velika dimenzionalnost. Disertacija se bavi problemima koji proizilaze iz velike dimenzionalnosti reprezentacije podataka, često nazivanim „prokletstvom dimenzionalnosti“, u kontekstu mašinskog učenja, data mining-a i information retrieval-a. Opisana istraživanja prate dva pravca: izučavanje ponašanja metrika (ne)sličnosti u odnosu na rastuću dimenzionalnost, i proučavanje metoda odabira atributa, prvenstveno u interakciji sa tehnikama reprezentacije dokumenata za klasifikaciju teksta. Centralni rezultati disertacije, relevantni za prvi pravac istraživanja, uključuju teorijske uvide u fenomen koncentracije kosinusne mere sličnosti, i detaljnu analizu fenomena habovitosti koji se odnosi na tendenciju nekih tačaka u skupu podataka da postanu habovi tako što bivaju uvrštene u neočekivano mnogo lista k najbližih suseda ostalih tačaka. Mehanizmi koji pokreću fenomen detaljno su proučeni, kako iz teorijske tako i iz empirijske perspektive. Habovitost je povezana sa (latentnom) dimenzionalnošću podataka, opisana je njena interakcija sa strukturom klastera u podacima i informacijama koje pružaju oznake klasa, i demonstriran je njen efekat na poznate algoritme za klasifikaciju, semi-supervizirano učenje, klastering i detekciju outlier-a, sa posebnim osvrtom na klasifikaciju vremenskih serija i information retrieval. Rezultati koji se odnose na drugi pravac istraživanja uključuju kvantifikaciju interakcije između različitih transformacija višedimenzionalnih reprezentacija dokumenata i odabira atributa, u kontekstu klasifikacije teksta
    corecore