103 research outputs found

    High-Dimensional Density Ratio Estimation with Extensions to Approximate Likelihood Computation

    Full text link
    The ratio between two probability density functions is an important component of various tasks, including selection bias correction, novelty detection and classification. Recently, several estimators of this ratio have been proposed. Most of these methods fail if the sample space is high-dimensional, and hence require a dimension reduction step, the result of which can be a significant loss of information. Here we propose a simple-to-implement, fully nonparametric density ratio estimator that expands the ratio in terms of the eigenfunctions of a kernel-based operator; these functions reflect the underlying geometry of the data (e.g., submanifold structure), often leading to better estimates without an explicit dimension reduction step. We show how our general framework can be extended to address another important problem, the estimation of a likelihood function in situations where that function cannot be well-approximated by an analytical form. One is often faced with this situation when performing statistical inference with data from the sciences, due the complexity of the data and of the processes that generated those data. We emphasize applications where using existing likelihood-free methods of inference would be challenging due to the high dimensionality of the sample space, but where our spectral series method yields a reasonable estimate of the likelihood function. We provide theoretical guarantees and illustrate the effectiveness of our proposed method with numerical experiments.Comment: With supplementary materia

    Dimensionality reduction and simultaneous classication approaches for complex data: methods and applications

    Get PDF
    Statistical learning (SL) is the study of the generalizable extraction of knowledge from data (Friedman et al. 2001). The concept of learning is used when human expertise does not exist, humans are unable to explain their expertise, solution changes in time, solution needs to be adapted to particular cases. The principal algorithms used in SL are classified in: (i) supervised learning (e.g. regression and classification), it is trained on labelled examples, i.e., input where the desired output is known. In other words, supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used speculatively to generate an output for previously unseen inputs; (ii) unsupervised learning (e.g. association and clustering), it operates on unlabeled examples, i.e., input where the desired output is unknown, in this case the objective is to discover structure in the data (e.g. through a cluster analysis), not to generalize a mapping from inputs to outputs; (iii) semi-supervised, it combines both labeled and unlabeled examples to generate an appropriate function or classifier. In a multidimensional context, when the number of variables is very large, or when it is believed that some of these do not contribute much to identify the groups structure in the data set, researchers apply a continuous model for dimensionality reduction as principal component analysis, factorial analysis, correspondence analy- sis, etc., and sequentially a discrete clustering model on the object scores computed as K-means, mixture models, etc. This approach is called tandem analysis (TA) by Arabie & Hubert (1994). However, De Sarbo et al. (1990) and De Soete & Carrol (1994) warn against this approach, because the methods for dimension reduction may identify dimensions that do not necessarily contribute much to perceive the groups structure in the data and that, on the contrary, may obscure or mask the groups structure that could exist in the data. A solution to this problem is given by a methodology that includes the simultaneous detection of factors and clusters on the computed scores. In the case of continuous data, many alternative methods combining cluster analysis and the search for a reduced set of factors have been proposed, focusing on factorial meth- ods, multidimensional scaling or unfolding analysis and clustering (e.g., Heiser 1993, De Soete & Heiser 1993). De Soete & Carroll (1994) proposed an alternative to the K-means procedure, named reduced K-means (RKM), which appeared to equal the earlier proposed projection pursuit clustering (PPC) (Bolton & Krzanowski 2012). RKM simultaneously searches for a clustering of objects, based on the K-means criterion (MacQueen 1967), and a dimensionality reduction of the variables, based on the principal component analysis (PCA). However, this approach may fail to recover the clustering of objects when the data contain much variance in directions orthogonal to the subspace of the data in which the clusters reside (Timmerman et al. 2010). To solve this problem, Vichi & Kiers (2001), proposed the factorial K-means (FKM) model. FKM combines K-means cluster analysis with PCA, then finding the best subspace that best represents the clustering structure in the data. In other terms FKM works in the reduced space, and simultaneously searches the best partition of objects based on the use of K-means criterion, represented by the best reduced orthogonal space, based on the use of PCA. When categorical variables are observed, TA corresponds to apply first multiple correspondence analysis (MCA) and subsequently the K-means clustering on the achieved factors. Hwang et al (2007) proposed an extension of MCA that takes into account cluster-level heterogeneity in respondents’ preferences/choices. The method involves combining MCA and k-means in a unified framework. The former is used for uncovering a low-dimensional space of multivariate categorical variables while the latter is used for identifying relatively homogeneous clusters of respondents. In the last years, the dimensionality reduction problem is very known also in other statistical contexts such as structural equation modeling (SEM). In fact, in a wide range of SEMs applications, the assumption that data are collected from a single ho- mogeneous population, is often unrealistic, and the identification of different groups (clusters) of observations constitutes a critical issue in many fields. Following this research idea, in this doctoral thesis we propose a good review on the more recent statistical models used to solve the dimensionality problem discussed above. In particular, in the first chapter we show an application on hyperspectral data classification using the most used discriminant functions to solve the high di- mensionality problem, e.g., the partial least squares discriminant analysis (PLS-DA); in the second chapter we present the multiple correspondence K-means (MCKM) model proposed by Fordellone & Vichi (2017), which identifies simultaneously the best partition of the N objects described by the best orthogonal linear combination of categorical variables according to a single objective function; finally, in the third chapter we present the partial least squares structural equation modeling K-means (PLS-SEM-KM) proposed by Fordellone & Vichi (2018), which identifies simultane- ously the best partition of the N objects described by the best causal relationship among the latent constructs

    CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

    Get PDF
    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research

    Optimal L2-norm empirical importance weights for the change of probability measure

    Get PDF
    This work proposes an optimization formulation to determine a set of empirical importance weights to achieve a change of probability measure. The objective is to estimate statistics from a target distribution using random samples generated from a (different) proposal distribution. This work considers the specific case in which the proposal distribution from which the random samples are generated is unknown; that is, we have available the samples but no explicit description of their underlying distribution. In this setting, the Radon–Nikodym theorem provides a valid but indeterminable solution to the task, since the distribution from which the random samples are generated is inaccessible. The proposed approach employs the well-defined and determinable empirical distribution function associated with the available samples. The core idea is to compute importance weights associated with the random samples, such that the distance between the weighted proposal empirical distribution function and the desired target distribution function is minimized. The distance metric selected for this work is the L[subscript 2] -norm and the importance weights are constrained to define a probability measure. The resulting optimization problem is shown to be a single linear equality and box-constrained quadratic program. This problem can be solved efficiently using optimization algorithms that scale well to high dimensions. Under some conditions restricting the class of distribution functions, the solution of the optimization problem is shown to result in a weighted proposal empirical distribution function that converges to the target distribution function in the L[subscript 1] -norm, as the number of samples tends to infinity. Results on a variety of test cases show that the proposed approach performs well in comparison with other well-known approaches.Singapore University of Technology and Design. International Design CenterUnited States. Defense Advanced Research Projects Agency (META program through AFRL Contract FA8650-10-C-7083 and Vanderbilt University Contract VUDSR#21807-S7)United States. Federal Aviation Administration. Office of Environment and Energy (FAA Award No. 09-C-NE-MIT, Amendment Nos. 028, 033, and 038

    Transfer k-means: a new supervised clustering approach

    Get PDF
    Η επιτηρούμενη και η μη-επιτηρούμενη μάθηση είναι δύο θεμελιώδη σχήματα μάθησης, των οποίων η διαφορά έγγυται στην παρουσία και απουσία ενός καθηγητή (δηλαδή μιας οντότητας που παρέχει παραδείγματα) αντίστοιχα. Από την άλλη πλευρά, η μεταφορά μάθησης είναι μια ιδέα που στοχεύει να βελτιώσει την μάθηση ενός έργου χρησιμοποιώντας βοηθητική γνώση. Ο στόχος της παρούσας διπλωματικής είναι να διερευνήσει πως αυτά τα δύο θεμελιώδη παραδείγματα μάθησης, επιτηρούμενη και μη-επιτηρούμενη μάθηση, μπορούν να συνεργαστούν στο πλαίσιο της μεταφοράς μάθησης. Ως αποτέλεσμα, αναπτύξαμε τη μέθοδο transfer-KKmeans, μια παραλλαγή της δημοφιλής ευριστικής μεθόδου KKmeans, που βασίζεται στην μεταφορά μάθησης. Η προτεινόμενη μέθοδος εμπλουτίζει την μη-επιτηρούμενη φύση του KKmeans χρησιμοποιώντας επιτήρηση από ένα διαφορετικό αλλά σχετικό χώρο ως τεχνική αρχικοποίησης των συστάδων, με σκοπό να βελτιώσει την απόδοση της ευριστικής αυτής μεθόδου. Παρέχουμε προσεγγιστικές εγγυήσεις σύμφωνα με την φύση της εισόδου και επαληθεύουμε πειραματικά τα οφέλη του transfer-KKmeans χρησιμοποιώντας κείμενα σε φυσική γλώσσα ως ρεαλιστική εφαρμογή.Supervised and unsupervised learning are two fundamental learning schemes whose difference lies in the presence and absence of a supervisor (i.e. entity which provides examples) respectively. On the other hand, transfer learning aims at improving the learning of a task by using auxiliary knowledge. The goal of this thesis was to investigate how the two fundamental paradigms, supervised and unsupervised learning, can collaborate in the setting of transfer learning. As a result, we developed transfer-KKmeans, a transfer learning variant of the popular KKmeans heuristic. The proposed method enhances the unsupervised nature of KKmeans, using supervision from a different but related context as a seeding technique, in order to improve the heuristic's performance towards more meaningful results. We provide approximation guarantees based on the nature of the input and we experimentally validate the benefits of the proposed method using documents as a real-world example

    Discriminant feature extraction: exploiting structures within each sample and across samples.

    Get PDF
    Zhang, Wei.Thesis (M.Phil.)--Chinese University of Hong Kong, 2009.Includes bibliographical references (leaves 95-109).Abstract also in Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Area of Machine Learning --- p.1Chapter 1.1.1 --- Types of Algorithms --- p.2Chapter 1.1.2 --- Modeling Assumptions --- p.4Chapter 1.2 --- Dimensionality Reduction --- p.4Chapter 1.3 --- Structure of the Thesis --- p.8Chapter 2 --- Dimensionality Reduction --- p.10Chapter 2.1 --- Feature Extraction --- p.11Chapter 2.1.1 --- Linear Feature Extraction --- p.11Chapter 2.1.2 --- Nonlinear Feature Extraction --- p.16Chapter 2.1.3 --- Sparse Feature Extraction --- p.19Chapter 2.1.4 --- Nonnegative Feature Extraction --- p.19Chapter 2.1.5 --- Incremental Feature Extraction --- p.20Chapter 2.2 --- Feature Selection --- p.20Chapter 2.2.1 --- Viewpoint of Feature Extraction --- p.21Chapter 2.2.2 --- Feature-Level Score --- p.22Chapter 2.2.3 --- Subset-Level Score --- p.22Chapter 3 --- Various Views of Feature Extraction --- p.24Chapter 3.1 --- Probabilistic Models --- p.25Chapter 3.2 --- Matrix Factorization --- p.26Chapter 3.3 --- Graph Embedding --- p.28Chapter 3.4 --- Manifold Learning --- p.28Chapter 3.5 --- Distance Metric Learning --- p.32Chapter 4 --- Tensor linear Laplacian discrimination --- p.34Chapter 4.1 --- Motivation --- p.35Chapter 4.2 --- Tensor Linear Laplacian Discrimination --- p.37Chapter 4.2.1 --- Preliminaries of Tensor Operations --- p.38Chapter 4.2.2 --- Discriminant Scatters --- p.38Chapter 4.2.3 --- Solving for Projection Matrices --- p.40Chapter 4.3 --- Definition of Weights --- p.44Chapter 4.3.1 --- Contextual Distance --- p.44Chapter 4.3.2 --- Tensor Coding Length --- p.45Chapter 4.4 --- Experimental Results --- p.47Chapter 4.4.1 --- Face Recognition --- p.48Chapter 4.4.2 --- Texture Classification --- p.50Chapter 4.4.3 --- Handwritten Digit Recognition --- p.52Chapter 4.5 --- Conclusions --- p.54Chapter 5 --- Semi-Supervised Semi-Riemannian Metric Map --- p.56Chapter 5.1 --- Introduction --- p.57Chapter 5.2 --- Semi-Riemannian Spaces --- p.60Chapter 5.3 --- Semi-Supervised Semi-Riemannian Metric Map --- p.61Chapter 5.3.1 --- The Discrepancy Criterion --- p.61Chapter 5.3.2 --- Semi-Riemannian Geometry Based Feature Extraction Framework --- p.63Chapter 5.3.3 --- Semi-Supervised Learning of Semi-Riemannian Metrics --- p.65Chapter 5.4 --- Discussion --- p.72Chapter 5.4.1 --- A General Framework for Semi-Supervised Dimensionality Reduction --- p.72Chapter 5.4.2 --- Comparison to SRDA --- p.74Chapter 5.4.3 --- Advantages over Semi-supervised Discriminant Analysis --- p.74Chapter 5.5 --- Experiments --- p.75Chapter 5.5.1 --- Experimental Setup --- p.76Chapter 5.5.2 --- Face Recognition --- p.76Chapter 5.5.3 --- Handwritten Digit Classification --- p.82Chapter 5.6 --- Conclusion --- p.84Chapter 6 --- Summary --- p.86Chapter A --- The Relationship between LDA and LLD --- p.89Chapter B --- Coding Length --- p.91Chapter C --- Connection between SRDA and ANMM --- p.92Chapter D --- From S3RMM to Graph-Based Approaches --- p.93Bibliography --- p.9

    Proceedings of the 35th International Workshop on Statistical Modelling : July 20- 24, 2020 Bilbao, Basque Country, Spain

    Get PDF
    466 p.The InternationalWorkshop on Statistical Modelling (IWSM) is a reference workshop in promoting statistical modelling, applications of Statistics for researchers, academics and industrialist in a broad sense. Unfortunately, the global COVID-19 pandemic has not allowed holding the 35th edition of the IWSM in Bilbao in July 2020. Despite the situation and following the spirit of the Workshop and the Statistical Modelling Society, we are delighted to bring you the proceedings book of extended abstracts

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio
    corecore