96 research outputs found

    Bayesian Hyperbolic Multidimensional Scaling

    Full text link
    Multidimensional scaling (MDS) is a widely used approach to representing high-dimensional, dependent data. MDS works by assigning each observation a location on a low-dimensional geometric manifold, with distance on the manifold representing similarity. We propose a Bayesian approach to multidimensional scaling when the low-dimensional manifold is hyperbolic. Using hyperbolic space facilitates representing tree-like structures common in many settings (e.g. text or genetic data with hierarchical structure). A Bayesian approach provides regularization that minimizes the impact of measurement error in the observed data and assesses uncertainty. We also propose a case-control likelihood approximation that allows for efficient sampling from the posterior distribution in larger data settings, reducing computational complexity from approximately O(n2)O(n^2) to O(n)O(n). We evaluate the proposed method against state-of-the-art alternatives using simulations, canonical reference datasets, Indian village network data, and human gene expression data

    More on Multidimensional Scaling and Unfolding in R: smacof Version 2

    Get PDF
    The smacof package offers a comprehensive implementation of multidimensional scaling (MDS) techniques in R. Since its first publication (De Leeuw and Mair 2009b) the functionality of the package has been enhanced, and several additional methods, features and utilities were added. Major updates include a complete re-implementation of multidimensional unfolding allowing for monotone dissimilarity transformations, including row-conditional, circular, and external unfolding. Additionally, the constrained MDS implementation was extended in terms of optimal scaling of the external variables. Further package additions include various tools and functions for goodness-of-fit assessment, unidimensional scaling, gravity MDS, asymmetric MDS, Procrustes, and MDS biplots. All these new package functionalities are illustrated using a variety of real-life applications

    Visualising Mutually Non-dominating Solution Sets in Many-objective Optimisation

    Get PDF
    Copyright © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.As many-objective optimization algorithms mature, the problem owner is faced with visualizing and understanding a set of mutually nondominating solutions in a high dimensional space. We review existing methods and present new techniques to address this problem. We address a common problem with the well-known heatmap visualization, since the often arbitrary ordering of rows and columns renders the heatmap unclear, by using spectral seriation to rearrange the solutions and objectives and thus enhance the clarity of the heatmap. A multiobjective evolutionary optimizer is used to further enhance the simultaneous visualization of solutions in objective and parameter space. Two methods for visualizing multiobjective solutions in the plane are introduced. First, we use RadViz and exploit interpretations of barycentric coordinates for convex polygons and simplices to map a mutually nondominating set to the interior of a regular convex polygon in the plane, providing an intuitive representation of the solutions and objectives. Second, we introduce a new measure of the similarity of solutions—the dominance distance—which captures the order relations between solutions. This metric provides an embedding in Euclidean space, which is shown to yield coherent visualizations in two dimensions. The methods are illustrated on standard test problems and data from a benchmark many-objective problem

    Euclidean Distance Matrices:Properties, Algorithms and Applications

    Get PDF
    Euclidean distance matrices (EDMs) are central players in many diverse fields including psychometrics, NMR spectroscopy, machine learning and sensor networks. However, they are not often exploited in signal processing. In this thesis, we analyze attributes of EDMs and derive new key properties of them. These analyses allow us to propose algorithms to approximate EDMs and provide analytic bounds on the performance of our methods. We use these techniques to suggest new solutions for several practical problems in signal processing. Together with these properties, algorithms and applications, EDMs can thus be considered as a fundamental toolbox to be used in signal processing. In more detail, we start by introducing the structure and properties of EDMs. In particular, we focus on their rank property; the rank of an EDM is at most the dimension of the set of points generating it plus 2. Using this property, we introduce the use of low rank matrix completion methods for approximating and completing noisy and partially revealed EDMs. We apply this algorithm to the problem of sensor position calibration in ultrasound tomography devices. By adapting the matrix completion framework, in addition to proposing a self calibration process for these devices, we also provide analytic bounds for the calibration error. We then study the problem of sensor localization using distance information by minimizing a non-linear cost function known as the s-stress function in the multidimensional scaling (MDS) community. We derive key properties of this cost function that can be used to reduce the search domain for finding its global minimum. We provide an efficient, low cost and distributed algorithm for minimizing this cost function for incomplete networks and noisy measurements. In randomized experiments, the proposed method converges to the global minimum of the s-stress in more than 99% of the cases. We also address the open problem of existence of non-global minimizers of the s-stress and reduce this problem to a hypothesis. If the hypothesis is true then the cost function has only global minimizers, otherwise, it has non-global minimizers. Using the rank property of EDMs and the proposed minimization algorithm for approximating them, we address an interesting and practical problem in acoustics. We show that using five microphones and one loudspeaker, we can hear the shape of a room. We reformulate this problem as finding the locations of the image sources of the loudspeaker with respect to the walls. We propose an algorithm to find these positions only using first-order echoes. We prove that the reconstruction of the room is almost surely unique. We further introduce a new algorithm for locating a microphone inside a known room using only one loudspeaker. Our experimental evaluations conducted on the EPFL campus and also in the Lausanne cathedral, confirm the robustness and accuracy of the proposed methods. By integrating further properties of EDMs into the matrix completion framework, we propose a new method for calibrating microphone arrays in a diffuse noise field. We use a specific characterization of diffuse noise fields to relate the coherence of recorded signals by two microphones to their mutual distance. As this model is not reliable for large distances between microphones, we use matrix completion coupled with other properties of EDMs to estimate these distances and calibrate the microphone array. Evaluation of our algorithm using real data measurements demonstrates, for the first time, the possibility of accurately calibrating large ad-hoc microphone arrays in a diffuse noise field. The last part of the thesis addresses a central problem in signal processing; the design of discrete-time filters (equivalently window functions) that are compact both in time and frequency. By properly adapting the definitions of compactness in the continuous time to discrete time, we formulate the search for maximally compact sequences as solving a semi-definite program. We show that the spectra of maximally compact sequences are a special class of Mathieu’s cosine functions. Using the asymptotic behavior of these functions, we provide a tight bound for the time-frequency spread of discrete-time sequences. Our analysis shows that the Heisenberg uncertainty bound on the time-frequency spread of sequences is not tight and the lower bound depends on the frequency spread, unlike in the continuous time case

    Smartphone-based photogrammetry for the 3D modeling of a geomorphological structure

    Get PDF
    The geomatic survey in the speleological field is one of the main activities that allows for the adding of both a scientific and popular value to cave exploration, and it is of fundamental importance for a detailed knowledge of the hypogean cavity. Today, the available instruments, such as laser scanners and metric cameras, allow us to quickly acquire data and obtain accurate three-dimensional models, but they are still expensive, require a careful planning phase of the survey, as well as some operator experience for their management. This work analyzes the performance of a smartphone device for a close-range photogrammetry approach for the extraction of accurate three-dimensional information of an underground cave. The image datasets that were acquired with a high-end smartphone were processed using the Structure from Motion (SfM)-based approach for dense point cloud generation: different image-matching algorithms implemented in a commercial and an open source software and in a smartphone application were tested. In order to assess the reachable accuracy of the proposed procedure, the achieved results were compared with a reference dense point cloud obtained with a professional camera or a terrestrial laser scanner. The approach has shown a good performance in terms of geometrical accuracies, computational time and applicability

    Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods

    Full text link
    A vast amount of textual web streams is influenced by events or phenomena emerging in the real world. The social web forms an excellent modern paradigm, where unstructured user generated content is published on a regular basis and in most occasions is freely distributed. The present Ph.D. Thesis deals with the problem of inferring information - or patterns in general - about events emerging in real life based on the contents of this textual stream. We show that it is possible to extract valuable information about social phenomena, such as an epidemic or even rainfall rates, by automatic analysis of the content published in Social Media, and in particular Twitter, using Statistical Machine Learning methods. An important intermediate task regards the formation and identification of features which characterise a target event; we select and use those textual features in several linear, non-linear and hybrid inference approaches achieving a significantly good performance in terms of the applied loss function. By examining further this rich data set, we also propose methods for extracting various types of mood signals revealing how affective norms - at least within the social web's population - evolve during the day and how significant events emerging in the real world are influencing them. Lastly, we present some preliminary findings showing several spatiotemporal characteristics of this textual information as well as the potential of using it to tackle tasks such as the prediction of voting intentions.Comment: PhD thesis, 238 pages, 9 chapters, 2 appendices, 58 figures, 49 table

    Adaptive prototype-based dissimilarity learning

    Get PDF
    Zhu X. Adaptive prototype-based dissimilarity learning. Bielefeld: Universitätsbibliothek Bielefeld; 2015.In this thesis we focus on prototype-based learning techniques, namely three unsuper- vised techniques: generative topographic mapping (GTM), neural gas (NG) and affinity propagation (AP), and two supervised techniques: generalized learning vector quantiza- tion (GLVQ) and robust soft learning vector quantization (RSLVQ). We extend their abilities with respect to the following central aspects: • Applicability on dissimilarity data: Due to the increased complexity of data, in many cases data are only available in form of (dis)similarities which describe the relations between objects. Classical methods can not directly deal with this kind of data. For unsupervised methods this problem has been studied, here we transfer the same idea to the two supervised prototype-based techniques such that they can directly deal with dissimilarities without an explicit embedding into a vector space. • Quadratic complexity issue: For dealing with dissimilarity data, due to the need of the full dissimilarity matrix, the complexity becomes quadratic which is infeasible for large data sets. In this thesis we investigate two linear approximation techniques: Nyström approximation and patch processing, and integrate them into unsupervised and supervised prototype-based techniques. • Reliability of prototype-based classifiers: In practical applications, a relia- bility measure is beneficial for evaluating the classification quality expected by the end users. Here we adopt concepts from conformal prediction (CP), which provides point-wise confidence measure of the prediction, and we combine those with supervised prototype-based techniques. • Model complexity: By means of the confidence values provided by CP, the model complexity can be automatically adjusted by adding new prototypes to cover low confidence data space. • Extendability to semi-supervised problems: Besides its ability to evaluate a classifier, conformal prediction can also be considered as a classifier. This opens a way that supervised techniques can be easily extended for semi-supervised settings by means of a self-training approach

    Grassmann Learning for Recognition and Classification

    Get PDF
    Computational performance associated with high-dimensional data is a common challenge for real-world classification and recognition systems. Subspace learning has received considerable attention as a means of finding an efficient low-dimensional representation that leads to better classification and efficient processing. A Grassmann manifold is a space that promotes smooth surfaces, where points represent subspaces and the relationship between points is defined by a mapping of an orthogonal matrix. Grassmann learning involves embedding high dimensional subspaces and kernelizing the embedding onto a projection space where distance computations can be effectively performed. In this dissertation, Grassmann learning and its benefits towards action classification and face recognition in terms of accuracy and performance are investigated and evaluated. Grassmannian Sparse Representation (GSR) and Grassmannian Spectral Regression (GRASP) are proposed as Grassmann inspired subspace learning algorithms. GSR is a novel subspace learning algorithm that combines the benefits of Grassmann manifolds with sparse representations using least squares loss §¤1-norm minimization for improved classification. GRASP is a novel subspace learning algorithm that leverages the benefits of Grassmann manifolds and Spectral Regression in a framework that supports high discrimination between classes and achieves computational benefits by using manifold modeling and avoiding eigen-decomposition. The effectiveness of GSR and GRASP is demonstrated for computationally intensive classification problems: (a) multi-view action classification using the IXMAS Multi-View dataset, the i3DPost Multi-View dataset, and the WVU Multi-View dataset, (b) 3D action classification using the MSRAction3D dataset and MSRGesture3D dataset, and (c) face recognition using the ATT Face Database, Labeled Faces in the Wild (LFW), and the Extended Yale Face Database B (YALE). Additional contributions include the definition of Motion History Surfaces (MHS) and Motion Depth Surfaces (MDS) as descriptors suitable for activity representations in video sequences and 3D depth sequences. An in-depth analysis of Grassmann metrics is applied on high dimensional data with different levels of noise and data distributions which reveals that standardized Grassmann kernels are favorable over geodesic metrics on a Grassmann manifold. Finally, an extensive performance analysis is made that supports Grassmann subspace learning as an effective approach for classification and recognition

    Distance-based analysis of dynamical systems and time series by optimal transport

    Get PDF
    The concept of distance is a fundamental notion that forms a basis for the orientation in space. It is related to the scientific measurement process: quantitative measurements result in numerical values, and these can be immediately translated into distances. Vice versa, a set of mutual distances defines an abstract Euclidean space. Each system is thereby represented as a point, whose Euclidean distances approximate the original distances as close as possible. If the original distance measures interesting properties, these can be found back as interesting patterns in this space. This idea is applied to complex systems: The act of breathing, the structure and activity of the brain, and dynamical systems and time series in general. In all these situations, optimal transportation distances are used; these measure how much work is needed to transform one probability distribution into another. The reconstructed Euclidean space then permits to apply multivariate statistical methods. In particular, canonical discriminant analysis makes it possible to distinguish between distinct classes of systems, e.g., between healthy and diseased lungs. This offers new diagnostic perspectives in the assessment of lung and brain diseases, and also offers a new approach to numerical bifurcation analysis and to quantify synchronization in dynamical systems.LEI Universiteit LeidenNWO Computational Life Sciences, grant no. 635.100.006Analyse en stochastie

    Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

    Get PDF
    International audienceBackground: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses
    • …
    corecore