374 research outputs found

    A Survey on Metric Learning for Feature Vectors and Structured Data

    Full text link
    The need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. This has led to the emergence of metric learning, which aims at automatically learning a metric from data and has attracted a lot of interest in machine learning and related fields for the past ten years. This survey paper proposes a systematic review of the metric learning literature, highlighting the pros and cons of each approach. We pay particular attention to Mahalanobis distance metric learning, a well-studied and successful framework, but additionally present a wide range of methods that have recently emerged as powerful alternatives, including nonlinear metric learning, similarity learning and local metric learning. Recent trends and extensions, such as semi-supervised metric learning, metric learning for histogram data and the derivation of generalization guarantees, are also covered. Finally, this survey addresses metric learning for structured data, in particular edit distance learning, and attempts to give an overview of the remaining challenges in metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new method

    Kernel depth funcions for functional data

    Get PDF
    In the last years the concept of data depth has been increasingly used in Statistics as a center-outward ordering of sample points in multivariate data sets. Recently data depth has been extended to functional data. In this paper we propose new intrinsic functional data depths based on the representation of functional data on Reproducing Kernel Hilbert Spaces, and test its performance against a number of well known alternatives in the problem of functional outlier detection.The authors acknowledge financial support from the Spanish Ministry of Economy and Competitiveness ECO2015-66593-P

    Statistical distances and probability metrics for multivariate data, ensembles and probability distributions

    Get PDF
    The use of distance measures in Statistics is of fundamental importance in solving practical problems, such us hypothesis testing, independence contrast, goodness of fit tests, classification tasks, outlier detection and density estimation methods, to name just a few. The Mahalanobis distance was originally developed to compute the distance from a point to the center of a distribution taking into account the distribution of the data, in this case the normal distribution. This is the only distance measure in the statistical literature that takes into account the probabilistic information of the data. In this thesis we address the study of different distance measures that share a fundamental characteristic: all the proposed distances incorporate probabilistic information. The thesis is organized as follows: In Chapter 1 we motivate the problems addressed in this thesis. In Chapter 2 we present the usual definitions and properties of the different distance measures for multivariate data and for probability distributions treated in the statistical literature. In Chapter 3 we propose a distance that generalizes the Mahalanobis distance to the case where the distribution of the data is not Gaussian. To this aim, we introduce a Mercer Kernel based on the distribution of the data at hand. The Mercer Kernel induces distances from a point to the center of a distribution. In this chapter we also present a plug-in estimator of the distance that allows us to solve classification and outlier detection problems in an efficient way. In Chapter 4 of this thesis, we present two new distance measures for multivariate data that incorporate the probabilistic information contained in the sample. In this chapter we also introduce two estimation methods for the proposed distances and we study empirically their convergence. In the experimental section of Chapter 4 we solve classification problems and obtain better results than several standard classification methods in the literature of discriminant analysis. In Chapter 5 we propose a new family of probability metrics and we study its theoretical properties. We introduce an estimation method to compute the proposed distances that is based on the estimation of the level sets, avoiding in this way the difficult task of density estimation. In this chapter we show that the proposed distance is able to solve hypothesis tests and classification problems in general contexts, obtaining better results than other standard methods in statistics. In Chapter 6 we introduce a new distance for sets of points. To this end, we define a dissimilarity measure for points by using a Mercer Kernel that is extended later to a Mercer Kernel for sets of points. In this way, we are able to induce a dissimilarity index for sets of points that it is used as an input for an adaptive k-mean clustering algorithm. The proposed clustering algorithm considers an alignment of the sets of points by taking into account a wide range of possible wrapping functions. This chapter presents an application to clustering neuronal spike trains, a relevant problem in neural coding. Finally, in Chapter 7, we present the general conclusions of this thesis and the future research lines.En Estadística el uso de medidas de distancia resulta de vital importancia a la hora de resolver problemas de índole práctica. Algunos métodos que hacen uso de distancias en estadística son: Contrastes de hipótesis, de independencia, de bondad de ajuste, métodos de clasificación, detección de atípicos y estimación de densidad, entre otros. La distancia de Mahalanobis, que fue diseñada originalmente para hallar la distancia de un punto al centro de una distribución usando información de la distribución ambiente, en este caso la normal. Constituye el único ejemplo existente en estadística de distancia que considera información probabilística. En esta tesis abordamos el estudio de diferentes medidas de distancia que comparten una característica en común: todas ellas incorporan información probabilística. El trabajo se encuentra organizado de la siguiente manera: En el Capítulo 1 motivamos los problemas abordados en esta tesis. En el Capítulo 2 de este trabajo presentamos las definiciones y propiedades de las diferentes medidas de distancias para datos multivariantes y para medidas de probabilidad existentes en la literatura. En el Capítulo 3 se propone una distancia que generaliza la distancia de Mahalanobis al caso en que la distribución de los datos no es Gaussiana. Para ello se propone un Núcleo (kernel) de Mercer basado en la densidad (muestral) de los datos que nos confiere la posibilidad de inducir distancias de un punto a una distribución. En este capítulo presentamos además un estimador plug-in de la distancia que nos permite resolver, de manera práctica y eficiente, problemas de detección de atípicos y problemas de clasificación mejorando los resultados obtenidos al utilizar otros métodos de la literatura. Continuando con el estudio de medidas de distancia, en el Capítulo 4 de esta tesis se proponen dos nuevas medidas de distancia para datos multivariantes incorporando información probabilística contenida en la muestra. En este capítulo proponemos también dos métodos de estimación eficientes para las distancias propuestas y estudiamos de manera empírica su convergencia. En la sección experimental del Capítulo 4 se resuelven problemas de clasificación con las medidas de distancia propuestas, mejorando los resultados obtenidos con procedimientos habitualmente utilizados en la literatura de análisis discriminante. En el Capítulo 5 proponemos una familia de distancias entre medidas de probabilidad. Se estudian también las propiedades teóricas de la familia de métricas propuesta y se establece un método de estimación de las distancias basado en la estimación de los conjuntos de nivel (definidos en este capítulo), evitando así la estimación directa de la densidad. En este capítulo se resuelven diferentes problemas de índole práctica con las métricas propuestas: Contraste de hipótesis y problemas de clasificación en diferentes contextos. Los resultados empíricos de este capítulo demuestran que la distancia propuesta es superior a otros métodos habituales de la literatura. Para finalizar con el estudio de distancias, en el Capítulo 6 se propone una medida de distancia entre conjuntos de puntos. Para ello, se define una medida de similaridad entre puntos a través de un kernel de Mercer. A continuación se extiende el kernel para puntos a un kernel de Mercer para conjuntos de puntos. De esta forma, el Núcleo de Mercer para conjuntos de puntos es utilizado para inducir una métrica (un índice de disimilaridad) entre conjuntos de puntos. En este capítulo se propone un método de clasificación por k-medias que utiliza la métrica propuesta y que contempla, además, la posibilidad de alinear los conjuntos de puntos en cada etapa de la construcción de los clusters. En este capítulo presentamos una aplicación relativa al estudio de la decodificación neuronal, donde utilizamos el método propuesto para encontrar clusters de neuronas con patrones de funcionamiento similares. Finalmente en el Capítulo 7 se presentan las conclusiones generales de este trabajo y las futuras líneas de investigación.Programa Oficial de Doctorado en Ingeniería MatemáticaPresidente: Santiago Velilla Cerdán.- Secretario: Verónica Vinciotti.- Vocal: Emilio Carrizosa Prieg

    Spitzer/IRAC precision photometry: a machine learning approach

    Get PDF
    The largest source of noise in exoplanet and brown dwarf photometric time series made with Spitzer/IRAC is the coupling between intra-pixel gain variations and spacecraft pointing fluctuations. Observers typically correct for this systematic in science data by deriving an instrumental noise model simultaneously with the astrophysical light curve and removing the noise model. Such techniques for self-calibrating Spitzer photometric datasets have been extremely successful, and in many cases enabled near-photon-limited precision on exoplanet transit and eclipse depths. Self-calibration, however, can suffer from certain limitations: (1) temporal astrophysical signals can become aliased as part of the instrument model; (2) for some techniques adequate model estimation often requires a high degree of intra-pixel positional redundancy (multiple samples with nearby centroids) over long time spans; (3) many techniques do not account for sporadic high frequency telescope vibrations that smear out the point spread function. We have begun to build independent general-purpose intra-pixel systematics removal algorithms using three machine learning techniques: K-Nearest Neighbors (with kernel regression), Random Decision Forests, and Artificial Neural Networks. These methods remove many of the limitations of self-calibration: (1) they operate on a dedicated calibration database of approximately one million measurements per IRAC waveband (3.6 and 4.5 microns) of non-variable stars, and thus are independent of the time series science data to be corrected; (2) the database covers a large area of the "Sweet Spot, so the methods do not require positional redundancy in the science data; (3) machine learning techniques in general allow for flexibility in training with multiple, sometimes unorthodox, variables, including those that trace PSF smear. We focus in this report on the K-Nearest Neighbors with Kernel Regression technique. (Additional communications are in preparation describing Decision Forests and Neural Networks.
    corecore