378 research outputs found

    Statistical applications of the multivariate skew-normal distribution

    Full text link
    Azzalini & Dalla Valle (1996) have recently discussed the multivariate skew-normal distribution which extends the class of normal distributions by the addition of a shape parameter. The first part of the present paper examines further probabilistic properties of the distribution, with special emphasis on aspects of statistical relevance. Inferential and other statistical issues are discussed in the following part, with applications to some multivariate statistics problems, illustrated by numerical examples. Finally, a further extension is described which introduces a skewing factor of an elliptical density.Comment: full-length version of the published paper, 32 pages, with 7 figures, uses psfra

    Eigenvectors of a kurtosis matrix as interesting directions to reveal cluster structure

    Get PDF
    In this paper we study the properties of a kurtosis matrix and propose its eigenvectors as interesting directions to reveal the possible cluster structure of a data set. Under a mixture of elliptical distributions with proportional scatter matrix, it is shown that a subset of the eigenvectors of the fourth-order moment matrix corresponds to Fisher's linear discriminant subspace. The eigenvectors of the estimated kurtosis matrix are consistent estimators of this subspace and its calculation is easy to implement and computationally efficient, which is particularly favourable when the ratio n/p is large.Publicad

    Contributions to the problem of cluster analysis

    Get PDF
    Dada una muestra aleatoria generada por una mezcla de distribuciones, el objetivo del análisis de conglomerados es partir la muestra en grupos homogéneos en relación a las poblaciones que los han generado. Algoritmos como kmeans y mclust resuelven el problema de conglomerados en el espacio original. Un enfoque alternativo es reducir primero la dimensión de los datos proyectando la muestra en un espacio de dimensión menor, e identificar los grupos en este subespacio. De esta forma, la maldición de la dimensión puede evitarse, pero hay que asegurarse de que los datos proyectados preservan la estructura de conglomerados de la muestra original. En este contexto, los métodos de búsqueda de proyecciones tienen como objetivo encontrar direcciones, o subespacios de baja dimensión, que muestren las vistas más interesantes de los datos (Friedman and Tukey, 1974; Friedman, 1987). Reducir la dimensión de la muestra es efectivo ya que no toda la información de los datos está ligada a la estructura de grupos de la muestra. Con la reducción se pretende eliminar la información no relevante, y quedarse con un espacio de dimensión menor donde el problema de conglomerados sea más fácil de resolver. Para ello hace falta un procedimiento que mantenga la información clave de los grupos. En este contexto, Peña and Prieto (2001) demuestran que las direcciones que minimizan y maximizan la kurtosis tienen propiedades óptimas para visualizar los grupos, y proponen un algoritmo de conglomerados que proyecta los datos en ambos tipos de direcciones y asigna las observaciones a los grupos en consonancia con los huecos encontrados en éstas. En el capítulo 1 de la tesis el concepto de kurtosis se revisa en detalle. El coeficiente de kurtosis univariante y las distintas interpretaciones que se le han dado en la literatura son analizadas. También estudiamos de que maneras puede definirse la kurtosis en una muestra multivariante y exploramos sus propiedades para detectar grupos. En el Capítulo 2 estudiamos las propiedades de una matriz de kurtosis y proponemos un subconjunto de sus vectores propios como direcciones interesantes para revelar la posible estructura de grupos de los datos. Esta idea es una extensión al caso multivariante del algoritmo propuesto en Peña and Prieto (2001). La ventaja de usar los vectores propios de una matriz para especificar el subespacio de interés radica en que no es necesario usar un algoritmo de optimización para encontrarlo, como ocurre en Peña and Prieto (2001). Por otra parte, ante una mezcla de distribuciones elípticas con matrices de covarianzas proporcionales, demostramos que un subconjunto de vectores propios de la matriz coincide con el subespacio lineal discriminante de Fisher. Los vectores propios de la matriz de kurtosis estimada son estimadores consistentes de este subespacio, y su calculo es fácil de implementar y computacionalmente eficiente. La matriz, por tanto, proporciona una forma de reducir la dimensión de los datos en vistas a resolver el problema de conglomerados en un subespacio de dimensión menor. Siguiendo la discusión en el Capítulo 2, en el capítulo 3 estudiamos matrices alternativas de kurtosis basadas en modificaciones locales de los datos, con la intención de mejorar los resultados obtenidos con los vectores propios de la matriz de kurtosis estudiada en el Capítulo 2. Mediante la sustitución de las observaciones de la muestra por la media de sus vecinos, las matrices de covarianzas de las componentes de la mezcla de distribuciones se contraen, dando un rol predominante a la variabilidad entre grupos en la descomposición de la matriz de kurtosis. En particular, se demuestra que las propiedades de separación de los vectores propios de la nueva matriz de kurtosis son mejores en el sentido que la modificación de las observaciones propuesta produce medias estandarizadas más alejadas entre sí que las de las observaciones originales. El Capítulo 4 propone algunas ideas en relación a la identificación de grupos no lineales en un espacio de baja dimensión, proyectando en direcciones aleatorias solamente las observaciones contenidas en un entorno local definido a partir de la dirección. Estas direcciones pueden ser entendidas como direcciones recortadas, y permiten detectar formas específicas que los algoritmos de conglomerados tradicionales con buenos resultados en baja dimensión no detectan con facilidad. El algoritmo sugerido está pensado para usarse una vez la dimensión del espacio de los datos ha sido reducida. Finalmente, en el Capítulo 5 proponemos un algoritmo de conglomerados no paramétrico basado en medianas locales. Cada observación es sustituida por su mediana local, moviéndose de esta manera hacia los picos y lejos de los valles de la distribución. Este proceso es repetido iterativamente hasta que cada observación converge a un punto fijo. El resultado es un partición de la muestra basado en donde convergen las secuencias de medianas locales. El algoritmo determina el número de grupos y la partición de las observaciones dada la proporción de vecinos. Una versión rápida del algoritmo, donde solamente se trata un subconjunto de las observaciones, también se proporciona. En el caso univariante, se demuestra la convergencia de cada observación al punto fijo más próximo, así como la existencia y unicidad de un punto fijo en un entorno de cada moda de la distribución

    Tandem clustering with invariant coordinate selection

    Full text link
    For high-dimensional data or data with noise variables, tandem clustering is a well-known technique that aims to improve cluster identification by first reducing the dimension. However, the usual approach using principal component analysis (PCA) has been criticized for focusing only on inertia so that the first components do not necessarily retain the structure of interest for clustering. To overcome this drawback, we propose a new tandem clustering approach based on invariant coordinate selection (ICS). By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while returning affine invariant components. Some theoretical results have already been derived and guarantee that under some elliptical mixture models, the structure of the data can be highlighted on a subset of the first and/or last components. Nevertheless, ICS has received little attention in a clustering context. Two challenges are the choice of the pair of scatter matrices and the selection of the components to retain. For clustering purposes, we demonstrate that the best scatter pairs consist of one scatter matrix that captures the within-cluster structure and another that captures the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully selected subset size that is smaller than usual. We evaluate the performance of ICS as a dimension reduction method in terms of preserving the cluster structure present in data. In an extensive simulation study and in empirical applications with benchmark data sets, we compare different combinations of scatter matrices, component selection criteria, and the impact of outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the approach with PCA

    Linear discrimination for three-level multivariate data with a separable additive mean vector and a doubly exchangeable covariance structure

    Get PDF
    In this article, we study a new linear discriminant function for three-level m-variate observations under the assumption of multivariate normality. We assume that the m-variate observations have a doubly exchangeable covariance structure consisting of three unstructured covariance matrices for three multivariate levels and a separable additive structure on the mean vector. The new discriminant function is very efficient in discriminating individuals in a small sample scenario. An iterative algorithm is proposed to calculate the maximum likelihood estimates of the unknown population parameters as closed form solutions do not exist for these unknown parameters. The new discriminant function is applied to a real data set as well as to simulated data sets. We compare our findings with other linear discriminant functions for three-level multivariate data as well as with the traditional linear discriminant function.Fil: Leiva, Ricardo Anibal. Universidad Nacional de Cuyo; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; ArgentinaFil: Roy, Anuradha. University of Texas; Estados Unido

    Human identification: an investigation of 3D models of paranasal sinuses to establish a biological profile on a modern UK population

    Get PDF
    Forensic anthropology traditionally aims to assist law enforcement with human identification by physically examining skeletal remains and assigning a biological profile using various metric and visual methods. These methods are crucial when a body undergoes extreme damage and standard approaches for positive identification are not possible. However, the traditional methods employed by forensic anthropologists were primarily developed from North American reference populations and have demonstrated varying accuracy rates when assigning age, sex, and ancestry to individuals outside of the reference collection. Medical imaging is a valuable source for facilitating empirical research and an accessible gateway for developing novel forensic anthropological methods for analysis including 3D modelling. This is especially critical for the United Kingdom (UK) where biological profiling methods developed from modern UK populations do not currently exist. Researchers have quantified the variability of the paranasal sinuses between individuals and have begun to explore their ability to provide biological information. However, the published literature that addresses these structures in a forensic context presents extremely varied insights and to date there has been no standardisation. This thesis presents research that addresses this gap and introduces a new approach for human identification using 3D models of the paranasal sinuses. The models were produced from a database of modern CT scans provided by University College London Hospital (UCLH), London, UK. Linear measurements and elliptic Fourier coefficients taken from 1,500 three-dimensional models across six ethnic groups assessed by one-way ANOVA and discriminant function analysis showed a range of classification rates with certain rates reaching 75-85.7% (p<0.05) in correctly classifying age and sex according to size and shape. The findings offer insights into the potential for employing CT scans to develop identification methods within the UK and establishes a foundation for using the paranasal sinuses as an attribute for establishing identification of unknown human remains in future crime reconstructions

    Clustering in high dimension for multivariate and functional data using extreme kurtosis projections

    Get PDF
    Cluster analysis is a problem that consists of the analysis of the existence of clusters in a multivariate sample. This analysis is performed by algorithms that differ significantly in their notion of what constitutes a cluster and how to find them efficiently. In this thesis we are interested in large data problems and therefore we consider algorithms that use dimension reduction techniques for the identification of interesting structures in large data sets. Particularly in those algorithms that use the kurtosis coefficient to detect the clusters present in the data. The thesis extends the work of Peña and Prieto (2001a) of identifying clusters in multivariate data using the univariate projections of the sample data on the directions that minimize and maximize the kurtosis coefficient of the projected data, and Peña et al. (2010) who used the eigenvalues of a kurtosis matrix to reduce the dimension. This thesis has two main contributions: First, we prove that the extreme kurtosis projections have some optimality properties for mixtures of normal distributions and we propose an algorithm to identify clusters when the data dimension and the number of clusters present in the sample are high. The good performance of the algorithm is shown through a simulations study where it is compared it with MCLUST, K-means and CLARA methods. Second, we propose the extension of multivariate kurtosis for functional data, and we analyze some of its properties for clustering. Additionally, we propose an algorithm based on kurtosis projections for functional data. Its good properties are compared with the results obtained by Functional Principal Components, Functional K-means and FunClust method. The thesis is structured as follows: Chapter 1 is an introductory Chapter where we will review some theoretical concepts that will be used throughout the thesis. In Chapter 2 we review in detail the concept of kurtosis. We study the properties of kurtosis. Give a detailed description of some algorithms proposed in the literature that use the kurtosis coefficient to detect the clusters present in the data. In Chapter 3 we study the directions that may be interesting for the detection of several clusters in the sample and we analyze how the extreme kurtosis directions are related to these directions. In addition, we present a clustering algorithm for high-dimensional data using extreme kurtosis directions. In Chapter 4 we introduce an extension of the multivariate kurtosis for the functional data and we analyze the properties of this measure regarding the identification of clusters. In addition, we present a clustering algorithm for functional data using extreme kurtosis directions. We finish with some remarks and conclusions in the final Chapter.Programa Oficial de Doctorado en Ingeniería MatemáticaPresidente: Ana María Justel Eusebio.- Secretario: Andrés Modesto Alonso Fernández.- Vocal: José Manuel Mira Mcwilliam

    A Comparison of Depth Functions in Maximal Depth Classification Rules

    Get PDF
    Data depth has been described as alternative to some parametric approaches in analyzing many multivariate data. Many depth functions have emerged over two decades and studied in literature. In this study, a nonparametric approach to classification based on notions of different data depth functions is considered and some properties of these methods are studied. The performance of different depth functions in maximal depth classifiers is investigated using simulation and real data with application to agricultural industry

    The topography of multivariate normal mixtures

    Get PDF
    Multivariate normal mixtures provide a flexible method of fitting high-dimensional data. It is shown that their topography, in the sense of their key features as a density, can be analyzed rigorously in lower dimensions by use of a ridgeline manifold that contains all critical points, as well as the ridges of the density. A plot of the elevations on the ridgeline shows the key features of the mixed density. In addition, by use of the ridgeline, we uncover a function that determines the number of modes of the mixed density when there are two components being mixed. A followup analysis then gives a curvature function that can be used to prove a set of modality theorems.Comment: Published at http://dx.doi.org/10.1214/009053605000000417 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Evaluating Surface Area-Basin Volume Relationships for Prairie Potholes

    Get PDF
    Establishing a relationship between surface area and volume of prairie potholes provides a simple method to estimate changes in water storage across the landscape. Applications include better prediction of floods and improved design for wetland restoration. Length, width, depth, surface area, and volumes were surveyed for eighty two potholes within the upper Turtle River watershed which lies sixty kilometers west of Grand Forks, ND. These data were used to determine the relationship and uncertainty between pothole surface and volume. Chi squared tests defined distributions of each variable. F and T statistical tests resolved similarities in variance and mean. The eighty two potholes were separated according to their National Wetlands Inventory (NWI) classification and tested using chi squared. T and F tests on the separate classes verified if the populations have a different mean and variance. Difference in depth, in particular, suggests that the two most common NWI classes PEMC and PEMA in the watershed are separate and distinct, based on the results from discriminant analysis. Despite this conclusion and the fact that PEMC wetlands are physically larger than PEMA wetlands, there is a stronger correlation between surface area and volume when the two classes remain combined. Regression of surface area and volume leads to an equation that can be applied to similar watershed throughout the prairie pothole region
    corecore