378 research outputs found
Statistical applications of the multivariate skew-normal distribution
Azzalini & Dalla Valle (1996) have recently discussed the multivariate
skew-normal distribution which extends the class of normal distributions by the
addition of a shape parameter. The first part of the present paper examines
further probabilistic properties of the distribution, with special emphasis on
aspects of statistical relevance. Inferential and other statistical issues are
discussed in the following part, with applications to some multivariate
statistics problems, illustrated by numerical examples. Finally, a further
extension is described which introduces a skewing factor of an elliptical
density.Comment: full-length version of the published paper, 32 pages, with 7 figures,
uses psfra
Eigenvectors of a kurtosis matrix as interesting directions to reveal cluster structure
In this paper we study the properties of a kurtosis matrix and propose its eigenvectors
as interesting directions to reveal the possible cluster structure of a data set. Under a
mixture of elliptical distributions with proportional scatter matrix, it is shown that a
subset of the eigenvectors of the fourth-order moment matrix corresponds to Fisher's linear
discriminant subspace. The eigenvectors of the estimated kurtosis matrix are consistent
estimators of this subspace and its calculation is easy to implement and computationally
efficient, which is particularly favourable when the ratio n/p is large.Publicad
Contributions to the problem of cluster analysis
Dada una muestra aleatoria generada por una mezcla de distribuciones, el objetivo del análisis de conglomerados es partir la muestra en grupos homogéneos en relación a las poblaciones que los han generado. Algoritmos como kmeans y mclust resuelven el problema de conglomerados en el espacio original. Un enfoque alternativo es reducir primero la dimensión de los datos proyectando la muestra en un espacio de dimensión menor, e identificar los grupos en este subespacio. De esta forma, la maldición de la dimensión puede evitarse, pero hay que asegurarse de que los datos proyectados preservan la estructura de conglomerados de la muestra original. En este contexto, los métodos de búsqueda de proyecciones tienen como objetivo encontrar direcciones, o subespacios de baja dimensión, que muestren las vistas más interesantes de los datos (Friedman and Tukey, 1974; Friedman, 1987). Reducir la dimensión de la muestra es efectivo ya que no toda la información de los datos está ligada a la estructura de grupos de la muestra. Con la reducción se pretende eliminar la información no relevante, y quedarse con un espacio de dimensión menor donde el problema de conglomerados sea más fácil de resolver. Para ello hace falta un procedimiento que mantenga la información clave de los grupos. En este contexto, Peña and Prieto (2001) demuestran que las direcciones que minimizan y maximizan la kurtosis tienen propiedades óptimas para visualizar los grupos, y proponen un algoritmo de conglomerados que proyecta los datos en ambos tipos de direcciones y asigna las observaciones a los grupos en consonancia con los huecos encontrados en éstas. En el capítulo 1 de la tesis el concepto de kurtosis se revisa en detalle. El coeficiente de kurtosis univariante y las distintas interpretaciones que se le han dado en la literatura son analizadas. También estudiamos de que maneras puede definirse la kurtosis en una muestra multivariante y exploramos sus propiedades para detectar grupos. En el Capítulo 2 estudiamos las propiedades de una matriz de kurtosis y proponemos un subconjunto de sus vectores propios como direcciones interesantes para revelar la posible estructura de grupos de los datos. Esta idea es una extensión al caso multivariante del
algoritmo propuesto en Peña and Prieto (2001). La ventaja de usar los vectores propios
de una matriz para especificar el subespacio de interés radica en que no es necesario usar
un algoritmo de optimización para encontrarlo, como ocurre en Peña and Prieto (2001).
Por otra parte, ante una mezcla de distribuciones elípticas con matrices de covarianzas proporcionales, demostramos que un subconjunto de vectores propios de la matriz
coincide con el subespacio lineal discriminante de Fisher. Los vectores propios de la matriz de kurtosis estimada son estimadores consistentes de este subespacio, y su calculo
es fácil de implementar y computacionalmente eficiente. La matriz, por tanto, proporciona una forma de reducir la dimensión de los datos en vistas a resolver el problema de
conglomerados en un subespacio de dimensión menor.
Siguiendo la discusión en el Capítulo 2, en el capítulo 3 estudiamos matrices alternativas de kurtosis basadas en modificaciones locales de los datos, con la intención de mejorar
los resultados obtenidos con los vectores propios de la matriz de kurtosis estudiada en el
Capítulo 2. Mediante la sustitución de las observaciones de la muestra por la media de sus
vecinos, las matrices de covarianzas de las componentes de la mezcla de distribuciones se
contraen, dando un rol predominante a la variabilidad entre grupos en la descomposición
de la matriz de kurtosis. En particular, se demuestra que las propiedades de separación
de los vectores propios de la nueva matriz de kurtosis son mejores en el sentido que la
modificación de las observaciones propuesta produce medias estandarizadas más alejadas
entre sí que las de las observaciones originales.
El Capítulo 4 propone algunas ideas en relación a la identificación de grupos no lineales en un espacio de baja dimensión, proyectando en direcciones aleatorias solamente
las observaciones contenidas en un entorno local definido a partir de la dirección. Estas
direcciones pueden ser entendidas como direcciones recortadas, y permiten detectar formas específicas que los algoritmos de conglomerados tradicionales con buenos resultados
en baja dimensión no detectan con facilidad. El algoritmo sugerido está pensado para
usarse una vez la dimensión del espacio de los datos ha sido reducida.
Finalmente, en el Capítulo 5 proponemos un algoritmo de conglomerados no paramétrico
basado en medianas locales. Cada observación es sustituida por su mediana local,
moviéndose de esta manera hacia los picos y lejos de los valles de la distribución. Este
proceso es repetido iterativamente hasta que cada observación converge a un punto fijo.
El resultado es un partición de la muestra basado en donde convergen las secuencias
de medianas locales. El algoritmo determina el número de grupos y la partición de las observaciones dada la proporción de vecinos. Una versión rápida del algoritmo, donde
solamente se trata un subconjunto de las observaciones, también se proporciona. En el
caso univariante, se demuestra la convergencia de cada observación al punto fijo más
próximo, así como la existencia y unicidad de un punto fijo en un entorno de cada moda
de la distribución
Tandem clustering with invariant coordinate selection
For high-dimensional data or data with noise variables, tandem clustering is
a well-known technique that aims to improve cluster identification by first
reducing the dimension. However, the usual approach using principal component
analysis (PCA) has been criticized for focusing only on inertia so that the
first components do not necessarily retain the structure of interest for
clustering. To overcome this drawback, we propose a new tandem clustering
approach based on invariant coordinate selection (ICS). By jointly
diagonalizing two scatter matrices, ICS is designed to find structure in the
data while returning affine invariant components. Some theoretical results have
already been derived and guarantee that under some elliptical mixture models,
the structure of the data can be highlighted on a subset of the first and/or
last components. Nevertheless, ICS has received little attention in a
clustering context. Two challenges are the choice of the pair of scatter
matrices and the selection of the components to retain. For clustering
purposes, we demonstrate that the best scatter pairs consist of one scatter
matrix that captures the within-cluster structure and another that captures the
global structure. For the former, local shape or pairwise scatters are of great
interest, as is the minimum covariance determinant (MCD) estimator based on a
carefully selected subset size that is smaller than usual. We evaluate the
performance of ICS as a dimension reduction method in terms of preserving the
cluster structure present in data. In an extensive simulation study and in
empirical applications with benchmark data sets, we compare different
combinations of scatter matrices, component selection criteria, and the impact
of outliers. Overall, the new approach of tandem clustering with ICS shows
promising results and clearly outperforms the approach with PCA
Linear discrimination for three-level multivariate data with a separable additive mean vector and a doubly exchangeable covariance structure
In this article, we study a new linear discriminant function for three-level m-variate observations under the assumption of multivariate normality. We assume that the m-variate observations have a doubly exchangeable covariance structure consisting of three unstructured covariance matrices for three multivariate levels and a separable additive structure on the mean vector. The new discriminant function is very efficient in discriminating individuals in a small sample scenario. An iterative algorithm is proposed to calculate the maximum likelihood estimates of the unknown population parameters as closed form solutions do not exist for these unknown parameters. The new discriminant function is applied to a real data set as well as to simulated data sets. We compare our findings with other linear discriminant functions for three-level multivariate data as well as with the traditional linear discriminant function.Fil: Leiva, Ricardo Anibal. Universidad Nacional de Cuyo; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; ArgentinaFil: Roy, Anuradha. University of Texas; Estados Unido
Human identification: an investigation of 3D models of paranasal sinuses to establish a biological profile on a modern UK population
Forensic anthropology traditionally aims to assist law enforcement with human identification by physically examining skeletal remains and assigning a biological profile using various metric and visual methods. These methods are crucial when a body undergoes extreme damage and standard approaches for positive identification are not possible. However, the traditional methods employed by forensic anthropologists were primarily developed from North American reference populations and have demonstrated varying accuracy rates when assigning age, sex, and ancestry to individuals outside of the reference collection. Medical imaging is a valuable source for facilitating empirical research and an accessible gateway for developing novel forensic anthropological methods for analysis including 3D modelling. This is especially critical for the United Kingdom (UK) where biological profiling methods developed from modern UK populations do not currently exist. Researchers have quantified the variability of the paranasal sinuses between individuals and have begun to explore their ability to provide biological information. However, the published literature that addresses these structures in a forensic context presents extremely varied insights and to date there has been no standardisation. This thesis presents research that addresses this gap and introduces a new approach for human identification using 3D models of the paranasal sinuses. The models were produced from a database of modern CT scans provided by University College London Hospital (UCLH), London, UK. Linear measurements and elliptic Fourier coefficients taken from 1,500 three-dimensional models across six ethnic groups assessed by one-way ANOVA and discriminant function analysis showed a range of classification rates with certain rates reaching 75-85.7% (p<0.05) in correctly classifying age and sex according to size and shape. The findings offer insights into the potential for employing CT scans to develop identification methods within the UK and establishes a foundation for using the paranasal sinuses as an attribute for establishing identification of unknown human remains in future crime reconstructions
Clustering in high dimension for multivariate and functional data using extreme kurtosis projections
Cluster analysis is a problem that consists of the analysis of the existence of
clusters in a multivariate sample. This analysis is performed by algorithms that
differ significantly in their notion of what constitutes a cluster and how to find them
efficiently. In this thesis we are interested in large data problems and therefore we
consider algorithms that use dimension reduction techniques for the identification
of interesting structures in large data sets. Particularly in those algorithms that
use the kurtosis coefficient to detect the clusters present in the data.
The thesis extends the work of Peña and Prieto (2001a) of identifying clusters
in multivariate data using the univariate projections of the sample data on the
directions that minimize and maximize the kurtosis coefficient of the projected
data, and Peña et al. (2010) who used the eigenvalues of a kurtosis matrix to
reduce the dimension.
This thesis has two main contributions:
First, we prove that the extreme kurtosis projections have some optimality
properties for mixtures of normal distributions and we propose an algorithm to
identify clusters when the data dimension and the number of clusters present in
the sample are high. The good performance of the algorithm is shown through a
simulations study where it is compared it with MCLUST, K-means and CLARA
methods.
Second, we propose the extension of multivariate kurtosis for functional data, and we analyze some of its properties for clustering. Additionally, we propose an
algorithm based on kurtosis projections for functional data. Its good properties
are compared with the results obtained by Functional Principal Components,
Functional K-means and FunClust method.
The thesis is structured as follows: Chapter 1 is an introductory Chapter where
we will review some theoretical concepts that will be used throughout the thesis.
In Chapter 2 we review in detail the concept of kurtosis. We study the
properties of kurtosis. Give a detailed description of some algorithms proposed
in the literature that use the kurtosis coefficient to detect the clusters present in
the data.
In Chapter 3 we study the directions that may be interesting for the detection
of several clusters in the sample and we analyze how the extreme kurtosis directions
are related to these directions. In addition, we present a clustering algorithm for
high-dimensional data using extreme kurtosis directions.
In Chapter 4 we introduce an extension of the multivariate kurtosis for the
functional data and we analyze the properties of this measure regarding the
identification of clusters. In addition, we present a clustering algorithm for
functional data using extreme kurtosis directions.
We finish with some remarks and conclusions in the final Chapter.Programa Oficial de Doctorado en Ingeniería MatemáticaPresidente: Ana María Justel Eusebio.- Secretario: Andrés Modesto Alonso Fernández.- Vocal: José Manuel Mira Mcwilliam
A Comparison of Depth Functions in Maximal Depth Classification Rules
Data depth has been described as alternative to some parametric approaches in analyzing many multivariate data. Many depth functions have emerged over two decades and studied in literature. In this study, a nonparametric approach to classification based on notions of different data depth functions is considered and some properties of these methods are studied. The performance of different depth functions in maximal depth classifiers is investigated using simulation and real data with application to agricultural industry
The topography of multivariate normal mixtures
Multivariate normal mixtures provide a flexible method of fitting
high-dimensional data. It is shown that their topography, in the sense of their
key features as a density, can be analyzed rigorously in lower dimensions by
use of a ridgeline manifold that contains all critical points, as well as the
ridges of the density. A plot of the elevations on the ridgeline shows the key
features of the mixed density. In addition, by use of the ridgeline, we uncover
a function that determines the number of modes of the mixed density when there
are two components being mixed. A followup analysis then gives a curvature
function that can be used to prove a set of modality theorems.Comment: Published at http://dx.doi.org/10.1214/009053605000000417 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Evaluating Surface Area-Basin Volume Relationships for Prairie Potholes
Establishing a relationship between surface area and volume of prairie potholes provides a simple method to estimate changes in water storage across the landscape. Applications include better prediction of floods and improved design for wetland restoration. Length, width, depth, surface area, and volumes were surveyed for eighty two potholes within the upper Turtle River watershed which lies sixty kilometers west of Grand Forks, ND. These data were used to determine the relationship and uncertainty between pothole surface and volume. Chi squared tests defined distributions of each variable. F and T statistical tests resolved similarities in variance and mean. The eighty two potholes were separated according to their National Wetlands Inventory (NWI) classification and tested using chi squared. T and F tests on the separate classes verified if the populations have a different mean and variance. Difference in depth, in particular, suggests that the two most common NWI classes PEMC and PEMA in the watershed are separate and distinct, based on the results from discriminant analysis. Despite this conclusion and the fact that PEMC wetlands are physically larger than PEMA wetlands, there is a stronger correlation between surface area and volume when the two classes remain combined. Regression of surface area and volume leads to an equation that can be applied to similar watershed throughout the prairie pothole region
- …