21 research outputs found

    Fast DD-classification of functional data

    Full text link
    A fast nonparametric procedure for classifying functional data is introduced. It consists of a two-step transformation of the original data plus a classifier operating on a low-dimensional hypercube. The functional data are first mapped into a finite-dimensional location-slope space and then transformed by a multivariate depth function into the DDDD-plot, which is a subset of the unit hypercube. This transformation yields a new notion of depth for functional data. Three alternative depth functions are employed for this, as well as two rules for the final classification on [0,1]q[0,1]^q. The resulting classifier has to be cross-validated over a small range of parameters only, which is restricted by a Vapnik-Cervonenkis bound. The entire methodology does not involve smoothing techniques, is completely nonparametric and allows to achieve Bayes optimality under standard distributional settings. It is robust, efficiently computable, and has been implemented in an R environment. Applicability of the new approach is demonstrated by simulations as well as a benchmark study

    Depth and Depth-Based Classification with R Package ddalpha

    Get PDF
    Following the seminal idea of Tukey (1975), data depth is a function that measures how close an arbitrary point of the space is located to an implicitly defined center of a data cloud. Having undergone theoretical and computational developments, it is now employed in numerous applications with classification being the most popular one. The R package ddalpha is a software directed to fuse experience of the applicant with recent achievements in the area of data depth and depth-based classification. ddalpha provides an implementation for exact and approximate computation of most reasonable and widely applied notions of data depth. These can be further used in the depth-based multivariate and functional classifiers implemented in the package, where the DDα-procedure is in the main focus. The package is expandable with user-defined custom depth methods and separators. The implemented functions for depth visualization and the built-in benchmark procedures may also serve to provide insights into the geometry of the data and the quality of pattern recognition

    Depth- and Potential-Based Supervised Learning

    Get PDF
    The task of supervised learning is to define a data-based rule by which the new objects are assigned to one of the classes. For this a training data set is used that contains objects with known class membership. In this thesis, two procedures for supervised classification are introduced. The first procedure is based on potential functions. The potential of a class is defined as a kernel density estimate multiplied by the class's prior probability. The method transforms the data to a potential-potential (pot-pot) plot, where each data point is mapped to a vector of potentials, similarly to the DD-plot. Separation of the classes, as well as classification of new data points, is performed on this plot, thus the bias in kernel density estimates due to insufficiently adapted multivariate kernels is compensated by a flexible classifier on the pot-pot plot. The proposed method has been implemented in the R-package ddalpha that is a software directed to fuse experience of the applicant with recent theoretical and computational achievements in the area of data depth and depth-based classification. It implements various depth functions and classifiers for multivariate and functional data under one roof. The package is expandable with user-defined custom depth methods and separators. The second classification procedure focuses on the centers of the classes and is based on data depth. The classifier adds a depth term to the objective function of the Bayes classifier, so that the cost of misclassification of a point depends not only on its belongingness to a class but also on its centrality in this class. Classification of more central points is enforced while outliers are underweighted. The proposed objective function may also be used to evaluate the performance of other classifiers instead of the usual average misclassification rate. The thesis also contains a new algorithm for the exact calculation of the Oja median. It modifies the algorithm of Ronkainen, Oja and Orponen (2003) by employing bounded regions which contain the median. The new algorithm is faster and has lower complexity than the previous one. The new algorithm has been implemented as a part of the R-package OjaNP

    Choosing among notions of multivariate depth statistics

    Full text link
    Classical multivariate statistics measures the outlyingness of a point by its Mahalanobis distance from the mean, which is based on the mean and the covariance matrix of the data. A multivariate depth function is a function which, given a point and a distribution in d-space, measures centrality by a number between 0 and 1, while satisfying certain postulates regarding invariance, monotonicity, convexity and continuity. Accordingly, numerous notions of multivariate depth have been proposed in the literature, some of which are also robust against extremely outlying data. The departure from classical Mahalanobis distance does not come without cost. There is a trade-off between invariance, robustness and computational feasibility. In the last few years, efficient exact algorithms as well as approximate ones have been constructed and made available in R-packages. Consequently, in practical applications the choice of a depth statistic is no more restricted to one or two notions due to computational limits; rather often more notions are feasible, among which the researcher has to decide. The article debates theoretical and practical aspects of this choice, including invariance and uniqueness, robustness and computational feasibility. Complexity and speed of exact algorithms are compared. The accuracy of approximate approaches like the random Tukey depth is discussed as well as the application to large and high-dimensional data. Extensions to local and functional depths and connections to regression depth are shortly addressed

    A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions

    Full text link
    The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in Machine Learning. Focusing on continuous probability distributions on the Euclidean space Rd\mathbb{R}^d, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametric statistical tool that measures the centrality of any element x∈Rdx\in\mathbb{R}^d with respect to (w.r.t.) a probability distribution or a data set. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Thus, its upper-level sets -- the depth-trimmed regions -- give rise to a definition of multivariate quantiles. The new pseudo-metric relies on the average of the Hausdorff distance between the depth-based quantile regions w.r.t. each distribution. Its good behavior w.r.t. major transformation groups, as well as its ability to factor out translations, are depicted. Robustness, an appealing feature of this pseudo-metric, is studied through the finite sample breakdown point. Moreover, we propose an efficient approximation method with linear time complexity w.r.t. the size of the data set and its dimension. The quality of this approximation as well as the performance of the proposed approach are illustrated in numerical experiments

    A generalized spatial sign covariance matrix

    Full text link
    The well-known spatial sign covariance matrix (SSCM) carries out a radial transform which moves all data points to a sphere, followed by computing the classical covariance matrix of the transformed data. Its popularity stems from its robustness to outliers, fast computation, and applications to correlation and principal component analysis. In this paper we study more general radial functions. It is shown that the eigenvectors of the generalized SSCM are still consistent and the ranks of the eigenvalues are preserved. The influence function of the resulting scatter matrix is derived, and it is shown that its breakdown value is as high as that of the original SSCM. A simulation study indicates that the best results are obtained when the inner half of the data points are not transformed and points lying far away are moved to the center

    Multi-source heterogeneous intelligence fusion

    Get PDF

    Essays on productivity dynamics and labour market outcomes

    Full text link
    This thesis studies the interaction of recent productivity dynamics and their various drivers with other economic outcomes, in particular related to the labour market. Chapter 1 exploits harmonised and comparable data for 13 OECD countries from the MultiProd database to shed new light on the relationship between productivity divergence, i.e. increasing gaps between the most and least productive firms in an industry, and aggregate productivity growth (APG). Chapter 2 again builds on the MultiProd database to comprehensively analyse the productivity-employment nexus at different levels of aggregation. The evidence suggests that both micro- and industry-level productivity growth translate positively into employment growth on average, which however is the outcome of counteracting mechanisms and the quantitative extent of the positive link depends on firm and industry characteristics. In Chapter 3, I theoretically investigate the previously neglected role of workers as task-aggregating institutions for the impact of automation technologies on labour demand. The analysis rationalises the positive micro-level relationship of automation on labour demand that prevails despite the task-replacing nature of automation. At the same time, automation may reduce employment at the more aggregate level and contribute to the fall of the labour share
    corecore