21 research outputs found
Fast DD-classification of functional data
A fast nonparametric procedure for classifying functional data is introduced.
It consists of a two-step transformation of the original data plus a classifier
operating on a low-dimensional hypercube. The functional data are first mapped
into a finite-dimensional location-slope space and then transformed by a
multivariate depth function into the -plot, which is a subset of the unit
hypercube. This transformation yields a new notion of depth for functional
data. Three alternative depth functions are employed for this, as well as two
rules for the final classification on . The resulting classifier has
to be cross-validated over a small range of parameters only, which is
restricted by a Vapnik-Cervonenkis bound. The entire methodology does not
involve smoothing techniques, is completely nonparametric and allows to achieve
Bayes optimality under standard distributional settings. It is robust,
efficiently computable, and has been implemented in an R environment.
Applicability of the new approach is demonstrated by simulations as well as a
benchmark study
Depth and Depth-Based Classification with R Package ddalpha
Following the seminal idea of Tukey (1975), data depth is a function that measures how close an arbitrary point of the space is located to an implicitly defined center of a data cloud. Having undergone theoretical and computational developments, it is now employed in numerous applications with classification being the most popular one. The R package ddalpha is a software directed to fuse experience of the applicant with recent achievements in the area of data depth and depth-based classification. ddalpha provides an implementation for exact and approximate computation of most reasonable and widely applied notions of data depth. These can be further used in the depth-based multivariate and functional classifiers implemented in the package, where the DDα-procedure is in the main focus. The package is expandable with user-defined custom depth methods and separators. The implemented functions for depth visualization and the built-in benchmark procedures may also serve to provide insights into the geometry of the data and the quality of pattern recognition
Depth- and Potential-Based Supervised Learning
The task of supervised learning is to define a data-based rule by which the new objects are assigned to one of the classes. For this a training data set is used that contains objects with known class membership. In this thesis, two procedures for supervised classification are introduced.
The first procedure is based on potential functions. The potential of a class is defined as a kernel density estimate multiplied by the class's prior probability. The method transforms the data to a potential-potential (pot-pot) plot, where each data point is mapped to a vector of potentials, similarly to the DD-plot. Separation of the classes, as well as classification of new data points, is performed on this plot, thus the bias in kernel density estimates due to insufficiently adapted multivariate kernels is compensated by a flexible classifier on the pot-pot plot.
The proposed method has been implemented in the R-package ddalpha that is a software directed to fuse experience of the applicant with recent theoretical and computational achievements in the area of data depth and depth-based classification. It implements various depth functions and classifiers for multivariate and functional data under one roof. The package is expandable with user-defined custom depth methods and separators.
The second classification procedure focuses on the centers of the classes and is based on data depth. The classifier adds a depth term to the objective function of the Bayes classifier, so that the cost of misclassification of a point depends not only on its belongingness to a class but also on its centrality in this class. Classification of more central points is enforced while outliers are underweighted. The proposed objective function may also be used to evaluate the performance of other classifiers instead of the usual average misclassification rate.
The thesis also contains a new algorithm for the exact calculation of the Oja median. It modifies the algorithm of Ronkainen, Oja and Orponen (2003) by employing bounded regions which contain the median. The new algorithm is faster and has lower complexity than the previous one. The new algorithm has been implemented as a part of the R-package OjaNP
Choosing among notions of multivariate depth statistics
Classical multivariate statistics measures the outlyingness of a point by its
Mahalanobis distance from the mean, which is based on the mean and the
covariance matrix of the data. A multivariate depth function is a function
which, given a point and a distribution in d-space, measures centrality by a
number between 0 and 1, while satisfying certain postulates regarding
invariance, monotonicity, convexity and continuity. Accordingly, numerous
notions of multivariate depth have been proposed in the literature, some of
which are also robust against extremely outlying data. The departure from
classical Mahalanobis distance does not come without cost. There is a trade-off
between invariance, robustness and computational feasibility. In the last few
years, efficient exact algorithms as well as approximate ones have been
constructed and made available in R-packages. Consequently, in practical
applications the choice of a depth statistic is no more restricted to one or
two notions due to computational limits; rather often more notions are
feasible, among which the researcher has to decide. The article debates
theoretical and practical aspects of this choice, including invariance and
uniqueness, robustness and computational feasibility. Complexity and speed of
exact algorithms are compared. The accuracy of approximate approaches like the
random Tukey depth is discussed as well as the application to large and
high-dimensional data. Extensions to local and functional depths and
connections to regression depth are shortly addressed
A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions
The design of a metric between probability distributions is a longstanding
problem motivated by numerous applications in Machine Learning. Focusing on
continuous probability distributions on the Euclidean space , we
introduce a novel pseudo-metric between probability distributions by leveraging
the extension of univariate quantiles to multivariate spaces. Data depth is a
nonparametric statistical tool that measures the centrality of any element
with respect to (w.r.t.) a probability distribution or a
data set. It is a natural median-oriented extension of the cumulative
distribution function (cdf) to the multivariate case. Thus, its upper-level
sets -- the depth-trimmed regions -- give rise to a definition of multivariate
quantiles. The new pseudo-metric relies on the average of the Hausdorff
distance between the depth-based quantile regions w.r.t. each distribution. Its
good behavior w.r.t. major transformation groups, as well as its ability to
factor out translations, are depicted. Robustness, an appealing feature of this
pseudo-metric, is studied through the finite sample breakdown point. Moreover,
we propose an efficient approximation method with linear time complexity w.r.t.
the size of the data set and its dimension. The quality of this approximation
as well as the performance of the proposed approach are illustrated in
numerical experiments
A generalized spatial sign covariance matrix
The well-known spatial sign covariance matrix (SSCM) carries out a radial
transform which moves all data points to a sphere, followed by computing the
classical covariance matrix of the transformed data. Its popularity stems from
its robustness to outliers, fast computation, and applications to correlation
and principal component analysis. In this paper we study more general radial
functions. It is shown that the eigenvectors of the generalized SSCM are still
consistent and the ranks of the eigenvalues are preserved. The influence
function of the resulting scatter matrix is derived, and it is shown that its
breakdown value is as high as that of the original SSCM. A simulation study
indicates that the best results are obtained when the inner half of the data
points are not transformed and points lying far away are moved to the center
Essays on productivity dynamics and labour market outcomes
This thesis studies the interaction of recent productivity dynamics and their various drivers with other economic outcomes, in particular related to the labour market. Chapter 1 exploits harmonised and comparable data for 13 OECD countries from the MultiProd database to shed new light on the relationship between productivity divergence, i.e. increasing gaps between the most and least productive firms in an industry, and aggregate productivity growth (APG). Chapter 2 again builds on the MultiProd database to comprehensively analyse the productivity-employment nexus at different levels of aggregation. The evidence suggests that both micro- and industry-level productivity growth translate positively into employment growth on average, which however is the outcome of counteracting mechanisms and the quantitative extent of the positive link depends on firm and industry characteristics. In Chapter 3, I theoretically investigate the previously neglected role of workers as task-aggregating institutions for the impact of automation technologies on labour demand. The analysis rationalises the positive micro-level relationship of automation on labour demand that prevails despite the task-replacing nature of automation. At the same time, automation may reduce employment at the more aggregate level and contribute to the fall of the labour share