1,043 research outputs found

    Entropy-based fuzzy clustering of interval-valued time series

    Get PDF
    This paper proposes a fuzzy C-medoids-based clustering method with entropy regularization to solve the issue of grouping complex data as interval-valued time series. The dual nature of the data, that are both time-varying and interval-valued, needs to be considered and embedded into clustering techniques. In this work, a new dissimilarity measure, based on Dynamic Time Warping, is proposed. The performance of the new clustering procedure is evaluated through a simulation study and an application to financial time series

    CONTRIBUTIONS IN CLASSIFICATION: VISUAL PRUNING FOR DECISION TREES, P-SPLINE BASED CLUSTERING OF CORRELATED SERIES, BOOSTED-ORIENTED PROBABILISTIC CLUSTERING OF SERIES.

    Get PDF
    This work consists of three papers written during my Ph.D. period. The thesis consists of five chapters. In chapter 2 the basic building blocks of our works are introduced. In particular we briefly recall the concepts of classification (supervised and unsupervised) and penalized spline. In chapter 3 we present a paper whose idea was presented at Cladag 2013 Symposium. Within the framework of recursive partitioning algorithms by tree-based methods, this paper provides a contribution on both the visual representation of the data partition in a geometrical space and the selection of the decision tree. In our visual approach the identification of both the best tree and of weakest links is immediately evaluable by the graphical analysis of the tree structure without considering the pruning sequence. The results in terms of error rate are really similar to the ones returned by the Classification And Regression Trees procedure, showing how this new way to select the best tree is a valid alternative to the well known cost-complexity pruning In chapter 4 we present a paper on parsimonious clustering of correlated series. Clustering of time series has become an important topic, motivated by the increased interest in these type of data. Most of the time, these procedures do not facilitate the removal of noise from data, have difficulties handling time series with unequal length and require a preprocessing step of the data considered, i.e. by modeling each series with an appropriate model for time series. In this work we propose a new clustering data (time) series way, which can be considered as belonging to both model-based and feature-based approach. Our method consists of since we model each series by penalized spline (P-spline) smoothers and performing clustering directly on spline coefficients. Using the P-spline smoothers the signal of series is separated from the noise, capturing the different shapes of series. The P-spline coefficients are close to the fitted curve and present the skeleton of the fit. Thus, summarizing each series by coefficients reduces the dimensionality of the problem, improving significantly computation time without reduction in performance of clustering procedure. To select the smoothing parameter we adopt a V-curve procedure. This criterion does not require the computation of the effective model dimension and it is insensitive to serial correlation in the noise around the trend. Using the P-spline smoothers, moments of the original data are conserved. This implies that mean and variance of the estimated series are equal to those of the raw series. This consideration allows to use a similar approach in dealing with series of different length. The performance is evaluated analyzing a simulated data set,also considering series with different length. An application of our proposal on financial time series is also performed. In Chapter 5 we present a paper that proposes a fuzzy clustering algorithm that is independent from the choice of the fuzzifier. It comes from two approaches, theoretically motivated for respectively unsupervised and supervised classification cases. The first is the Probabilistic Distance (PD) clustering procedure. The second is the well known Boosting philosophy. From the PD approach we took the idea of determining the probabilities of each series to any of the k clusters. As this probability is unequivocally related to the distance of each series from the cluster centers, there are no degrees of freedom in determine the membership matrix. From the Boosting approach we took the idea of weighting each series according some measure of badness of fit in order to define an unsupervised learning process based on a weighted re-sampling procedure. Our idea is to adapt the boosting philosophy to unsupervised learning problems, specially to non hierarchical cluster analysis. In such a case there not exists a target variable, but as the goal is to assign each instance (i.e. a series) of a data set to a cluster, we have a target instance. The representative instance of a given cluster (i.e. the center of a cluster) can be assumed as a target instance, a loss function to be minimized can be assumed as a synthetic index of the global performance, the probability of each series to belong to a given cluster can be assumed as the individual contribution of a given instance to the overall solution. In contrast to the boosting approach, the higher is the probability of a given series to be member of a given cluster, the higher is the weight of that instance in the re-sampling process. As a learner we use a P-spline smoother. To define the probabilities of each series to belong to a given cluster we use the PD clustering approach. This approach allows us to define a suitable loss function and, at the same time, to propose a fuzzy clustering procedure that does not depend on the definition of a fuzzifier parameter. The global performance of the proposed method is investigated by three experiments (one of them on simulated data and the remaining two on data sets known in literature) evaluated by using a fuzzy variant of the Rand Index. Chapter 6 concludes the thesis

    Holistic analysis of the life course: Methodological challenges and new perspectives

    Get PDF
    Abstract We survey state-of-the-art approaches to study trajectories in their entirety, adopting a holistic perspective, and discuss their strengths and weaknesses. We begin by considering sequence analysis (SA), one of the most established holistic approaches. We discuss the inherent problems arising in SA, particularly in the study of the relationship between trajectories and covariates. We describe some recent developments combining SA and Event History Analysis, and illustrate how weakening the holistic perspective—focusing on sub-trajectories—might result in a more flexible analysis of life courses. We then move to some model-based approaches (included in the broad classes of multistate and of mixture latent Markov models) that further weaken the holistic perspective, assuming that the difficult task of predicting and explaining trajectories can be simplified by focusing on the collection of observed transitions. Our goal is twofold. On one hand, we aim to provide social scientists with indications for informed methodological choices and to emphasize issues that require consideration for proper application of the described approaches. On the other hand, by identifying relevant and open methodological challenges, we highlight and encourage promising directions for future research

    Recognizing Handwriting Styles in a Historical Scanned Document Using Unsupervised Fuzzy Clustering

    Full text link
    The forensic attribution of the handwriting in a digitized document to multiple scribes is a challenging problem of high dimensionality. Unique handwriting styles may be dissimilar in a blend of several factors including character size, stroke width, loops, ductus, slant angles, and cursive ligatures. Previous work on labeled data with Hidden Markov models, support vector machines, and semi-supervised recurrent neural networks have provided moderate to high success. In this study, we successfully detect hand shifts in a historical manuscript through fuzzy soft clustering in combination with linear principal component analysis. This advance demonstrates the successful deployment of unsupervised methods for writer attribution of historical documents and forensic document analysis.Comment: 26 pages in total, 5 figures and 2 table

    Scoring and assessment in medical VR training simulators with dynamic time series classification

    Get PDF
    This is the author accepted manuscript. the final version is available from Elsevier via the DOI in this recordThis research proposes and evaluates scoring and assessment methods for Virtual Reality (VR) training simulators. VR simulators capture detailed n-dimensional human motion data which is useful for performance analysis. Custom made medical haptic VR training simulators were developed and used to record data from 271 trainees of multiple clinical experience levels. DTW Multivariate Prototyping (DTW-MP) is proposed. VR data was classified as Novice, Intermediate or Expert. Accuracy of algorithms applied for time-series classification were: dynamic time warping 1-nearest neighbor (DTW-1NN) 60%, nearest centroid SoftDTW classification 77.5%, Deep Learning: ResNet 85%, FCN 75%, CNN 72.5% and MCDCNN 28.5%. Expert VR data recordings can be used for guidance of novices. Assessment feedback can help trainees to improve skills and consistency. Motion analysis can identify different techniques used by individuals. Mistakes can be detected dynamically in real-time, raising alarms to prevent injuries.Royal Academy of Engineering (RAEng)University of ExeterUniversity of Technology SydneyBournemouth Universit

    Vol. 15, No. 1 (Full Issue)

    Get PDF

    Hierarchical Bayesian Fuzzy Clustering Approach for High Dimensional Linear Time-Series

    Get PDF
    This paper develops a computational approach to improve fuzzy clustering and forecasting performance when dealing with endogeneity issues and misspecified dynamics in high dimensional dynamic data. Hierarchical Bayesian methods are used to structure linear time variations, reduce dimensionality, and compute a distance function capturing the most probable set of clusters among univariate and multivariate time-series. Nonlinearities involved in the procedure look like permanent shifts and are replaced by coefficient changes. Monte Carlo implementations are also addressed to compute exact posterior probabilities for each cluster chosen and then minimize the increasing probability of outliers plaguing traditional clustering time-series techniques. An empirical example highlights the strengths and limitations of the estimating procedure. Discussions with related works are also displayed

    Vol. 13, No. 1 (Full Issue)

    Get PDF
    • …
    corecore