1,280 research outputs found

    Exploratory Analysis of Functional Data via Clustering and Optimal Segmentation

    Full text link
    We propose in this paper an exploratory analysis algorithm for functional data. The method partitions a set of functions into KK clusters and represents each cluster by a simple prototype (e.g., piecewise constant). The total number of segments in the prototypes, PP, is chosen by the user and optimally distributed among the clusters via two dynamic programming algorithms. The practical relevance of the method is shown on two real world datasets

    Model-Based Clustering and Classification of Functional Data

    Full text link
    The problem of complex data analysis is a central topic of modern statistical science and learning systems and is becoming of broader interest with the increasing prevalence of high-dimensional data. The challenge is to develop statistical models and autonomous algorithms that are able to acquire knowledge from raw data for exploratory analysis, which can be achieved through clustering techniques or to make predictions of future data via classification (i.e., discriminant analysis) techniques. Latent data models, including mixture model-based approaches are one of the most popular and successful approaches in both the unsupervised context (i.e., clustering) and the supervised one (i.e, classification or discrimination). Although traditionally tools of multivariate analysis, they are growing in popularity when considered in the framework of functional data analysis (FDA). FDA is the data analysis paradigm in which the individual data units are functions (e.g., curves, surfaces), rather than simple vectors. In many areas of application, the analyzed data are indeed often available in the form of discretized values of functions or curves (e.g., time series, waveforms) and surfaces (e.g., 2d-images, spatio-temporal data). This functional aspect of the data adds additional difficulties compared to the case of a classical multivariate (non-functional) data analysis. We review and present approaches for model-based clustering and classification of functional data. We derive well-established statistical models along with efficient algorithmic tools to address problems regarding the clustering and the classification of these high-dimensional data, including their heterogeneity, missing information, and dynamical hidden structure. The presented models and algorithms are illustrated on real-world functional data analysis problems from several application area

    Optimisation based approaches for machine learning

    Get PDF
    Machine learning has attracted a lot of attention in recent years and it has become an integral part of many commercial and research projects, with a wide range of applications. With current developments in technology, more data is generated and stored than ever before. Identifying patterns, trends and anomalies in these datasets and summarising them with simple quantitative models is a vital task. This thesis focuses on the development of machine learning algorithms based on mathematical programming for datasets that are relatively small in size. The first topic of this doctoral thesis is piecewise regression, where a dataset is partitioned into multiple regions and a regression model is fitted to each one. This work uses an existing algorithm from the literature and extends the mathematical formulation in order to include information criteria. The inclusion of such criteria targets to deal with overfitting, which is a common problem in supervised learning tasks, by finding a balance between predictive performance and model complexity. The improvement in overall performance is demonstrated by testing and comparing the proposed method with various algorithms from the literature on various regression datasets. Extending the topic of regression, a decision tree regressor is also proposed. Decision trees are powerful and easy to understand structures that can be used both for regression and classification. In this work, an optimisation model is used for the binary splitting of nodes. A statistical test is introduced to check whether the partitioning of nodes is statistically meaningful and as a result control the tree generation process. Additionally, a novel mathematical formulation is proposed to perform feature selection and ultimately identify the appropriate variable to be selected for the splitting of nodes. The performance of the proposed algorithm is once again compared with a number of literature algorithms and it is shown that the introduction of the variable selection model is useful for reducing the training time of the algorithm without major sacrifices in performance. Lastly, a novel decision tree classifier is proposed. This algorithm is based on a mathematical formulation that identifies the optimal splitting variable and break value, applies a linear transformation to the data and then assigns them to a class while minimising the number of misclassified samples. The introduction of the linear transformation step reduces the dimensionality of the examined dataset down to a single variable, aiding the classification accuracy of the algorithm for more complex datasets. Popular classifiers from the literature have been used to compare the accuracy of the proposed algorithm on both synthetic and publicly available classification datasets

    Heterogeneous Change Point Inference

    Full text link
    We propose HSMUCE (heterogeneous simultaneous multiscale change-point estimator) for the detection of multiple change-points of the signal in a heterogeneous gaussian regression model. A piecewise constant function is estimated by minimizing the number of change-points over the acceptance region of a multiscale test which locally adapts to changes in the variance. The multiscale test is a combination of local likelihood ratio tests which are properly calibrated by scale dependent critical values in order to keep a global nominal level alpha, even for finite samples. We show that HSMUCE controls the error of over- and underestimation of the number of change-points. To this end, new deviation bounds for F-type statistics are derived. Moreover, we obtain confidence sets for the whole signal. All results are non-asymptotic and uniform over a large class of heterogeneous change-point models. HSMUCE is fast to compute, achieves the optimal detection rate and estimates the number of change-points at almost optimal accuracy for vanishing signals, while still being robust. We compare HSMUCE with several state of the art methods in simulations and analyse current recordings of a transmembrane protein in the bacterial outer membrane with pronounced heterogeneity for its states. An R-package is available online

    Approximate Dynamic Programming: Health Care Applications

    Get PDF
    This dissertation considers different approximate solutions to Markov decision problems formulated within the dynamic programming framework in two health care applications. Dynamic formulations are appropriate for problems which require optimization over time and a variety of settings for different scenarios and policies. This is similar to the situation in a lot of health care applications for which because of the curses of dimensionality, exact solutions do not always exist. Thus, approximate analysis to find near optimal solutions are motivated. To check the quality of approximation, additional evidence such as boundaries, consistency analysis, or asymptotic behavior evaluation are required. Emergency vehicle management and dose-finding clinical trials are the two heath care applications considered here in order to investigate dynamic formulations, approximate solutions, and solution quality assessments. The dynamic programming formulation for real-time ambulance dispatching and relocation policies, response-adaptive dose-finding clinical trial, and optimal stopping of adaptive clinical trials is presented. Approximate solutions are derived by multiple methods such as basis function regression, one-step look-ahead policy, simulation-based gridding algorithm, and diffusion approximation. Finally, some boundaries to assess the optimality gap and a proof of consistency for approximate solutions are presented to ensure the quality of approximation
    corecore