58 research outputs found

    Cross-product Penalized Component Analysis (XCAN)

    Full text link
    Matrix factorization methods are extensively employed to understand complex data. In this paper, we introduce the cross-product penalized component analysis (XCAN), a sparse matrix factorization based on the optimization of a loss function that allows a trade-off between variance maximization and structural preservation. The approach is based on previous developments, notably (i) the Sparse Principal Component Analysis (SPCA) framework based on the LASSO, (ii) extensions of SPCA to constrain both modes of the factorization, like co-clustering or the Penalized Matrix Decomposition (PMD), and (iii) the Group-wise Principal Component Analysis (GPCA) method. The result is a flexible modeling approach that can be used for data exploration in a large variety of problems. We demonstrate its use with applications from different disciplines

    Optimal Price Regulation for Natural and Legal Monopolies

    Get PDF
    Optimal price regulation for natural and legal monopolies is an impossible task. The still difficult .task of good price regulation can be systematized by considering separately price level and price structure of the regulated firm. Various methods of price level and price structure regulation are evaluated and then considered for the regulation of electricity transmission, both in the context of an independent transmission company and of vertical integration between transmission and most of the generation capacity. The regulatory approach suggested uses price caps defined on two-part tariffs. This way, flexibility for short-term capacity utilization can be combined with incentives for investments in new transmission capacity.

    Forward Selection Component Analysis: Algorithms and Applications

    Get PDF

    Learning nonparametric DAGs with incremental information via high-order HSIC

    Full text link
    Score-based methods for learning Bayesain networks(BN) aim to maximizing the global score functions. However, if local variables have direct and indirect dependence simultaneously, the global optimization on score functions misses edges between variables with indirect dependent relationship, of which scores are smaller than those with direct dependent relationship. In this paper, we present an identifiability condition based on a determined subset of parents to identify the underlying DAG. By the identifiability condition, we develop a two-phase algorithm namely optimal-tuning (OT) algorithm to locally amend the global optimization. In the optimal phase, an optimization problem based on first-order Hilbert-Schmidt independence criterion (HSIC) gives an estimated skeleton as the initial determined parents subset. In the tuning phase, the skeleton is locally tuned by deletion, addition and DAG-formalization strategies using the theoretically proved incremental properties of high-order HSIC. Numerical experiments for different synthetic datasets and real-world datasets show that the OT algorithm outperforms existing methods. Especially in Sigmoid Mix model with the size of the graph being d=40{\rm\bf d=40}, the structure intervention distance (SID) of the OT algorithm is 329.7 smaller than the one obtained by CAM, which indicates that the graph estimated by the OT algorithm misses fewer edges compared with CAM.Source code of the OT algorithm is available at https://github.com/YafeiannWang/optimal-tune-algorithm

    Improved fMRI-based Pain Prediction using Bayesian Group-wise Functional Registration

    Full text link
    In recent years, neuroimaging has undergone a paradigm shift, moving away from the traditional brain mapping approach toward developing integrated, multivariate brain models that can predict categories of mental events. However, large interindividual differences in brain anatomy and functional localization after standard anatomical alignment remain a major limitation in performing this analysis, as it leads to feature misalignment across subjects in subsequent predictive models

    Stratified Staged Trees: Modelling, Software and Applications

    Get PDF
    The thesis is focused on Probabilistic Graphical Models (PGMs), which are a rich framework for encoding probability distributions over complex domains. In particular, joint multivariate distributions over large numbers of random variables that interact with each other can be investigated through PGMs and conditional independence statements can be succinctly represented with graphical representations. These representations sit at the intersection of statistics and computer science, relying on concepts mainly from probability theory, graph algorithms and machine learning. They are applied in a wide variety of fields, such as medical diagnosis, image understanding, speech recognition, natural language processing, and many more. Over the years theory and methodology have developed and been extended in a multitude of directions. In particular, in this thesis different aspects of new classes of PGMs called Staged Trees and Chain Event Graphs (CEGs) are studied. In some sense, Staged Trees are a generalization of Bayesian Networks (BNs). Indeed, BNs provide a transparent graphical tool to define a complex process in terms of conditional independent structures. Despite their strengths in allowing for the reduction in the dimensionality of joint probability distributions of the statistical model and in providing a transparent framework for causal inference, BNs are not optimal GMs in all situations. The biggest problems with their usage mainly occur when the event space is not a simple product of the sample spaces of the random variables of interest, and when conditional independence statements are true only under certain values of variables. This happens when there are context-specific conditional independence structures. Some extensions to the BN framework have been proposed to handle these issues: context-specific BNs, Bayesian Multinets, or Similarity Networks citep{geiger1996knowledge}. These adopt a hypothesis variable to encode the context-specific statements over a particular set of random variables. For each value taken by the hypothesis variable the graphical modeller has to construct a particular BN model called local network. The collection of these local networks constitute a Bayesian Multinet, Probabilistic Decision Graphs, among others. It has been showed that Chain Event Graph (CEG) models encompass all discrete BN models and its discrete variants described above as a special subclass and they are also richer than Probabilistic Decision Graphs whose semantics is actually somewhat distinct. Unlike most of its competitors, CEGs can capture all (also context-specific) conditional independences in a unique graph, obtained by a coalescence over the vertices of an appropriately constructed probability tree, called Staged Tree. CEGs have been developed for categorical variables and have been used for cohort studies, causal analysis and case-control studies. The user\u2019s toolbox to efficiently and effectively perform uncertainty reasoning with CEGs further includes methods for inference and probability propagation, the exploration of equivalence classes and robustness studies. The main contributions of this thesis to the literature on Staged Trees are related to Stratified Staged Trees with a keen eye of application. Few observations are made on non-Stratified Staged Trees in the last part of the thesis. A core output of the thesis is an R software package which efficiently implements a host of functions for learning and estimating Staged Trees from data, relying on likelihood principles. Also structural learning algorithms based on distance or divergence between pair of categorical probability distributions and based on the clusterization of probability distributions in a fixed number of stages for each stratum of the tree are developed. Also a new class of Directed Acyclic Graph has been introduced, named Asymmetric-labeled DAG (ALDAG), which gives a BN representation of a given Staged Tree. The ALDAG is a minimal DAG such that the statistical model embedded in the Staged Tree is contained in the one associated to the ALDAG. This is possible thanks to the use of colored edges, so that each color indicates a different type of conditional dependence: total, context-specific, partial or local. Staged Trees are also adopted in this thesis as a statistical tool for classification purpose. Staged Tree Classifiers are introduced, which exhibit comparable predictive results based on accuracy with respect to algorithms from state of the art of machine learning such as neural networks and random forests. At last, algorithms to obtain an ordering of variables for the construction of the Staged Tree are designed

    Time and frequency domain statistical methods for high-frequency time series

    Get PDF
    Advances in sensor technology enable environmental monitoring programmes to record and store measurements at high-temporal resolution over long time periods. These large volumes of high-frequency data promote an increasingly comprehensive picture of many environmental processes that would not have been accessible in the past with monthly, fortnightly or even daily sampling. However, benefiting from these increasing amounts of high-frequency data presents various challenges in terms of data processing and statistical modeling using standard methods and software tools. These challenges are attributed to the large volumes of data, the persistent and long memory serial correlation in the data, the signal to noise ratio, and the complex and time-varying dynamics and inter-relationships between the different drivers of the process at different timescales. This thesis aims at using and developing a variety of statistical methods in both the time and frequency domains to effectively explore and analyze high-frequency time series data as well as to reduce their dimensionality, with specific application to a 3 year hydrological time series. Firstly, the thesis investigates the statistical challenges of exploring, modeling and analyzing these large volumes of high-frequency time series. Thereafter, it uses and develops more advanced statistical techniques to: (i) better visualize and identify the different modes of variability and common patterns in such data, and (ii) provide a more adequate dimension reduction representation to the data, which takes into account the persistent serial dependence structure and non-stationarity in the series. Throughout the thesis, a 15-minute resolution time series of excess partial pressure of carbon dioxide (EpCO2) obtained for a small catchment in the River Dee in Scotland has been used as an illustrative data set. Understanding the bio-geochemical and hydrological drivers of EpCO 2 is very important to the assessment of the global carbon budget. Specifically, Chapters 1 and 2 present a range of advanced statistical approaches in both the time and frequency domains, including wavelet analysis and additive models, to visualize and explore temporal variations and relationships between variables for the River Dee data across the different timescales to investigate the statistical challenges posed by such data. In Chapter 3, a functional data analysis approach is employed to identify the common daily patterns of EpCO2 by means of functional principal component analysis and functional cluster analysis. The techniques used in this chapter assume independent functional data. However, in numerous applications, functional observations are serially correlated over time, e.g. where each curve represents a segment of the whole time interval. In this situation, ignoring the temporal dependence may result in an inappropriate dimension reduction of the data and inefficient inference procedures. Subsequently, the dynamic functional principal components, recently developed by Hor mann et al. (2014), are considered in Chapter 4 to account for the temporal correlation using a frequency domain approach. A specific contribution of this thesis is the extension of the methodology of dynamic functional principal components to temporally dependent functional data estimated using any type of basis functions, not only orthogonal basis functions. Based on the scores of the proposed general version of dynamic functional principal components, a novel clustering approach is proposed and used to cluster the daily curves of EpCO2 taking into account the dependence structure in the data. The dynamic functional principal components depend in their construction on the assumption of second-order stationarity, which is not a realistic assumption in most environmental applications. Therefore, in Chapter 5, a second specific contribution of this thesis is the development of a time-varying dynamic functional principal components which allows the components to vary smoothly over time. The performance of these smooth dynamic functional principal components is evaluated empirically using the EpCO2 data and using a simulation study. The simulation study compares the performance of smooth and original dynamic functional principal components under both stationary and non-stationary conditions. The smooth dynamic functional principal components have shown considerable improvement in representing non-stationary dependent functional data in smaller dimensions. Using a bootstrap inference procedure, the smooth dynamic functional principal components have been subsequently employed to investigate whether or not the spectral density and covariance structure of the functional time series under study change over time. To account for the possible changes in the covariance structure, a clustering approach based on the proposed smooth dynamic functional principal components is suggested and the results of application are discussed. Finally, Chapter 6 provides a summary of the work presented within this thesis, discusses the limitations and implications and proposes areas for future research
    • 

    corecore