13 research outputs found

    Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

    Get PDF
    Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

    The Matrix Pencil for Power System Modal Extraction

    Get PDF
    This work introduces the matrix pencil modal extraction method through the use of several illustrative examples. This method is used to estimate the eigenvalues of reduced-order models of large nonlinear systems based on their dynamic responses

    Definition of MV Load Diagrams via Weighted Evidence Accumulation Clustering using Subsampling

    Get PDF
    A definition of medium voltage (MV) load diagrams was made, based on the data base knowledge discovery process. Clustering techniques were used as support for the agents of the electric power retail markets to obtain specific knowledge of their customers’ consumption habits. Each customer class resulting from the clustering operation is represented by its load diagram. The Two-step clustering algorithm and the WEACS approach based on evidence accumulation (EAC) were applied to an electricity consumption data from a utility client’s database in order to form the customer’s classes and to find a set of representative consumption patterns. The WEACS approach is a clustering ensemble combination approach that uses subsampling and that weights differently the partitions in the co-association matrix. As a complementary step to the WEACS approach, all the final data partitions produced by the different variations of the method are combined and the Ward Link algorithm is used to obtain the final data partition. Experiment results showed that WEACS approach led to better accuracy than many other clustering approaches. In this paper the WEACS approach separates better the customer’s population than Two-step clustering algorithm

    Determination of electricity consumers’ load profiles via weighted evidence accumulation clustering using subsampling

    Get PDF
    With the electricity market liberalization, the distribution and retail companies are looking for better market strategies based on adequate information upon the consumption patterns of its electricity consumers. A fair insight on the consumers’ behavior will permit the definition of specific contract aspects based on the different consumption patterns. In order to form the different consumers’ classes, and find a set of representative consumption patterns we use electricity consumption data from a utility client’s database and two approaches: Two-step clustering algorithm and the WEACS approach based on evidence accumulation (EAC) for combining partitions in a clustering ensemble. While EAC uses a voting mechanism to produce a co-association matrix based on the pairwise associations obtained from N partitions and where each partition has equal weight in the combination process, the WEACS approach uses subsampling and weights differently the partitions. As a complementary step to the WEACS approach, we combine the partitions obtained in the WEACS approach with the ALL clustering ensemble construction method and we use the Ward Link algorithm to obtain the final data partition. The characterization of the obtained consumers’ clusters was performed using the C5.0 classification algorithm. Experiment results showed that the WEACS approach leads to better results than many other clustering approaches

    Phenotype clustering of breast epithelial cells in confocal images based on nuclear protein distribution analysis

    Get PDF
    Background: The distribution of the chromatin-associatedproteins plays a key role in directing nuclear function. Previously, wedeveloped an image-based method to quantify the nuclear distributions ofproteins and showed that these distributions depended on the phenotype ofhuman mammary epithelial cells. Here we describe a method that creates ahierarchical tree of the given cell phenotypes and calculates thestatistical significance between them, based on the clustering analysisof nuclear protein distributions. Results: Nuclear distributions ofnuclear mitotic apparatus protein were previously obtained fornon-neoplastic S1 and malignant T4-2 human mammary epithelial cellscultured for up to 12 days. Cell phenotype was defined as S1 or T4-2 andthe number of days in cultured. A probabilistic ensemble approach wasused to define a set of consensus clusters from the results of multipletraditional cluster analysis techniques applied to the nucleardistribution data. Cluster histograms were constructed to show how cellsin any one phenotype were distributed across the consensus clusters.Grouping various phenotypes allowed us to build phenotype trees andcalculate the statistical difference between each group. The resultsshowed that non-neoplastic S1 cells could be distinguished from malignantT4-2 cells with 94.19 percent accuracy; that proliferating S1 cells couldbe distinguished from differentiated S1 cells with 92.86 percentaccuracy; and showed no significant difference between the variousphenotypes of T4-2 cells corresponding to increasing tumor sizes.Conclusion: This work presents a cluster analysis method that canidentify significant cell phenotypes, based on the nuclear distributionof specific proteins, with high accuracy

    Voting-Based Consensus of Data Partitions

    Get PDF
    Over the past few years, there has been a renewed interest in the consensus problem for ensembles of partitions. Recent work is primarily motivated by the developments in the area of combining multiple supervised learners. Unlike the consensus of supervised classifications, the consensus of data partitions is a challenging problem due to the lack of globally defined cluster labels and to the inherent difficulty of data clustering as an unsupervised learning problem. Moreover, the true number of clusters may be unknown. A fundamental goal of consensus methods for partitions is to obtain an optimal summary of an ensemble and to discover a cluster structure with accuracy and robustness exceeding those of the individual ensemble partitions. The quality of the consensus partitions highly depends on the ensemble generation mechanism and on the suitability of the consensus method for combining the generated ensemble. Typically, consensus methods derive an ensemble representation that is used as the basis for extracting the consensus partition. Most ensemble representations circumvent the labeling problem. On the other hand, voting-based methods establish direct parallels with consensus methods for supervised classifications, by seeking an optimal relabeling of the ensemble partitions and deriving an ensemble representation consisting of a central aggregated partition. An important element of the voting-based aggregation problem is the pairwise relabeling of an ensemble partition with respect to a representative partition of the ensemble, which is refered to here as the voting problem. The voting problem is commonly formulated as a weighted bipartite matching problem. In this dissertation, a general theoretical framework for the voting problem as a multi-response regression problem is proposed. The problem is formulated as seeking to estimate the uncertainties associated with the assignments of the objects to the representative clusters, given their assignments to the clusters of an ensemble partition. A new voting scheme, referred to as cumulative voting, is derived as a special instance of the proposed regression formulation corresponding to fitting a linear model by least squares estimation. The proposed formulation reveals the close relationships between the underlying loss functions of the cumulative voting and bipartite matching schemes. A useful feature of the proposed framework is that it can be applied to model substantial variability between partitions, such as a variable number of clusters. A general aggregation algorithm with variants corresponding to cumulative voting and bipartite matching is applied and a simulation-based analysis is presented to compare the suitability of each scheme to different ensemble generation mechanisms. The bipartite matching is found to be more suitable than cumulative voting for a particular generation model, whereby each ensemble partition is generated as a noisy permutation of an underlying labeling, according to a probability of error. For ensembles with a variable number of clusters, it is proposed that the aggregated partition be viewed as an estimated distributional representation of the ensemble, on the basis of which, a criterion may be defined to seek an optimally compressed consensus partition. The properties and features of the proposed cumulative voting scheme are studied. In particular, the relationship between cumulative voting and the well-known co-association matrix is highlighted. Furthermore, an adaptive aggregation algorithm that is suited for the cumulative voting scheme is proposed. The algorithm aims at selecting the initial reference partition and the aggregation sequence of the ensemble partitions the loss of mutual information associated with the aggregated partition is minimized. In order to subsequently extract the final consensus partition, an efficient agglomerative algorithm is developed. The algorithm merges the aggregated clusters such that the maximum amount of information is preserved. Furthermore, it allows the optimal number of consensus clusters to be estimated. An empirical study using several artificial and real-world datasets demonstrates that the proposed cumulative voting scheme leads to discovering substantially more accurate consensus partitions compared to bipartite matching, in the case of ensembles with a relatively large or a variable number of clusters. Compared to other recent consensus methods, the proposed method is found to be comparable with or better than the best performing methods. Moreover, accurate estimates of the true number of clusters are often achieved using cumulative voting, whereas consistently poor estimates are achieved based on bipartite matching. The empirical evidence demonstrates that the bipartite matching scheme is not suitable for these types of ensembles

    Enhancement of spatial data analysis

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Sensing Structured Signals with Active and Ensemble Methods

    Full text link
    Modern problems in signal processing and machine learning involve the analysis of data that is high-volume, high-dimensional, or both. In one example, scientists studying the environment must choose their set of measurements from an infinite set of possible sample locations. In another, performing inference on high-resolution images involves operating on vectors whose dimensionality is on the order of tens of thousands. To combat the challenges presented by these and other applications, researchers rely on two key features intrinsic to many large datasets. First, large volumes of data can often be accurately represented by a few key points, allowing for efficient processing, summary, and collection of data. Second, high-dimensional data often has low-dimensional intrinsic structure that can be leveraged for processing and storage. This thesis leverages these facts to develop and analyze algorithms capable of handling the challenges presented by modern data. The first scenario considered in this thesis is that of monitoring regions of low oxygen concentration (hypoxia) in lakes via an autonomous robot. Tracking the spatial extent of such hypoxic regions is of great interest and importance to scientists studying the Great Lakes, but current systems rely heavily on hydrodynamic models and a very small number of measurements at predefined sample locations. Existing active learning algorithms minimize the samples required to determine the spatial extent but do not consider the distance traveled during the estimation procedure. We propose a novel active learning algorithm for tracking such regions that balances both the number of measurements taken and the distance traveled in estimating the boundary of the hypoxic zone. The second scenario considered is learning a union of subspaces (UoS) model that best fits a given collection of points. This model can be viewed as a generalization of principal components analysis (PCA) in which data vectors are drawn from one of several low-dimensional linear subspaces of the ambient space and has applications in image segmentation and object recognition. The problem of automatically sorting the data according to nearest subspace is known as subspace clustering, and existing unsupervised algorithms perform this task well in many situations. However, state-of-the-art algorithms do not fully leverage the problem geometry, and the resulting clustering errors are far from the best possible using the UoS model. We present two novel means of bridging this gap. We first present a method of incorporating semi-supervised information into existing unsupervised subspace clustering algorithms in the form of pairwise constraints between items. We next study an ensemble algorithm for unsupervised subspace clustering that functions by combining the outputs from many efficient but inaccurate base clusterings to achieve state-of- the-art performance. Finally, we perform the first principled study of model selection for subspace clustering, in which we define clustering quality metrics that do not rely on the ground truth and evaluate their ability to reliably predict clustering accuracy. The contributions of this thesis demonstrate the applicability of tools from signal processing and machine learning to problems ranging from scientific exploration to computer vision. By utilizing inherent structure in the data, we develop algorithms that are efficient in terms of computational complexity and other realistic costs, making them truly practical for modern problems in data science.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140795/1/lipor_1.pd
    corecore