33 research outputs found

    Definition of MV Load Diagrams via Weighted Evidence Accumulation Clustering using Subsampling

    Get PDF
    A definition of medium voltage (MV) load diagrams was made, based on the data base knowledge discovery process. Clustering techniques were used as support for the agents of the electric power retail markets to obtain specific knowledge of their customers’ consumption habits. Each customer class resulting from the clustering operation is represented by its load diagram. The Two-step clustering algorithm and the WEACS approach based on evidence accumulation (EAC) were applied to an electricity consumption data from a utility client’s database in order to form the customer’s classes and to find a set of representative consumption patterns. The WEACS approach is a clustering ensemble combination approach that uses subsampling and that weights differently the partitions in the co-association matrix. As a complementary step to the WEACS approach, all the final data partitions produced by the different variations of the method are combined and the Ward Link algorithm is used to obtain the final data partition. Experiment results showed that WEACS approach led to better accuracy than many other clustering approaches. In this paper the WEACS approach separates better the customer’s population than Two-step clustering algorithm

    The Consensus Clustering as a Contribution to Parental Recognition Problem Based on Hand Biometrics

    Full text link
    The clustering analysis is a subject that has been interesting researchers from several areas, such as health (medical diagnosis, clustering of proteins and genes), marketing (market analysis and image segmentation), information management (clustering of web pages). The clustering algorithms are usually applied in Data Mining, allowing the identification of natural groups for a given data set. The use of different clustering methods for the same data set can produce different groups. So, several studies have been led to validate the resulting clusters. There has been an increasing interest on how to determine a consensus clustering that combines the different individual clusterings, reflecting the main structure in clusters inherent to each of them, as a perspective to get a higher quality clustering. As several techniques of consensus clustering have been researched, the present work focuses on problem of finding the best partition in the consensus clustering. We analyze the most referred techniques in literature, the consensus clustering techniques with different mechanisms to achieve the consensus, i.e.; Voting mechanisms; Co-association matrix; Mutual Information and hyper-graphs; and a multi-objective consensus clustering existing on literature. In this paper we discuss these approaches and a comparative study is presented, that considers a set of experiments using two-dimensional synthetic data sets with different characteristics, as number of clusters, their cardinality, shape, homogeneity and separability, and a real-world data set based on hand\u27s biometrics shape, in context of people parental recognition. With this data we intend to investigate the ability of the consensus clustering algorithms in correctly cluster a child and her/his parents. This has an enormous business potential leading to a great economic value, since that with this technology a website can match data, as hand\u27s photographs, and say if A and B are related somehow. We conclude that, in some cases, the multi-objective technique proved to outperform the other techniques, and unlike the other techniques, is little influenced by poor clustering even in situations like noise introduction and clusters with different homogeneity or overlapped. Furthermore, shows that can capture the performance of the best base clustering and still outperform it. Regarding to real data, no technique was capable of identifying a person\u27s mother/father. However, the research of distances between hands from a person and its father, mother, siblings, can retrieve the probability of that person being his/her familiar. This doesn\u27t enable the identification of relatives but instead, decreases the size of database for seeking the matches

    Determination of electricity consumers’ load profiles via weighted evidence accumulation clustering using subsampling

    Get PDF
    With the electricity market liberalization, the distribution and retail companies are looking for better market strategies based on adequate information upon the consumption patterns of its electricity consumers. A fair insight on the consumers’ behavior will permit the definition of specific contract aspects based on the different consumption patterns. In order to form the different consumers’ classes, and find a set of representative consumption patterns we use electricity consumption data from a utility client’s database and two approaches: Two-step clustering algorithm and the WEACS approach based on evidence accumulation (EAC) for combining partitions in a clustering ensemble. While EAC uses a voting mechanism to produce a co-association matrix based on the pairwise associations obtained from N partitions and where each partition has equal weight in the combination process, the WEACS approach uses subsampling and weights differently the partitions. As a complementary step to the WEACS approach, we combine the partitions obtained in the WEACS approach with the ALL clustering ensemble construction method and we use the Ward Link algorithm to obtain the final data partition. The characterization of the obtained consumers’ clusters was performed using the C5.0 classification algorithm. Experiment results showed that the WEACS approach leads to better results than many other clustering approaches

    LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles

    Get PDF
    Cluster ensembles have emerged as a powerful meta-learning paradigm that provides improved accuracy and robustness by aggregating several input data clusterings. In particular, link-based similarity methods have recently been introduced with superior performance to the conventional co-association approach. This paper presents a MATLAB package, LinkCluE, that implements the link-based cluster ensemble framework. A variety of functional methods for evaluating clustering results, based on both internal and external criteria, are also provided. Additionally, the underlying algorithms together with the sample uses of the package with interesting real and synthetic datasets are demonstrated herein.

    Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features

    Get PDF
    BACKGROUND: Ensemble attribute profile clustering is a novel, text-based strategy for analyzing a user-defined list of genes and/or proteins. The strategy exploits annotation data present in gene-centered corpora and utilizes ideas from statistical information retrieval to discover and characterize properties shared by subsets of the list. The practical utility of this method is demonstrated by employing it in a retrospective study of two non-overlapping sets of genes defined by a published investigation as markers for normal human breast luminal epithelial cells and myoepithelial cells. RESULTS: Each genetic locus was characterized using a finite set of biological properties and represented as a vector of features indicating attributes associated with the locus (a gene attribute profile). In this study, the vector space models for a pre-defined list of genes were constructed from the Gene Ontology (GO) terms and the Conserved Domain Database (CDD) protein domain terms assigned to the loci by the gene-centered corpus LocusLink. This data set of GO- and CDD-based gene attribute profiles, vectors of binary random variables, was used to estimate multiple finite mixture models and each ensuing model utilized to partition the profiles into clusters. The resultant partitionings were combined using a unanimous voting scheme to produce consensus clusters, sets of profiles that co-occured consistently in the same cluster. Attributes that were important in defining the genes assigned to a consensus cluster were identified. The clusters and their attributes were inspected to ascertain the GO and CDD terms most associated with subsets of genes and in conjunction with external knowledge such as chromosomal location, used to gain functional insights into human breast biology. The 52 luminal epithelial cell markers and 89 myoepithelial cell markers are disjoint sets of genes. Ensemble attribute profile clustering-based analysis indicated that both lists contained groups of genes with the functional properties of membrane receptor biology/signal transduction and nucleic acid binding/transcription. A subset of the luminal markers was associated with metabolic and oxidoreductase activities, whereas a subset of myoepithelial markers was associated with protein hydrolase activity. CONCLUSION: Given a set of genes and/or proteins associated with a phenomenon, process or system of interest, ensemble attribute profile clustering provides a simple method for collating and sythesizing the annotation data pertaining to them that are present in text-based, gene-centered corpora. The results provide information about properties common and unique to subsets of the list and hence insights into the biology of the problem under investigation

    Variability analysis of the hierarchical clustering algoritms and its implication on consensus clustering

    Full text link
    Clustering is one of the most important unsupervised learning tools when no prior knowledge about the data set is available. Clustering algorithms aim to find underlying structure of the data sets taking into account clustering criteria, properties in the data and specific way of data comparison. In the literature many clustering algorithms have been proposed having a common goal which is, given a set of objects, grouping similar objects in the same cluster and dissimilar objects in different clusters. Hierarchical clustering algorithms are of great importance in data analysis providing knowledge about the data structure. Due to the graphical representation of the resultant partitions, through a dendrogram, may give more information than the clustering obtained by non hierarchical clustering algorithms. The use of different clustering methods for the same data set, or the use of the same clustering method but with different initializations (different parameters), can produce different clustering. So several studies have been concerned with validate the resulting clustering analyzing them in terms of stability / variability, and also, there has been an increasing interest on the problem of determining a consensus clustering. This work empirically analyzes the clustering variability delivered by hierarchical algorithms, and some consensus clustering techniques are also investigated. By the variability of hierarchical clustering, we select the most suitable consensus clustering technique existing in literature. Results on a range of synthetic and real data sets reveal significant differences of the variability of hierarchical clustering as well as different performances of the consensus clustering techniques

    Considering Currency in Decision Trees in the Context of Big Data

    Get PDF
    In the current age of big data, decision trees are one of the most commonly applied data mining methods. However, for reliable results they require up-to-date input data, which is not always given in reality. We present a two-phase approach based on probability theory for considering currency of stored data in decision trees. Our approach is efficient and thus suitable for big data applications. Moreover, it is independent of the particular decision tree classifier. Finally, it is context-specific since the decision tree structure and supplemental data are taken into account. We demonstrate the benefits of the novel approach by applying it to three datasets. The results show a substantial increase in the classification success rate as opposed to not considering currency. Thus, applying our approach prevents wrong classification and consequently wrong decisions
    corecore