Search CORE

97,880 research outputs found

Feature Selection in k-Median Clustering

Author: Mangasarian Olvi
Wild Edward
Publication venue
Publication date: 01/01/2004
Field of study

An e ective method for selecting features in clustering unlabeled data is proposed based on changing the objective function of the standard k-median clustering algorithm. The change consists of perturbing the objective function by a term that drives the medians of each of the k clusters toward the (shifted) global median of zero for the entire dataset. As the perturbation parameter is increased, more and more features are driven automatically toward the global zero median and are eliminated from the problem until one last feature remains. An error curve for unlabeled data clustering as a function of the number of features used gives reducedfeature clustering error relative to the \gold standard" of the full-feature clustering. This clustering error curve parallels a classi cation error curve based on real data labels. This justi es the utility of the former error curve for unlabeled data as a means of choosing an appropriate number of reduced features in order to achieve a correctness comparable to that obtained by the full set of original features. For example, on the 3-class Wine dataset, clustering with 4 selected input space features is comparable to within 4% to clustering using the original 13 features of the problem

Minds@University of Wisconsin

Utility-driven assessment of anonymized data via clustering

Author: Fazendeiro Paulo
Ferrão Maria Eugénia
Prata Paula
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/07/2022
Field of study

In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law students. Several anonymized clustering scenarios were compared against the original cluster solution. The clustering techniques were explored as data utility models in the context of data anonymization, using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a relevant metric in social sciences research). For a matter of self-containment, we present an overview of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed several clustering validity indices to understand to what extent the data structure is preserved, or not, after data anonymization. The results suggest that for low dimensionality/cardinality datasets the anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that relevant field-of-study estimates obtained from anonymized data are biased.info:eu-repo/semantics/publishedVersio

UBibliorum repositorio digital da ubi

PubMed Central

Investigation Of Multi-Criteria Clustering Techniques For Smart Grid Datasets

Author: Campion Mitch J
Publication venue: UND Scholarly Commons
Publication date: 01/01/2018
Field of study

The processing of data arising from connected smart grid technology is an important area of research for the next generation power system. The volume of data allows for increased awareness and efficiency of operation but poses challenges for analyzing the data and turning it into meaningful information. This thesis showcases the utility of clustering algorithms applied to three separate smart-grid data sets and analyzes their ability to improve awareness and operational efficiency. Hierarchical clustering for anomaly detection in phasor measurement unit (PMU) datasets is identified as an appropriate method for fault and anomaly detection. It showed an increase in anomaly detection efficiency according to Dunn Index (DI) and improved computational considerations compared to currently employed techniques such as Density Based Spatial Clustering of Applications with Noise (DBSCAN). The efficacy of betweenness-centrality (BC) based clustering in a novel clustering scheme for the determination of microgrids from large scale bus systems is demonstrated and compared against a multitude of other graph clustering algorithms. The BC based clustering showed an overall decrease in economic dispatch cost when compared to other methods of graph clustering. Additionally, the utility of BC for identification of critical buses was showcased. Finally, this work demonstrates the utility of partitional dynamic time warping (DTW) and k-shape clustering methods for classifying power demand profiles of households with and without electric vehicles (EVs). The utility of DTW time-series clustering was compared against other methods of time-series clustering and tested based upon demand forecasting using traditional and deep-learning techniques. Additionally, a novel process for selecting an optimal time-series clustering scheme based upon a scaled sum of cluster validity indices (CVIs) was developed. Forecasting schemes based on DTW and k-shape demand profiles showed an overall increase in forecast accuracy. In summary, the use of clustering methods for three distinct types of smart grid datasets is demonstrated. The use of clustering algorithms as a means of processing data can lead to overall methods that improve forecasting, economic dispatch, event detection, and overall system operation. Ultimately, the techniques demonstrated in this thesis give analytical insights and foster data-driven management and automation for smart grid power systems of the future

UND Scholarly Commons (University of North Dakota)

Graph Summarization

Author: Bonifati Angela
Dumbrava Stefania
Kondylakis Haridimos
Publication venue
Publication date: 01/04/2020
Field of study

The continuous and rapid growth of highly interconnected datasets, which are both voluminous and complex, calls for the development of adequate processing and analytical techniques. One method for condensing and simplifying such datasets is graph summarization. It denotes a series of application-specific algorithms designed to transform graphs into more compact representations while preserving structural patterns, query answers, or specific property distributions. As this problem is common to several areas studying graph topologies, different approaches, such as clustering, compression, sampling, or influence detection, have been proposed, primarily based on statistical and optimization methods. The focus of our chapter is to pinpoint the main graph summarization methods, but especially to focus on the most recent approaches and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hal-Diderot

Cross-species analysis of genetically engineered mouse models of MAPK-driven colorectal cancer identifies hallmarks of the human disease

Author: Belmont Peter J.
Budinska Eva
Coffee Erin
Delorenzi Mauro
Derkits Sahra
Hung Kenneth E.
Jiang Ping
Martin Eric S.
Rejto Paul A.
Roper Jatin
Sansom Owen J.
Sinnamon Mark J.
Tejpar Sabine
Xie Tao
Publication venue: 'The Company of Biologists'
Publication date: 17/04/2014
Field of study

Effective treatment options for advanced colorectal cancer (CRC) are limited, survival rates are poor and this disease continues to be a leading cause of cancer-related deaths worldwide. Despite being a highly heterogeneous disease, a large subset of individuals with sporadic CRC typically harbor relatively few established ‘driver’ lesions. Here, we describe a collection of genetically engineered mouse models (GEMMs) of sporadic CRC that combine lesions frequently altered in human patients, including well-characterized tumor suppressors and activators of MAPK signaling. Primary tumors from these models were profiled, and individual GEMM tumors segregated into groups based on their genotypes. Unique allelic and genotypic expression signatures were generated from these GEMMs and applied to clinically annotated human CRC patient samples. We provide evidence that a Kras signature derived from these GEMMs is capable of distinguishing human tumors harboring KRAS mutation, and tracks with poor prognosis in two independent human patient cohorts. Furthermore, the analysis of a panel of human CRC cell lines suggests that high expression of the GEMM Kras signature correlates with sensitivity to targeted pathway inhibitors. Together, these findings implicate GEMMs as powerful preclinical tools with the capacity to recapitulate relevant human disease biology, and support the use of genetic signatures generated in these models to facilitate future drug discovery and validation efforts

Directory of Open Access Journals

PubMed Central

Enlighten

Recommended from our members

Approaches to conceptual clustering

Author: Fisher Douglas
Langley Pat
Publication venue: eScholarship, University of California
Publication date: 12/07/1985
Field of study

Methods for Conceptual Clustering may be explicated in two lights. Conceptual Clustering methods may be viewed as extensions to techniques of numerical taxonomy, a collection of methods developed by social and natural scientists for creating classification schemes over object sets. Alternatively, conceptual clustering may be viewed as a form of learning by observation or concept formation, as opposed to methods of learning from examples or concept identification. In this paper we survey and compare a number of conceptual clustering methods along dimensions suggested by each of these views. The point we most wish to clarify is that conceptual clustering processes can be explicated as being composed of three distinct but inter-dependent subprocesses: the process of deriving a hierarchical classification scheme; the process of aggregating objects into individual classes; and the process of assigning conceptual descriptions to object classes. Each subprocess may be characterized along a number of dimensions related to search, thus facilitating a better understanding of the conceptual clustering process as a whole

eScholarship - University of California