97,880 research outputs found
Feature Selection in k-Median Clustering
An e ective method for selecting features in clustering
unlabeled data is proposed based on changing the objective
function of the standard k-median clustering algorithm. The
change consists of perturbing the objective function by a
term that drives the medians of each of the k clusters toward
the (shifted) global median of zero for the entire dataset.
As the perturbation parameter is increased, more and more
features are driven automatically toward the global zero
median and are eliminated from the problem until one last
feature remains. An error curve for unlabeled data clustering
as a function of the number of features used gives reducedfeature
clustering error relative to the \gold standard" of the
full-feature clustering. This clustering error curve parallels
a classi cation error curve based on real data labels. This
justi es the utility of the former error curve for unlabeled
data as a means of choosing an appropriate number of
reduced features in order to achieve a correctness comparable
to that obtained by the full set of original features. For
example, on the 3-class Wine dataset, clustering with 4
selected input space features is comparable to within 4%
to clustering using the original 13 features of the problem
Utility-driven assessment of anonymized data via clustering
In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This
approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law
students. Several anonymized clustering scenarios were compared against the original cluster solution.
The clustering techniques were explored as data utility models in the context of data anonymization,
using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized
data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a
relevant metric in social sciences research). For a matter of self-containment, we present an overview
of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed
several clustering validity indices to understand to what extent the data structure is preserved, or not,
after data anonymization. The results suggest that for low dimensionality/cardinality datasets the
anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that
relevant field-of-study estimates obtained from anonymized data are biased.info:eu-repo/semantics/publishedVersio
Investigation Of Multi-Criteria Clustering Techniques For Smart Grid Datasets
The processing of data arising from connected smart grid technology is an important area of research for the next generation power system. The volume of data allows for increased awareness and efficiency of operation but poses challenges for analyzing the data and turning it into meaningful information. This thesis showcases the utility of clustering algorithms applied to three separate smart-grid data sets and analyzes their ability to improve awareness and operational efficiency.
Hierarchical clustering for anomaly detection in phasor measurement unit (PMU) datasets is identified as an appropriate method for fault and anomaly detection. It showed an increase in anomaly detection efficiency according to Dunn Index (DI) and improved computational considerations compared to currently employed techniques such as Density Based Spatial Clustering of Applications with Noise (DBSCAN).
The efficacy of betweenness-centrality (BC) based clustering in a novel clustering scheme for the determination of microgrids from large scale bus systems is demonstrated and compared against a multitude of other graph clustering algorithms. The BC based clustering showed an overall decrease in economic dispatch cost when compared to other methods of graph clustering. Additionally, the utility of BC for identification of critical buses was showcased.
Finally, this work demonstrates the utility of partitional dynamic time warping (DTW) and k-shape clustering methods for classifying power demand profiles of households with and without electric vehicles (EVs). The utility of DTW time-series clustering was compared against other methods of time-series clustering and tested based upon demand forecasting using traditional and deep-learning techniques. Additionally, a novel process for selecting an optimal time-series clustering scheme based upon a scaled sum of cluster validity indices (CVIs) was developed. Forecasting schemes based on DTW and k-shape demand profiles showed an overall increase in forecast accuracy.
In summary, the use of clustering methods for three distinct types of smart grid datasets is demonstrated. The use of clustering algorithms as a means of processing data can lead to overall methods that improve forecasting, economic dispatch, event detection, and overall system operation. Ultimately, the techniques demonstrated in this thesis give analytical insights and foster data-driven management and automation for smart grid power systems of the future
Graph Summarization
The continuous and rapid growth of highly interconnected datasets, which are
both voluminous and complex, calls for the development of adequate processing
and analytical techniques. One method for condensing and simplifying such
datasets is graph summarization. It denotes a series of application-specific
algorithms designed to transform graphs into more compact representations while
preserving structural patterns, query answers, or specific property
distributions. As this problem is common to several areas studying graph
topologies, different approaches, such as clustering, compression, sampling, or
influence detection, have been proposed, primarily based on statistical and
optimization methods. The focus of our chapter is to pinpoint the main graph
summarization methods, but especially to focus on the most recent approaches
and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie
Cross-species analysis of genetically engineered mouse models of MAPK-driven colorectal cancer identifies hallmarks of the human disease
Effective treatment options for advanced colorectal cancer (CRC) are limited, survival rates are poor and this disease continues to be a leading cause of cancer-related deaths worldwide. Despite being a highly heterogeneous disease, a large subset of individuals with sporadic CRC typically harbor relatively few established ‘driver’ lesions. Here, we describe a collection of genetically engineered mouse models (GEMMs) of sporadic CRC that combine lesions frequently altered in human patients, including well-characterized tumor suppressors and activators of MAPK signaling. Primary tumors from these models were profiled, and individual GEMM tumors segregated into groups based on their genotypes. Unique allelic and genotypic expression signatures were generated from these GEMMs and applied to clinically annotated human CRC patient samples. We provide evidence that a Kras signature derived from these GEMMs is capable of distinguishing human tumors harboring KRAS mutation, and tracks with poor prognosis in two independent human patient cohorts. Furthermore, the analysis of a panel of human CRC cell lines suggests that high expression of the GEMM Kras signature correlates with sensitivity to targeted pathway inhibitors. Together, these findings implicate GEMMs as powerful preclinical tools with the capacity to recapitulate relevant human disease biology, and support the use of genetic signatures generated in these models to facilitate future drug discovery and validation efforts
Recommended from our members
Approaches to conceptual clustering
Methods for Conceptual Clustering may be explicated in two lights. Conceptual Clustering methods may be viewed as extensions to techniques of numerical taxonomy, a collection of methods developed by social and natural scientists for creating classification schemes over object sets. Alternatively, conceptual clustering may be viewed as a form of learning by observation or concept formation, as opposed to methods of learning from examples or concept identification. In this paper we survey and compare a number of conceptual clustering methods along dimensions suggested by each of these views. The point we most wish to clarify is that conceptual clustering processes can be explicated as being composed of three distinct but inter-dependent subprocesses: the process of deriving a hierarchical classification scheme; the process of aggregating objects into individual classes; and the process of assigning conceptual descriptions to object classes. Each subprocess may be characterized along a number of dimensions related to search, thus facilitating a better understanding of the conceptual clustering process as a whole
- …