403,822 research outputs found

    Learning Feature Weights for Density-Based Clustering

    Get PDF
    K-Means is the most popular and widely used clustering algorithm. This algorithm cannot recover non-spherical shape clusters in data sets. DBSCAN is arguably the most popular algorithm to recover arbitrary shape clusters; this is why this density-based clustering algorithm is of great interest to tackle its weaknesses. One issue of concern is that DBSCAN requires two parameters, and it cannot recover widely variable density clusters. The problem lies at the heart of this thesis is that during the clustering process DBSCAN takes all the available features and treats all the features equally regardless of their degree of relevance in the data set, which can have negative impacts. This thesis addresses the above problems by laying the foundation of the feature weighted density-based clustering. Specifically, the thesis introduces a densitybased clustering algorithm using reverse nearest neighbour, DBSCANR that require less parameter than DBSCAN for recovering clusters. DBSCANR is based on the insight that in real-world data sets the densities of arbitrary shape clusters to be recovered within a data set are very different from each other. The thesis extends DBSCANR to what is referred to as weighted DBSCANR, WDBSCANR by exploiting feature weighting technique to give the different level of relevance to the features in a data set. The thesis extends W-DBSCANR further by using the Minkowski metric so that the weight can be interpreted as feature re-scaling factors named MW-DBSCANR. Experiments on both artificial and realworld data sets demonstrate the superiority of our method over DBSCAN type algorithms. These weighted algorithms considerably reduce the impact of irrelevant features while recovering arbitrary shape clusters of different level of densities in a high-dimensional data set. Within this context, this thesis incorporates a popular algorithm, feature selection using feature similarity, FSFS into bothW-DBSCANR andMW-DBSCANR, to address the problem of feature selection. This unsupervised feature selection algorithm makes use of feature clustering and feature similarity to reduce the number of features in a data set. With a similar aim, exploiting the concept of feature similarity, the thesis introduces a method, density-based feature selection using feature similarity, DBFSFS to take density-based cluster structure into consideration for reducing the number of features in a data set. This thesis then applies the developed method to real-world high-dimensional gene expression data sets. DBFSFS improves the clustering recovery by substantially reducing the number of features from high-dimensional low sample size data sets

    Effect Size Estimation and Misclassification Rate Based Variable Selection in Linear Discriminant Analysis

    Get PDF
    Supervised classifying of biological samples based on genetic information, (e.g. gene expression profiles) is an important problem in biostatistics. In order to find both accurate and interpretable classification rules variable selection is indispensable. This article explores how an assessment of the individual importance of variables (effect size estimation) can be used to perform variable selection. I review recent effect size estimation approaches in the context of linear discriminant analysis (LDA) and propose a new conceptually simple effect size estimation method which is at the same time computationally efficient. I then show how to use effect sizes to perform variable selection based on the misclassification rate which is the data independent expectation of the prediction error. Simulation studies and real data analyses illustrate that the proposed effect size estimation and variable selection methods are competitive. Particularly, they lead to both compact and interpretable feature sets.Comment: 21 pages, 2 figure

    Bandwidth selection for kernel estimation in mixed multi-dimensional spaces

    Get PDF
    Kernel estimation techniques, such as mean shift, suffer from one major drawback: the kernel bandwidth selection. The bandwidth can be fixed for all the data set or can vary at each points. Automatic bandwidth selection becomes a real challenge in case of multidimensional heterogeneous features. This paper presents a solution to this problem. It is an extension of \cite{Comaniciu03a} which was based on the fundamental property of normal distributions regarding the bias of the normalized density gradient. The selection is done iteratively for each type of features, by looking for the stability of local bandwidth estimates across a predefined range of bandwidths. A pseudo balloon mean shift filtering and partitioning are introduced. The validity of the method is demonstrated in the context of color image segmentation based on a 5-dimensional space

    Analysis of group evolution prediction in complex networks

    Full text link
    In the world, in which acceptance and the identification with social communities are highly desired, the ability to predict evolution of groups over time appears to be a vital but very complex research problem. Therefore, we propose a new, adaptable, generic and mutli-stage method for Group Evolution Prediction (GEP) in complex networks, that facilitates reasoning about the future states of the recently discovered groups. The precise GEP modularity enabled us to carry out extensive and versatile empirical studies on many real-world complex / social networks to analyze the impact of numerous setups and parameters like time window type and size, group detection method, evolution chain length, prediction models, etc. Additionally, many new predictive features reflecting the group state at a given time have been identified and tested. Some other research problems like enriching learning evolution chains with external data have been analyzed as well

    Optimising Selective Sampling for Bootstrapping Named Entity Recognition

    Get PDF
    Training a statistical named entity recognition system in a new domain requires costly manual annotation of large quantities of in-domain data. Active learning promises to reduce the annotation cost by selecting only highly informative data points. This paper is concerned with a real active learning experiment to bootstrap a named entity recognition system for a new domain of radio astronomical abstracts. We evaluate several committee-based metrics for quantifying the disagreement between classifiers built using multiple views, and demonstrate that the choice of metric can be optimised in simulation experiments with existing annotated data from different domains. A final evaluation shows that we gained substantial savings compared to a randomly sampled baseline. 1

    Resolving transition metal chemical space: feature selection for machine learning and structure-property relationships

    Full text link
    Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML model predictive accuracy. We introduce a series of revised autocorrelation functions (RACs) that encode relationships between the heuristic atomic properties (e.g., size, connectivity, and electronegativity) on a molecular graph. We alter the starting point, scope, and nature of the quantities evaluated in standard ACs to make these RACs amenable to inorganic chemistry. On an organic molecule set, we first demonstrate superior standard AC performance to other presently-available topological descriptors for ML model training, with mean unsigned errors (MUEs) for atomization energies on set-aside test molecules as low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs on set-aside test molecules in spin-state splitting in comparison to 15-20x higher errors from feature sets that encode whole-molecule structural information. Systematic feature selection methods including univariate filtering, recursive feature elimination, and direct optimization (e.g., random forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4-5x smaller than RAC-155 produce sub- to 1-kcal/mol spin-splitting MUEs, with good transferability to metal-ligand bond length prediction (0.004-5 {\AA} MUE) and redox potential on a smaller data set (0.2-0.3 eV MUE). Evaluation of feature selection results across property sets reveals the relative importance of local, electronic descriptors (e.g., electronegativity, atomic number) in spin-splitting and distal, steric effects in redox potential and bond lengths.Comment: 43 double spaced pages, 11 figures, 4 table