403,822 research outputs found
Learning Feature Weights for Density-Based Clustering
K-Means is the most popular and widely used clustering algorithm. This algorithm
cannot recover non-spherical shape clusters in data sets. DBSCAN is arguably
the most popular algorithm to recover arbitrary shape clusters; this is why
this density-based clustering algorithm is of great interest to tackle its weaknesses.
One issue of concern is that DBSCAN requires two parameters, and it cannot recover
widely variable density clusters. The problem lies at the heart of this thesis
is that during the clustering process DBSCAN takes all the available features and
treats all the features equally regardless of their degree of relevance in the data set,
which can have negative impacts.
This thesis addresses the above problems by laying the foundation of the feature
weighted density-based clustering. Specifically, the thesis introduces a densitybased
clustering algorithm using reverse nearest neighbour, DBSCANR that require
less parameter than DBSCAN for recovering clusters. DBSCANR is based
on the insight that in real-world data sets the densities of arbitrary shape clusters
to be recovered within a data set are very different from each other.
The thesis extends DBSCANR to what is referred to as weighted DBSCANR, WDBSCANR
by exploiting feature weighting technique to give the different level of
relevance to the features in a data set. The thesis extends W-DBSCANR further
by using the Minkowski metric so that the weight can be interpreted as feature
re-scaling factors named MW-DBSCANR. Experiments on both artificial and realworld
data sets demonstrate the superiority of our method over DBSCAN type
algorithms. These weighted algorithms considerably reduce the impact of irrelevant
features while recovering arbitrary shape clusters of different level of densities
in a high-dimensional data set.
Within this context, this thesis incorporates a popular algorithm, feature selection
using feature similarity, FSFS into bothW-DBSCANR andMW-DBSCANR, to
address the problem of feature selection. This unsupervised feature selection algorithm
makes use of feature clustering and feature similarity to reduce the number
of features in a data set. With a similar aim, exploiting the concept of feature
similarity, the thesis introduces a method, density-based feature selection using
feature similarity, DBFSFS to take density-based cluster structure into consideration
for reducing the number of features in a data set. This thesis then applies
the developed method to real-world high-dimensional gene expression data sets.
DBFSFS improves the clustering recovery by substantially reducing the number of
features from high-dimensional low sample size data sets
Effect Size Estimation and Misclassification Rate Based Variable Selection in Linear Discriminant Analysis
Supervised classifying of biological samples based on genetic information,
(e.g. gene expression profiles) is an important problem in biostatistics. In
order to find both accurate and interpretable classification rules variable
selection is indispensable. This article explores how an assessment of the
individual importance of variables (effect size estimation) can be used to
perform variable selection. I review recent effect size estimation approaches
in the context of linear discriminant analysis (LDA) and propose a new
conceptually simple effect size estimation method which is at the same time
computationally efficient. I then show how to use effect sizes to perform
variable selection based on the misclassification rate which is the data
independent expectation of the prediction error. Simulation studies and real
data analyses illustrate that the proposed effect size estimation and variable
selection methods are competitive. Particularly, they lead to both compact and
interpretable feature sets.Comment: 21 pages, 2 figure
Bandwidth selection for kernel estimation in mixed multi-dimensional spaces
Kernel estimation techniques, such as mean shift, suffer from one major
drawback: the kernel bandwidth selection. The bandwidth can be fixed for all
the data set or can vary at each points. Automatic bandwidth selection becomes
a real challenge in case of multidimensional heterogeneous features. This paper
presents a solution to this problem. It is an extension of \cite{Comaniciu03a}
which was based on the fundamental property of normal distributions regarding
the bias of the normalized density gradient. The selection is done iteratively
for each type of features, by looking for the stability of local bandwidth
estimates across a predefined range of bandwidths. A pseudo balloon mean shift
filtering and partitioning are introduced. The validity of the method is
demonstrated in the context of color image segmentation based on a
5-dimensional space
Analysis of group evolution prediction in complex networks
In the world, in which acceptance and the identification with social
communities are highly desired, the ability to predict evolution of groups over
time appears to be a vital but very complex research problem. Therefore, we
propose a new, adaptable, generic and mutli-stage method for Group Evolution
Prediction (GEP) in complex networks, that facilitates reasoning about the
future states of the recently discovered groups. The precise GEP modularity
enabled us to carry out extensive and versatile empirical studies on many
real-world complex / social networks to analyze the impact of numerous setups
and parameters like time window type and size, group detection method,
evolution chain length, prediction models, etc. Additionally, many new
predictive features reflecting the group state at a given time have been
identified and tested. Some other research problems like enriching learning
evolution chains with external data have been analyzed as well
Optimising Selective Sampling for Bootstrapping Named Entity Recognition
Training a statistical named entity recognition system in a new domain requires costly manual annotation of large quantities of in-domain data. Active learning promises to reduce the annotation cost by selecting only highly informative data points. This paper is concerned with a real active learning experiment to bootstrap a named entity recognition system for a new domain of radio astronomical abstracts. We evaluate several committee-based metrics for quantifying the disagreement between classifiers built using multiple views, and demonstrate that the choice of metric can be optimised in simulation experiments with existing annotated data from different domains. A final evaluation shows that we gained substantial savings compared to a randomly sampled baseline. 1
Resolving transition metal chemical space: feature selection for machine learning and structure-property relationships
Machine learning (ML) of quantum mechanical properties shows promise for
accelerating chemical discovery. For transition metal chemistry where accurate
calculations are computationally costly and available training data sets are
small, the molecular representation becomes a critical ingredient in ML model
predictive accuracy. We introduce a series of revised autocorrelation functions
(RACs) that encode relationships between the heuristic atomic properties (e.g.,
size, connectivity, and electronegativity) on a molecular graph. We alter the
starting point, scope, and nature of the quantities evaluated in standard ACs
to make these RACs amenable to inorganic chemistry. On an organic molecule set,
we first demonstrate superior standard AC performance to other
presently-available topological descriptors for ML model training, with mean
unsigned errors (MUEs) for atomization energies on set-aside test molecules as
low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs
on set-aside test molecules in spin-state splitting in comparison to 15-20x
higher errors from feature sets that encode whole-molecule structural
information. Systematic feature selection methods including univariate
filtering, recursive feature elimination, and direct optimization (e.g., random
forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4-5x
smaller than RAC-155 produce sub- to 1-kcal/mol spin-splitting MUEs, with good
transferability to metal-ligand bond length prediction (0.004-5 {\AA} MUE) and
redox potential on a smaller data set (0.2-0.3 eV MUE). Evaluation of feature
selection results across property sets reveals the relative importance of
local, electronic descriptors (e.g., electronegativity, atomic number) in
spin-splitting and distal, steric effects in redox potential and bond lengths.Comment: 43 double spaced pages, 11 figures, 4 table
- …