17,527 research outputs found
Pathway Histogram Analysis of Trajectories: A general strategy for quantification of molecular mechanisms
A key overall goal of biomolecular simulations is the characterization of
"mechanism" -- the pathways through configuration space of processes such as
conformational transitions and binding. Some amount of heterogeneity is
intrinsic to the ensemble of pathways, in direct analogy to thermal
configurational ensembles. Quantification of that heterogeneity is essential to
a complete understanding of mechanism. We propose a general approach for
characterizing path ensembles based on mapping individual trajectories into
pathway classes whose populations and uncertainties can be analyzed as an
ordinary histogram, providing a quantitative "fingerprint" of mechanism. In
contrast to prior flux-based analyses used for discrete-state models,
stochastic deviations from average behavior are explicitly included via direct
classification of trajectories. The histogram approach, furthermore, is
applicable to analysis of continuous trajectories. It enables straightforward
comparison between ensembles produced by different methods or under different
conditions. To implement the formulation, we develop approaches for classifying
trajectories, including a clustering-based approach suitable for both
continuous-space (e.g., molecular dynamics) or discrete-state (e.g., Markov
state model) trajectories, as well as a "fundamental sequence" approach
tailored for discrete-state trajectories but also applicable to continuous
trajectories through a mapping process. We apply the pathway histogram analysis
to a toy model and an extremely long atomistic molecular dynamics trajectory of
protein folding
Predictive Modeling of ICU Healthcare-Associated Infections from Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling Approach
Early detection of patients vulnerable to infections acquired in the hospital
environment is a challenge in current health systems given the impact that such
infections have on patient mortality and healthcare costs. This work is focused
on both the identification of risk factors and the prediction of
healthcare-associated infections in intensive-care units by means of
machine-learning methods. The aim is to support decision making addressed at
reducing the incidence rate of infections. In this field, it is necessary to
deal with the problem of building reliable classifiers from imbalanced
datasets. We propose a clustering-based undersampling strategy to be used in
combination with ensemble classifiers. A comparative study with data from 4616
patients was conducted in order to validate our proposal. We applied several
single and ensemble classifiers both to the original dataset and to data
preprocessed by means of different resampling methods. The results were
analyzed by means of classic and recent metrics specifically designed for
imbalanced data classification. They revealed that the proposal is more
efficient in comparison with other approaches
Probabilistic Combination of Classifier and Cluster Ensembles for Non-transductive Learning
Unsupervised models can provide supplementary soft constraints to help
classify new target data under the assumption that similar objects in the
target set are more likely to share the same class label. Such models can also
help detect possible differences between training and target distributions,
which is useful in applications where concept drift may take place. This paper
describes a Bayesian framework that takes as input class labels from existing
classifiers (designed based on labeled data from the source domain), as well as
cluster labels from a cluster ensemble operating solely on the target data to
be classified, and yields a consensus labeling of the target data. This
framework is particularly useful when the statistics of the target data drift
or change from those of the training data. We also show that the proposed
framework is privacy-aware and allows performing distributed learning when
data/models have sharing restrictions. Experiments show that our framework can
yield superior results to those provided by applying classifier ensembles only
Progressive Boosting for Class Imbalance
Pattern recognition applications often suffer from skewed data distributions
between classes, which may vary during operations w.r.t. the design data.
Two-class classification systems designed using skewed data tend to recognize
the majority class better than the minority class of interest. Several
data-level techniques have been proposed to alleviate this issue by up-sampling
minority samples or under-sampling majority samples. However, some informative
samples may be neglected by random under-sampling and adding synthetic positive
samples through up-sampling adds to training complexity. In this paper, a new
ensemble learning algorithm called Progressive Boosting (PBoost) is proposed
that progressively inserts uncorrelated groups of samples into a Boosting
procedure to avoid loss of information while generating a diverse pool of
classifiers. Base classifiers in this ensemble are generated from one iteration
to the next, using subsets from a validation set that grows gradually in size
and imbalance. Consequently, PBoost is more robust to unknown and variable
levels of skew in operational data, and has lower computation complexity than
Boosting ensembles in literature. In PBoost, a new loss factor is proposed to
avoid bias of performance towards the negative class. Using this loss factor,
the weight update of samples and classifier contribution in final predictions
are set based on the ability to recognize both classes. Using the proposed loss
factor instead of standard accuracy can avoid biasing performance in any
Boosting ensemble. The proposed approach was validated and compared using
synthetic data, videos from the FIA dataset that emulates face
re-identification applications, and KEEL collection of datasets. Results show
that PBoost can outperform state of the art techniques in terms of both
accuracy and complexity over different levels of imbalance and overlap between
classes
A Classifier-free Ensemble Selection Method based on Data Diversity in Random Subspaces
The Ensemble of Classifiers (EoC) has been shown to be effective in improving
the performance of single classifiers by combining their outputs, and one of
the most important properties involved in the selection of the best EoC from a
pool of classifiers is considered to be classifier diversity. In general,
classifier diversity does not occur randomly, but is generated systematically
by various ensemble creation methods. By using diverse data subsets to train
classifiers, these methods can create diverse classifiers for the EoC. In this
work, we propose a scheme to measure data diversity directly from random
subspaces, and explore the possibility of using it to select the best data
subsets for the construction of the EoC. Our scheme is the first ensemble
selection method to be presented in the literature based on the concept of data
diversity. Its main advantage over the traditional framework (ensemble creation
then selection) is that it obviates the need for classifier training prior to
ensemble selection. A single Genetic Algorithm (GA) and a Multi-Objective
Genetic Algorithm (MOGA) were evaluated to search for the best solutions for
the classifier-free ensemble selection. In both cases, objective functions
based on different clustering diversity measures were implemented and tested.
All the results obtained with the proposed classifier-free ensemble selection
method were compared with the traditional classifier-based ensemble selection
using Mean Classifier Error (ME) and Majority Voting Error (MVE). The
applicability of the method is tested on UCI machine learning problems and NIST
SD19 handwritten numerals
An Optimization Framework for Semi-Supervised and Transfer Learning using Multiple Classifiers and Clusterers
Unsupervised models can provide supplementary soft constraints to help
classify new, "target" data since similar instances in the target set are more
likely to share the same class label. Such models can also help detect possible
differences between training and target distributions, which is useful in
applications where concept drift may take place, as in transfer learning
settings. This paper describes a general optimization framework that takes as
input class membership estimates from existing classifiers learnt on previously
encountered "source" data, as well as a similarity matrix from a cluster
ensemble operating solely on the target data to be classified, and yields a
consensus labeling of the target data. This framework admits a wide range of
loss functions and classification/clustering methods. It exploits properties of
Bregman divergences in conjunction with Legendre duality to yield a principled
and scalable approach. A variety of experiments show that the proposed
framework can yield results substantially superior to those provided by popular
transductive learning techniques or by naively applying classifiers learnt on
the original task to the target data
Ensemble Classifiers and Their Applications: A Review
Ensemble classifier refers to a group of individual classifiers that are
cooperatively trained on data set in a supervised classification problem. In
this paper we present a review of commonly used ensemble classifiers in the
literature. Some ensemble classifiers are also developed targeting specific
applications. We also present some application driven ensemble classifiers in
this paper.Comment: Published with International Journal of Computer Trends and
Technology (IJCTT
Heuristic Ternary Error-Correcting Output Codes Via Weight Optimization and Layered Clustering-Based Approach
One important classifier ensemble for multiclass classification problems is
Error-Correcting Output Codes (ECOCs). It bridges multiclass problems and
binary-class classifiers by decomposing multiclass problems to a serial
binary-class problems. In this paper, we present a heuristic ternary code,
named Weight Optimization and Layered Clustering-based ECOC (WOLC-ECOC). It
starts with an arbitrary valid ECOC and iterates the following two steps until
the training risk converges. The first step, named Layered Clustering based
ECOC (LC-ECOC), constructs multiple strong classifiers on the most confusing
binary-class problem. The second step adds the new classifiers to ECOC by a
novel Optimized Weighted (OW) decoding algorithm, where the optimization
problem of the decoding is solved by the cutting plane algorithm. Technically,
LC-ECOC makes the heuristic training process not blocked by some difficult
binary-class problem. OW decoding guarantees the non-increase of the training
risk for ensuring a small code length. Results on 14 UCI datasets and a music
genre classification problem demonstrate the effectiveness of WOLC-ECOC
Features in Concert: Discriminative Feature Selection meets Unsupervised Clustering
Feature selection is an essential problem in computer vision, important for
category learning and recognition. Along with the rapid development of a wide
variety of visual features and classifiers, there is a growing need for
efficient feature selection and combination methods, to construct powerful
classifiers for more complex and higher-level recognition tasks. We propose an
algorithm that efficiently discovers sparse, compact representations of input
features or classifiers, from a vast sea of candidates, with important
optimality properties, low computational cost and excellent accuracy in
practice. Different from boosting, we start with a discriminant linear
classification formulation that encourages sparse solutions. Then we obtain an
equivalent unsupervised clustering problem that jointly discovers ensembles of
diverse features. They are independently valuable but even more powerful when
united in a cluster of classifiers. We evaluate our method on the task of
large-scale recognition in video and show that it significantly outperforms
classical selection approaches, such as AdaBoost and greedy forward-backward
selection, and powerful classifiers such as SVMs, in speed of training and
performance, especially in the case of limited training data
Recommended from our members
Cluster ensembles for high dimensional clustering : an empirical study
This paper studies cluster ensembles for high dimensional data clustering. We examine three different approaches to constructing cluster ensembles. To address high dimensionality, we focus on ensemble construction methods that build on two popular dimension reduction techniques, random projection and principal component analysis (PCA). We present evidence showing that ensembles generated by random projection perform better than those by PCA and further that this can be attributed to the capability of random projection to produce diverse base clusterings. We also examine four different consensus functions for combining the clusterings of the ensemble. We compare their performance using two types of ensembles, each with different properties. In both cases, we show that a recent consensus function based on bipartite graph partitioning achieves the best performance
- …