30,262 research outputs found
CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification
Class imbalance classification is a challenging research problem in data
mining and machine learning, as most of the real-life datasets are often
imbalanced in nature. Existing learning algorithms maximise the classification
accuracy by correctly classifying the majority class, but misclassify the
minority class. However, the minority class instances are representing the
concept with greater interest than the majority class instances in real-life
applications. Recently, several techniques based on sampling methods
(under-sampling of the majority class and over-sampling the minority class),
cost-sensitive learning methods, and ensemble learning have been used in the
literature for classifying imbalanced datasets. In this paper, we introduce a
new clustering-based under-sampling approach with boosting (AdaBoost)
algorithm, called CUSBoost, for effective imbalanced classification. The
proposed algorithm provides an alternative to RUSBoost (random under-sampling
with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost)
algorithms. We evaluated the performance of CUSBoost algorithm with the
state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost,
SMOTEBoost on 13 imbalance binary and multi-class datasets with various
imbalance ratios. The experimental results show that the CUSBoost is a
promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201
An evaluation of DNA-damage response and cell-cycle pathways for breast cancer classification
Accurate subtyping or classification of breast cancer is important for
ensuring proper treatment of patients and also for understanding the molecular
mechanisms driving this disease. While there have been several gene signatures
proposed in the literature to classify breast tumours, these signatures show
very low overlaps, different classification performance, and not much relevance
to the underlying biology of these tumours. Here we evaluate DNA-damage
response (DDR) and cell cycle pathways, which are critical pathways implicated
in a considerable proportion of breast tumours, for their usefulness and
ability in breast tumour subtyping. We think that subtyping breast tumours
based on these two pathways could lead to vital insights into molecular
mechanisms driving these tumours. Here, we performed a systematic evaluation of
DDR and cell-cycle pathways for subtyping of breast tumours into the five known
intrinsic subtypes. Homologous Recombination (HR) pathway showed the best
performance in subtyping breast tumours, indicating that HR genes are strongly
involved in all breast tumours. Comparisons of pathway based signatures and two
standard gene signatures supported the use of known pathways for breast tumour
subtyping. Further, the evaluation of these standard gene signatures showed
that breast tumour subtyping, prognosis and survival estimation are all closely
related. Finally, we constructed an all-inclusive super-signature by combining
(union of) all genes and performing a stringent feature selection, and found it
to be reasonably accurate and robust in classification as well as prognostic
value. Adopting DDR and cell cycle pathways for breast tumour subtyping
achieved robust and accurate breast tumour subtyping, and constructing a
super-signature which contains feature selected mix of genes from these
molecular pathways as well as clinical aspects is valuable in clinical
practice.Comment: 28 pages, 7 figures, 6 table
How to use the Kohonen algorithm to simultaneously analyse individuals in a survey
The Kohonen algorithm (SOM, Kohonen,1984, 1995) is a very powerful tool for
data analysis. It was originally designed to model organized connections
between some biological neural networks. It was also immediately considered as
a very good algorithm to realize vectorial quantization, and at the same time
pertinent classification, with nice properties for visualization. If the
individuals are described by quantitative variables (ratios, frequencies,
measurements, amounts, etc.), the straightforward application of the original
algorithm leads to build code vectors and to associate to each of them the
class of all the individuals which are more similar to this code-vector than to
the others. But, in case of individuals described by categorical (qualitative)
variables having a finite number of modalities (like in a survey), it is
necessary to define a specific algorithm. In this paper, we present a new
algorithm inspired by the SOM algorithm, which provides a simultaneous
classification of the individuals and of their modalities.Comment: Special issue ESANN 0
Steps toward a classifier for the Virtual Observatory. I. Classifying the SDSS photometric archive
Modern photometric multiband digital surveys produce large amounts of data
that, in order to be effectively exploited, need automatic tools capable to
extract from photometric data an objective classification. We present here a
new method for classifying objects in large multi-parametric photometric data
bases, consisting of a combination of a clustering algorithm and a cluster
agglomeration tool. The generalization capabilities and the potentialities of
this approach are tested against the complexity of the Sloan Digital Sky Survey
archive, for which an example of application is reported.Comment: To appear in the Proceedings of the "1st Workshop of Astronomy and
Astrophysics for Students" - Naples, 19-20 April 200
- …