78,392 research outputs found
Coping with new Challenges in Clustering and Biomedical Imaging
The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known.
Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively.
Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications.
In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people
Model-Based Method for Social Network Clustering
We propose a simple mixed membership model for social network clustering in
this note. A flexible function is adopted to measure affinities among a set of
entities in a social network. The model not only allows each entity in the
network to possess more than one membership, but also provides accurate
statistical inference about network structure. We estimate the membership
parameters by using an MCMC algorithm. We evaluate the performance of the
proposed algorithm by applying our model to two empirical social network data,
the Zachary club data and the bottlenose dolphin network data. We also conduct
some numerical studies for different types of simulated networks for assessing
the effectiveness of our algorithm. In the end, some concluding remarks and
future work are addressed briefly
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Relation between Financial Market Structure and the Real Economy: Comparison between Clustering Methods
We quantify the amount of information filtered by different hierarchical
clustering methods on correlations between stock returns comparing it with the
underlying industrial activity structure. Specifically, we apply, for the first
time to financial data, a novel hierarchical clustering approach, the Directed
Bubble Hierarchical Tree and we compare it with other methods including the
Linkage and k-medoids. In particular, by taking the industrial sector
classification of stocks as a benchmark partition, we evaluate how the
different methods retrieve this classification. The results show that the
Directed Bubble Hierarchical Tree can outperform other methods, being able to
retrieve more information with fewer clusters. Moreover, we show that the
economic information is hidden at different levels of the hierarchical
structures depending on the clustering method. The dynamical analysis on a
rolling window also reveals that the different methods show different degrees
of sensitivity to events affecting financial markets, like crises. These
results can be of interest for all the applications of clustering methods to
portfolio optimization and risk hedging.Comment: 31 pages, 17 figure
- …