41,572 research outputs found
Partial mixture model for tight clustering of gene expression time-course
Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to
this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.
Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate
information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a
simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.
Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset
under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion
Merging -means with hierarchical clustering for identifying general-shaped groups
Clustering partitions a dataset such that observations placed together in a
group are similar but different from those in other groups. Hierarchical and
-means clustering are two approaches but have different strengths and
weaknesses. For instance, hierarchical clustering identifies groups in a
tree-like structure but suffers from computational complexity in large datasets
while -means clustering is efficient but designed to identify homogeneous
spherically-shaped clusters. We present a hybrid non-parametric clustering
approach that amalgamates the two methods to identify general-shaped clusters
and that can be applied to larger datasets. Specifically, we first partition
the dataset into spherical groups using -means. We next merge these groups
using hierarchical methods with a data-driven distance measure as a stopping
criterion. Our proposal has the potential to reveal groups with general shapes
and structure in a dataset. We demonstrate good performance on several
simulated and real datasets.Comment: 16 pages, 1 table, 9 figures; accepted for publication in Sta
Steganographer Identification
Conventional steganalysis detects the presence of steganography within single
objects. In the real-world, we may face a complex scenario that one or some of
multiple users called actors are guilty of using steganography, which is
typically defined as the Steganographer Identification Problem (SIP). One might
use the conventional steganalysis algorithms to separate stego objects from
cover objects and then identify the guilty actors. However, the guilty actors
may be lost due to a number of false alarms. To deal with the SIP, most of the
state-of-the-arts use unsupervised learning based approaches. In their
solutions, each actor holds multiple digital objects, from which a set of
feature vectors can be extracted. The well-defined distances between these
feature sets are determined to measure the similarity between the corresponding
actors. By applying clustering or outlier detection, the most suspicious
actor(s) will be judged as the steganographer(s). Though the SIP needs further
study, the existing works have good ability to identify the steganographer(s)
when non-adaptive steganographic embedding was applied. In this chapter, we
will present foundational concepts and review advanced methodologies in SIP.
This chapter is self-contained and intended as a tutorial introducing the SIP
in the context of media steganography.Comment: A tutorial with 30 page
A survey on utilization of data mining approaches for dermatological (skin) diseases prediction
Due to recent technology advances, large volumes of medical data is obtained. These data contain valuable information. Therefore data mining techniques can be used to extract useful patterns. This paper is intended to introduce data mining and its various techniques and a survey of the available literature on medical data mining. We emphasize mainly on the application of data mining on skin diseases. A categorization has been provided based on the different data mining techniques. The utility of the various data mining methodologies is highlighted. Generally association mining is suitable for extracting rules. It has been used especially in cancer diagnosis. Classification is a robust method in medical mining. In this paper, we have summarized the different uses of classification in dermatology. It is one of the most important methods for diagnosis of erythemato-squamous diseases. There are different methods like Neural Networks, Genetic Algorithms and fuzzy classifiaction in this topic. Clustering is a useful method in medical images mining. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. Clustering has some applications in dermatology. Besides introducing different mining methods, we have investigated some challenges which exist in mining skin data
- …