672 research outputs found
An Agent-Based Algorithm exploiting Multiple Local Dissimilarities for Clusters Mining and Knowledge Discovery
We propose a multi-agent algorithm able to automatically discover relevant
regularities in a given dataset, determining at the same time the set of
configurations of the adopted parametric dissimilarity measure yielding compact
and separated clusters. Each agent operates independently by performing a
Markovian random walk on a suitable weighted graph representation of the input
dataset. Such a weighted graph representation is induced by the specific
parameter configuration of the dissimilarity measure adopted by the agent,
which searches and takes decisions autonomously for one cluster at a time.
Results show that the algorithm is able to discover parameter configurations
that yield a consistent and interpretable collection of clusters. Moreover, we
demonstrate that our algorithm shows comparable performances with other similar
state-of-the-art algorithms when facing specific clustering problems
Modelling and recognition of protein contact networks by multiple kernel learning and dissimilarity representations
Multiple kernel learning is a paradigm which employs a properly constructed chain of kernel functions able to simultaneously analyse different data or different representations of the same data. In this paper, we propose an hybrid classification system based on a linear combination of multiple kernels defined over multiple dissimilarity spaces. The core of the training procedure is the joint optimisation of kernel weights and representatives selection in the dissimilarity spaces. This equips the system with a two-fold knowledge discovery phase: by analysing the weights, it is possible to check which representations are more suitable for solving the classification problem, whereas the pivotal patterns selected as representatives can give further insights on the modelled system, possibly with the help of field-experts. The proposed classification system is tested on real proteomic data in order to predict proteins' functional role starting from their folded structure: specifically, a set of eight representations are drawn from the graph-based protein folded description. The proposed multiple kernel-based system has also been benchmarked against a clustering-based classification system also able to exploit multiple dissimilarities simultaneously. Computational results show remarkable classification capabilities and the knowledge discovery analysis is in line with current biological knowledge, suggesting the reliability of the proposed system
A COLLABORATIVE FILTERING APPROACH TO PREDICT WEB PAGES OF INTEREST FROMNAVIGATION PATTERNS OF PAST USERS WITHIN AN ACADEMIC WEBSITE
This dissertation is a simulation study of factors and techniques involved in designing hyperlink recommender systems that recommend to users, web pages that past users with similar navigation behaviors found interesting. The methodology involves identification of pertinent factors or techniques, and for each one, addresses the following questions: (a) room for improvement; (b) better approach, if any; and (c) performance characteristics of the technique in environments that hyperlink recommender systems operate in. The following four problems are addressed:Web Page Classification. A new metric (PageRank Ă— Inverse Links-to-Word count ratio) is proposed for classifying web pages as content or navigation, to help in the discovery of user navigation behaviors from web user access logs. Results of a small user study suggest that this metric leads to desirable results.Data Mining. A new apriori algorithm for mining association rules from large databases is proposed. The new algorithm addresses the problem of scaling of the classical apriori algorithm by eliminating an expensive joinstep, and applying the apriori property to every row of the database. In this study, association rules show the correlation relationships between user navigation behaviors and web pages they find interesting. The new algorithm has better space complexity than the classical one, and better time efficiency under some conditionsand comparable time efficiency under other conditions.Prediction Models for User Interests. We demonstrate that association rules that show the correlation relationships between user navigation patterns and web pages they find interesting can be transformed intocollaborative filtering data. We investigate collaborative filtering prediction models based on two approaches for computing prediction scores: using simple averages and weighted averages. Our findings suggest that theweighted averages scheme more accurately computes predictions of user interests than the simple averages scheme does.Clustering. Clustering techniques are frequently applied in the design of personalization systems. We studied the performance of the CLARANS clustering algorithm in high dimensional space in relation to the PAM and CLARA clustering algorithms. While CLARA had the best time performance, CLARANS resulted in clusterswith the lowest intra-cluster dissimilarities, and so was most effective in this regard
Document Collection Visualization and Clustering Using An Atom Metaphor for Display and Interaction
Visual Data Mining have proven to be of high value in exploratory data analysis and data mining because it provides an intuitive feedback on data analysis and support decision-making activities. Several visualization techniques have been developed for cluster discovery such as Grand Tour, HD-Eye, Star Coordinates, etc. They are very useful tool which are visualized in 2D or 3D; however, they have not simple for users who are not trained. This thesis proposes a new approach to build a 3D clustering visualization system for document clustering by using k-mean algorithm. A cluster will be represented by a neutron (centroid) and electrons (documents) which will keep a distance with neutron by force. Our approach employs quantified domain knowledge and explorative observation as prediction to map high dimensional data onto 3D space for revealing the relationship among documents. User can perform an intuitive visual assessment of the consistency of the cluster structure
Image Based Biomarkers from Magnetic Resonance Modalities: Blending Multiple Modalities, Dimensions and Scales.
The successful analysis and processing of medical
imaging data is a multidisciplinary work that requires the
application and combination of knowledge from diverse fields,
such as medical engineering, medicine, computer science and
pattern classification. Imaging biomarkers are biologic features
detectable by imaging modalities and their use offer the prospect
of more efficient clinical studies and improvement in both
diagnosis and therapy assessment. The use of Dynamic Contrast
Enhanced Magnetic Resonance Imaging (DCE-MRI) and its
application to the diagnosis and therapy has been extensively
validated, nevertheless the issue of an appropriate or optimal
processing of data that helps to extract relevant biomarkers
to highlight the difference between heterogeneous tissue still
remains. Together with DCE-MRI, the data extracted from
Diffusion MRI (DWI-MR and DTI-MR) represents a promising
and complementary tool. This project initially proposes the
exploration of diverse techniques and methodologies for the
characterization of tissue, following an analysis and classification
of voxel-level time-intensity curves from DCE-MRI data mainly
through the exploration of dissimilarity based representations
and models. We will explore metrics and representations to
correlate the multidimensional data acquired through diverse
imaging modalities, a work which starts with the appropriate
elastic registration methodology between DCE-MRI and DWI-
MR on the breast and its corresponding validation.
It has been shown that the combination of multi-modal MRI
images improve the discrimination of diseased tissue. However the fusion
of dissimilar imaging data for classification and segmentation purposes is
not a trivial task, there is an inherent difference in information domains,
dimensionality and scales. This work also proposes a multi-view consensus
clustering methodology for the integration of multi-modal MR images
into a unified segmentation of tumoral lesions for heterogeneity assessment. Using a variety of metrics and distance functions this multi-view
imaging approach calculates multiple vectorial dissimilarity-spaces for
each one of the MRI modalities and makes use of the concepts behind
cluster ensembles to combine a set of base unsupervised segmentations
into an unified partition of the voxel-based data. The methodology is
specially designed for combining DCE-MRI and DTI-MR, for which a
manifold learning step is implemented in order to account for the geometric constrains of the high dimensional diffusion information.The successful analysis and processing of medical
imaging data is a multidisciplinary work that requires the
application and combination of knowledge from diverse fields,
such as medical engineering, medicine, computer science and
pattern classification. Imaging biomarkers are biologic features
detectable by imaging modalities and their use offer the prospect
of more efficient clinical studies and improvement in both
diagnosis and therapy assessment. The use of Dynamic Contrast
Enhanced Magnetic Resonance Imaging (DCE-MRI) and its
application to the diagnosis and therapy has been extensively
validated, nevertheless the issue of an appropriate or optimal
processing of data that helps to extract relevant biomarkers
to highlight the difference between heterogeneous tissue still
remains. Together with DCE-MRI, the data extracted from
Diffusion MRI (DWI-MR and DTI-MR) represents a promising
and complementary tool. This project initially proposes the
exploration of diverse techniques and methodologies for the
characterization of tissue, following an analysis and classification
of voxel-level time-intensity curves from DCE-MRI data mainly
through the exploration of dissimilarity based representations
and models. We will explore metrics and representations to
correlate the multidimensional data acquired through diverse
imaging modalities, a work which starts with the appropriate
elastic registration methodology between DCE-MRI and DWI-
MR on the breast and its corresponding validation.
It has been shown that the combination of multi-modal MRI
images improve the discrimination of diseased tissue. However the fusion
of dissimilar imaging data for classification and segmentation purposes is
not a trivial task, there is an inherent difference in information domains,
dimensionality and scales. This work also proposes a multi-view consensus
clustering methodology for the integration of multi-modal MR images
into a unified segmentation of tumoral lesions for heterogeneity assessment. Using a variety of metrics and distance functions this multi-view
imaging approach calculates multiple vectorial dissimilarity-spaces for
each one of the MRI modalities and makes use of the concepts behind
cluster ensembles to combine a set of base unsupervised segmentations
into an unified partition of the voxel-based data. The methodology is
specially designed for combining DCE-MRI and DTI-MR, for which a
manifold learning step is implemented in order to account for the geometric constrains of the high dimensional diffusion information
Projection-Based Clustering through Self-Organization and Swarm Intelligence
It covers aspects of unsupervised machine learning used for knowledge discovery in data science and introduces a data-driven approach to cluster analysis, the Databionic swarm (DBS). DBS consists of the 3D landscape visualization and clustering of data. The 3D landscape enables 3D printing of high-dimensional data structures. The clustering and number of clusters or an absence of cluster structure are verified by the 3D landscape at a glance. DBS is the first swarm-based technique that shows emergent properties while exploiting concepts of swarm intelligence, self-organization and the Nash equilibrium concept from game theory. It results in the elimination of a global objective function and the setting of parameters. By downloading the R package DBS can be applied to data drawn from diverse research fields and used even by non-professionals in the field of data mining
Multilevel mixed-type data analysis for validating partitions of scrapie isolates
The dissertation arises from a joint study with the Department of Food Safety and Veterinary Public Health of the Istituto Superiore di SanitĂ . The aim is to investigate and validate the existence of distinct strains of the scrapie disease taking into account the availability of a priori benchmark partition formulated by researchers. Scrapie of small ruminants is caused by prions, which are unconventional infectious agents of proteinaceous nature a ecting humans and animals. Due to the absence of nucleic acids, which precludes direct analysis of strain variation by molecular methods, the presence of di erent sheep scrapie strains is usually investigated by bioassay in laboratory rodents. Data are collected by an experimental study on scrapie conducted at the Istituto Superiore di SanitĂ by experimental transmission of scrapie isolates to bank voles.
We aim to discuss the validation of a given partition in a statistical classification framework using a multi-step procedure. Firstly, we use unsupervised classification to see how alternative clustering results match researchers’ understanding of the heterogeneity of the isolates. We discuss whether and how clustering results can be eventually exploited to extend the preliminary partition elicited by researchers. Then we motivate the subsequent partition validation based on the predictive performance of several supervised classifiers.
Our data-driven approach contains two main methodological original contributions. We advocate the use of partition validation measures to investigate a given benchmark partition: firstly we discuss the issue of how the data can be used to evaluate a preliminary benchmark partition and eventually modify it with statistical results to find a conclusive partition that could be used as a “gold standard” in future studies. Moreover, collected data have a multilevel structure and for each lower-level unit, mixed-type data are available. Each step in the procedure is then adapted to deal with multilevel mixed-type data. We extend distance-based clustering algorithms to deal with multilevel mixed-type data. Whereas in supervised classification we propose a two-step approach to classify the higher-level units starting from the lower-level observations. In this framework, we also need to define an ad-hoc cross validation algorithm
Ensemble deep learning: A review
Ensemble learning combines several individual models to obtain better
generalization performance. Currently, deep learning models with multilayer
processing architecture is showing better performance as compared to the
shallow or traditional classification models. Deep ensemble learning models
combine the advantages of both the deep learning models as well as the ensemble
learning such that the final model has better generalization performance. This
paper reviews the state-of-art deep ensemble models and hence serves as an
extensive summary for the researchers. The ensemble models are broadly
categorised into ensemble models like bagging, boosting and stacking, negative
correlation based deep ensemble models, explicit/implicit ensembles,
homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised,
semi-supervised, reinforcement learning and online/incremental, multilabel
based deep ensemble models. Application of deep ensemble models in different
domains is also briefly discussed. Finally, we conclude this paper with some
future recommendations and research directions
- …