2 research outputs found
Catégorisation par mesures de dissimilitude et caractérisation d'images en multi échelle
Dans cette thèse, on introduit la métrique "Coefficient de forme" pour la classement des données de dissimilitudes. Cette approche est inspirée par l'analyse discriminante géométrique et on a défini des règles de décision pour imiter le comportement du classifieur linéaire et quadratique. Le nombre de paramètres est limité (deux par classe). On a également étendu et amélioré cette démarche avantageuse et rapide pour apprendre uniquement à partir des représentations de dissimilitudes en utilisant l'efficacité du classificateur des Machines à Vecteurs de Support. Comme contexte applicatif pour la classification par dissimilitudes, on utilise la recherche d'images à l'aide d'une représentation des images en multi échelle en utilisant la "Pyramide Réduite Différentielle". Une application pour la description de visages est développée. Des résultats de classification à partir du coefficient de forme et utilisant une version adaptée des Machines à Vecteurs de Support, sur des bases de données issues des applications du monde réel sont présentés et comparés avec d'autres méthodes de classement basées sur des dissimilitudes. Il en ressort une forte robustesse de la méthode proposée avec des perfommances supérieures ou égales aux algorithmes de l'état de l'art.The dissimilarity representation is an alternative for the use of features in the recognition of real world objects like images, spectra and time-signal. Instead of an absolute characterization of objects by a set of features, the expert or the system is asked to define a measure that estimates the dissimilarity between pairs of objects. Such a measure may also be defined for structural representations such as strings and graphs. The dissimilarity representation is potentially able to bridge structural and statistical pattern recognition. In this thesis we introduce a new fast Mahalanobis-like metric the Shape Coefficient for classification of dissimilarity data. Our approach is inspired by the Geometrical Discriminant Analysis and we have defined decision rules to mimic the behavior of the linear and quadratic classifier. The number of parameters is limited (two per class). We also expand and ameliorate this advantageous and rapid adaptive approach to learn only from dissimilarity representations by using the effectiveness of the Support Vector Machines classifier for real-world classification tasks. Several methods for incorporating dissimilarity representations are presented, investigated and compared to the Shape Coefficient in this thesis: Pekalska and Duin prototype dissimilarity based classifiers; Haasdonk's kernel based SVM classifier; KNN classifier. Numerical experiments on artificial and real data show interesting behavior compared to Support Vector Machines and to KNN classifier: (a) lower or equivalent error rate, (b) equivalent CPU time, (c) more robustness with sparse dissimilarity data. The experimental results on real world dissimilarity databases show that the Shape Coefficient can be an alternative approach to these known methods and can be as effective as them in terms of accuracy for classification.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF
Data Clustering and Partial Supervision with Some Parallel Developments
Data Clustering and Partial Supell'ision with SOllie Parallel Developments
by Sameh A. Salem
Clustering is an important and irreplaceable step towards the search for structures in the
data. Many different clustering algorithms have been proposed. Yet, the sources of variability
in most clustering algorithms affect the reliability of their results. Moreover, the
majority tend to be based on the knowledge of the number of clusters as one of the input
parameters. Unfortunately, there are many scenarios, where this knowledge may not be
available. In addition, clustering algorithms are very computationally intensive which leads
to a major challenging problem in scaling up to large datasets. This thesis gives possible
solutions for such problems.
First, new measures - called clustering performance measures (CPMs) - for assessing
the reliability of a clustering algorithm are introduced. These CPMs can be used to evaluate:
I) clustering algorithms that have a structure bias to certain type of data distribution
as well as those that have no such biases, 2) clustering algorithms that have initialisation
dependency as well as the clustering algorithms that have a unique solution for a given set
of parameter values with no initialisation dependency.
Then, a novel clustering algorithm, which is a RAdius based Clustering ALgorithm
(RACAL), is proposed. RACAL uses a distance based principle to map the distributions of
the data assuming that clusters are determined by a distance parameter, without having to
specify the number of clusters. Furthermore, RACAL is enhanced by a validity index to
choose the best clustering result, i.e. result has compact clusters with wide cluster separations,
for a given input parameter. Comparisons with other clustering algorithms indicate
the applicability and reliability of the proposed clustering algorithm. Additionally, an adaptive
partial supervision strategy is proposed for using in conjunction with RACAL_to make
it act as a classifier. Results from RACAL with partial supervision, RACAL-PS, indicate
its robustness in classification. Additionally, a parallel version of RACAL (P-RACAL) is
proposed. The parallel evaluations of P-RACAL indicate that P-RACAL is scalable in terms
of speedup and scaleup, which gives the ability to handle large datasets of high dimensions
in a reasonable time.
Next, a novel clustering algorithm, which achieves clustering without any control of
cluster sizes, is introduced. This algorithm, which is called Nearest Neighbour Clustering,
Algorithm (NNCA), uses the same concept as the K-Nearest Neighbour (KNN) classifier
with the advantage that the algorithm needs no training set and it is completely unsupervised.
Additionally, NNCA is augmented with a partial supervision strategy, NNCA-PS, to
act as a classifier. Comparisons with other methods indicate the robustness of the proposed
method in classification. Additionally, experiments on parallel environment indicate the
suitability and scalability of the parallel NNCA, P-NNCA, in handling large datasets.
Further investigations on more challenging data are carried out. In this context, microarray
data is considered. In such data, the number of clusters is not clearly defined.
This points directly towards the clustering algorithms that does not require the knowledge
of the number of clusters. Therefore, the efficacy of one of these algorithms is examined.
Finally, a novel integrated clustering performance measure (lCPM) is proposed to be used
as a guideline for choosing the proper clustering algorithm that has the ability to extract
useful biological information in a particular dataset.
Supplied by The British Library - 'The world's knowledge'
Supplied by The British Library - 'The world's knowledge