    Multivariate classification of gene expression microarray data

    L'expressiódels gens obtinguts de l'anàliside microarrays s'utilitza en molts casos, per classificar les cèllules. En aquestatesi, unaversióprobabilística del mètodeDiscriminant Partial Least Squares (p-DPLS)s'utilitza per classificar les mostres de les expressions delsseus gens. p-DPLS esbasa en la regla de Bayes de la probabilitat a posteriori. Aquestsclassificadorssónforaçats a classficarsempre.Per superaraquestalimitaciós'haimplementatl'opció de rebuig.Aquestaopciópermetrebutjarlesmostresamb alt riscd'errors de classificació (és a dir, mostresambigüesi outliers).Aquestaopció de rebuigcombinacriterisbasats en els residuals x, el leverage ielsvalorspredits. A més,esdesenvolupa un mètode de selecció de variables per triarels gens mésrellevants, jaque la majoriadels gens analitzatsamb un microarraysónirrellevants per al propòsit particular de classificacióI podenconfondre el classificador. Finalment, el DPLSs'estenen a la classificació multi-classemitjançant la combinació de PLS ambl'anàlisidiscriminant lineal

    Abstract Reliability Estimation of a Statistical Classifier

    Pattern classification techniques derived from statistical principles have been widely studied and have proven powerful in addressing practical classification problems. In real-world applications, the challenge is often to cope with unseen patterns i.e., patterns which are very different from those examined during the training phase. The issue with unseen patterns is the lack of accuracy of the classifier output in the regions of pattern space where the density of training data is low, which could lead to a false classification output. This paper proposes a method for estimating the reliability of a classifier to cope with these situations. While existing methods for quantifying the reliability are often based on the class membership probability estimated on global approximations, the proposed method takes into account the local density of training data in the neighborhood of a test pattern. The calculations are further simplified by using the Gaussian mixture model (GMM) to calculate the local density of the training data. The reliability of a classifier output is defined in terms of a confidence interval on the class membership probability. The lower bound of a confidence interval or the local density of training data may be used to detect the unseen patterns. The effectiveness of the proposed method is demonstrated using real data sets and performance is compared with other reliability estimation methods