    Exponential Family Hybrid Semi-Supervised Learning

    We present an approach to semi-supervised learning based on an exponential family characterization. Our approach generalizes previous work on coupled priors for hybrid generative/discriminative models. Our model is more flexible and natural than previous approaches. Experimental results on several data sets show that our approach also performs better in practice.Comment: 6 pages, 3 figure

    Altitude Training: Strong Bounds for Single-Layer Dropout

    Dropout training, originally designed for deep neural networks, has been successful on high-dimensional single-layer natural language tasks. This paper proposes a theoretical explanation for this phenomenon: we show that, under a generative Poisson topic model with long documents, dropout training improves the exponent in the generalization bound for empirical risk minimization. Dropout achieves this gain much like a marathon runner who practices at altitude: once a classifier learns to perform reasonably well on training examples that have been artificially corrupted by dropout, it will do very well on the uncorrupted test set. We also show that, under similar conditions, dropout preserves the Bayes decision boundary and should therefore induce minimal bias in high dimensions.Comment: Advances in Neural Information Processing Systems (NIPS), 201

    Unifying generative and discriminative learning principles

    <p>Abstract</p> <p>Background</p> <p>The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too.</p> <p>Results</p> <p>Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites.</p> <p>Conclusions</p> <p>We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.</p

    Métodos para la Clasificación Automática de Imágenes de Resonancia Magnética del Cerebro

    Las imágenes digitales adquiridas como resultado de la Resonancia Magnética son ampliamente utilizadas para diagnosticar, estudiar y pronosticar la evolución y respuesta al tratamiento de una gran variedad de patologías. La interpretación adecuada de estas imágenes requiere un extenso y complejo análisis asociado que involucra numerosas técnicas informáticas. En la práctica, las principales dificultades para clasificar las Imágenes de Resonancia Magnética (IRM) utilizando métodos de Aprendizaje Automático son el extenso volumen de información asociada a este formato, el ruido intrínseco y la normalmente escasa cantidad de sujetos presentes en las investigaciones, lo que dificulta notablemente su procesamiento. Este trabajo analiza el comportamiento en situaciones reales de las técnicas más utilizadas en el estado del arte para la clasificación automática de IRM cerebrales como Discriminante Lineal, Redes Neuronales y Maquina de Vectores de Soporte, así como la influencia de distintos pre-procesamientos de la imagen (alineamiento, recorte de la imagen, extracción de ROIs) en el resultado de la clasificación. Además, se investigan adicionalmente otros factores importantes, como el uso de diferentes tipos de imágenes de Resonancia Magnética (T2 y Difusión) y la incorporación adicional de sujetos de control al entrenamiento. Con este fin se han utilizado las bases de datos de activación cerebral por apetito y cáncer proporcionadas por el Laboratorio de Imagen y Espectroscopia por Resonancia Magnética del Instituto de Investigaciones Biomédicas Alberto Sols (IIB) CSIC/UAM, en Madrid, España. El objetivo general de este estudio es detectar qué tipo de pre-procesamiento y qué algoritmos de clasificación proporcionan mejor clasificación automática. Los resultados obtenidos muestran como mejor secuencia de procesamiento; el alineamiento con recorte y la reducción dimensional, previo a la clasificación utilizando SVM (Maquina de Vectores de Soporte, Support Vector Machine).Digital images acquired in Magnetic Resonance Imaging (MRI) are widely used to diagnose, study and predict the evolution and response to treatment of a variety of important pathologies. Adequate interpretation of these images requires an extensive and complex associated analysis involving numerous computer techniques. In practice, the main difficulties to classify MRI using machine learning methods are the large volume of information associated to this format, the intrinsic noise and the reduced number of subjects normally present in real research conditions, two circumstances resulting in remarkably difficult data processing. In this work we systematically investigate the performance in real conditions of the most widely used techniques in automatic classification of brain MRI scans, as Discriminant Analysis (DA), Artificial Neural Networks (ANN) and Support Vector Machine (SVM), as well as the influence of different pre-processing methods of the image (alignment, image cropping , removing ROIs) in the results of the classification. In addition, we further investigate other important factors such as the use of different types of magnetic resonance images (T2w and Diffusion) or the incorporation of additional control subjects when training the classifier. We used two different databases of MRI (cerebral activation by appetite, and response of brain tumours to treatment), both provided by the Laboratory of Imaging and Spectroscopy by Magnetic Resonance Spectroscopy at the Institute of Biomedical Research Alberto Sols (IIB) CSIC / UAM, Madrid, Spain. The main goal of the study is to identify which pre-processing strategies and classification algorithms provide better automatic classification results. Results show as best processing sequence, alignment with clipping and dimensional reduction prior to classification using SVM (Support Vector Machine)

    Bias-variance tradeoff in hybrid generative-discriminative models

    Aspects of generative and discriminative classifiers

    In recent years, under the new terminology of generative and discriminative classifiers, research interest in classical statistical approaches to discriminant analysis has re-emerged in the machine learning community. In discriminant analysis, observations with features x\mathbf{x} measured are classified into classes labelled by a categorical variable yy. {\em Generative classifiers}, also termed the sampling paradigm, such as normal-based discriminant analysis and the na\"{i}ve Bayes classifier, model the joint distribution p(x,y)p(\mathbf{x}, y) of the measured features x\mathbf{x} and the class labels yy factorised in the form p(x∣y)p(y)p(\mathbf{x}|y)p(y), where p(x∣y)p(\mathbf{x}|y) is a data-generating process (DGP), and learn the model parameters through maximisation of the likelihood with respect to p(x∣y)p(y)p(\mathbf{x}|y)p(y). {\em Discriminative classifiers}, also termed the diagnostic paradigm, such as logistic regression, model the conditional distribution p(y∣x)p(y|\mathbf{x}) of the class labels given the features, and learn the model parameters through maximising the conditional likelihood based on p(y∣x)p(y|\mathbf{x}). In order to exploit the best of both worlds, it is necessary to first compare generative and discriminative classifiers and then combine them. In this thesis, we first performed some empirical and simulation studies to provide extension of and make comments on a highly-cited report~\citep{Ng:01}, which compared the na\"{i}ve Bayes classifier or normal-based linear discriminant analysis (LDA) with linear logistic regression (LLR). Then we studied extensively two hybrid-learning techniques, namely the hybrid generative-discriminative algorithm~\citep{Raina:03} and the generative-discriminative tradeoff (GDT) approach~\citep{Bouchard:04}, for combining the generative and discriminative classifiers. Based on our results from these studies, we proposed a joint generative-discriminative modelling approach to classification. In addition, we extended our investigation to generative and discriminative hidden Markov models, the latent variable models for structured data. We also developed discriminative approaches for a specific application, that of histogram-based image thresholding. The contributions of this thesis are the following. First,~\citet{Ng:01} claimed that there exist two distinct regimes of performance between the generative and discriminative classifiers with regard to the training-set size; however, our empirical and simulation studies, as presented in Chapter \ref{ch:ng}, suggest that it is not so reliable to claim such an existence of the two distinct regimes. In addition, for real world datasets, so far there is no theoretically correct, general criterion for choosing between the discriminative and the generative approaches to classification of an observation x\mathbf{x} into a class yy; the choice depends on the relative confidence you have in the correctness of the specification of either p(y∣x)p(y|\mathbf{x}) or p(x,y)p(\mathbf{x}, y). This can be to some extent a demonstration of why~\citet{Efron:75} and~\citet{ONeill:80} prefer LDA but other empirical studies may prefer LLR instead. Furthermore, we suggest that pairing of either LDA assuming a common diagonal covariance matrix (LDA-Λ\Lambda) or the na\"{i}ve Bayes classifier and LLR may not be perfect, and hence it may not be reliable for any claim that was derived from the comparison between LDA-Λ\Lambda or the na\"{i}ve Bayes classifier and LLR to be generalised to all generative and discriminative classifiers. Secondly, in Chapter \ref{ch:gdt}, we present the interpretation and asymptotic relative efficiency (ARE) of the GDT approach for linear and quadratic normal discrimination without model mis-specification, and compare its ARE with those of its generative and discriminative counterparts. The classification performance of the GDT is compared with those of LDA and LLR on simulated datasets. We argue that the GDT is a generative model integrating both discriminative and generative learning. It is therefore sensitive to model mis-specification of the data-generating process and, in practice, its discriminative component may behave differently from a truly discriminative approach. Amongst the three approaches that we compare, the asymptotic efficiency of the GDT is lower than that of the generative approach when no model mis-specification occurs. In addition, without model mis-specification, LDA performs the best; with model mis-specification, the GDT may perform the best at an optimal tradeoff between its discriminative and generative components, and LLR, a truly discriminative classifier, in general performs well when the training-sample size is reasonably large. Thirdly, in Chapter \ref{ch:hyb}, we interpret the hybrid algorithm from three perspectives, namely class-conditional probabilities, class-posterior probabilities and loss functions underlying the model. We suggest that the hybrid algorithm is by nature a generative model with its parameters learnt through both generative and discriminative approaches, in the sense that it assumes a scaled data-generation process and uses scaled class-posterior probabilities to perform discrimination. Our suggestion can also be applied to its multi-class extension. In addition, using simulated and real-world data, we compare the performance of the normalised hybrid algorithm as a classifier with that of the na\"{i}ve Bayes classifier and LLR. Our simulation studies suggest in general the following: if the covariance matrices are diagonal matrices, the na\"{i}ve Bayes classifier performs the best; if the covariance matrices are full matrices, LLR performs the best. Our studies also suggest that the hybrid algorithm may provide worse performance than either the na\"{i}ve Bayes classifier or LLR alone. Fourthly, based on our studies presented in Chapters \ref{ch:ng},~\ref{ch:gdt} and~\ref{ch:hyb}, we propose in Chapter \ref{ch:jgd} a joint generative-discriminative modelling (JGD) approach to classification, by partitioning variables into two subsets based on statistical tests of the DGP. Our JGD approach adopts statistical tests, such as normality tests, of the assumed DGP for each variable to justify the use of generative approaches for the variables which satisfy the tests and of discriminative approaches for other variables. Such a partition of variables and a combination of generative and discriminative approaches are derived in a probabilistic rather than a heuristic way. We have concentrated on particular choices for the generative and discriminative components of our models, but the overall principle is quite general and can accommodate many other special versions. Of course, we must ensure that the assumptions underlying the resulting generative classifiers can be tested statistically. Numerical results from real UCI and gene-expression data and from simulated data demonstrate promising performance of this new approach for practical application to both low- and high-dimensional data. Fifthly, in Chapter \ref{ch:hmm}, we study the assumption of ``mutual information independence", which is used by~\citet{Zhou:05} for deriving the so-called discriminative hidden Markov model (D-HMM). We suggest that the mutual information assumption (\ref{equ:dhmm:mi1}) results in the D-HMM, while another mutual information assumption (\ref{equ:ghmm2:mi1}) results in its generative counterpart, the G-HMM. However, in practice, whether or not the assumptions are reasonable and how the corresponding HMMs perform can be data-dependent; research efforts to explore an adaptive switching between or combination of these two models may be worthwhile. Meanwhile, we suggest that the so-called output-dependent HMMs could be represented in a state-dependent manner, and vice versa, essentially by application of Bayes' theorem. Finally, in Chapter \ref{ch:img}, we present discriminative approaches to histogram-based image thresholding, in which the optimal threshold is derived from the maximum likelihood based on the conditional distribution p(y∣x)p(y|x) of yy, the class indicator of a grey level xx, given xx. The discriminative approaches can be regarded as discriminative extensions of the traditional generative approaches to thresholding, such as Otsu's method~\citep{Otsu:79} and Kittler and Illingworth's minimum error thresholding (MET)~\citep{Kittler:86}. As illustrations, we develop discriminative versions of Otsu's method and MET by using discriminant functions corresponding to the original methods to represent p(y∣x)p(y|x). These two discriminative thresholding approaches are compared with their original counterparts on selecting thresholds for a variety of histograms of mixture distributions. Results show that the discriminative Otsu method consistently provides relatively good performance. Although being of higher computational complexity than the original methods in parameter estimation, its robustness and model simplicity can justify the discriminative Otsu method for scenarios in which the risk of model mis-specification is high and the computation is not demanding

    Inferring relevance from eye movements with wrong models

    Statistical inference forms the backbone of modern science. It is often viewed as giving an objective validation for hypotheses or models. Perhaps for this reason the theory of statistical inference is often derived with the assumption that the "truth" is within the model family. However, in many real-world applications the applied statistical models are incorrect. A more appropriate probabilistic model may be computationally too complex, or the problem to be modelled may be so new that there is little prior information to be incorporated. However, in statistical theory the theoretical and practical implications of the incorrectness of the model family are to a large extent unexplored. This thesis focusses on conditional statistical inference, that is, modeling of classes of future observations given observed data, under the assumption that the model is incorrect. Conditional inference or prediction is one of the main application areas of statistical models which is still lacking a conclusive theoretical justification of Bayesian inference. The main result of the thesis is an axiomatic derivation where, given an incorrect model and assuming that the utility is conditional likelihood, a discriminative posterior yields a distribution on model parameters which best agrees with the utility. The devised discriminative posterior outperforms the classical Bayesian joint likelihood-based approach in conditional inference. Additionally, a theoretically justified expectation maximization-type algorithm is presented for obtaining conditional maximum likelihood point estimates for conditional inference tasks. The convergence of the algorithm is shown to be more stable than in earlier partly heuristic variants. The practical application field of the thesis is inference of relevance from eye movement signals in an information retrieval setup. It is shown that relevance can be predicted to some extent, and that this information can be exploited in a new kind of task, proactive information retrieval. Besides making it possible to design new kinds of engineering applications, statistical modeling of eye tracking data can also be applied in basic psychological research to make hypotheses of cognitive processes affecting eye movements, which is the second application area of the thesis