13,098 research outputs found
A robust approach to model-based classification based on trimming and constraints
In a standard classification framework a set of trustworthy learning data are
employed to build a decision rule, with the final aim of classifying unlabelled
units belonging to the test set. Therefore, unreliable labelled observations,
namely outliers and data with incorrect labels, can strongly undermine the
classifier performance, especially if the training size is small. The present
work introduces a robust modification to the Model-Based Classification
framework, employing impartial trimming and constraints on the ratio between
the maximum and the minimum eigenvalue of the group scatter matrices. The
proposed method effectively handles noise presence in both response and
exploratory variables, providing reliable classification even when dealing with
contaminated datasets. A robust information criterion is proposed for model
selection. Experiments on real and simulated data, artificially adulterated,
are provided to underline the benefits of the proposed method
The supervised hierarchical Dirichlet process
We propose the supervised hierarchical Dirichlet process (sHDP), a
nonparametric generative model for the joint distribution of a group of
observations and a response variable directly associated with that whole group.
We compare the sHDP with another leading method for regression on grouped data,
the supervised latent Dirichlet allocation (sLDA) model. We evaluate our method
on two real-world classification problems and two real-world regression
problems. Bayesian nonparametric regression models based on the Dirichlet
process, such as the Dirichlet process-generalised linear models (DP-GLM) have
previously been explored; these models allow flexibility in modelling nonlinear
relationships. However, until now, Hierarchical Dirichlet Process (HDP)
mixtures have not seen significant use in supervised problems with grouped data
since a straightforward application of the HDP on the grouped data results in
learnt clusters that are not predictive of the responses. The sHDP solves this
problem by allowing for clusters to be learnt jointly from the group structure
and from the label assigned to each group.Comment: 14 page
Recommended from our members
Constraint based approaches to interpretable and semi-supervised machine learning
Interpretability and Explainability of machine learning algorithms are becoming increasingly important as Machine Learning (ML) systems get widely applied to domains like clinical healthcare, social media and governance. A related major challenge in deploying ML systems pertains to reliable learning when expert annotation is severely limited. This dissertation prescribes a common framework to address these challenges, based on the use of constraints that can make an ML model more interpretable, lead to novel methods for explaining ML models, or help to learn reliably with limited supervision.
In particular, we focus on the class of latent variable models and develop a general learning framework by constraining realizations of latent variables and/or model parameters. We propose specific constraints that can be used to develop identifiable latent variable models, that in turn learn interpretable outcomes. The proposed framework is first used in Nonānegative Matrix Factorization and Probabilistic Graphical Models. For both models, algorithms are proposed to incorporate such constraints with seamless and tractable augmentation of the associated learning and inference procedures. The utility of the proposed methods is demonstrated for our working application domain ā identifiable phenotyping using Electronic Health Records (EHRs). Evaluation by domain experts reveals that the proposed models are indeed more clinically relevant (and hence more interpretable) than existing counterparts. The work also demonstrates that while there may be inherent tradeāoffs between constraining models to encourage interpretability, the quantitative performance of downstream tasks remains competitive.
We then focus on constraint based mechanisms to explain decisions or outcomes of supervised black-box models. We propose an explanation model based on generating examples where the nature of the examples is constrained i.e. they have to be sampled from the underlying data domain. To do so, we train a generative model to characterize the data manifold in a high dimensional ambient space. Constrained sampling then allows us to generate naturalistic examples that lie along the data manifold. We propose ways to summarize model behavior using such constrained examples.
In the last part of the contributions, we argue that heterogeneity of data sources is useful in situations where very little to no supervision is available. This thesis leverages such heterogeneity (via constraints) for two critical but widely different machine learning algorithms. In each case, a novel algorithm in the sub-class of coāregularization is developed to combine information from heterogeneous sources. Coāregularization is a framework of constraining latent variables and/or latent distributions in order to leverage heterogeneity. The proposed algorithms are utilized for clustering, where the intent is to generate a partition or grouping of observed samples, and for Learning to Rank algorithms ā used to rank a set of observed samples in order of preference with respect to a specific search query. The proposed methods are evaluated on clustering web documents, social network users, and information retrieval applications for ranking search queries.Electrical and Computer Engineerin
- ā¦