14 research outputs found

    Sanitized Clustering against Confounding Bias

    Full text link
    Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias (SCAB), which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by Variational Auto-Encoder (VAE). Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias. The code is available at \url{https://github.com/EvaFlower/SCAB}.Comment: Machine Learning, in pres

    Statistics meets Machine Learning

    Get PDF
    Theory and application go hand in hand in most areas of statistics. In a world flooded with huge amounts of data waiting to be analyzed, classified and transformed into useful outputs, the designing of fast, robust and stable algorithms has never been as important as it is today. On the other hand, irrespective of whether the focus is put on estimation, prediction, classification or other purposes, it is equally crucial to provide clear guarantees that such algorithms have strong theoretical guarantees. Many statisticians, independently of their original research interests, have become increasingly aware of the importance of the numerical needs faced in numerous applications including gene expression profiling, health care, pattern and speech recognition, data security, marketing personalization, natural language processing, to name just a few. The goal of this workshop is twofold: (a) exchange knowledge on successful algorithmic approaches and discuss some of the existing challenges, and (b) to bring together researchers in statistics and machine learning with the aim of sharing expertise and exploiting possible differences in points of views to obtain a better understanding of some of the common important problems

    Finding Optimal Diverse Feature Sets with Alternative Feature Selection

    Get PDF
    Feature selection is popular for obtaining small, interpretable, yet highly accurate prediction models. Conventional feature-selection methods typically yield one feature set only, which might not suffice in some scenarios. For example, users might be interested in finding alternative feature sets with similar prediction quality, offering different explanations of the data. In this article, we introduce alternative feature selection and formalize it as an optimization problem. In particular, we define alternatives via constraints and enable users to control the number and dissimilarity of alternatives. Next, we analyze the complexity of this optimization problem and show NP-hardness. Further, we discuss how to integrate conventional feature-selection methods as objectives. Finally, we evaluate alternative feature selection with 30 classification datasets. We observe that alternative feature sets may indeed have high prediction quality, and we analyze several factors influencing this outcome

    Finding Optimal Diverse Feature Sets with Alternative Feature Selection

    Full text link
    Feature selection is popular for obtaining small, interpretable, yet highly accurate prediction models. Conventional feature-selection methods typically yield one feature set only, which might not suffice in some scenarios. For example, users might be interested in finding alternative feature sets with similar prediction quality, offering different explanations of the data. In this article, we introduce alternative feature selection and formalize it as an optimization problem. In particular, we define alternatives via constraints and enable users to control the number and dissimilarity of alternatives. Next, we analyze the complexity of this optimization problem and show NP-hardness. Further, we discuss how to integrate conventional feature-selection methods as objectives. Finally, we evaluate alternative feature selection with 30 classification datasets. We observe that alternative feature sets may indeed have high prediction quality, and we analyze several factors influencing this outcome

    Learning With An Insufficient Supply Of Data Via Knowledge Transfer And Sharing

    Get PDF
    As machine learning methods extend to more complex and diverse set of problems, situations arise where the complexity and availability of data presents a situation where the information source is not adequate to generate a representative hypothesis. Learning from multiple sources of data is a promising research direction as researchers leverage ever more diverse sources of information. Since data is not readily available, knowledge has to be transferred from other sources and new methods (both supervised and un-supervised) have to be developed to selectively share and transfer knowledge. In this dissertation, we present both supervised and un-supervised techniques to tackle a problem where learning algorithms cannot generalize and require an extension to leverage knowledge from different sources of data. Knowledge transfer is a difficult problem as diverse sources of data can overwhelm each individual dataset\u27s distribution and a careful set of transformations has to be applied to increase the relevant knowledge at the risk of biasing a dataset\u27s distribution and inducing negative transfer that can degrade a learner\u27s performance. We give an overview of the issues encountered when the learning dataset does not have a sufficient supply of training examples. We categorize the structure of small datasets and highlight the need for further research. We present an instance-transfer supervised classification algorithm to improve classification performance in a target dataset via knowledge transfer from an auxiliary dataset. The improved classification performance of our algorithm is demonstrated with several real-world experiments. We extend the instance-transfer paradigm to supervised classification with Absolute Rarity\u27 , where a dataset has an insufficient supply of training examples and a skewed class distribution. We demonstrate one solution with a transfer learning approach and another with an imbalanced learning approach and demonstrate the effectiveness of our algorithms with several real world text and demographics classification problems (among others). We present an unsupervised multi-task clustering algorithm where several small datasets are simultaneously clustered and knowledge is transferred between the datasets to improve clustering performance on each individual dataset and we demonstrate the improved clustering performance with an extensive set of experiments

    Modelling of Brainstem Toxicity Including Variable Relative Biological Effectiveness in Paediatric Proton Therapy

    Get PDF
    Revised edition. Minor spelling errors corrected.Brainstem necrosis is a rare but severe side-effect following paediatric proton therapy. Substructures of the brainstem may be associated with regional differences in radiosensitivity, but these are not accounted for clinically. The relative biological effectiveness (RBE), the ratio between a test dose and reference dose resulting in the same clinical endpoint, is also assumed to be constant for proton therapy. This may underestimate the biological effect of the radiation since the RBE is thought to be variable across the beam profile. Current dose constraints and normal tissue complication probability (NTCP) models for adult tissues are further developed than ones based on paediatric patients. However, paediatric tissue is associated with different radiosensitivity than adult tissues, and more data is required to quantify this. This study aimed to further explore the association between variable RBE, regional radiosensitivity of the brainstem and brainstem toxicity of paediatric proton therapy patients. A cohort of 36 paediatric proton therapy patients that received significant dose to the brainstem, and were subsequently at risk of brainstem necrosis, were included in a case-control study. The patients had RBE-weighted dose distributions and dose-averaged linear energy transfer (LETd) distributions recalculated with the FLUKA Monte Carlo code for variable RBE models. The brainstem was delineated into substructures. Dose-volume histograms and dose statistics of the cohort were used to fit Lyman-Kutcher-Burman(LKB) models to the data for different RBE-weighted dose distributions and substructures. Dose statistics were also used as a basis for cluster analyses to explore regional differences across the brainstem. The results showed higher average variable RBE-weighted dose and LETd observed for cases compared to controls, while this was not the trend for the constant RBE factor. This thesis shows the first fitting of LKB models to substructures of the brainstem. For the full brainstem structure, the tolerance dose (TD50) range was 61.7−68.6 Gy(RBE) using RBE1.1 and 65.4−70.0 Gy(RBE) based on the variable RBE models. The cluster analysis separated the data points into a small number of relatively solid clusters but overall did not show clear trends in sorting out cases from controls.Masteroppgave i fysikkPHYS399MAMN-PHY

    Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data.

    Get PDF
    The location of groups of similar observations (clusters) in data is a well-studied problem, and has many practical applications. There are a wide range of approaches to clustering, which rely on different definitions of similarity, and are appropriate for datasets with different characteristics. Despite a rich literature, there exist a number of open problems in clustering, and limitations to existing algorithms. This thesis develops methodology for clustering high-dimensional, mixed datasets with complex clustering structures, using low-density cluster separators that bi-partition datasets using cluster boundaries that pass through regions of minimal density, separating regions of high probability density, associated with clusters. The bi-partitions arising from a succession of minimum density cluster separators are combined using divisive hierarchical and partitional algorithms, to locate a complete clustering, while estimating the number of clusters. The proposed algorithms locate cluster separators using one-dimensional arbitrarily oriented subspaces, circumventing the challenges associated with clustering in high-dimensional spaces. This requires continuous observations; thus, to extend the applicability of the proposed algorithms to mixed datasets, methods for producing an appropriate continuous representation of datasets containing non-continuous features are investigated. The exact evaluation of the density intersected by a cluster boundary is restricted to linear separators. This limitation is lifted by a non-linear mapping of the original observations into a feature space, in which a linear separator permits the correct identification of non-linearly separable clusters in the original dataset. In large, high-dimensional datasets, searching for one-dimensional subspaces, which result in a minimum density separator is computationally expensive. Therefore, a computationally efficient approach to low-density cluster separation using approximately optimal projection directions is proposed, which searches over a collection of one-dimensional random projections for an appropriate subspace for cluster identification. The proposed approaches produce high-quality partitions, that are competitive with well-established and state-of-the-art algorithms
    corecore