1,113 research outputs found
Feature Selection in the Contrastive Analysis Setting
Contrastive analysis (CA) refers to the exploration of variations uniquely
enriched in a target dataset as compared to a corresponding background dataset
generated from sources of variation that are irrelevant to a given task. For
example, a biomedical data analyst may wish to find a small set of genes to use
as a proxy for variations in genomic data only present among patients with a
given disease (target) as opposed to healthy control subjects (background).
However, as of yet the problem of feature selection in the CA setting has
received little attention from the machine learning community. In this work we
present contrastive feature selection (CFS), a method for performing feature
selection in the CA setting. We motivate our approach with a novel
information-theoretic analysis of representation learning in the CA setting,
and we empirically validate CFS on a semi-synthetic dataset and four real-world
biomedical datasets. We find that our method consistently outperforms
previously proposed state-of-the-art supervised and fully unsupervised feature
selection methods not designed for the CA setting. An open-source
implementation of our method is available at https://github.com/suinleelab/CFS.Comment: NeurIPS 202
Recommended from our members
End-to-End Machine Learning Frameworks for Medicine: Data Imputation, Model Interpretation and Synthetic Data Generation
Tremendous successes in machine learning have been achieved in a variety of applications such as image classification and language translation via supervised learning frameworks. Recently, with the rapid increase of electronic health records (EHR), machine learning researchers got immense opportunities to adopt the successful supervised learning frameworks to diverse clinical applications. To properly employ machine learning frameworks for medicine, we need to handle the special properties of the EHR and clinical applications: (1) extensive missing data, (2) model interpretation, (3) privacy of the data. This dissertation addresses those specialties to construct end-to-end machine learning frameworks for clinical decision support. We focus on the following three problems: (1) how to deal with incomplete data (data imputation), (2) how to explain the decisions of the trained model (model interpretation), (3) how to generate synthetic data for better sharing private clinical data (synthetic data generation). To appropriately handle those problems, we propose novel machine learning algorithms for both static and longitudinal settings. For data imputation, we propose modified Generative Adversarial Networks and Recurrent Neural Networks to accurately impute the missing values and return the complete data for applying state-of-the-art supervised learning models. For model interpretation, we utilize the actor-critic framework to estimate feature importance of the trained model's decision in an instance level. We expand this algorithm to active sensing framework that recommends which observations should we measure and when. For synthetic data generation, we extend well-known Generative Adversarial Network frameworks from static setting to longitudinal setting, and propose a novel differentially private synthetic data generation framework.To demonstrate the utilities of the proposed models, we evaluate those models on various real-world medical datasets including cohorts in the intensive care units, wards, and primary care hospitals. We show that the proposed algorithms consistently outperform state-of-the-art for handling missing data, understanding the trained model, and generating private synthetic data that are critical for building end-to-end machine learning frameworks for medicine
A comprehensive survey on deep active learning and its applications in medical image analysis
Deep learning has achieved widespread success in medical image analysis,
leading to an increasing demand for large-scale expert-annotated medical image
datasets. Yet, the high cost of annotating medical images severely hampers the
development of deep learning in this field. To reduce annotation costs, active
learning aims to select the most informative samples for annotation and train
high-performance models with as few labeled samples as possible. In this
survey, we review the core methods of active learning, including the evaluation
of informativeness and sampling strategy. For the first time, we provide a
detailed summary of the integration of active learning with other
label-efficient techniques, such as semi-supervised, self-supervised learning,
and so on. Additionally, we also highlight active learning works that are
specifically tailored to medical image analysis. In the end, we offer our
perspectives on the future trends and challenges of active learning and its
applications in medical image analysis.Comment: Paper List on Github:
https://github.com/LightersWang/Awesome-Active-Learning-for-Medical-Image-Analysi
Feature selection for modular GA-based classification
Genetic algorithms (GAs) have been used as conventional methods for classifiers to adaptively evolve solutions for classification problems. Feature selection plays an important role in finding relevant features in classification. In this paper, feature selection is explored with modular GA-based classification. A new feature selection technique, Relative Importance Factor (RIF), is proposed to find less relevant features in the input domain of each class module. By removing these features, it is aimed to reduce the classification error and dimensionality of classification problems. Benchmark classification data sets are used to evaluate the proposed approach. The experiment results show that RIF can be used to find less relevant features and help achieve lower classification error with the feature space dimension reduced
Neural networks in multiphase reactors data mining: feature selection, prior knowledge, and model design
Les réseaux de neurones artificiels (RNA) suscitent toujours un vif intérêt dans la plupart des domaines d’ingénierie non seulement pour leur attirante « capacité d’apprentissage » mais aussi pour leur flexibilité et leur bonne performance, par rapport aux approches classiques. Les RNA sont capables «d’approximer» des relations complexes et non linéaires entre un vecteur de variables d’entrées x et une sortie y. Dans le contexte des réacteurs multiphasiques le potentiel des RNA est élevé car la modélisation via la résolution des équations d’écoulement est presque impossible pour les systèmes gaz-liquide-solide. L’utilisation des RNA dans les approches de régression et de classification rencontre cependant certaines difficultés. Un premier problème, général à tous les types de modélisation empirique, est celui de la sélection des variables explicatives qui consiste à décider quel sous-ensemble xs ⊂ x des variables indépendantes doit être retenu pour former les entrées du modèle. Les autres difficultés à surmonter, plus spécifiques aux RNA, sont : le sur-apprentissage, l’ambiguïté dans l’identification de l’architecture et des paramètres des RNA et le manque de compréhension phénoménologique du modèle résultant. Ce travail se concentre principalement sur trois problématiques dans l’utilisation des RNA: i) la sélection des variables, ii) l’utilisation de la connaissance apriori, et iii) le design du modèle. La sélection des variables, dans le contexte de la régression avec des groupes adimensionnels, a été menée avec les algorithmes génétiques. Dans le contexte de la classification, cette sélection a été faite avec des méthodes séquentielles. Les types de connaissance a priori que nous avons insérés dans le processus de construction des RNA sont : i) la monotonie et la concavité pour la régression, ii) la connectivité des classes et des coûts non égaux associés aux différentes erreurs, pour la classification. Les méthodologies développées dans ce travail ont permis de construire plusieurs modèles neuronaux fiables pour les prédictions de la rétention liquide et de la perte de charge dans les colonnes garnies à contre-courant ainsi que pour la prédiction des régimes d’écoulement dans les colonnes garnies à co-courant.Artificial neural networks (ANN) have recently gained enormous popularity in many engineering fields, not only for their appealing “learning ability, ” but also for their versatility and superior performance with respect to classical approaches. Without supposing a particular equational form, ANNs mimic complex nonlinear relationships that might exist between an input feature vector x and a dependent (output) variable y. In the context of multiphase reactors the potential of neural networks is high as the modeling by resolution of first principle equations to forecast sought key hydrodynamics and transfer characteristics is intractable. The general-purpose applicability of neural networks in regression and classification, however, poses some subsidiary difficulties that can make their use inappropriate for certain modeling problems. Some of these problems are general to any empirical modeling technique, including the feature selection step, in which one has to decide which subset xs ⊂ x should constitute the inputs (regressors) of the model. Other weaknesses specific to the neural networks are overfitting, model design ambiguity (architecture and parameters identification), and the lack of interpretability of resulting models. This work addresses three issues in the application of neural networks: i) feature selection ii) prior knowledge matching within the models (to answer to some extent the overfitting and interpretability issues), and iii) the model design. Feature selection was conducted with genetic algorithms (yet another companion from artificial intelligence area), which allowed identification of good combinations of dimensionless inputs to use in regression ANNs, or with sequential methods in a classification context. The type of a priori knowledge we wanted the resulting ANN models to match was the monotonicity and/or concavity in regression or class connectivity and different misclassification costs in classification. Even the purpose of the study was rather methodological; some resulting ANN models might be considered contributions per se. These models-- direct proofs for the underlying methodologies-- are useful for predicting liquid hold-up and pressure drop in counter-current packed beds and flow regime type in trickle beds
- …