1,113 research outputs found

    Feature Selection in the Contrastive Analysis Setting

    Full text link
    Contrastive analysis (CA) refers to the exploration of variations uniquely enriched in a target dataset as compared to a corresponding background dataset generated from sources of variation that are irrelevant to a given task. For example, a biomedical data analyst may wish to find a small set of genes to use as a proxy for variations in genomic data only present among patients with a given disease (target) as opposed to healthy control subjects (background). However, as of yet the problem of feature selection in the CA setting has received little attention from the machine learning community. In this work we present contrastive feature selection (CFS), a method for performing feature selection in the CA setting. We motivate our approach with a novel information-theoretic analysis of representation learning in the CA setting, and we empirically validate CFS on a semi-synthetic dataset and four real-world biomedical datasets. We find that our method consistently outperforms previously proposed state-of-the-art supervised and fully unsupervised feature selection methods not designed for the CA setting. An open-source implementation of our method is available at https://github.com/suinleelab/CFS.Comment: NeurIPS 202

    A comprehensive survey on deep active learning and its applications in medical image analysis

    Full text link
    Deep learning has achieved widespread success in medical image analysis, leading to an increasing demand for large-scale expert-annotated medical image datasets. Yet, the high cost of annotating medical images severely hampers the development of deep learning in this field. To reduce annotation costs, active learning aims to select the most informative samples for annotation and train high-performance models with as few labeled samples as possible. In this survey, we review the core methods of active learning, including the evaluation of informativeness and sampling strategy. For the first time, we provide a detailed summary of the integration of active learning with other label-efficient techniques, such as semi-supervised, self-supervised learning, and so on. Additionally, we also highlight active learning works that are specifically tailored to medical image analysis. In the end, we offer our perspectives on the future trends and challenges of active learning and its applications in medical image analysis.Comment: Paper List on Github: https://github.com/LightersWang/Awesome-Active-Learning-for-Medical-Image-Analysi

    Feature selection for modular GA-based classification

    Get PDF
    Genetic algorithms (GAs) have been used as conventional methods for classifiers to adaptively evolve solutions for classification problems. Feature selection plays an important role in finding relevant features in classification. In this paper, feature selection is explored with modular GA-based classification. A new feature selection technique, Relative Importance Factor (RIF), is proposed to find less relevant features in the input domain of each class module. By removing these features, it is aimed to reduce the classification error and dimensionality of classification problems. Benchmark classification data sets are used to evaluate the proposed approach. The experiment results show that RIF can be used to find less relevant features and help achieve lower classification error with the feature space dimension reduced

    Neural networks in multiphase reactors data mining: feature selection, prior knowledge, and model design

    Get PDF
    Les réseaux de neurones artificiels (RNA) suscitent toujours un vif intérêt dans la plupart des domaines d’ingénierie non seulement pour leur attirante « capacité d’apprentissage » mais aussi pour leur flexibilité et leur bonne performance, par rapport aux approches classiques. Les RNA sont capables «d’approximer» des relations complexes et non linéaires entre un vecteur de variables d’entrées x et une sortie y. Dans le contexte des réacteurs multiphasiques le potentiel des RNA est élevé car la modélisation via la résolution des équations d’écoulement est presque impossible pour les systèmes gaz-liquide-solide. L’utilisation des RNA dans les approches de régression et de classification rencontre cependant certaines difficultés. Un premier problème, général à tous les types de modélisation empirique, est celui de la sélection des variables explicatives qui consiste à décider quel sous-ensemble xs ⊂ x des variables indépendantes doit être retenu pour former les entrées du modèle. Les autres difficultés à surmonter, plus spécifiques aux RNA, sont : le sur-apprentissage, l’ambiguïté dans l’identification de l’architecture et des paramètres des RNA et le manque de compréhension phénoménologique du modèle résultant. Ce travail se concentre principalement sur trois problématiques dans l’utilisation des RNA: i) la sélection des variables, ii) l’utilisation de la connaissance apriori, et iii) le design du modèle. La sélection des variables, dans le contexte de la régression avec des groupes adimensionnels, a été menée avec les algorithmes génétiques. Dans le contexte de la classification, cette sélection a été faite avec des méthodes séquentielles. Les types de connaissance a priori que nous avons insérés dans le processus de construction des RNA sont : i) la monotonie et la concavité pour la régression, ii) la connectivité des classes et des coûts non égaux associés aux différentes erreurs, pour la classification. Les méthodologies développées dans ce travail ont permis de construire plusieurs modèles neuronaux fiables pour les prédictions de la rétention liquide et de la perte de charge dans les colonnes garnies à contre-courant ainsi que pour la prédiction des régimes d’écoulement dans les colonnes garnies à co-courant.Artificial neural networks (ANN) have recently gained enormous popularity in many engineering fields, not only for their appealing “learning ability, ” but also for their versatility and superior performance with respect to classical approaches. Without supposing a particular equational form, ANNs mimic complex nonlinear relationships that might exist between an input feature vector x and a dependent (output) variable y. In the context of multiphase reactors the potential of neural networks is high as the modeling by resolution of first principle equations to forecast sought key hydrodynamics and transfer characteristics is intractable. The general-purpose applicability of neural networks in regression and classification, however, poses some subsidiary difficulties that can make their use inappropriate for certain modeling problems. Some of these problems are general to any empirical modeling technique, including the feature selection step, in which one has to decide which subset xs ⊂ x should constitute the inputs (regressors) of the model. Other weaknesses specific to the neural networks are overfitting, model design ambiguity (architecture and parameters identification), and the lack of interpretability of resulting models. This work addresses three issues in the application of neural networks: i) feature selection ii) prior knowledge matching within the models (to answer to some extent the overfitting and interpretability issues), and iii) the model design. Feature selection was conducted with genetic algorithms (yet another companion from artificial intelligence area), which allowed identification of good combinations of dimensionless inputs to use in regression ANNs, or with sequential methods in a classification context. The type of a priori knowledge we wanted the resulting ANN models to match was the monotonicity and/or concavity in regression or class connectivity and different misclassification costs in classification. Even the purpose of the study was rather methodological; some resulting ANN models might be considered contributions per se. These models-- direct proofs for the underlying methodologies-- are useful for predicting liquid hold-up and pressure drop in counter-current packed beds and flow regime type in trickle beds
    corecore