    VaB-AL: Incorporating Class Imbalance and Difficulty with Variational Bayes for Active Learning

    Active Learning for discriminative models has largely been studied with the focus on individual samples, with less emphasis on how classes are distributed or which classes are hard to deal with. In this work, we show that this is harmful. We propose a method based on the Bayes' rule, that can naturally incorporate class imbalance into the Active Learning framework. We derive that three terms should be considered together when estimating the probability of a classifier making a mistake for a given sample; i) probability of mislabelling a class, ii) likelihood of the data given a predicted class, and iii) the prior probability on the abundance of a predicted class. Implementing these terms requires a generative model and an intractable likelihood estimation. Therefore, we train a Variational Auto Encoder (VAE) for this purpose. To further tie the VAE with the classifier and facilitate VAE training, we use the classifiers' deep feature representations as input to the VAE. By considering all three probabilities, among them especially the data imbalance, we can substantially improve the potential of existing methods under limited data budget. We show that our method can be applied to classification tasks on multiple different datasets -- including one that is a real-world dataset with heavy data imbalance -- significantly outperforming the state of the art

    Cross-scanner and cross-protocol multi-shell diffusion MRI data harmonization: algorithms and result

    Cross-scanner and cross-protocol variability of diffusion magnetic resonance imaging (dMRI) data are known to be major obstacles in multi-site clinical studies since they limit the ability to aggregate dMRI data and derived measures. Computational algorithms that harmonize the data and minimize such variability are critical to reliably combine datasets acquired from different scanners and/or protocols, thus improving the statistical power and sensitivity of multi-site studies. Different computational approaches have been proposed to harmonize diffusion MRI data or remove scanner-specific differences. To date, these methods have mostly been developed for or evaluated on single b-value diffusion MRI data. In this work, we present the evaluation results of 19 algorithms that are developed to harmonize the cross-scanner and cross-protocol variability of multi-shell diffusion MRI using a benchmark database. The proposed algorithms rely on various signal representation approaches and computational tools, such as rotational invariant spherical harmonics, deep neural networks and hybrid biophysical and statistical approaches. The benchmark database consists of data acquired from the same subjects on two scanners with different maximum gradient strength (80 and 300 ​mT/m) and with two protocols. We evaluated the performance of these algorithms for mapping multi-shell diffusion MRI data across scanners and across protocols using several state-of-the-art imaging measures. The results show that data harmonization algorithms can reduce the cross-scanner and cross-protocol variabilities to a similar level as scan-rescan variability using the same scanner and protocol. In particular, the LinearRISH algorithm based on adaptive linear mapping of rotational invariant spherical harmonics features yields the lowest variability for our data in predicting the fractional anisotropy (FA), mean diffusivity (MD), mean kurtosis (MK) and the rotationally invariant spherical harmonic (RISH) features. But other algorithms, such as DIAMOND, SHResNet, DIQT, CMResNet show further improvement in harmonizing the return-to-origin probability (RTOP). The performance of different approaches provides useful guidelines on data harmonization in future multi-site studies

    Etude d’applications émergentes en HPC et leurs impacts sur des stratégies d’ordonnancement

    With the expected convergence between HPC, BigData and AI, newapplications with different profiles are coming to HPC infrastructures.We aim at better understanding the features and needs of theseapplications in order to be able to run them efficiently on HPC platforms.The approach followed is bottom-up: we study thoroughly an emergingapplication, Spatially Localized Atlas Network (SLANT, originating from the neuroscience community) to understand its behavior.Based on these observations, we derive a generic, yet simple, application model (namely, a linear sequence of stochastic jobs). We expect this model to be representative for a large set of upcoming applicationsthat require the computational power of HPC clusters without fitting the typical behavior oflarge-scale traditional applications.In a second step, we show how one can manipulate this generic model in a scheduling framework. Specifically we consider the problem of making reservations (both time andmemory) for an execution on an HPC platform.We derive solutions using the model of the first step of this work.We experimentally show the robustness of the model, even with very few data or with another application, to generate themodel, and provide performance gainsLa convergence entre les domaines du calcul haute-performance, du BigData et de l'intelligence artificiellefait émerger de nouveaux profils d'application sur les infrastructures HPC.Dans ce travail, nous proposons une étude de ces nouvelles applications afin de mieux comprendre leurs caractériques et besoinsdans le but d'optimiser leur exécution sur des plateformes HPC.Pour ce faire, nous adoptons une démarche ascendante. Premièrement, nous étudions en détail une application émergente, SLANT, provenant du domaine des neurosciences. Par un profilage détaillé de l'application, nous exposons ses principales caractéristiques ainsi que ses besoins en terme de ressources de calcul.A partir de ces observations, nous proposons un modèle d'application générique, pour le moment simple, composé d'une séquence linéaire de tâches stochastiques. Ce modèle devrait, selon nous, être adapté à une grande variété de ces applications émergentes qui requièrent la puissance de calcul des clusters HPC sans présenter le comportement typique des applications qui s'exécutent sur des machines à grande-échelle.Deuxièmement, nous montrons comment utiliser le modèle d'application générique dans le cadre du développement de stratégies d'ordonnancement. Plus précisément, nous nous intéressons à la conception de stratégies de réservations (à la fois en terme de temps de calcul et de mémoire).Nous proposons de telles solutions utilisant le modèle d'application générique exprimé dans la première étape de ce travail.Enfin, nous montrons la robustesse du modèle d'application et de nos stratégies d'ordonnancement au travers d'évaluations expérimentales de nos stratégies.Notamment, nous démontrons que nos solutions surpassent les approches standards de la communauté des neurosciences, même en cas de donnéespartielles ou d'extension à d'autres applications que SLANT