1 research outputs found

    Non-parametric feature selection for machine learning in complex settings

    No full text
    In our big data society, the possibilities of acquiring and storing data are continuously increasing. Therefore, very large datasets are often constructed for a given prediction problem. Besides a large number of data points, these datasets often also contain lots of features, corresponding to the characteristics measured for each data point. Lacking of prior information about the relevance of the collected features, practionners often consider the whole available information; they thus have to deal with databases containing a huge number of features, while many of them are generally redundant or irrelevant for the considered problem. The task of feature selection consists in detecting a small set of features which are relevant and informative for the considered problem. Feature selection is often recognized as a preprocessing step of major importance, resulting in many potential benefits. These include improving the performances of prediction models and speeding up their construction, reducing the data acquisition and storage costs as well as better understanding and more easily interpreting the problems at hand. Because of the aforementioned motivations, feature selection has become an extremely active reasearch field. However, most existing works are dedicated to classical prediction problems, where a complete dataset of continuous or categorical features and a full output target vector are available. Nevertheless, in many situtations, data is not encountered in such an ideal form. The main objective of the thesis is therefore to propose efficient feature selection algorithms for non-standard situations, that we generically denote as complex settings. More precisely, four kinds of problems are tackled in the thesis : semi-supervised problems, multi-label classification, datasets with missing values and problems with uncertain/noisy labels. All these settings correspond to frequently encountered problems, which are important to address in practice. The second objective of this thesis is more theoretical. We study the adequacy of the mutual information as a feature selection criterion, regarding the Bayes risk for classification problems and the mean squared and the mean absolute error for regression problems. This study gives valuable insights on the behaviour of the mutual information for feature selection.(FSA - Sciences de l'ingénieur) -- UCL, 201
    corecore