344 research outputs found

    Stable variable selection for right censored data: comparison of methods

    Get PDF
    The instability in the selection of models is a major concern with data sets containing a large number of covariates. This paper deals with variable selection methodology in the case of high-dimensional problems where the response variable can be right censored. We focuse on new stable variable selection methods based on bootstrap for two methodologies: the Cox proportional hazard model and survival trees. As far as the Cox model is concerned, we investigate the bootstrapping applied to two variable selection techniques: the stepwise algorithm based on the AIC criterion and the L1-penalization of Lasso. Regarding survival trees, we review two methodologies: the bootstrap node-level stabilization and random survival forests. We apply these different approaches to two real data sets. We compare the methods on the prediction error rate based on the Harrell concordance index and the relevance of the interpretation of the corresponding selected models. The aim is to find a compromise between a good prediction performance and ease to interpretation for clinicians. Results suggest that in the case of a small number of individuals, a bootstrapping adapted to L1-penalization in the Cox model or a bootstrap node-level stabilization in survival trees give a good alternative to the random survival forest methodology, known to give the smallest prediction error rate but difficult to interprete by non-statisticians. In a clinical perspective, the complementarity between the methods based on the Cox model and those based on survival trees would permit to built reliable models easy to interprete by the clinician.Comment: nombre de pages : 29 nombre de tableaux : 2 nombre de figures :

    De Statisticien à Data Scientist: Développements pédagogiques à l'INSA de Toulouse

    Get PDF
    International audienceAccording to a recent report from the European Commission, the world generates every minute 1.7 million of billions of data bytes, the equivalent of 360,000 DVDs, and companies that build their decision-making processes by exploiting these data increase their productivity. The treatment and valorization of massive data has consequences on the employment of graduate students in statistics. Which additional skills do students trained in statistics need to acquire to become data scientists ? How to evolve training so that future graduates can adapt to rapid changes in this area, without neglecting traditional jobs and the fundamental and lasting foundation for the training? After considering the notion of big data and questioning the emergence of a "new" science: Data Science, we present the current developments in the training of engineers in Mathematical and Modeling at INSA Toulouse.Selon un rapport récent de la commission européenne, le monde génère chaque minute 1,7 millions de milliards d'octets de données, soit l'équivalent de 360 000 DVD, et les entreprises qui bâtissent leur processus décisionnels en exploitant ces données accroissent leur productivité. Le traitement et la valorisation de données massives a des conséquence en matière d'emploi pour les diplômés des filières statistiques. Quelles compétences nouvelles les étudiants formés en statistique doivent-ils acquérir devenir des scientifiques des données ? Comment faire évoluer les formations pour permettre aux futurs diplômés de s'adapter aux évolutions rapides dans ce domaine, sans pour autant négliger les métiers traditionnels et le socle fondamental et pérenne de la formation? Après nous être interrogés sur la notion de données massives et l'émergence d'une "nouvelle" science : la science des données, nous présenterons les évolutions en cours dans la formation d'ingénieurs en Génie Mathématique et Modélisation à l'INSA de Toulouse

    Statistique et Big Data Analytics; Volumétrie, L'Attaque des Clones

    Get PDF
    This article assumes acquired the skills and expertise of a statistician in unsupervised (NMF, k-means, SVD) and supervised learning (regression, CART, random forest). What skills and knowledge do a statistician must acquire to reach the "Volume" scale of big data? After a quick overview of the different strategies available and especially of those imposed by Hadoop, the algorithms of some available learning methods are outlined in order to understand how they are adapted to the strong stresses of the Map-Reduce functionalitie
    • …
    corecore