Search CORE

43 research outputs found

Analyse supervisée multibloc en grande dimension

Author: Lorenzo Hadrien
Publication venue: HAL CCSD
Publication date: 27/11/2019
Field of study

Statistical learning objective is to learn from observed data in order to predict the response for a new sample. In the context of vaccination, the number of features is higher than the number of individuals. This is a degenerate case of statistical analysis which needs specific tools. The regularization algorithms can deal with those drawbacks. Different types of regularization methods can be used which depends on the data set structure but also upon the question. In this work, the main objective was to use the available information with soft-thresholded empirical covariance matrix estimations through SVD decompositions. This solution is particularly efficient in terms of variable selection and computation time. Heterogeneous typed data sets (coming from different sources and also called multiblock data) were at the core of our methodology. Since some data set generations are expensive, it is common to down sample the population acquiring some types of data. This leads to multi-block missing data patterns. The second objective of our methodology is to deal with those missing values using the response values. But the response values are not present in the test data sets and so we have designed a methodology which permits to consider both the cases of missing values in the train or in the test data sets. Thanks to soft-thresholding, our methodology can regularize and select features. This estimator needs only two parameters to be fixed which are the number of components and the maximum number of features to be selected. The corresponding tuning is performed by cross-validation. According to simulations, the proposed method shows very good results comparing to benchmark methods, especially in terms of prediction and computation time. This method has also been applied to several real data sets associated with vaccine, thomboembolic and food researches.L’apprentissage statistique consiste à apprendre à partir de données mesurées dans un échantillon d’individus et cherche à prédire la grandeur d’intérêt chez un nouvel individu. Dans le cas de la vaccination, ou dans d’autres cas dont certains présentés dans ce manuscrit, le nombre de variables mesurées dépasse le nombre d’individus observés, c’est un cas dégénéré d’analyse statistique qui nécessite l’utilisation de méthodes spécifiques. Les propriétés des algorithmes de régularisation permettent de gérer ces cas. Il en existe plusieurs types en fonction de la structure des données considérées et du problème qui sont étudiés. Dans le cas de ce travail, l’objectif principal a été d’utiliser l’information disponible à l’issue de décompositions en éléments propres des matrices de covariances transformées via un opérateur de seuillage doux. Cette solution est particulièrement peu coûteuse en termes de temps de calcul et permet la sélection des variables d’intérêt. Nous nous sommes centrés sur les données qualifiées d’hétérogènes, c’est à dire issues de jeux de données qui sont provenant de sources ou de technologies distinctes. On parle aussi de données multiblocs. Les coûts d’utilisation de certaines technologies pouvant être prohibitifs, il est souvent choisi de ne pas acquérir certaines données sur l’ensemble d’un échantillon, mais seulement sur un sous-échantillon d’étude. Dans ce cas, le jeu de données se retrouve amputé d’une partie non négligeable de l’information. La structure des données associée à ces défauts d’acquisition induit une répartition elle-même multibloc de ces données manquantes, on parle alors de données manquantes par blocs. Le second objectif de notre méthode est de gérer ces données manquantes par blocs en s’appuyant sur l’information à prédire, ceci dans le but de créer un modèle prédictif qui puisse gérer les données manquantes aussi bien pour les données d’entraînement que pour celles de test. Cette méthode emprunte au seuillage doux afin de sélectionner les variables d’intérêt et ne nécessite que deux paramètres à régler qui sont le nombre de composantes et le nombre de variables à sélectionner parmi les covariables. Ce paramétrage est classiquement réalisé par validation croisée. La méthode développée a fait l’objet de simulations la comparant aux principales méthodes existantes. Elle montre d’excellents résultats en prédiction et en termes de temps de calcul. Elle a aussi été appliquée à plusieurs jeux de donnée

INRIA a CCSD electronic archive server

A Roadmap for HEP Software and Computing R&D for the 2020s

Author: Albrecht Johannes
Alves Antonio Augusto, Jr
Amadio Guilherme
Andronico Giuseppe
Anh-Ky Nguyen
Aphecetche Laurent
Apostolakis John
Asai Makoto
Atzori Luca
Babik Marian
Bagliesi Giuseppe
Bandieramonte Marilena
Banerjee Sunanda
Barisits Martin
Bauerdick Lothar A. T.
Belforte Stefano
Benjamin Douglas
Bernius Catrin
Bhimji Wahid
Bianchi Riccardo Maria
Bird Ian
Biscarat Catherine
Blomer Jakob
Bloom Kenneth
Boccali Tommaso
Bockelman Brian
Bold Tomasz
Bonacorsi Daniele
Boveia Antonio
Bozzi Concezio
Bracko Marko
Britton David
Buckley Andy
Buncic Predrag
Calafiura Paolo
Campana Simone
Canal Philippe
Canali Luca
Carlino Gianpaolo
Castro Nuno
Cattaneo Marco
Cerminara Gianluca
Cervantes Villanueva Javier
Chang Philip
Chapman John
Chen Gang
Childers Taylor
Clarke Peter
Clemencic Marco
Cogneras Eric
Coles Jeremy
Collier Ian
Colling David
Corti Gloria
Cosmo Gabriele
Costanzo Davide
Couturier Ben
Cranmer Kyle
Cranshaw Jack
Cristella Leonardo
Crooks David
Crépé-Renaudin Sabine
Currie Robert
Dallmeier-Tiessen Sünje
De Cian Michel
De Roeck Albert
De Kaushik
Delgado Peris Antonio
Derue Frédéric
Di Girolamo Alessandro
Di Guida Salvatore
Dimitrov Gancho
Doglioni Caterina
Dotti Andrea
Duellmann Dirk
Duflot Laurent
Dykstra Dave
Dziedziniewicz-Wojcik Katarzyna
Dziurda Agnieszka
Egede Ulrik
Elmer Peter
Elmsheuser Johannes
Elvira V. Daniel
Eulisse Giulio
Farrell Steven
Ferber Torben
Filipcic Andrej
Fisk Ian
Fitzpatrick Conor
Flix José
Formica Andrea
Forti Alessandra
Foundation HEP Software
Franzoni Giovanni
Frost James
Fuess Stu
Gaede Frank
Ganis Gerardo
Gardner Robert
Garonne Vincent
Gellrich Andreas
Genser Krzysztof
George Simon
Geurts Frank
Gheata Andrei
Gheata Mihaela
Giacomini Francesco
Giagu Stefano
Giffels Manuel
Gingrich Douglas
Girone Maria
Gligorov Vladimir V.
Glushkov Ivan
Gohn Wesley
Gonzalez Lopez Jose Benito
González Caballero Isidro
González Fernández Juan R.
Govi Giacomo
Grandi Claudio
Grasland Hadrien
Gray Heather
Grillo Lucia
Guan Wen
Gutsche Oliver
Gyurjyan Vardan
Hanushevsky Andrew
Hariri Farah
Hartmann Thomas
Harvey John
Hauth Thomas
Hegner Benedikt
Heinemann Beate
Heinrich Lukas
Heiss Andreas
Hernández José M.
Hildreth Michael
Hodgkinson Mark
Hoeche Stefan
Holzman Burt
Hristov Peter
Huang Xingtao
Ivanchenko Vladimir N.
Ivanov Todor
Iven Jan
Jashal Brij
Jayatilaka Bodhitha
Jones Roger
Jouvin Michel
Jun Soon Yung
Kagan Michael
Kalderon Charles William
Kane Meghan
Karavakis Edward
Katz Daniel S.
Kcira Dorian
Keeble Oliver
Kersevan Borut Paul
Kirby Michael
Klimentov Alexei
Klute Markus
Komarov Ilya
Konstantinov Dmitri
Koppenburg Patrick
Kowalkowski Jim
Kreczko Luke
Kuhr Thomas
Kutschke Robert
Kuznetsov Valentin
Lampl Walter
Lancon Eric
Lange David
Lassnig Mario
Laycock Paul
Leggett Charles
Letts James
Lewendel Birgit
Li Teng
Lima Guilherme
Linacre Jacob
Linden Tomas
Livny Miron
Lo Presti Giuseppe
Lopienski Sebastian
Love Peter
Lyon Adam
Magini Nicolò
Marshall Zachary L
Martelli Edoardo
Martin-Haugh Stewart
Mato Pere
Mazumdar Kajari
McCauley Thomas
McFayden Josh
McKee Shawn
McNab Andrew
Mehdiyev Rashid
Meinhard Helge
Menasce Dario
Mendez Lorenzo Patricia
Mete Alaettin Serhan
Michelotto Michele
Mitrevski Jovan
Moneta Lorenzo
Morgan Ben
Mount Richard
Moyse Edward
Murray Sean
Nairz Armin
Neubauer Mark S
Norman Andrew
Novaes Sérgio
Novak Mihaly
Oyanguren Arantza
Ozturk Nurcan
Pacheco Pages Andres
Paganini Michela
Pansanel Jerome
Pascuzzi Vincent R.
Patrick Glenn
Pearce Alex
Pearson Ben
Pedro Kevin
Perdue Gabriel
Perez-Calero Yzquierdo Antonio
Perrozzi Luca
Petersen Troels
Petric Marko
Petzold Andreas
Piedra Jónatan
Piilonen Leo
Piparo Danilo
Pivarski Jim
Pokorski Witold
Polci Francesco
Potamianos Karolos
Psihas Fernanda
Puig Navarro Albert
Quast Günter
Raven Gerhard
Reuter Jürgen
Ribon Alberto
Rinaldi Lorenzo
Ritter Martin
Robinson James
Rodrigues Eduardo
Roiser Stefan
Rousseau David
Roy Gareth
Rybkine Grigori
Sailer Andre
Sakuma Tai
Santana Renato
Sartirana Andrea
Schellman Heidi
Schovancová Jaroslava
Schramm Steven
Schulz Markus
Sciabà Andrea
Seidel Sally
Sekmen Sezen
Serfon Cedric
Severini Horst
Sexton-Kennedy Elizabeth
Seymour Michael
Sgalaberna Davide
Shapoval Illya
Shiers Jamie
Shiu Jing-Ge
Short Hannah
Siroli Gian Piero
Skipsey Sam
Smith Tim
Snyder Scott
Sokoloff Michael D
Spentzouris Panagiotis
Stadie Hartmut
Stark Giordon
Stewart Gordon
Stewart Graeme
Sánchez Arturo
Sánchez-Hernández Alberto
Taffard Anyes
Tamponi Umberto
Templon Jeff
Tenaglia Giacomo
Tsulaia Vakhtang
Tunnell Christopher
Vaandering Eric
Valassi Andrea
Vallecorsa Sofia
Valsan Liviu
Van Gemmeren Peter
Vernet Renaud
Viren Brett
Vlimant Jean-Roch
Voss Christian
Votava Margaret
Vuosalo Carl
Vázquez Sierra Carlos
Wartel Romain
Watts Gordon T.
Wenaus Torre
Wenzel Sandro
Williams Mike
Winklmeier Frank
Wissing Christoph
Wuerthwein Frank
Wynne Benjamin
Xiaomei Zhang
Yang Wei
Yazgan Efe
Publication venue
Publication date: 18/12/2017
Field of study

Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.Peer reviewe

Universidade do Minho: RepositoriUM

Hal - Université Grenoble Alpes

HAL Clermont Université

Helsingin yliopiston digitaalinen arkisto

HAL-CEA

Hal-Diderot

arXiv.org e-Print Archive

HAL-IN2P3

DSpace@MIT

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio della ricerca- Università di Roma La Sapienza

CERN Document Server

HAL-Polytechnique

Explore Bristol Research

Supervised analysis of high dimensional multibloc data

Author: LORENZO Hadrien
Publication venue
Publication date: 30/04/2021
Field of study

L’apprentissage statistique consiste à apprendre à partir de données mesurées dans un échantillon d’individus et cherche à prédire la grandeur d’intérêt chez un nouvel individu. Dans le cas de la vaccination, ou dans d’autres cas dont certains présentés dans ce manuscrit, le nombre de variables mesurées dépasse le nombre d’individus observés, c’est un cas dégénéré d’analyse statistique qui nécessite l’utilisation de méthodes spécifiques. Les propriétés des algorithmes de régularisation permettent de gérer ces cas. Il en existe plusieurs types en fonction de la structure des données considérées et du problème qui sont étudiés. Dans le cas de ce travail, l’objectif principal a été d’utiliser l’information disponible à l’issue de décompositions en éléments propres des matrices de covariances transformées via un opérateur de seuillage doux. Cette solution est particulièrement peu coûteuse en termes de temps de calcul et permet la sélection des variables d’intérêt. Nous nous sommes centrés sur les données qualifiées d’hétérogènes, c’est à dire issues de jeux de données qui sont provenant de sources ou de technologies distinctes. On parle aussi de données multiblocs. Les coûts d’utilisation de certaines technologies pouvant être prohibitifs, il est souvent choisi de ne pas acquérir certaines données sur l’ensemble d’un échantillon, mais seulement sur un sous-échantillon d’étude. Dans ce cas, le jeu de données se retrouve amputé d’une partie non négligeable de l’information. La structure des données associée à ces défauts d’acquisition induit une répartition elle-même multibloc de ces données manquantes, on parle alors de données manquantes par blocs. Le second objectif de notre méthode est de gérer ces données manquantes par blocs en s’appuyant sur l’information à prédire, ceci dans le but de créer un modèle prédictif qui puisse gérer les données manquantes aussi bien pour les données d’entraînement que pour celles de test. Cette méthode emprunte au seuillage doux afin de sélectionner les variables d’intérêt et ne nécessite que deux paramètres à régler qui sont le nombre de composantes et le nombre de variables à sélectionner parmi les covariables. Ce paramétrage est classiquement réalisé par validation croisée. La méthode développée a fait l’objet de simulations la comparant aux principales méthodes existantes. Elle montre d’excellents résultats en prédiction et en termes de temps de calcul. Elle a aussi été appliquée à plusieurs jeux de donnéesStatistical learning objective is to learn from observed data in order to predict the response for a new sample. In the context of vaccination, the number of features is higher than the number of individuals. This is a degenerate case of statistical analysis which needs specific tools. The regularization algorithms can deal with those drawbacks. Different types of regularization methods can be used which depends on the data set structure but also upon the question. In this work, the main objective was to use the available information with soft-thresholded empirical covariance matrix estimations through SVD decompositions. This solution is particularly efficient in terms of variable selection and computation time. Heterogeneous typed data sets (coming from different sources and also called multiblock data) were at the core of our methodology. Since some data set generations are expensive, it is common to down sample the population acquiring some types of data. This leads to multi-block missing data patterns. The second objective of our methodology is to deal with those missing values using the response values. But the response values are not present in the test data sets and so we have designed a methodology which permits to consider both the cases of missing values in the train or in the test data sets. Thanks to soft-thresholding, our methodology can regularize and select features. This estimator needs only two parameters to be fixed which are the number of components and the maximum number of features to be selected. The corresponding tuning is performed by cross-validation. According to simulations, the proposed method shows very good results comparing to benchmark methods, especially in terms of prediction and computation time. This method has also been applied to several real data sets associated with vaccine, thomboembolic and food researches

Oskar Bordeaux

Supervised analysis of high dimensional multibloc data

Author: Lorenzo Hadrien
Publication venue
Publication date: 27/11/2019
Field of study

Theses.fr

Analyse supervisée multibloc en grande dimension

Author: Lorenzo Hadrien
Publication venue: HAL CCSD
Publication date: 27/11/2019
Field of study

Thèses en Ligne

INRIA a CCSD electronic archive server

HAL-Inserm

Hal-Diderot

Sélection de variables en régression SIR par seuillage doux ou seuillage dur de la matrice d’intérêt

Author: Lorenzo Hadrien
Saracco Jérôme
Publication venue: HAL CCSD
Publication date: 01/06/2022
Field of study

International audienc

INRIA a CCSD electronic archive server

HoPSIR: Homogeneous Penalization of Sliced Inverse Regression

Author: GIRARD Stéphane
LORENZO Hadrien
Publication venue
Publication date: 03/07/2023
Field of study

En régression, les approches purement paramétriques nécessitent un modèle parfois complexe à mettre en place. Inversement, les méthodes non-paramétriques souffrent lorsque la dimension de la variable explicative augmente puisqu'alors les points de données sont isolés les uns des autres. Les approches semi-paramétriques ont été proposées afin d'allier les bénéfices des deux approches. La méthode SIR (Sliced Inverse Regression en anglais pour Régression Inverse par Tranches en français) est une d'entre elles, la partie paramétrique permettant une réduction de dimension. En grande dimension, SIR n'est cependant plus applicable car elle nécessite l'inversion de la matrice de covariance empirique. Différentes approches ont été proposées afin de pallier cette limitation technique mais aucune n'a intégré sa solution via un modèle statistique, ce que propose ce travail. Au travers d'une classe de fonctions particulières, les fonctions homogènes de degré positif, nous introduisons une famille de lois a priori qui permet de construire une version pénalisée de SIR par maximisation de la loi a posteriori. Cette approche montre un excellent comportement sur simulations par comparaisons aux approches actuelles.In regression, purely parametric approaches require a model that is sometimes complex to set up. Conversely, non-parametric methods suffer when the dimension of the covariate increases as the data points are isolated from each other. Semi-parametric approaches have been proposed to combine the benefits of both approaches. The SIR method is one of them, the parametric part allowing a reduction of dimension. In high dimension, however, SIR is no longer applicable as it requires the inversion of the empirical covariance matrix. Different approaches have been proposed to overcome this technical limitation but none of them has integrated its solution via a statistical model, which is precisely what is proposed in this work. Through a particular class of functions, the homogeneous functions of positive degree, we introduce a family of prior distributions which allows to build a penalized version of SIR by maximizing the posterior distribution. This approach shows an excellent behaviour on simulations compared to current approaches

Oskar Bordeaux

Détection d’individus atypiques en régression SIR (sliced inverse regression)

Author: Lorenzo Hadrien
Saracco Jérôme
Publication venue: HAL CCSD
Publication date: 01/06/2021
Field of study

International audienc

INRIA a CCSD electronic archive server

Computational outlier detection methods in sliced inverse regression

Author: LORENZO Hadrien
SARACCO Jérôme
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/06/2021
Field of study

Sliced inverse regression (SIR) focuses on the relationship between a dependent variable y and a p-dimensional explanatory variable x in a semiparametric regression model in which the link relies on an index x β and link function f. SIR allows to estimate the direction of β that forms the effective dimension reduction (EDR) space. Based on the estimated index, the link function f can then be nonparametrically estimated using kernel estimator. This two-step approach is sensitive to the presence of outliers in the data. The aim of this paper is to propose computational methods to detect outliers in that kind of single-index regression model. Three outlier detection methods are proposed and their numerical behaviors are illustrated on a simulated sample. To discriminate outliers from "normal" observations, they use IB (in-bags) or OOB (out-of-bags) prediction errors from subsampling or resampling approaches. These methods, implemented in R, are compared with each other in a simulation study. An application on a real data is also provided

INRIA a CCSD electronic archive server

Oskar Bordeaux