Search CORE

73 research outputs found

Prix Nobel d'Economie et mathématiques

Author: Bair Jacques
Haesbroeck Gentiane
Publication venue: Société Belge des Professeurs de Mathématiques d'Expression Française (SBPMef)
Publication date: 01/03/2013
Field of study

Le Prix Nobel d'Economie 2012 a été décerné aux deux mathématiciens américains Lloyd Shapley et Alvin Roth. Cet événement nous a donné l'occasion de nous pencher quelque peu sur des prix internationaux pouvant être attribués à des mathématiciens

Open Repository and Bibliography - Liège

Detection of influential observations on the error rate based on the generalized k-means clustering procedure

Author: Haesbroeck Gentiane
Ruwet Christel
Publication venue
Publication date: 14/10/2009
Field of study

Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on the generalized k-means algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of this clustering technique, a classi cation rule is provided in order to classify the objects into one of the clusters. When classi cation is the main objective of the statistical analysis, performance is often measured by means of an error rate ER(F; Fm) where F is the distribution of the training sample used to set up the classi cation rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, the error rate will be corrupted since it relies on a contaminated rule, while the test sample may still be considered as being distributed according to the model distribution. To measure the robustness of classification based on this clustering proce- dure, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al. (2008) and Croux et al. (2008) in the context of linear and logistic discrimination. In this setup, the contaminated distribution takes the form F(eps)= (1-eps)*Fm + eps*Dx, where Dx is the Dirac distribution putting all its mass at x: After studying the influence function of the error rate of the generalized k- means procedure, which depends on the influence functions of the generalized k-means centers derived by Garcia-Escudero and Gordaliza (1999), a diagnostic tool based on its value will be presented. The aim is to detect observations in the training sample which can be influential for the error rate

Open Repository and Bibliography - Liège

Robust detection techniques for multivariate spatial data

Author: Ernst Marie
Haesbroeck Gentiane
Publication venue
Publication date: 26/11/2013
Field of study

Spatial data are characterized by statistical units, with known geographical positions, on which non spatial attributes are measured. Two types of atypical observations can be defined: global and/or local outliers. The attribute values of a global outlier are outlying with respect to the values taken by the majority of the data points while the attribute values of a local outlier are extreme when compared to those of its neighbors. Classical outlier detection techniques may be used to find global outliers as the geographical positions of the data is not taken into account in this search. The detection of local outliers is more complex especially when there are more than one non spatial attribute. In this poster, two new procedures for local outliers detection are defined. The first approach is to adapt an existing technique using in particular a regularized estimator of the covariance matrix. The second technique measures outlyingness using depth function

Open Repository and Bibliography - Liège

Règles de l'art de l'analyse statistique de données

Author: Haesbroeck Gentiane
Publication venue
Publication date: 25/11/2022
Field of study

La statistique exploitée en recherche n'est pas toujours appliquée dans les règles de l'art. Quelques exemples illustrent la nécessité de développer la formation et la bonne compréhension des concepts en vue d'améliorer la reproductibilité des résultats

Open Repository and Bibliography - Liège

Les probabilités : comment allier intuition et raisonnement ? Que peut apporter l'outil informatique ?

Author: Haesbroeck Gentiane
Henry Valérie
Publication venue
Publication date: 07/02/2018
Field of study

Open Repository and Bibliography - Liège

Influence function of the error rate of classification based on clustering

Author: Haesbroeck Gentiane
Ruwet Christel
Publication venue
Publication date: 19/05/2009
Field of study

Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on two particular cases of the generalized k-means algorithm : the classical k-means procedure as well as the k-medoids algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of these clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. When classification is the main objective of the statistical analysis, performance is often measured by means of an error rate. Two types of error rates can be computed: a theoretical one and a more empirical one. The first one can be written as ER(F, Fm) where F is the distribution of the training sample used to set up the classification rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). The empirical error rate corresponds to ER(F, F), meaning that the classification rule is tested on the same sample as the one used to set up the rule. This talk will present the results concerning the theoretical error rate. In case there are some outliers in the data, the classification rule may be corrupted. Even if it is evaluated at the model distribution, the theoretical error rate may then be contaminated. To measure the robustness of classification based on clustering, influence functions have been computed. Similar results as those derived by Croux et al (2008) and Croux et al (2008) in discriminant analysis were observed. More specifically, under optimality (which happens when the model distribution is FN = 0.5 N(μ1, σ) + 0.5 N(μ2, σ), Qiu and Tamhane 2007), the contaminated error rate can never be smaller than the optimal value, resulting in a first order influence function identically equal to 0. Second order influence functions need then to be computed. When the optimality does not hold, the first order influence function of the theoretical error rate does not vanish anymore and shows that contamination may improve the error rate achieved under the non-optimal model. The first and, when required, second order influence functions of the theoretical error rate are useful in their own right to compare the robustness of the 2-means and 2-medoids classification procedures. They have also other applications. For example, they may be used to derive diagnostic tools in order to detect observations having an unduly large influence on the error rate. Also, under optimality, the second order influence function of the theoretical error rate can yield asymptotic relative classification efficiencies

Open Repository and Bibliography - Liège

Impact of contamination on empirical and theoretical error

Author: Haesbroeck Gentiane
Ruwet Christel
Publication venue
Publication date: 18/06/2009
Field of study

Classification analysis allows to group similar objects into a given number of groups by means of a classification rule. Many classification procedures are available : linear discrimination, logistic discrimination, etc. Focus in this poster will be on classification resulting from a clustering analysis. Indeed, among the outputs of classical clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. More precisely, let F denote the underlying distribution and assume that the generalized kmeans algorithm with penalty function is used to construct the k clusters C1(F), . . . ,Ck(F) with centers T1(F), . . . , Tk(F). When one feels that k true groups are existing among the data, classification might be the main objective of the statistical analysis. Performance of a particular classification technique can be measured by means of an error rate. Depending on the availability of data, two types of error rates may be computed: a theoretical one and a more empirical one. In the first case, the rule is estimated on a training sample with distribution F while the evaluation of the classification performance may be done through a test sample distributed according to a model distribution of interest, Fm say. In the second case, the same data are used to set up the rule and to evaluate the performance. Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, thetheoretical error rate will be corrupted since it relies on a contaminated rule but it may still consider a test sample distributed according to the model distribution. The empirical error rate will be affected twice: via the rule and also via the sample used for the evaluation of the classification performance. To measure the robustness of classification based on clustering, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al (2008) and Croux et al (2008) in the context of linear and logistic discrimination. In the computation of influence functions, the contaminated distribution takes the form F(eps) = (1 − eps)*Fm + eps* Dx, where Dx is the Dirac distribution putting all its mass at x. It is interesting to note that the impact of the point mass x may be positive, i.e. may decrease the error rate, when the data at hand is used to evaluate the error

Open Repository and Bibliography - Liège

Detection of Local and Global Outliers in Spatial Data

Author: Ernst Marie
Haesbroeck Gentiane
Publication venue
Publication date: 11/07/2013
Field of study

Spatial data are characterized by statistical units, with known geographical positions, on which non spatial attributes are measured. Two types of atypical observations can be defined: global and/or local outliers. The attribute values of a global outlier are outlying with respect to the values taken by the majority of the data points while the attribute values of a local outlier are extreme when compared to those of its neighbors. Classical outlier detection techniques may be used to find global outliers as the geographical positions of the data is not taken into account in this search. The detection of local outliers is more complex especially when there are more than one non spatial attribute. In this talk, existing techniques were outlined and two new procedures were defined. The first approach is to adapt an existing technique using in particular a regularized estimator of the covariance matrix. The second technique measures outlyingness using depth function

Open Repository and Bibliography - Liège

Classification performance resulting from of 2-means

Author: Haesbroeck Gentiane
Ruwet Christel
Publication venue: 'Elsevier BV'
Publication date: 01/02/2013
Field of study

The k-means procedure is probably one of the most common nonhierachical clustering techniques. From a theoretical point of view, it is related to the search for the k principal points of the underlying distribution. In this paper, the classification resulting from that procedure for k=2 is shown to be optimal under a balanced mixture of two spherically symmetric and homoscedastic distributions. Then, the classification efficiency of the 2-means rule is assessed using the second order influence function and compared to the classification efficiencies of the Fisher and logistic discriminations. Influence functions are also considered here to compare the robustness to infinitesimal contamination of the 2-means method w.r.t. the generalized 2-means technique

Crossref

Open Repository and Bibliography - Liège

Chiffres et interprétation: focus sur les statistiques dans la presse

Author: Haesbroeck Gentiane
Publication venue
Publication date: 28/04/2022
Field of study

Le citoyen est de plus en plus bombardé de « chiffres » en tout genre, ceux-ci étant largement diffusés, interprétés, comparés et exploités par divers acteurs (journalistes, décideurs politiques, experts scientifiques…), avec au final des résultats variés, parfois contradictoires ou erronés, contribuant au renforcement de l’idée « There are lies, damned lies and statistics ». Le programme de mathématique de l’enseignement secondaire prévoit une formation en « littératie statistique » à destination des élèves. A l’ère du « big data » et de l’ « open-access » vers une masse importante de données, il convient de s’assurer que chaque élève apprenne à interpréter des données ou résultats statistiques de manière correcte et de façon critique. Plusieurs cas concrets d’interprétation abusive ou erronée seront présentés à partir d’articles ou de reportages parus dans la presse. L’objectif est de mettre en évidence certaines bonnes pratiques et les éventuelles erreurs à éviter dans le cadre d’une analyse de données

Open Repository and Bibliography - Liège