753 research outputs found

    Partial recovery bounds for clustering with the relaxed KKmeans

    Full text link
    We investigate the clustering performances of the relaxed KKmeans in the setting of sub-Gaussian Mixture Model (sGMM) and Stochastic Block Model (SBM). After identifying the appropriate signal-to-noise ratio (SNR), we prove that the misclassification error decay exponentially fast with respect to this SNR. These partial recovery bounds for the relaxed KKmeans improve upon results currently known in the sGMM setting. In the SBM setting, applying the relaxed KKmeans SDP allows to handle general connection probabilities whereas other SDPs investigated in the literature are restricted to the assortative case (where within group probabilities are larger than between group probabilities). Again, this partial recovery bound complements the state-of-the-art results. All together, these results put forward the versatility of the relaxed KKmeans.Comment: 39 page

    Model Assisted Variable Clustering: Minimax-optimal Recovery and Algorithms

    Get PDF
    Model-based clustering defines population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we contrast our methods with another popular clustering method, spectral clustering, specialized to variable clustering, and show that ensuring exact cluster recovery via this method requires clusters to have a higher separation, relative to the minimax threshold. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.Comment: Maintext: 38 pages; supplementary information: 37 page

    La carte du sang de l'immobilier chinois, un cas de cyber-activisme

    Get PDF
    International audienceThe aim of this paper is to explore new forms of social mobilization in urban areas that use the Internet technologies. The development of online social networks offers new possibilities of expression and protest. Web 2.0 is transformed into a digital public space complementary to the traditional physical public space especially when this one is particularly controlled. The Chinese bloody map of real estate is particularly representative of these changes. Published in October 2010, this map uses cooperative knowledge of Internet users to list real estate developments that led to physical violence. These can range from simple protest repression to self-immolation. The checked version of the map shows 85 events and the open version 199 cases. The diffusion of the news of this map in the Chinese and international medias helped to put this social issue on the international political agenda related to urban development and real estate in China.L'objectif de cet article est d'explorer les nouvelles formes de mobilisation sociale en milieu urbain qui utilisent les technologies de l'Internet. Le développement des réseaux sociaux en ligne offre en effet de nouvelles possibilités d'expression et de contestation. Le Web 2.0 se transforme ainsi en un espace public numérique complémentaire de l'espace public physique traditionnel surtout lorsque celui-ci est particulièrement contrôlé. Le cas de la " carte du sang de l'immobilier chinois " est particulièrement représentatif de ces transformations. Mise en ligne en octobre 2010, cette carte coopérative fait appel à la connaissance des internautes pour répertorier les développements immobiliers ayant donné lieu à des violences physiques. Celles-ci peuvent aller de la simple répression de manifestations à des immolations par le feu. La version vérifiée de la carte présente 85 événements et la version ouverte 199 cas. La publication de cette carte a été reprise dans les médias chinois et internationaux, participant ainsi à l'inscription sur l'agenda politique international des enjeux sociaux liés au développement urbain et immobilier chinois

    Modéliser l'efficacité d'un réseau: Le cas de la poste aux chevaux dans la France pré-industrielle (1632-1833)

    Get PDF
    National audienceThe topic deals with the question of relationships between the creation of a new transportation network and its effects, in terms of space shaping, here in the “Ancien Regime” France. Using a modelization approach, we aim to build the postal relation space of pre-industrial France and its evolution through two centuries. Postal roads are digitalized from historical lists of post houses, at seven different dates. Using a 900 points grid, we calculate the accessibility “at every place in France”, with a Shimbel index, and we take into account a secondary road network, characterized by a slower transportation speed. We weight the accessibility values by geometrical accessibility, and we generalize the results by interpolation methods. This methodology enlightens major regional differences, with some regions that have been favored by postal roads creations and other, less provided. The regional differentiations and their evolution reveal some planned decisions taken by French State as soon as the 18e century.Nous abordons la question des effets structurants d'un nouveau réseau de transport sur un territoire, ici la France d'Ancien Régime. Pour cela, nous reconstituons par la modélisation l'espace des relations postales et son évolution sur deux siècles. Les routes de poste sont numérisées à partir de la liste des relais à sept dates différentes. Grâce à l'établissement d'une grille de 900 points, nous évaluons l'accessibilité « en tout point de la France » par le calcul de l'indice d'accessibilité de Shimbel, en tenant compte d'un réseau de voies secondaires de circulation plus lente. Nous généralisons ensuite les valeurs obtenues par lissage et nous les pondérons par l'accessibilité géométrique des lieux. Cette méthode met en évidence de forts contrastes régionaux, entre des zones qui ont été particulièrement favorisées par l'implantation du réseau postal et d'autres, moins bien pourvues. L'analyse des dénivellations régionales d'accessibilité et de leur évolution met en lumière une action volontaire d'aménagement de l'espace par les routes de poste dès le 18ème siècle

    High-dimensional regression with unknown variance

    No full text
    38 pagesWe review recent results for high-dimensional sparse linear regression in the practical case of unknown variance. Different sparsity settings are covered, including coordinate-sparsity, group-sparsity and variation-sparsity. The emphasis is put on non-asymptotic analyses and feasible procedures. In addition, a small numerical study compares the practical performance of three schemes for tuning the Lasso estimator and some references are collected for some more general models, including multivariate regression and nonparametric regression
    corecore