753 research outputs found
Partial recovery bounds for clustering with the relaxed means
We investigate the clustering performances of the relaxed means in the
setting of sub-Gaussian Mixture Model (sGMM) and Stochastic Block Model (SBM).
After identifying the appropriate signal-to-noise ratio (SNR), we prove that
the misclassification error decay exponentially fast with respect to this SNR.
These partial recovery bounds for the relaxed means improve upon results
currently known in the sGMM setting. In the SBM setting, applying the relaxed
means SDP allows to handle general connection probabilities whereas other
SDPs investigated in the literature are restricted to the assortative case
(where within group probabilities are larger than between group probabilities).
Again, this partial recovery bound complements the state-of-the-art results.
All together, these results put forward the versatility of the relaxed
means.Comment: 39 page
Model Assisted Variable Clustering: Minimax-optimal Recovery and Algorithms
Model-based clustering defines population level clusters relative to a model
that embeds notions of similarity. Algorithms tailored to such models yield
estimated clusters with a clear statistical interpretation. We take this view
here and introduce the class of G-block covariance models as a background model
for variable clustering. In such models, two variables in a cluster are deemed
similar if they have similar associations will all other variables. This can
arise, for instance, when groups of variables are noise corrupted versions of
the same latent factor. We quantify the difficulty of clustering data generated
from a G-block covariance model in terms of cluster proximity, measured with
respect to two related, but different, cluster separation metrics. We derive
minimax cluster separation thresholds, which are the metric values below which
no algorithm can recover the model-defined clusters exactly, and show that they
are different for the two metrics. We therefore develop two algorithms, COD and
PECOK, tailored to G-block covariance models, and study their
minimax-optimality with respect to each metric. Of independent interest is the
fact that the analysis of the PECOK algorithm, which is based on a corrected
convex relaxation of the popular K-means algorithm, provides the first
statistical analysis of such algorithms for variable clustering. Additionally,
we contrast our methods with another popular clustering method, spectral
clustering, specialized to variable clustering, and show that ensuring exact
cluster recovery via this method requires clusters to have a higher separation,
relative to the minimax threshold. Extensive simulation studies, as well as our
data analyses, confirm the applicability of our approach.Comment: Maintext: 38 pages; supplementary information: 37 page
La carte du sang de l'immobilier chinois, un cas de cyber-activisme
International audienceThe aim of this paper is to explore new forms of social mobilization in urban areas that use the Internet technologies. The development of online social networks offers new possibilities of expression and protest. Web 2.0 is transformed into a digital public space complementary to the traditional physical public space especially when this one is particularly controlled. The Chinese bloody map of real estate is particularly representative of these changes. Published in October 2010, this map uses cooperative knowledge of Internet users to list real estate developments that led to physical violence. These can range from simple protest repression to self-immolation. The checked version of the map shows 85 events and the open version 199 cases. The diffusion of the news of this map in the Chinese and international medias helped to put this social issue on the international political agenda related to urban development and real estate in China.L'objectif de cet article est d'explorer les nouvelles formes de mobilisation sociale en milieu urbain qui utilisent les technologies de l'Internet. Le développement des réseaux sociaux en ligne offre en effet de nouvelles possibilités d'expression et de contestation. Le Web 2.0 se transforme ainsi en un espace public numérique complémentaire de l'espace public physique traditionnel surtout lorsque celui-ci est particulièrement contrôlé. Le cas de la " carte du sang de l'immobilier chinois " est particulièrement représentatif de ces transformations. Mise en ligne en octobre 2010, cette carte coopérative fait appel à la connaissance des internautes pour répertorier les développements immobiliers ayant donné lieu à des violences physiques. Celles-ci peuvent aller de la simple répression de manifestations à des immolations par le feu. La version vérifiée de la carte présente 85 événements et la version ouverte 199 cas. La publication de cette carte a été reprise dans les médias chinois et internationaux, participant ainsi à l'inscription sur l'agenda politique international des enjeux sociaux liés au développement urbain et immobilier chinois
Modéliser l'efficacité d'un réseau: Le cas de la poste aux chevaux dans la France pré-industrielle (1632-1833)
National audienceThe topic deals with the question of relationships between the creation of a new transportation network and its effects, in terms of space shaping, here in the “Ancien Regime” France. Using a modelization approach, we aim to build the postal relation space of pre-industrial France and its evolution through two centuries. Postal roads are digitalized from historical lists of post houses, at seven different dates. Using a 900 points grid, we calculate the accessibility “at every place in France”, with a Shimbel index, and we take into account a secondary road network, characterized by a slower transportation speed. We weight the accessibility values by geometrical accessibility, and we generalize the results by interpolation methods. This methodology enlightens major regional differences, with some regions that have been favored by postal roads creations and other, less provided. The regional differentiations and their evolution reveal some planned decisions taken by French State as soon as the 18e century.Nous abordons la question des effets structurants d'un nouveau réseau de transport sur un territoire, ici la France d'Ancien Régime. Pour cela, nous reconstituons par la modélisation l'espace des relations postales et son évolution sur deux siècles. Les routes de poste sont numérisées à partir de la liste des relais à sept dates différentes. Grâce à l'établissement d'une grille de 900 points, nous évaluons l'accessibilité « en tout point de la France » par le calcul de l'indice d'accessibilité de Shimbel, en tenant compte d'un réseau de voies secondaires de circulation plus lente. Nous généralisons ensuite les valeurs obtenues par lissage et nous les pondérons par l'accessibilité géométrique des lieux. Cette méthode met en évidence de forts contrastes régionaux, entre des zones qui ont été particulièrement favorisées par l'implantation du réseau postal et d'autres, moins bien pourvues. L'analyse des dénivellations régionales d'accessibilité et de leur évolution met en lumière une action volontaire d'aménagement de l'espace par les routes de poste dès le 18ème siècle
High-dimensional regression with unknown variance
38 pagesWe review recent results for high-dimensional sparse linear regression in the practical case of unknown variance. Different sparsity settings are covered, including coordinate-sparsity, group-sparsity and variation-sparsity. The emphasis is put on non-asymptotic analyses and feasible procedures. In addition, a small numerical study compares the practical performance of three schemes for tuning the Lasso estimator and some references are collected for some more general models, including multivariate regression and nonparametric regression
- …