1,390 research outputs found
Interpolating Thin-Shell and Sharp Large-Deviation Estimates For Isotropic Log-Concave Measures
Given an isotropic random vector with log-concave density in Euclidean
space \Real^n, we study the concentration properties of on all scales,
both above and below its expectation. We show in particular that:
\P(\abs{|X| -\sqrt{n}} \geq t \sqrt{n}) \leq C \exp(-c n^{1/2} \min(t^3,t))
\;\;\; \forall t \geq 0 ~, for some universal constants . This
improves the best known deviation results on the thin-shell and mesoscopic
scales due to Fleury and Klartag, respectively, and recovers the sharp
large-deviation estimate of Paouris. Another new feature of our estimate is
that it improves when is (), in precise
agreement with Paouris' estimates. The upper bound on the thin-shell width
\sqrt{\Var(|X|)} we obtain is of the order of , and improves down to
when is . Our estimates thus continuously interpolate
between a new best known thin-shell estimate and the sharp large-deviation
estimate of Paouris. As a consequence, a new best known bound on the Cheeger
isoperimetric constant appearing in a conjecture of
Kannan--Lov\'asz--Simonovits is deduced.Comment: 29 pages - formulation is now general, estimating deviation of a
linear image of X, and dependence on the \psi_\alpha constant is explicit.
Corrected typos and refined explanations. To appear in GAF
A new specification of generalized linear models for categorical data
Regression models for categorical data are specified in heterogeneous ways.
We propose to unify the specification of such models. This allows us to define
the family of reference models for nominal data. We introduce the notion of
reversible models for ordinal data that distinguishes adjacent and cumulative
models from sequential ones. The combination of the proposed specification with
the definition of reference and reversible models and various invariance
properties leads to a new view of regression models for categorical data.Comment: 31 pages, 13 figure
Partitioned conditional generalized linear models for categorical data
In categorical data analysis, several regression models have been proposed
for hierarchically-structured response variables, e.g. the nested logit model.
But they have been formally defined for only two or three levels in the
hierarchy. Here, we introduce the class of partitioned conditional generalized
linear models (PCGLMs) defined for any numbers of levels. The hierarchical
structure of these models is fully specified by a partition tree of categories.
Using the genericity of the (r,F,Z) specification, the PCGLM can handle
nominal, ordinal but also partially-ordered response variables.Comment: 25 pages, 13 figure
Thin-shell concentration for convex measures
We prove that for , -concave measures on satisfy a
thin shell concentration similar to the log-concave one. It leads to a
Berry-Esseen type estimate for their one dimensional marginal distributions. We
also establish sharp reverse H\"older inequalities for -concave measures
High-Resolution Road Vehicle Collision Prediction for the City of Montreal
Road accidents are an important issue of our modern societies, responsible
for millions of deaths and injuries every year in the world. In Quebec only, in
2018, road accidents are responsible for 359 deaths and 33 thousands of
injuries. In this paper, we show how one can leverage open datasets of a city
like Montreal, Canada, to create high-resolution accident prediction models,
using big data analytics. Compared to other studies in road accident
prediction, we have a much higher prediction resolution, i.e., our models
predict the occurrence of an accident within an hour, on road segments defined
by intersections. Such models could be used in the context of road accident
prevention, but also to identify key factors that can lead to a road accident,
and consequently, help elaborate new policies.
We tested various machine learning methods to deal with the severe class
imbalance inherent to accident prediction problems. In particular, we
implemented the Balanced Random Forest algorithm, a variant of the Random
Forest machine learning algorithm in Apache Spark. Interestingly, we found that
in our case, Balanced Random Forest does not perform significantly better than
Random Forest.
Experimental results show that 85% of road vehicle collisions are detected by
our model with a false positive rate of 13%. The examples identified as
positive are likely to correspond to high-risk situations. In addition, we
identify the most important predictors of vehicle collisions for the area of
Montreal: the count of accidents on the same road segment during previous
years, the temperature, the day of the year, the hour and the visibility
Parametric Modelling of Multivariate Count Data Using Probabilistic Graphical Models
Multivariate count data are defined as the number of items of different
categories issued from sampling within a population, which individuals are
grouped into categories. The analysis of multivariate count data is a recurrent
and crucial issue in numerous modelling problems, particularly in the fields of
biology and ecology (where the data can represent, for example, children counts
associated with multitype branching processes), sociology and econometrics. We
focus on I) Identifying categories that appear simultaneously, or on the
contrary that are mutually exclusive. This is achieved by identifying
conditional independence relationships between the variables; II)Building
parsimonious parametric models consistent with these relationships; III)
Characterising and testing the effects of covariates on the joint distribution
of the counts. To achieve these goals, we propose an approach based on
graphical probabilistic models, and more specifically partially directed
acyclic graphs
- âŠ