55,885 research outputs found
Four lectures on probabilistic methods for data science
Methods of high-dimensional probability play a central role in applications
for statistics, signal processing theoretical computer science and related
fields. These lectures present a sample of particularly useful tools of
high-dimensional probability, focusing on the classical and matrix Bernstein's
inequality and the uniform matrix deviation inequality. We illustrate these
tools with applications for dimension reduction, network analysis, covariance
estimation, matrix completion and sparse signal recovery. The lectures are
geared towards beginning graduate students who have taken a rigorous course in
probability but may not have any experience in data science applications.Comment: Lectures given at 2016 PCMI Graduate Summer School in Mathematics of
Data. Some typos, inaccuracies fixe
To steal or not to steal: Firm attributes, legal environment, and valuation
Newly released data on corporate governance and disclosure practices reveal wide within-country variation, with the variation increasing as legal environment gets less investor friendly. This paper examines why firms practice high-quality governance when law does not require it; firm attributes that are related to the quality of governance; how the attributes interact with legal environment; and the relation between firm valuation and corporate governance. A simple model, in which a controlling shareholder trades off private benefits of diversion against costs that vary across countries and time, identifies three relevant firm attributes: investment opportunities, external financing, and ownership structure. Using firm-level governance and transparency data on 859 firms in 27 countries, we find that firms with greater growth opportunities, greater needs for external financing, and more concentrated cash flow rights practice higher-quality governance and disclose more. Moreover, firms that score higher in governance and transparency rankings are valued higher in the stock market. Equally important, all these relations are stronger in countries that are less investor friendly, demonstrating that firms do adapt to poor legal environments to establish efficient governance practices.http://deepblue.lib.umich.edu/bitstream/2027.42/39939/3/wp554.pd
Unbiased sampling of network ensembles
Sampling random graphs with given properties is a key step in the analysis of
networks, as random ensembles represent basic null models required to identify
patterns such as communities and motifs. An important requirement is that the
sampling process is unbiased and efficient. The main approaches are
microcanonical, i.e. they sample graphs that match the enforced constraints
exactly. Unfortunately, when applied to strongly heterogeneous networks (like
most real-world examples), the majority of these approaches become biased
and/or time-consuming. Moreover, the algorithms defined in the simplest cases,
such as binary graphs with given degrees, are not easily generalizable to more
complicated ensembles. Here we propose a solution to the problem via the
introduction of a "Maximize and Sample" ("Max & Sam" for short) method to
correctly sample ensembles of networks where the constraints are `soft', i.e.
realized as ensemble averages. Our method is based on exact maximum-entropy
distributions and is therefore unbiased by construction, even for strongly
heterogeneous networks. It is also more computationally efficient than most
microcanonical alternatives. Finally, it works for both binary and weighted
networks with a variety of constraints, including combined degree-strength
sequences and full reciprocity structure, for which no alternative method
exists. Our canonical approach can in principle be turned into an unbiased
microcanonical one, via a restriction to the relevant subset. Importantly, the
analysis of the fluctuations of the constraints suggests that the
microcanonical and canonical versions of all the ensembles considered here are
not equivalent. We show various real-world applications and provide a code
implementing all our algorithms.Comment: MatLab code available at
http://www.mathworks.it/matlabcentral/fileexchange/46912-max-sam-package-zi
Statistical guarantees for the EM algorithm: From population to sample-based analysis
We develop a general framework for proving rigorous guarantees on the
performance of the EM algorithm and a variant known as gradient EM. Our
analysis is divided into two parts: a treatment of these algorithms at the
population level (in the limit of infinite data), followed by results that
apply to updates based on a finite set of samples. First, we characterize the
domain of attraction of any global maximizer of the population likelihood. This
characterization is based on a novel view of the EM updates as a perturbed form
of likelihood ascent, or in parallel, of the gradient EM updates as a perturbed
form of standard gradient ascent. Leveraging this characterization, we then
provide non-asymptotic guarantees on the EM and gradient EM algorithms when
applied to a finite set of samples. We develop consequences of our general
theory for three canonical examples of incomplete-data problems: mixture of
Gaussians, mixture of regressions, and linear regression with covariates
missing completely at random. In each case, our theory guarantees that with a
suitable initialization, a relatively small number of EM (or gradient EM) steps
will yield (with high probability) an estimate that is within statistical error
of the MLE. We provide simulations to confirm this theoretically predicted
behavior
- …