165,103 research outputs found
Stream Sampling for Frequency Cap Statistics
Unaggregated data, in streamed or distributed form, is prevalent and come
from diverse application domains which include interactions of users with web
services and IP traffic. Data elements have {\em keys} (cookies, users,
queries) and elements with different keys interleave. Analytics on such data
typically utilizes statistics stated in terms of the frequencies of keys. The
two most common statistics are {\em distinct}, which is the number of active
keys in a specified segment, and {\em sum}, which is the sum of the frequencies
of keys in the segment. Both are special cases of {\em cap} statistics, defined
as the sum of frequencies {\em capped} by a parameter , which are popular in
online advertising platforms. Aggregation by key, however, is costly, requiring
state proportional to the number of distinct keys, and therefore we are
interested in estimating these statistics or more generally, sampling the data,
without aggregation. We present a sampling framework for unaggregated data that
uses a single pass (for streams) or two passes (for distributed data) and state
proportional to the desired sample size. Our design provides the first
effective solution for general frequency cap statistics. Our -capped
samples provide estimates with tight statistical guarantees for cap statistics
with and nonnegative unbiased estimates of {\em any} monotone
non-decreasing frequency statistics. An added benefit of our unified design is
facilitating {\em multi-objective samples}, which provide estimates with
statistical guarantees for a specified set of different statistics, using a
single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201
Dirichlet Bayesian Network Scores and the Maximum Relative Entropy Principle
A classic approach for learning Bayesian networks from data is to identify a
maximum a posteriori (MAP) network structure. In the case of discrete Bayesian
networks, MAP networks are selected by maximising one of several possible
Bayesian Dirichlet (BD) scores; the most famous is the Bayesian Dirichlet
equivalent uniform (BDeu) score from Heckerman et al (1995). The key properties
of BDeu arise from its uniform prior over the parameters of each local
distribution in the network, which makes structure learning computationally
efficient; it does not require the elicitation of prior knowledge from experts;
and it satisfies score equivalence.
In this paper we will review the derivation and the properties of BD scores,
and of BDeu in particular, and we will link them to the corresponding entropy
estimates to study them from an information theoretic perspective. To this end,
we will work in the context of the foundational work of Giffin and Caticha
(2007), who showed that Bayesian inference can be framed as a particular case
of the maximum relative entropy principle. We will use this connection to show
that BDeu should not be used for structure learning from sparse data, since it
violates the maximum relative entropy principle; and that it is also
problematic from a more classic Bayesian model selection perspective, because
it produces Bayes factors that are sensitive to the value of its only
hyperparameter. Using a large simulation study, we found in our previous work
(Scutari, 2016) that the Bayesian Dirichlet sparse (BDs) score seems to provide
better accuracy in structure learning; in this paper we further show that BDs
does not suffer from the issues above, and we recommend to use it for sparse
data instead of BDeu. Finally, will show that these issues are in fact
different aspects of the same problem and a consequence of the distributional
assumptions of the prior.Comment: 20 pages, 4 figures; extended version submitted to Behaviormetrik
A statistical physics perspective on criticality in financial markets
Stock markets are complex systems exhibiting collective phenomena and
particular features such as synchronization, fluctuations distributed as
power-laws, non-random structures and similarity to neural networks. Such
specific properties suggest that markets operate at a very special point.
Financial markets are believed to be critical by analogy to physical systems
but few statistically founded evidence have been given. Through a data-based
methodology and comparison to simulations inspired by statistical physics of
complex systems, we show that the Dow Jones and indices sets are not rigorously
critical. However, financial systems are closer to the criticality in the crash
neighborhood.Comment: 23 pages, 19 figure
Measuring the Complexity of Continuous Distributions
We extend previously proposed measures of complexity, emergence, and
self-organization to continuous distributions using differential entropy. This
allows us to calculate the complexity of phenomena for which distributions are
known. We find that a broad range of common parameters found in Gaussian and
scale-free distributions present high complexity values. We also explore the
relationship between our measure of complexity and information adaptation.Comment: 21 pages, 5 Tables, 4 Figure
- …