136,635 research outputs found
A low variance consistent test of relative dependency
We describe a novel non-parametric statistical hypothesis test of relative
dependence between a source variable and two candidate target variables. Such a
test enables us to determine whether one source variable is significantly more
dependent on a first target variable or a second. Dependence is measured via
the Hilbert-Schmidt Independence Criterion (HSIC), resulting in a pair of
empirical dependence measures (source-target 1, source-target 2). We test
whether the first dependence measure is significantly larger than the second.
Modeling the covariance between these HSIC statistics leads to a provably more
powerful test than the construction of independent HSIC statistics by
sub-sampling. The resulting test is consistent and unbiased, and (being based
on U-statistics) has favorable convergence properties. The test can be computed
in quadratic time, matching the computational complexity of standard empirical
HSIC estimators. The effectiveness of the test is demonstrated on several
real-world problems: we identify language groups from a multilingual corpus,
and we prove that tumor location is more dependent on gene expression than
chromosomal imbalances. Source code is available for download at
https://github.com/wbounliphone/reldep.Comment: International Conference on Machine Learning, Jul 2015, Lille, Franc
Robust Unit Root and Cointegration Rank Tests for Panels and Large Systems
This study develops new tests for unit roots and cointegration rank in heterogeneous time series panels using methods that are robust to the presence of both incidental trends and cross sectional dependency of unknown form. Furthermore, the procedures do not require a choice of lag truncation or bandwidth to accommodate higher order serial correlation. The cointegration rank tests can also be implemented in relatively large dimensioned systems of equations for which conventional VECM based tests become infeasible. Monte Carlo simulations demonstrate that the procedures have high power and good size properties even in panels with relatively small dimensions.Panel Unit Roots, Cointegration Rank Tests, Robust Autocovariance Estimation
Asymptotic Analysis of Generative Semi-Supervised Learning
Semisupervised learning has emerged as a popular framework for improving
modeling accuracy while controlling labeling cost. Based on an extension of
stochastic composite likelihood we quantify the asymptotic accuracy of
generative semi-supervised learning. In doing so, we complement
distribution-free analysis by providing an alternative framework to measure the
value associated with different labeling policies and resolve the fundamental
question of how much data to label and in what manner. We demonstrate our
approach with both simulation studies and real world experiments using naive
Bayes for text classification and MRFs and CRFs for structured prediction in
NLP.Comment: 12 pages, 9 figure
Robust Inference of Trees
This paper is concerned with the reliable inference of optimal
tree-approximations to the dependency structure of an unknown distribution
generating data. The traditional approach to the problem measures the
dependency strength between random variables by the index called mutual
information. In this paper reliability is achieved by Walley's imprecise
Dirichlet model, which generalizes Bayesian learning with Dirichlet priors.
Adopting the imprecise Dirichlet model results in posterior interval
expectation for mutual information, and in a set of plausible trees consistent
with the data. Reliable inference about the actual tree is achieved by focusing
on the substructure common to all the plausible trees. We develop an exact
algorithm that infers the substructure in time O(m^4), m being the number of
random variables. The new algorithm is applied to a set of data sampled from a
known distribution. The method is shown to reliably infer edges of the actual
tree even when the data are very scarce, unlike the traditional approach.
Finally, we provide lower and upper credibility limits for mutual information
under the imprecise Dirichlet model. These enable the previous developments to
be extended to a full inferential method for trees.Comment: 26 pages, 7 figure
Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?
Hidden Markov models (HMMs) have been successfully applied to automatic
speech recognition for more than 35 years in spite of the fact that a key HMM
assumption -- the statistical independence of frames -- is obviously violated
by speech data. In fact, this data/model mismatch has inspired many attempts to
modify or replace HMMs with alternative models that are better able to take
into account the statistical dependence of frames. However it is fair to say
that in 2010 the HMM is the consensus model of choice for speech recognition
and that HMMs are at the heart of both commercially available products and
contemporary research systems. In this paper we present a preliminary
exploration aimed at understanding how speech data depart from HMMs and what
effect this departure has on the accuracy of HMM-based speech recognition. Our
analysis uses standard diagnostic tools from the field of statistics --
hypothesis testing, simulation and resampling -- which are rarely used in the
field of speech recognition. Our main result, obtained by novel manipulations
of real and resampled data, demonstrates that real data have statistical
dependency and that this dependency is responsible for significant numbers of
recognition errors. We also demonstrate, using simulation and resampling, that
if we `remove' the statistical dependency from data, then the resulting
recognition error rates become negligible. Taken together, these results
suggest that a better understanding of the structure of the statistical
dependency in speech data is a crucial first step towards improving HMM-based
speech recognition
Recommended from our members
Endogenous Correlation
We model endogenous correlation in asset returns via the role of heterogeneous expectations in investor types, and the dynamic impact of imitative learning by investors. Learning is driven by relative performance. In addition, we allow a cautious slow learning pace to reflect institutional conditions. Imitative learning shapes the market ecology that influences price formation. Using the model of non-imitative agents as a benchmark, our results show that the dynamics of imitative learning endogenously induce a significant degree of asset dependency and patterns of non-constant correlation. The asymmetric learning effect on correlation, however, implies a self-reinforcing process, where a bearish condition amplifies the effect that further exacerbates asset dependency. We conclude that imitative learning, even when rational, can to a certain extent account for the phenomena of market crashes. Our results have implications for transparency in regulation issues
Development of filtered Euler–Euler two-phase model for circulating fluidised bed: High resolution simulation, formulation and a priori analyses
Euler–Euler two-phase model simulations are usually performed with mesh sizes larger than the smallscale structure size of gas–solid flows in industrial fluidised beds because of computational resource limitation. Thus, these simulations do not fully account for the particle segregation effect at the small scale and this causes poor prediction of bed hydrodynamics. An appropriate modelling approach accounting for the influence of unresolved structures needs to be proposed for practical simulations. For this purpose, computational grids are refined to a cell size of a few particle diameters to obtain mesh-independent results requiring up to 17 million cells in a 3D periodic circulating fluidised bed. These mesh-independent results are filtered by volume averaging and used to perform a priori analyses on the filtered phase balance equations. Results show that filtered momentum equations can be used for practical simulations but must take account of a drift velocity due to the sub-grid correlation between the local fluid velocity and the local particle volume fraction, and particle sub-grid stresses due to the filtering of the non-linear convection term. This paper proposes models for sub-grid drift velocity and particle sub-grid stresses and assesses these models by a priori tests
- …