136,635 research outputs found

    A low variance consistent test of relative dependency

    Get PDF
    We describe a novel non-parametric statistical hypothesis test of relative dependence between a source variable and two candidate target variables. Such a test enables us to determine whether one source variable is significantly more dependent on a first target variable or a second. Dependence is measured via the Hilbert-Schmidt Independence Criterion (HSIC), resulting in a pair of empirical dependence measures (source-target 1, source-target 2). We test whether the first dependence measure is significantly larger than the second. Modeling the covariance between these HSIC statistics leads to a provably more powerful test than the construction of independent HSIC statistics by sub-sampling. The resulting test is consistent and unbiased, and (being based on U-statistics) has favorable convergence properties. The test can be computed in quadratic time, matching the computational complexity of standard empirical HSIC estimators. The effectiveness of the test is demonstrated on several real-world problems: we identify language groups from a multilingual corpus, and we prove that tumor location is more dependent on gene expression than chromosomal imbalances. Source code is available for download at https://github.com/wbounliphone/reldep.Comment: International Conference on Machine Learning, Jul 2015, Lille, Franc

    Robust Unit Root and Cointegration Rank Tests for Panels and Large Systems

    Get PDF
    This study develops new tests for unit roots and cointegration rank in heterogeneous time series panels using methods that are robust to the presence of both incidental trends and cross sectional dependency of unknown form. Furthermore, the procedures do not require a choice of lag truncation or bandwidth to accommodate higher order serial correlation. The cointegration rank tests can also be implemented in relatively large dimensioned systems of equations for which conventional VECM based tests become infeasible. Monte Carlo simulations demonstrate that the procedures have high power and good size properties even in panels with relatively small dimensions.Panel Unit Roots, Cointegration Rank Tests, Robust Autocovariance Estimation

    Asymptotic Analysis of Generative Semi-Supervised Learning

    Full text link
    Semisupervised learning has emerged as a popular framework for improving modeling accuracy while controlling labeling cost. Based on an extension of stochastic composite likelihood we quantify the asymptotic accuracy of generative semi-supervised learning. In doing so, we complement distribution-free analysis by providing an alternative framework to measure the value associated with different labeling policies and resolve the fundamental question of how much data to label and in what manner. We demonstrate our approach with both simulation studies and real world experiments using naive Bayes for text classification and MRFs and CRFs for structured prediction in NLP.Comment: 12 pages, 9 figure

    Robust Inference of Trees

    Full text link
    This paper is concerned with the reliable inference of optimal tree-approximations to the dependency structure of an unknown distribution generating data. The traditional approach to the problem measures the dependency strength between random variables by the index called mutual information. In this paper reliability is achieved by Walley's imprecise Dirichlet model, which generalizes Bayesian learning with Dirichlet priors. Adopting the imprecise Dirichlet model results in posterior interval expectation for mutual information, and in a set of plausible trees consistent with the data. Reliable inference about the actual tree is achieved by focusing on the substructure common to all the plausible trees. We develop an exact algorithm that infers the substructure in time O(m^4), m being the number of random variables. The new algorithm is applied to a set of data sampled from a known distribution. The method is shown to reliably infer edges of the actual tree even when the data are very scarce, unlike the traditional approach. Finally, we provide lower and upper credibility limits for mutual information under the imprecise Dirichlet model. These enable the previous developments to be extended to a full inferential method for trees.Comment: 26 pages, 7 figure

    Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?

    Full text link
    Hidden Markov models (HMMs) have been successfully applied to automatic speech recognition for more than 35 years in spite of the fact that a key HMM assumption -- the statistical independence of frames -- is obviously violated by speech data. In fact, this data/model mismatch has inspired many attempts to modify or replace HMMs with alternative models that are better able to take into account the statistical dependence of frames. However it is fair to say that in 2010 the HMM is the consensus model of choice for speech recognition and that HMMs are at the heart of both commercially available products and contemporary research systems. In this paper we present a preliminary exploration aimed at understanding how speech data depart from HMMs and what effect this departure has on the accuracy of HMM-based speech recognition. Our analysis uses standard diagnostic tools from the field of statistics -- hypothesis testing, simulation and resampling -- which are rarely used in the field of speech recognition. Our main result, obtained by novel manipulations of real and resampled data, demonstrates that real data have statistical dependency and that this dependency is responsible for significant numbers of recognition errors. We also demonstrate, using simulation and resampling, that if we `remove' the statistical dependency from data, then the resulting recognition error rates become negligible. Taken together, these results suggest that a better understanding of the structure of the statistical dependency in speech data is a crucial first step towards improving HMM-based speech recognition

    Development of filtered Euler–Euler two-phase model for circulating fluidised bed: High resolution simulation, formulation and a priori analyses

    Get PDF
    Euler–Euler two-phase model simulations are usually performed with mesh sizes larger than the smallscale structure size of gas–solid flows in industrial fluidised beds because of computational resource limitation. Thus, these simulations do not fully account for the particle segregation effect at the small scale and this causes poor prediction of bed hydrodynamics. An appropriate modelling approach accounting for the influence of unresolved structures needs to be proposed for practical simulations. For this purpose, computational grids are refined to a cell size of a few particle diameters to obtain mesh-independent results requiring up to 17 million cells in a 3D periodic circulating fluidised bed. These mesh-independent results are filtered by volume averaging and used to perform a priori analyses on the filtered phase balance equations. Results show that filtered momentum equations can be used for practical simulations but must take account of a drift velocity due to the sub-grid correlation between the local fluid velocity and the local particle volume fraction, and particle sub-grid stresses due to the filtering of the non-linear convection term. This paper proposes models for sub-grid drift velocity and particle sub-grid stresses and assesses these models by a priori tests
    corecore