13,115 research outputs found

    Detecting low-complexity unobserved causes

    Full text link
    We describe a method that infers whether statistical dependences between two observed variables X and Y are due to a "direct" causal link or only due to a connecting causal path that contains an unobserved variable of low complexity, e.g., a binary variable. This problem is motivated by statistical genetics. Given a genetic marker that is correlated with a phenotype of interest, we want to detect whether this marker is causal or it only correlates with a causal one. Our method is based on the analysis of the location of the conditional distributions P(Y|x) in the simplex of all distributions of Y. We report encouraging results on semi-empirical data

    We Are Not Your Real Parents: Telling Causal from Confounded using MDL

    No full text
    Given data over variables (X1,...,Xm,Y)(X_1,...,X_m, Y) we consider the problem of finding out whether XX jointly causes YY or whether they are all confounded by an unobserved latent variable ZZ. To do so, we take an information-theoretic approach based on Kolmogorov complexity. In a nutshell, we follow the postulate that first encoding the true cause, and then the effects given that cause, results in a shorter description than any other encoding of the observed variables. The ideal score is not computable, and hence we have to approximate it. We propose to do so using the Minimum Description Length (MDL) principle. We compare the MDL scores under the models where XX causes YY and where there exists a latent variables ZZ confounding both XX and YY and show our scores are consistent. To find potential confounders we propose using latent factor modeling, in particular, probabilistic PCA (PPCA). Empirical evaluation on both synthetic and real-world data shows that our method, CoCa, performs very well -- even when the true generating process of the data is far from the assumptions made by the models we use. Moreover, it is robust as its accuracy goes hand in hand with its confidence

    Incomplete graphical model inference via latent tree aggregation

    Get PDF
    Graphical network inference is used in many fields such as genomics or ecology to infer the conditional independence structure between variables, from measurements of gene expression or species abundances for instance. In many practical cases, not all variables involved in the network have been observed, and the samples are actually drawn from a distribution where some variables have been marginalized out. This challenges the sparsity assumption commonly made in graphical model inference, since marginalization yields locally dense structures, even when the original network is sparse. We present a procedure for inferring Gaussian graphical models when some variables are unobserved, that accounts both for the influence of missing variables and the low density of the original network. Our model is based on the aggregation of spanning trees, and the estimation procedure on the Expectation-Maximization algorithm. We treat the graph structure and the unobserved nodes as missing variables and compute posterior probabilities of edge appearance. To provide a complete methodology, we also propose several model selection criteria to estimate the number of missing nodes. A simulation study and an illustration flow cytometry data reveal that our method has favorable edge detection properties compared to existing graph inference techniques. The methods are implemented in an R package

    On specifying heterogeneity in knowledge production functions

    Get PDF
    Within the Geography of Innovation literature, the Knowledge Production Function approach has become a reference framework to investigate the presence of localized knowledge spillovers and spatial econometric tools have been applied to study interregional spillovers. A linear specification for the KPF is assumed linking patents to R&D expenditure. This approach however suffers of different drawbacks. First patent applications are count data in nature. Patents per inhabitants may produce an unrealistic picture of the spatial distribution of innovative activities. Secondly, spatial heterogeneity is not usually observed, producing both omitted variables bias and spatial correlation in the error structure. Third, a positive R&D-patents linkage may arise as a spurious correlation if market size is not observed, causing R&D to be endogenous. This paper uses a regional cross section model to study the spatial distribution of high tech patents across 232 European regions in the period 2005/2006 to address these issues. Two main processes drive technological change in the model: research activities and knowledge generated outside firms and in a second moment embedded through either formal or informal acquisition. Among the different knowledge sources we particularly focus on the role of firms working in Knowledge Intensive Business Services and on that of universities. In developing the empirical model we take into account that a) patents are count data; b) the exclusion of market size will cause biased and inconsistent model parameters estimates; c) estimates of interregional spillovers may be biased by the omission of heterogeneity in the model specification. Empirical results indicate that, as expected, a count data distribution best fits the data, producing less spatially autocorrelated residuals. Regional innovative activity is explained by both investments in research and localization of KIBS, but only the first generates positive interregional externalities. Scientific universities do not directly affect the production of new knowledge. However, different knowledge production processes characterize regions with and without scientific universities, with R&D driving innovation in the sooner and KIBS in the latter. Finally, most of what are assumed to be interregional spillovers reveal to be, at a more careful inquiry, effect due to unaccounted spatial heterogeneity in regional innovation.

    Chaos detection in economics. Metric versus topological tools

    Get PDF
    In their paper Frank F., Gencay R., and Stengos T., (1988) analyze the quarterly macroeconomic data from 1960 to 1988 for West Germany, Italy, Japan and England. The goal was to check for the presence of deterministic chaos. To ensure that the data analysed was stationary they used a first difference then tried a linear fit. Using a reasonable AR specification for each time series their conclusion was that time series showed different structures. In particular the non linear structure was present in the time series of Japan. Nevertheless the application of metric tools for detecting chaos (correlation dimension and Lyapunov exponent) didn’t show presence of chaos in any time series. Starting from this conclusion we applied a topological tool Visual Recurrence Analysis to these time series to compare the results. The purpose is to verify if the analysis performed by a topological tool could give results different from ones obtained using a metric tool.economics time series, chaos, and topological tool

    Non-Parametric Causality Detection: An Application to Social Media and Financial Data

    Get PDF
    According to behavioral finance, stock market returns are influenced by emotional, social and psychological factors. Several recent works support this theory by providing evidence of correlation between stock market prices and collective sentiment indexes measured using social media data. However, a pure correlation analysis is not sufficient to prove that stock market returns are influenced by such emotional factors since both stock market prices and collective sentiment may be driven by a third unmeasured factor. Controlling for factors that could influence the study by applying multivariate regression models is challenging given the complexity of stock market data. False assumptions about the linearity or non-linearity of the model and inaccuracies on model specification may result in misleading conclusions. In this work, we propose a novel framework for causal inference that does not require any assumption about the statistical relationships among the variables of the study and can effectively control a large number of factors. We apply our method in order to estimate the causal impact that information posted in social media may have on stock market returns of four big companies. Our results indicate that social media data not only correlate with stock market returns but also influence them.Comment: Physica A: Statistical Mechanics and its Applications 201
    corecore