13,115 research outputs found
Detecting low-complexity unobserved causes
We describe a method that infers whether statistical dependences between two
observed variables X and Y are due to a "direct" causal link or only due to a
connecting causal path that contains an unobserved variable of low complexity,
e.g., a binary variable. This problem is motivated by statistical genetics.
Given a genetic marker that is correlated with a phenotype of interest, we want
to detect whether this marker is causal or it only correlates with a causal
one. Our method is based on the analysis of the location of the conditional
distributions P(Y|x) in the simplex of all distributions of Y. We report
encouraging results on semi-empirical data
We Are Not Your Real Parents: Telling Causal from Confounded using MDL
Given data over variables we consider the problem of finding out whether jointly causes or whether they are all confounded by an unobserved latent variable . To do so, we take an information-theoretic approach based on Kolmogorov complexity. In a nutshell, we follow the postulate that first encoding the true cause, and then the effects given that cause, results in a shorter description than any other encoding of the observed variables. The ideal score is not computable, and hence we have to approximate it. We propose to do so using the Minimum Description Length (MDL) principle. We compare the MDL scores under the models where causes and where there exists a latent variables confounding both and and show our scores are consistent. To find potential confounders we propose using latent factor modeling, in particular, probabilistic PCA (PPCA). Empirical evaluation on both synthetic and real-world data shows that our method, CoCa, performs very well -- even when the true generating process of the data is far from the assumptions made by the models we use. Moreover, it is robust as its accuracy goes hand in hand with its confidence
Incomplete graphical model inference via latent tree aggregation
Graphical network inference is used in many fields such as genomics or
ecology to infer the conditional independence structure between variables, from
measurements of gene expression or species abundances for instance. In many
practical cases, not all variables involved in the network have been observed,
and the samples are actually drawn from a distribution where some variables
have been marginalized out. This challenges the sparsity assumption commonly
made in graphical model inference, since marginalization yields locally dense
structures, even when the original network is sparse. We present a procedure
for inferring Gaussian graphical models when some variables are unobserved,
that accounts both for the influence of missing variables and the low density
of the original network. Our model is based on the aggregation of spanning
trees, and the estimation procedure on the Expectation-Maximization algorithm.
We treat the graph structure and the unobserved nodes as missing variables and
compute posterior probabilities of edge appearance. To provide a complete
methodology, we also propose several model selection criteria to estimate the
number of missing nodes. A simulation study and an illustration flow cytometry
data reveal that our method has favorable edge detection properties compared to
existing graph inference techniques. The methods are implemented in an R
package
On specifying heterogeneity in knowledge production functions
Within the Geography of Innovation literature, the Knowledge Production Function approach has become a reference framework to investigate the presence of localized knowledge spillovers and spatial econometric tools have been applied to study interregional spillovers. A linear specification for the KPF is assumed linking patents to R&D expenditure. This approach however suffers of different drawbacks. First patent applications are count data in nature. Patents per inhabitants may produce an unrealistic picture of the spatial distribution of innovative activities. Secondly, spatial heterogeneity is not usually observed, producing both omitted variables bias and spatial correlation in the error structure. Third, a positive R&D-patents linkage may arise as a spurious correlation if market size is not observed, causing R&D to be endogenous. This paper uses a regional cross section model to study the spatial distribution of high tech patents across 232 European regions in the period 2005/2006 to address these issues. Two main processes drive technological change in the model: research activities and knowledge generated outside firms and in a second moment embedded through either formal or informal acquisition. Among the different knowledge sources we particularly focus on the role of firms working in Knowledge Intensive Business Services and on that of universities. In developing the empirical model we take into account that a) patents are count data; b) the exclusion of market size will cause biased and inconsistent model parameters estimates; c) estimates of interregional spillovers may be biased by the omission of heterogeneity in the model specification. Empirical results indicate that, as expected, a count data distribution best fits the data, producing less spatially autocorrelated residuals. Regional innovative activity is explained by both investments in research and localization of KIBS, but only the first generates positive interregional externalities. Scientific universities do not directly affect the production of new knowledge. However, different knowledge production processes characterize regions with and without scientific universities, with R&D driving innovation in the sooner and KIBS in the latter. Finally, most of what are assumed to be interregional spillovers reveal to be, at a more careful inquiry, effect due to unaccounted spatial heterogeneity in regional innovation.
Chaos detection in economics. Metric versus topological tools
In their paper Frank F., Gencay R., and Stengos T., (1988) analyze the quarterly macroeconomic data from 1960 to 1988 for West Germany, Italy, Japan and England. The goal was to check for the presence of deterministic chaos. To ensure that the data analysed was stationary they used a first difference then tried a linear fit. Using a reasonable AR specification for each time series their conclusion was that time series showed different structures. In particular the non linear structure was present in the time series of Japan. Nevertheless the application of metric tools for detecting chaos (correlation dimension and Lyapunov exponent) didn’t show presence of chaos in any time series. Starting from this conclusion we applied a topological tool Visual Recurrence Analysis to these time series to compare the results. The purpose is to verify if the analysis performed by a topological tool could give results different from ones obtained using a metric tool.economics time series, chaos, and topological tool
Non-Parametric Causality Detection: An Application to Social Media and Financial Data
According to behavioral finance, stock market returns are influenced by
emotional, social and psychological factors. Several recent works support this
theory by providing evidence of correlation between stock market prices and
collective sentiment indexes measured using social media data. However, a pure
correlation analysis is not sufficient to prove that stock market returns are
influenced by such emotional factors since both stock market prices and
collective sentiment may be driven by a third unmeasured factor. Controlling
for factors that could influence the study by applying multivariate regression
models is challenging given the complexity of stock market data. False
assumptions about the linearity or non-linearity of the model and inaccuracies
on model specification may result in misleading conclusions.
In this work, we propose a novel framework for causal inference that does not
require any assumption about the statistical relationships among the variables
of the study and can effectively control a large number of factors. We apply
our method in order to estimate the causal impact that information posted in
social media may have on stock market returns of four big companies. Our
results indicate that social media data not only correlate with stock market
returns but also influence them.Comment: Physica A: Statistical Mechanics and its Applications 201
- …