228 research outputs found
Pragmatic Causal Inference
Data-driven causal inference from real-world multivariate systems can be biased for a number of reasons. These include unmeasured confounding, systematic censoring of observations, data dependence induced by a network of unit interactions, and misspecification of parametric models. This dissertation proposes statistical methods spanning three major steps of the causal inference workflow -- discovery of a suitable causal model, which in our case, can be visualized via one of several classes of causal graphical models, identification of target causal parameters as functions of the observed data distribution, and estimation of these parameters from finite samples. The overarching goal of these methods is to augment the data scientist's toolkit to tackle the aforementioned challenges in real-world systems in theoretically sound yet practical ways. We provide a continuous optimization procedure for causal discovery in the presence of latent confounders, and a computationally efficient discrete search procedure for discovery and downstream estimation of causal effects in causal graphs encoding interactions between units in a network. For identification, we provide an algorithm that generalizes the state-of-the-art for recovery of target parameters in missing not at random distributions that can be represented graphically via directed acyclic graphs. Finally for estimation, we provide results on the tangent space of causal graphical models with latent variables which may be used to improve the efficiency of semiparametric estimators for any target parameter of interest. We also provide novel estimators, including influence-function based estimators, for the average causal effect of a point exposure on an outcome when there are latent variables in the system
COLOR IMAGE QUANTIZATION USING GDBSCAN
Color image quantization is the most widely used techniques in the field of image compression. DBSCAN is a density based data clustering technique. However DBSCAN is widely used for data clustering but not very popular for color image quantization due to some of issues associated with it. One of the problems associated with DBSCAN is that it becomes expensive when used on whole image data and also the noise points been unmapped. In this paper we are proposing a new color image quantization scheme which overcomes these problems. Our proposed algorithm is GDBSCAN (Grid Based DBSCAN) where we first decompose the image data in grids and then apply DBSCAN algorithm on each grid
Causal Inference With Outcome-Dependent Missingness And Self-Censoring
We consider missingness in the context of causal inference when the outcome
of interest may be missing. If the outcome directly affects its own missingness
status, i.e., it is "self-censoring", this may lead to severely biased causal
effect estimates. Miao et al. [2015] proposed the shadow variable method to
correct for bias due to self-censoring; however, verifying the required model
assumptions can be difficult. Here, we propose a test based on a randomized
incentive variable offered to encourage reporting of the outcome that can be
used to verify identification assumptions that are sufficient to correct for
both self-censoring and confounding bias. Concretely, the test confirms whether
a given set of pre-treatment covariates is sufficient to block all backdoor
paths between the treatment and outcome as well as all paths between the
treatment and missingness indicator after conditioning on the outcome. We show
that under these conditions, the causal effect is identified by using the
treatment as a shadow variable, and it leads to an intuitive inverse
probability weighting estimator that uses a product of the treatment and
response weights. We evaluate the efficacy of our test and downstream estimator
via simulations.Comment: 15 pages. In proceedings of the 39th Conference on Uncertainty in
Artificial Intelligenc
RCT Rejection Sampling for Causal Estimation Evaluation
Confounding is a significant obstacle to unbiased estimation of causal
effects from observational data. For settings with high-dimensional covariates
-- such as text data, genomics, or the behavioral social sciences --
researchers have proposed methods to adjust for confounding by adapting machine
learning methods to the goal of causal estimation. However, empirical
evaluation of these adjustment methods has been challenging and limited. In
this work, we build on a promising empirical evaluation strategy that
simplifies evaluation design and uses real data: subsampling randomized
controlled trials (RCTs) to create confounded observational datasets while
using the average causal effects from the RCTs as ground-truth. We contribute a
new sampling algorithm, which we call RCT rejection sampling, and provide
theoretical guarantees that causal identification holds in the observational
data to allow for valid comparisons to the ground-truth RCT. Using synthetic
data, we show our algorithm indeed results in low bias when oracle estimators
are evaluated on the confounded samples, which is not always the case for a
previously proposed algorithm. In addition to this identification result, we
highlight several finite data considerations for evaluation designers who plan
to use RCT rejection sampling on their own datasets. As a proof of concept, we
implement an example evaluation pipeline and walk through these finite data
considerations with a novel, real-world RCT -- which we release publicly --
consisting of approximately 70k observations and text data as high-dimensional
covariates. Together, these contributions build towards a broader agenda of
improved empirical evaluation for causal estimation.Comment: Code and data at https://github.com/kakeith/rct_rejection_samplin
On Feeding Business Systems with Linked Resources from the Web of Data
Business systems that are fed with data from the Web of Data require transparent interoperability. The Linked Data principles establish that different resources that represent the same real-world entities must be linked for such purpose. Link rules are paramount to transparent interoperability since they produce the links between resources. State-of-the-art link rules are learnt by genetic programming and build on comparing the values of the attributes of the resources. Unfortunately, this approach falls short in cases in which resources have similar values for their attributes, but represent different real-world entities. In this paper, we present a proposal that leverages a genetic programming that learns link rules and an ad-hoc filtering technique that boosts them to decide whether the links that they produce must be selected or not. Our analysis of the literature reveals that our approach is novel and our experimental analysis confirms that it helps improve the F1 score by increasing precision without a significant penalty on recall.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016- 75394-
Impact of age-related macular degeneration on diabetic retinopathy: An electronic health record based big data analysis from a tertiary eye centre in South India.
PURPOSE: To determine whether the presence of age-related macular degeneration (AMD) decreases the risk of diabetic retinopathy. METHODS: This was a retrospective, case-cohort study performed in patients with a systemic diagnosis of diabetes at a tertiary health care center from May 2011 to April 2020. A total of 43,153 patients (1,024 AMD patients and 42,129 non-AMD patients) were included in the analysis. A total of 1,024 age and diabetes mellitus (DM) duration-matched controls were chosen from the non-AMD group for risk factor analysis. The severity of diabetic retinopathy was compared between the patients with AMD and the patients without AMD. RESULTS: Out of the enrolled 43,153 diabetic patients, 26,906 were males and 16,247 were females. A total of 1,024 patients had AMD and 42,129 had no AMD. The mean age of the cohort was 58.60 ± 0.09 years. The overall prevalence of DR was noted to be 22.8% (9,825 out of 43,153 eyes). A significantly lower prevalence of diabetic retinopathy (DR) (23% in non-AMD, 11.4% in AMD, OR = -0.43, P < 0.001), non-proliferative diabetic retinopathy (NPDR) (12% in non-AMD, 8.2% in AMD, OR = -0.66, P < 0.001), and proliferative diabetic retinopathy (PDR) (11% in non-AMD, 3.2% in AMD, OR = -0.27, P < 0.001) was seen in the AMD patients. No significant difference was seen between the dry and wet AMD. On multivariate logistic regression analysis, the lower age, absence of AMD, and male gender were associated with a higher risk of PDR. CONCLUSION: The presence of AMD was noted to statistically reduce the risk of DR. Our results may be useful in the field of resource allocation and awareness of DR
- …