880 research outputs found
A Bayesian Approach to Graphical Record Linkage and Deduplication
© 2016 American Statistical Association.We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online
SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication
We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate -way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data
Structure in the nucleus of NGC 1068 at 10 microns
New 8 to 13 micron array camera images of the central kiloparsec of Seyfert 2 galaxy NGC 1068 resolve structure that is similar to that observed at visible and radio wavelengths. The images reveal an infrared source which is extended and asymmetric, with its long axis oriented at P.A. 33 deg. Maps of the spatial distribution of 8 to 13 micron color temperature and warm dust opacity are derived from the multiwavelength infrared images. The results suggest that there exist two pointlike luminosity sources in the central regions of NGC 1068, with the brighter source at the nucleus and the fainter one some 100 pc to the northeast. This geometry strengthens the possibility that the 10 micron emission observed from grains in the nucleus is powered by a nonthermal source. In the context of earlier visible and radio studies, these results considerably strengthen the case for jet induced star formation in NGC 1068
The 8.3 and 12.4 micron imaging of the Galactic Center source complex with the Goddard infrared array camera
A 30 x 30 arcsec field at the Galactic Center (1.5 x 1.5 parsec) was mapped at 8.3 microns and 12.41 microns with high spatial resolution and accurate relative astrometry, using the 16 x 16 Si:Bi accumulation mode charge injection device Goddard infrared array camera. The design and performance of the array camera detector electronics system and image data processing techniques are discussed. Color temperature and dust opacity distributions derived from the spatially accurate images indicate that the compact infrared sources and the large scale ridge structure are bounded by warmer, more diffuse material. None of the objects appear to be heated appreciably by internal luminosity sources. These results are consistent with the model proposing that the complex is heated externally by a strong luminosity source at the Galactic Center, which dominates the energetics of the inner few parsecs of the galaxy
Recommended from our members
TRAIL-induced variation of cell signaling states provides nonheritable resistance to apoptosis.
TNFα-related apoptosis-inducing ligand (TRAIL), specifically initiates programmed cell death, but often fails to eradicate all cells, making it an ineffective therapy for cancer. This fractional killing is linked to cellular variation that bulk assays cannot capture. Here, we quantify the diversity in cellular signaling responses to TRAIL, linking it to apoptotic frequency across numerous cell systems with single-cell mass cytometry (CyTOF). Although all cells respond to TRAIL, a variable fraction persists without apoptotic progression. This cell-specific behavior is nonheritable where both the TRAIL-induced signaling responses and frequency of apoptotic resistance remain unaffected by prior exposure. The diversity of signaling states upon exposure is correlated to TRAIL resistance. Concomitantly, constricting the variation in signaling response with kinase inhibitors proportionally decreases TRAIL resistance. Simultaneously, TRAIL-induced de novo translation in resistant cells, when blocked by cycloheximide, abrogated all TRAIL resistance. This work highlights how cell signaling diversity, and subsequent translation response, relates to nonheritable fractional escape from TRAIL-induced apoptosis. This refined view of TRAIL resistance provides new avenues to study death ligands in general
Differentially Private Model Selection with Penalized and Constrained Likelihood
In statistical disclosure control, the goal of data analysis is twofold: The
released information must provide accurate and useful statistics about the
underlying population of interest, while minimizing the potential for an
individual record to be identified. In recent years, the notion of differential
privacy has received much attention in theoretical computer science, machine
learning, and statistics. It provides a rigorous and strong notion of
protection for individuals' sensitive information. A fundamental question is
how to incorporate differential privacy into traditional statistical inference
procedures. In this paper we study model selection in multivariate linear
regression under the constraint of differential privacy. We show that model
selection procedures based on penalized least squares or likelihood can be made
differentially private by a combination of regularization and randomization,
and propose two algorithms to do so. We show that our private procedures are
consistent under essentially the same conditions as the corresponding
non-private procedures. We also find that under differential privacy, the
procedure becomes more sensitive to the tuning parameters. We illustrate and
evaluate our method using simulation studies and two real data examples
Sharing Social Network Data: Differentially Private Estimation of Exponential-Family Random Graph Models
Motivated by a real-life problem of sharing social network data that contain
sensitive personal information, we propose a novel approach to release and
analyze synthetic graphs in order to protect privacy of individual
relationships captured by the social network while maintaining the validity of
statistical results. A case study using a version of the Enron e-mail corpus
dataset demonstrates the application and usefulness of the proposed techniques
in solving the challenging problem of maintaining privacy \emph{and} supporting
open access to network data to ensure reproducibility of existing studies and
discovering new scientific insights that can be obtained by analyzing such
data. We use a simple yet effective randomized response mechanism to generate
synthetic networks under -edge differential privacy, and then use
likelihood based inference for missing data and Markov chain Monte Carlo
techniques to fit exponential-family random graph models to the generated
synthetic networks.Comment: Updated, 39 page
Bayesian Exponential Random Graph Models with Nodal Random Effects
We extend the well-known and widely used Exponential Random Graph Model
(ERGM) by including nodal random effects to compensate for heterogeneity in the
nodes of a network. The Bayesian framework for ERGMs proposed by Caimo and
Friel (2011) yields the basis of our modelling algorithm. A central question in
network models is the question of model selection and following the Bayesian
paradigm we focus on estimating Bayes factors. To do so we develop an
approximate but feasible calculation of the Bayes factor which allows one to
pursue model selection. Two data examples and a small simulation study
illustrate our mixed model approach and the corresponding model selection.Comment: 23 pages, 9 figures, 3 table
R.A.Fisher, design theory, and the Indian connection
Design Theory, a branch of mathematics, was born out of the experimental
statistics research of the population geneticist R. A. Fisher and of Indian
mathematical statisticians in the 1930s. The field combines elements of
combinatorics, finite projective geometries, Latin squares, and a variety of
further mathematical structures, brought together in surprising ways. This
essay will present these structures and ideas as well as how the field came
together, in itself an interesting story.Comment: 11 pages, 3 figure
- …