67 research outputs found
Efficient Inference of Gaussian Process Modulated Renewal Processes with Application to Medical Event Data
The episodic, irregular and asynchronous nature of medical data render them
difficult substrates for standard machine learning algorithms. We would like to
abstract away this difficulty for the class of time-stamped categorical
variables (or events) by modeling them as a renewal process and inferring a
probability density over continuous, longitudinal, nonparametric intensity
functions modulating that process. Several methods exist for inferring such a
density over intensity functions, but either their constraints and assumptions
prevent their use with our potentially bursty event streams, or their time
complexity renders their use intractable on our long-duration observations of
high-resolution events, or both. In this paper we present a new and efficient
method for inferring a distribution over intensity functions that uses direct
numeric integration and smooth interpolation over Gaussian processes. We
demonstrate that our direct method is up to twice as accurate and two orders of
magnitude more efficient than the best existing method (thinning). Importantly,
the direct method can infer intensity functions over the full range of bursty
to memoryless to regular events, which thinning and many other methods cannot.
Finally, we apply the method to clinical event data and demonstrate the
face-validity of the abstraction, which is now amenable to standard learning
algorithms.Comment: 8 pages, 4 figure
Spectral anonymization of data
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 87-96).Data anonymization is the process of conditioning a dataset such that no sensitive information can be learned about any specific individual, but valid scientific analysis can nevertheless be performed on it. It is not sufficient to simply remove identifying information because the remaining data may be enough to infer the individual source of the record (a reidentification disclosure) or to otherwise learn sensitive information about a person (a predictive disclosure). The only known way to prevent these disclosures is to remove additional information from the dataset. Dozens of anonymization methods have been proposed over the past few decades; most work by perturbing or suppressing variable values. None have been successful at simultaneously providing perfect privacy protection and allowing perfectly accurate scientific analysis. This dissertation makes the new observation that the anonymizing operations do not need to be made in the original basis of the dataset. Operating in a different, judiciously chosen basis can improve privacy protection, analytic utility, and computational efficiency. I use the term 'spectral anonymization' to refer to anonymizing in a spectral basis, such as the basis provided by the data's eigenvectors. Additionally, I propose new measures of reidentification and prediction risk that are more generally applicable and more informative than existing measures. I also propose a measure of analytic utility that assesses the preservation of the multivariate probability distribution. Finally, I propose the demanding reference standard of nonparticipation in the study to define adequate privacy protection. I give three examples of spectral anonymization in practice. The first example improves basic cell swapping from a weak algorithm to one competitive with state of-the-art methods merely by a change of basis.(cont) The second example demonstrates avoiding the curse of dimensionality in microaggregation. The third describes a powerful algorithm that reduces computational disclosure risk to the same level as that of nonparticipants and preserves at least 4th order interactions in the multivariate distribution. No previously reported algorithm has achieved this combination of results.by Thomas Anton Lasko.Ph.D
Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model
Complex diseases are caused by a multitude of factors that may differ between
patients even within the same diagnostic category. A few underlying root causes
may nevertheless initiate the development of disease within each patient. We
therefore focus on identifying patient-specific root causes of disease, which
we equate to the sample-specific predictivity of the exogenous error terms in a
structural equation model. We generalize from the linear setting to the
heteroscedastic noise model where with
non-linear functions and representing the conditional mean
and mean absolute deviation, respectively. This model preserves identifiability
but introduces non-trivial challenges that require a customized algorithm
called Generalized Root Causal Inference (GRCI) to extract the error terms
correctly. GRCI recovers patient-specific root causes more accurately than
existing alternatives
Sample-Specific Root Causal Inference with Latent Variables
Root causal analysis seeks to identify the set of initial perturbations that
induce an unwanted outcome. In prior work, we defined sample-specific root
causes of disease using exogenous error terms that predict a diagnosis in a
structural equation model. We rigorously quantified predictivity using Shapley
values. However, the associated algorithms for inferring root causes assume no
latent confounding. We relax this assumption by permitting confounding among
the predictors. We then introduce a corresponding procedure called Extract
Errors with Latents (EEL) for recovering the error terms up to contamination by
vertices on certain paths under the linear non-Gaussian acyclic model. EEL also
identifies the smallest sets of dependent errors for fast computation of the
Shapley values. The algorithm bypasses the hard problem of estimating the
underlying causal graph in both cases. Experiments highlight the superior
accuracy and robustness of EEL relative to its predecessors
- …