1,115 research outputs found

    Spectral anonymization of data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 87-96).Data anonymization is the process of conditioning a dataset such that no sensitive information can be learned about any specific individual, but valid scientific analysis can nevertheless be performed on it. It is not sufficient to simply remove identifying information because the remaining data may be enough to infer the individual source of the record (a reidentification disclosure) or to otherwise learn sensitive information about a person (a predictive disclosure). The only known way to prevent these disclosures is to remove additional information from the dataset. Dozens of anonymization methods have been proposed over the past few decades; most work by perturbing or suppressing variable values. None have been successful at simultaneously providing perfect privacy protection and allowing perfectly accurate scientific analysis. This dissertation makes the new observation that the anonymizing operations do not need to be made in the original basis of the dataset. Operating in a different, judiciously chosen basis can improve privacy protection, analytic utility, and computational efficiency. I use the term 'spectral anonymization' to refer to anonymizing in a spectral basis, such as the basis provided by the data's eigenvectors. Additionally, I propose new measures of reidentification and prediction risk that are more generally applicable and more informative than existing measures. I also propose a measure of analytic utility that assesses the preservation of the multivariate probability distribution. Finally, I propose the demanding reference standard of nonparticipation in the study to define adequate privacy protection. I give three examples of spectral anonymization in practice. The first example improves basic cell swapping from a weak algorithm to one competitive with state of-the-art methods merely by a change of basis.(cont) The second example demonstrates avoiding the curse of dimensionality in microaggregation. The third describes a powerful algorithm that reduces computational disclosure risk to the same level as that of nonparticipants and preserves at least 4th order interactions in the multivariate distribution. No previously reported algorithm has achieved this combination of results.by Thomas Anton Lasko.Ph.D

    Statistical methods for NHS incident reporting data

    Get PDF
    The National Reporting and Learning System (NRLS) is the English and Welsh NHS’ national repository of incident reports from healthcare. It aims to capture details of incident reports, at national level, and facilitate clinical review and learning to improve patient safety. These incident reports range from minor ‘near-misses’ to critical incidents that may lead to severe harm or death. NRLS data are currently reported as crude counts and proportions, but their major use is clinical review of the free-text descriptions of incidents. There are few well-developed quantitative analysis approaches for NRLS, and this thesis investigates these methods. A literature review revealed a wealth of clinical detail, but also systematic constraints of NRLS’ structure, including non-mandatory reporting, missing data and misclassification. Summary statistics for reports from 2010/11 – 2016/17 supported this and suggest NRLS was not suitable for statistical modelling in isolation. Modelling methods were advanced by creating a hybrid dataset using other sources of hospital casemix data from Hospital Episode Statistics (HES). A theoretical model was established, based on ‘exposure’ variables (using casemix proxies), and ‘culture’ as a random-effect. The initial modelling approach examined Poisson regression, mixture and multilevel models. Overdispersion was significant, generated mainly by clustering and aggregation in the hybrid dataset, but models were chosen to reflect these structures. Further modelling approaches were examined, using Generalized Additive Models to smooth predictor variables, regression tree-based models including Random Forests, and Artificial Neural Networks. Models were also extended to examine a subset of death and severe harm incidents, exploring how sparse counts affect models. Text mining techniques were examined for analysis of incident descriptions and showed how term frequency might be used. Terms were used to generate latent topics models used, in-turn, to predict the harm level of incidents. Model outputs were used to create a ‘Standardised Incident Reporting Ratio’ (SIRR) and cast this in the mould of current regulatory frameworks, using process control techniques such as funnel plots and cusum charts. A prototype online reporting tool was developed to allow NHS organisations to examine their SIRRs, provide supporting analyses, and link data points back to individual incident reports

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

    Probability Models for Health Care Operations with Application to Emergency Medicine

    Get PDF
    This thesis consists of four contributing chapters; two of which are inspired by practical problems related to emergency department (ED) operations management and the remaining two are motivated by the theoretical problem related to the time-dependent priority queue. Unlike classical priority queue, priorities in the time-dependent priority queue depends on the amount of time an arrival waits for service in addition to the priority class they belong. The mismatch between the demand for ED services and the available resources have direct and indirect negative consequences. Moreover, ED physician pay in some jurisdictions reflects pay-for-performance contracts based on operational benchmarks. To assist in capacity planning and meeting these benchmarks, in chapter 4, I built a forecasting model to produce short-term forecasts of ED arrivals. In chapter 5, I empirically investigated the effect of workload on the productivity of ED services. Specifically, under discretionary work setting, different statistical models were fitted to identify the effect of workload and census on four measures of ED service processes, namely, number discharged, length of stay, service time, and waiting time. The time-dependent priority model was first proposed by Kleinrock (1964), and, more recently, naming it accumulating priority queue (APQ), Stanford et al. (2014) derived the waiting time distributions for the various priority classes when the queue has a single server. In chapter 6, I derived expressions for the waiting time distributions for a multi-server APQ with Poisson arrivals for each class, and a common exponential service time distribution. In chapter 7, I worked with a KPI based service system where there are specific time targets by which each class of customers should commence their service and a compliance probability indicating the proportion of customers from that class meeting the target. Recognizing the fact that customer who misses their KPI target is of greater, not lesser importance, I seek to minimize a weighted sum of the expected amount of excess waiting for each class. When minimizing the total expected excess, our numerical examples lead to an easily-implemented rule of thumb for the optimal priority accumulation rates, which can have an immediate impact on health care delivery
    • …
    corecore