194 research outputs found

    Data Analytics with Differential Privacy

    Full text link
    Differential privacy is the state-of-the-art definition for privacy, guaranteeing that any analysis performed on a sensitive dataset leaks no information about the individuals whose data are contained therein. In this thesis, we develop differentially private algorithms to analyze distributed and streaming data. In the distributed model, we consider the particular problem of learning -- in a distributed fashion -- a global model of the data, that can subsequently be used for arbitrary analyses. We build upon PrivBayes, a differentially private method that approximates the high-dimensional distribution of a centralized dataset as a product of low-order distributions, utilizing a Bayesian Network model. We examine three novel approaches to learning a global Bayesian Network from distributed data, while offering the differential privacy guarantee to all local datasets. Our work includes a detailed theoretical analysis of the distributed, differentially private entropy estimator which we use in one of our algorithms, as well as a detailed experimental evaluation, using both synthetic and real-world data. In the streaming model, we focus on the problem of estimating the density of a stream of users, which expresses the fraction of all users that actually appear in the stream. We offer one of the strongest privacy guarantees for the streaming model, user-level pan-privacy, which ensures that the privacy of any user is protected, even against an adversary that observes the internal state of the algorithm. We provide a detailed analysis of an existing, sampling-based algorithm for the problem and propose two novel modifications that significantly improve it, both theoretically and experimentally, by optimally using all the allocated "privacy budget."Comment: Diploma Thesis, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 201

    Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

    Full text link
    We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.Comment: Submitted to IEEE Transactions on Knowledge and Data Engineering on 07/2020. Revised on 05/202

    Improving Stability in Decision Tree Models

    Full text link
    Owing to their inherently interpretable structure, decision trees are commonly used in applications where interpretability is essential. Recent work has focused on improving various aspects of decision trees, including their predictive power and robustness; however, their instability, albeit well-documented, has been addressed to a lesser extent. In this paper, we take a step towards the stabilization of decision tree models through the lens of real-world health care applications due to the relevance of stability and interpretability in this space. We introduce a new distance metric for decision trees and use it to determine a tree's level of stability. We propose a novel methodology to train stable decision trees and investigate the existence of trade-offs that are inherent to decision tree models - including between stability, predictive power, and interpretability. We demonstrate the value of the proposed methodology through an extensive quantitative and qualitative analysis of six case studies from real-world health care applications, and we show that, on average, with a small 4.6% decrease in predictive power, we gain a significant 38% improvement in the model's stability

    The Backbone Method for Ultra-High Dimensional Sparse Machine Learning

    Full text link
    We present the backbone method, a generic framework that enables sparse and interpretable supervised machine learning methods to scale to ultra-high dimensional problems. We solve sparse regression problems with 10710^7 features in minutes and 10810^8 features in hours, as well as decision tree problems with 10510^5 features in minutes.The proposed method operates in two phases: we first determine the backbone set, consisting of potentially relevant features, by solving a number of tractable subproblems; then, we solve a reduced problem, considering only the backbone features. For the sparse regression problem, our theoretical analysis shows that, under certain assumptions and with high probability, the backbone set consists of the truly relevant features. Numerical experiments on both synthetic and real-world datasets demonstrate that our method outperforms or competes with state-of-the-art methods in ultra-high dimensional problems, and competes with optimal solutions in problems where exact methods scale, both in terms of recovering the truly relevant features and in its out-of-sample predictive performance.Comment: First submission to Machine Learning: 06/2020. Revised: 10/202

    Serum Profiles of C-Reactive Protein, Interleukin-8, and Tumor Necrosis Factor-α in Patients with Acute Pancreatitis

    Get PDF
    Background-Aims. Early prediction of the severity of acute pancreatitis would lead to prompt intensive treatment resulting in improvement of the outcome. The present study investigated the use of C-reactive protein (CRP), interleukin IL-8 and tumor necrosis factor-α (TNF-α) as prognosticators of the severity of the disease. Methods. Twenty-six patients with acute pancreatitis were studied. Patients with APACHE II score of 9 or more formed the severe group, while the mild group consisted of patients with APACHE II score of less than 9. Serum samples for measurement of CRP, IL-8 and TNF-α were collected on the day of admission and additionally on the 2nd, 3rd and 7th days. Results. Significantly higher levels of IL-8 were found in patients with severe acute pancreatitis compared to those with mild disease especially at the 2nd and 3rd days (P = .001 and P = .014, resp.). No significant difference for CRP and TNF-α was observed between the two groups. The optimal cut-offs for IL-8 in order to discriminate severe from mild disease at the 2nd and 3rd days were 25.4 pg/mL and 14.5 pg/mL, respectively. Conclusions. IL-8 in early phase of acute pancreatitis is superior marker compared to CRP and TNF-α for distinguishing patients with severe disease

    Slowly Varying Regression under Sparsity

    Full text link
    We consider the problem of parameter estimation in slowly varying regression models with sparsity constraints. We formulate the problem as a mixed integer optimization problem and demonstrate that it can be reformulated exactly as a binary convex optimization problem through a novel exact relaxation. The relaxation utilizes a new equality on Moore-Penrose inverses that convexifies the non-convex objective function while coinciding with the original objective on all feasible binary points. This allows us to solve the problem significantly more efficiently and to provable optimality using a cutting plane-type algorithm. We develop a highly optimized implementation of such algorithm, which substantially improves upon the asymptotic computational complexity of a straightforward implementation. We further develop a heuristic method that is guaranteed to produce a feasible solution and, as we empirically illustrate, generates high quality warm-start solutions for the binary optimization problem. We show, on both synthetic and real-world datasets, that the resulting algorithm outperforms competing formulations in comparable times across a variety of metrics including out-of-sample predictive performance, support recovery accuracy, and false positive rate. The algorithm enables us to train models with 10,000s of parameters, is robust to noise, and able to effectively capture the underlying slowly changing support of the data generating process.Comment: Submitted to Operations Research. First submission: 02/202

    Non-operative management of blunt abdominal trauma. Is it safe and feasible in a district general hospital?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>To evaluate the feasibility and safety of non-operative management (NOM) of blunt abdominal trauma in a district general hospital with middle volume trauma case load.</p> <p>Methods</p> <p>Prospective protocol-driven study including 30 consecutive patients who have been treated in our Department during a 30-month-period. Demographic, medical and trauma characteristics, type of treatment and outcome were examined. Patients were divided in 3 groups: those who underwent immediate laparotomy (OP group), those who had a successful NOM (NOM-S group) and those with a NOM failure (NOM-F group).</p> <p>Results</p> <p>NOM was applied in 73.3% (22 patients) of all blunt abdominal injuries with a failure rate of 13.6% (3 patients). Injury severity score (ISS), admission hematocrit, hemodynamic status and need for transfusion were significantly different between NOM and OP group. NOM failure occurred mainly in patients with splenic trauma.</p> <p>Conclusion</p> <p>According to our experience, the hemodynamically stable or easily stabilized trauma patient can be admitted in a non-ICU ward with the provision of close monitoring. Splenic injury, especially with multiple-site free intra-abdominal fluid in abdominal computed tomography, carries a high risk for NOM failure. In this series, the main criterion for a laparotomy in a NOM patient was hemodynamic deterioration after a second rapid fluid load.</p

    Adaptive hybrid optimization strategy for calibration and parameter estimation of physical models

    Full text link
    A new adaptive hybrid optimization strategy, entitled squads, is proposed for complex inverse analysis of computationally intensive physical models. The new strategy is designed to be computationally efficient and robust in identification of the global optimum (e.g. maximum or minimum value of an objective function). It integrates a global Adaptive Particle Swarm Optimization (APSO) strategy with a local Levenberg-Marquardt (LM) optimization strategy using adaptive rules based on runtime performance. The global strategy optimizes the location of a set of solutions (particles) in the parameter space. The LM strategy is applied only to a subset of the particles at different stages of the optimization based on the adaptive rules. After the LM adjustment of the subset of particle positions, the updated particles are returned to the APSO strategy. The advantages of coupling APSO and LM in the manner implemented in squads is demonstrated by comparisons of squads performance against Levenberg-Marquardt (LM), Particle Swarm Optimization (PSO), Adaptive Particle Swarm Optimization (APSO; the TRIBES strategy), and an existing hybrid optimization strategy (hPSO). All the strategies are tested on 2D, 5D and 10D Rosenbrock and Griewank polynomial test functions and a synthetic hydrogeologic application to identify the source of a contaminant plume in an aquifer. Tests are performed using a series of runs with random initial guesses for the estimated (function/model) parameters. Squads is observed to have the best performance when both robustness and efficiency are taken into consideration than the other strategies for all test functions and the hydrogeologic application
    corecore