194 research outputs found
Data Analytics with Differential Privacy
Differential privacy is the state-of-the-art definition for privacy,
guaranteeing that any analysis performed on a sensitive dataset leaks no
information about the individuals whose data are contained therein. In this
thesis, we develop differentially private algorithms to analyze distributed and
streaming data. In the distributed model, we consider the particular problem of
learning -- in a distributed fashion -- a global model of the data, that can
subsequently be used for arbitrary analyses. We build upon PrivBayes, a
differentially private method that approximates the high-dimensional
distribution of a centralized dataset as a product of low-order distributions,
utilizing a Bayesian Network model. We examine three novel approaches to
learning a global Bayesian Network from distributed data, while offering the
differential privacy guarantee to all local datasets. Our work includes a
detailed theoretical analysis of the distributed, differentially private
entropy estimator which we use in one of our algorithms, as well as a detailed
experimental evaluation, using both synthetic and real-world data. In the
streaming model, we focus on the problem of estimating the density of a stream
of users, which expresses the fraction of all users that actually appear in the
stream. We offer one of the strongest privacy guarantees for the streaming
model, user-level pan-privacy, which ensures that the privacy of any user is
protected, even against an adversary that observes the internal state of the
algorithm. We provide a detailed analysis of an existing, sampling-based
algorithm for the problem and propose two novel modifications that
significantly improve it, both theoretically and experimentally, by optimally
using all the allocated "privacy budget."Comment: Diploma Thesis, School of Electrical and Computer Engineering,
Technical University of Crete, Chania, Greece, 201
Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme
We present a novel approach for the problem of frequency estimation in data
streams that is based on optimization and machine learning. Contrary to
state-of-the-art streaming frequency estimation algorithms, which heavily rely
on random hashing to maintain the frequency distribution of the data steam
using limited storage, the proposed approach exploits an observed stream prefix
to near-optimally hash elements and compress the target frequency distribution.
We develop an exact mixed-integer linear optimization formulation, which
enables us to compute optimal or near-optimal hashing schemes for elements seen
in the observed stream prefix; then, we use machine learning to hash unseen
elements. Further, we develop an efficient block coordinate descent algorithm,
which, as we empirically show, produces high quality solutions, and, in a
special case, we are able to solve the proposed formulation exactly in linear
time using dynamic programming. We empirically evaluate the proposed approach
both on synthetic datasets and on real-world search query data. We show that
the proposed approach outperforms existing approaches by one to two orders of
magnitude in terms of its average (per element) estimation error and by 45-90%
in terms of its expected magnitude of estimation error.Comment: Submitted to IEEE Transactions on Knowledge and Data Engineering on
07/2020. Revised on 05/202
Improving Stability in Decision Tree Models
Owing to their inherently interpretable structure, decision trees are
commonly used in applications where interpretability is essential. Recent work
has focused on improving various aspects of decision trees, including their
predictive power and robustness; however, their instability, albeit
well-documented, has been addressed to a lesser extent. In this paper, we take
a step towards the stabilization of decision tree models through the lens of
real-world health care applications due to the relevance of stability and
interpretability in this space. We introduce a new distance metric for decision
trees and use it to determine a tree's level of stability. We propose a novel
methodology to train stable decision trees and investigate the existence of
trade-offs that are inherent to decision tree models - including between
stability, predictive power, and interpretability. We demonstrate the value of
the proposed methodology through an extensive quantitative and qualitative
analysis of six case studies from real-world health care applications, and we
show that, on average, with a small 4.6% decrease in predictive power, we gain
a significant 38% improvement in the model's stability
The Backbone Method for Ultra-High Dimensional Sparse Machine Learning
We present the backbone method, a generic framework that enables sparse and
interpretable supervised machine learning methods to scale to ultra-high
dimensional problems. We solve sparse regression problems with features
in minutes and features in hours, as well as decision tree problems with
features in minutes.The proposed method operates in two phases: we first
determine the backbone set, consisting of potentially relevant features, by
solving a number of tractable subproblems; then, we solve a reduced problem,
considering only the backbone features. For the sparse regression problem, our
theoretical analysis shows that, under certain assumptions and with high
probability, the backbone set consists of the truly relevant features.
Numerical experiments on both synthetic and real-world datasets demonstrate
that our method outperforms or competes with state-of-the-art methods in
ultra-high dimensional problems, and competes with optimal solutions in
problems where exact methods scale, both in terms of recovering the truly
relevant features and in its out-of-sample predictive performance.Comment: First submission to Machine Learning: 06/2020. Revised: 10/202
Serum Profiles of C-Reactive Protein, Interleukin-8, and Tumor Necrosis Factor-α in Patients with Acute Pancreatitis
Background-Aims. Early prediction of the severity of acute pancreatitis would lead to prompt intensive treatment resulting in improvement of the outcome. The present study investigated the use of C-reactive protein (CRP), interleukin IL-8 and tumor necrosis factor-α (TNF-α) as prognosticators of the severity of the disease.
Methods. Twenty-six patients with acute pancreatitis were studied. Patients with APACHE II score of 9 or more formed the severe group, while the mild group consisted of patients with APACHE II score of less than 9. Serum samples for measurement of CRP, IL-8 and TNF-α were collected on the day of admission and additionally on the 2nd, 3rd and 7th days.
Results. Significantly higher levels of IL-8 were found in patients with severe acute pancreatitis compared to those with mild disease especially at the 2nd and 3rd days (P = .001 and P = .014, resp.). No significant difference for CRP and TNF-α was observed between the two groups. The optimal cut-offs for IL-8 in order to discriminate severe from mild disease at the 2nd and 3rd days were 25.4 pg/mL and 14.5 pg/mL, respectively.
Conclusions. IL-8 in early phase of acute pancreatitis is superior marker compared to CRP and TNF-α for distinguishing patients with severe disease
Slowly Varying Regression under Sparsity
We consider the problem of parameter estimation in slowly varying regression
models with sparsity constraints. We formulate the problem as a mixed integer
optimization problem and demonstrate that it can be reformulated exactly as a
binary convex optimization problem through a novel exact relaxation. The
relaxation utilizes a new equality on Moore-Penrose inverses that convexifies
the non-convex objective function while coinciding with the original objective
on all feasible binary points. This allows us to solve the problem
significantly more efficiently and to provable optimality using a cutting
plane-type algorithm. We develop a highly optimized implementation of such
algorithm, which substantially improves upon the asymptotic computational
complexity of a straightforward implementation. We further develop a heuristic
method that is guaranteed to produce a feasible solution and, as we empirically
illustrate, generates high quality warm-start solutions for the binary
optimization problem. We show, on both synthetic and real-world datasets, that
the resulting algorithm outperforms competing formulations in comparable times
across a variety of metrics including out-of-sample predictive performance,
support recovery accuracy, and false positive rate. The algorithm enables us to
train models with 10,000s of parameters, is robust to noise, and able to
effectively capture the underlying slowly changing support of the data
generating process.Comment: Submitted to Operations Research. First submission: 02/202
Non-operative management of blunt abdominal trauma. Is it safe and feasible in a district general hospital?
<p>Abstract</p> <p>Background</p> <p>To evaluate the feasibility and safety of non-operative management (NOM) of blunt abdominal trauma in a district general hospital with middle volume trauma case load.</p> <p>Methods</p> <p>Prospective protocol-driven study including 30 consecutive patients who have been treated in our Department during a 30-month-period. Demographic, medical and trauma characteristics, type of treatment and outcome were examined. Patients were divided in 3 groups: those who underwent immediate laparotomy (OP group), those who had a successful NOM (NOM-S group) and those with a NOM failure (NOM-F group).</p> <p>Results</p> <p>NOM was applied in 73.3% (22 patients) of all blunt abdominal injuries with a failure rate of 13.6% (3 patients). Injury severity score (ISS), admission hematocrit, hemodynamic status and need for transfusion were significantly different between NOM and OP group. NOM failure occurred mainly in patients with splenic trauma.</p> <p>Conclusion</p> <p>According to our experience, the hemodynamically stable or easily stabilized trauma patient can be admitted in a non-ICU ward with the provision of close monitoring. Splenic injury, especially with multiple-site free intra-abdominal fluid in abdominal computed tomography, carries a high risk for NOM failure. In this series, the main criterion for a laparotomy in a NOM patient was hemodynamic deterioration after a second rapid fluid load.</p
Adaptive hybrid optimization strategy for calibration and parameter estimation of physical models
A new adaptive hybrid optimization strategy, entitled squads, is proposed for
complex inverse analysis of computationally intensive physical models. The new
strategy is designed to be computationally efficient and robust in
identification of the global optimum (e.g. maximum or minimum value of an
objective function). It integrates a global Adaptive Particle Swarm
Optimization (APSO) strategy with a local Levenberg-Marquardt (LM) optimization
strategy using adaptive rules based on runtime performance. The global strategy
optimizes the location of a set of solutions (particles) in the parameter
space. The LM strategy is applied only to a subset of the particles at
different stages of the optimization based on the adaptive rules. After the LM
adjustment of the subset of particle positions, the updated particles are
returned to the APSO strategy. The advantages of coupling APSO and LM in the
manner implemented in squads is demonstrated by comparisons of squads
performance against Levenberg-Marquardt (LM), Particle Swarm Optimization
(PSO), Adaptive Particle Swarm Optimization (APSO; the TRIBES strategy), and an
existing hybrid optimization strategy (hPSO). All the strategies are tested on
2D, 5D and 10D Rosenbrock and Griewank polynomial test functions and a
synthetic hydrogeologic application to identify the source of a contaminant
plume in an aquifer. Tests are performed using a series of runs with random
initial guesses for the estimated (function/model) parameters. Squads is
observed to have the best performance when both robustness and efficiency are
taken into consideration than the other strategies for all test functions and
the hydrogeologic application
- …