14 research outputs found
Data Analytics with Differential Privacy
Differential privacy is the state-of-the-art definition for privacy,
guaranteeing that any analysis performed on a sensitive dataset leaks no
information about the individuals whose data are contained therein. In this
thesis, we develop differentially private algorithms to analyze distributed and
streaming data. In the distributed model, we consider the particular problem of
learning -- in a distributed fashion -- a global model of the data, that can
subsequently be used for arbitrary analyses. We build upon PrivBayes, a
differentially private method that approximates the high-dimensional
distribution of a centralized dataset as a product of low-order distributions,
utilizing a Bayesian Network model. We examine three novel approaches to
learning a global Bayesian Network from distributed data, while offering the
differential privacy guarantee to all local datasets. Our work includes a
detailed theoretical analysis of the distributed, differentially private
entropy estimator which we use in one of our algorithms, as well as a detailed
experimental evaluation, using both synthetic and real-world data. In the
streaming model, we focus on the problem of estimating the density of a stream
of users, which expresses the fraction of all users that actually appear in the
stream. We offer one of the strongest privacy guarantees for the streaming
model, user-level pan-privacy, which ensures that the privacy of any user is
protected, even against an adversary that observes the internal state of the
algorithm. We provide a detailed analysis of an existing, sampling-based
algorithm for the problem and propose two novel modifications that
significantly improve it, both theoretically and experimentally, by optimally
using all the allocated "privacy budget."Comment: Diploma Thesis, School of Electrical and Computer Engineering,
Technical University of Crete, Chania, Greece, 201
Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme
We present a novel approach for the problem of frequency estimation in data
streams that is based on optimization and machine learning. Contrary to
state-of-the-art streaming frequency estimation algorithms, which heavily rely
on random hashing to maintain the frequency distribution of the data steam
using limited storage, the proposed approach exploits an observed stream prefix
to near-optimally hash elements and compress the target frequency distribution.
We develop an exact mixed-integer linear optimization formulation, which
enables us to compute optimal or near-optimal hashing schemes for elements seen
in the observed stream prefix; then, we use machine learning to hash unseen
elements. Further, we develop an efficient block coordinate descent algorithm,
which, as we empirically show, produces high quality solutions, and, in a
special case, we are able to solve the proposed formulation exactly in linear
time using dynamic programming. We empirically evaluate the proposed approach
both on synthetic datasets and on real-world search query data. We show that
the proposed approach outperforms existing approaches by one to two orders of
magnitude in terms of its average (per element) estimation error and by 45-90%
in terms of its expected magnitude of estimation error.Comment: Submitted to IEEE Transactions on Knowledge and Data Engineering on
07/2020. Revised on 05/202
Improving Stability in Decision Tree Models
Owing to their inherently interpretable structure, decision trees are
commonly used in applications where interpretability is essential. Recent work
has focused on improving various aspects of decision trees, including their
predictive power and robustness; however, their instability, albeit
well-documented, has been addressed to a lesser extent. In this paper, we take
a step towards the stabilization of decision tree models through the lens of
real-world health care applications due to the relevance of stability and
interpretability in this space. We introduce a new distance metric for decision
trees and use it to determine a tree's level of stability. We propose a novel
methodology to train stable decision trees and investigate the existence of
trade-offs that are inherent to decision tree models - including between
stability, predictive power, and interpretability. We demonstrate the value of
the proposed methodology through an extensive quantitative and qualitative
analysis of six case studies from real-world health care applications, and we
show that, on average, with a small 4.6% decrease in predictive power, we gain
a significant 38% improvement in the model's stability
The Backbone Method for Ultra-High Dimensional Sparse Machine Learning
We present the backbone method, a generic framework that enables sparse and
interpretable supervised machine learning methods to scale to ultra-high
dimensional problems. We solve sparse regression problems with features
in minutes and features in hours, as well as decision tree problems with
features in minutes.The proposed method operates in two phases: we first
determine the backbone set, consisting of potentially relevant features, by
solving a number of tractable subproblems; then, we solve a reduced problem,
considering only the backbone features. For the sparse regression problem, our
theoretical analysis shows that, under certain assumptions and with high
probability, the backbone set consists of the truly relevant features.
Numerical experiments on both synthetic and real-world datasets demonstrate
that our method outperforms or competes with state-of-the-art methods in
ultra-high dimensional problems, and competes with optimal solutions in
problems where exact methods scale, both in terms of recovering the truly
relevant features and in its out-of-sample predictive performance.Comment: First submission to Machine Learning: 06/2020. Revised: 10/202
Slowly Varying Regression under Sparsity
We consider the problem of parameter estimation in slowly varying regression
models with sparsity constraints. We formulate the problem as a mixed integer
optimization problem and demonstrate that it can be reformulated exactly as a
binary convex optimization problem through a novel exact relaxation. The
relaxation utilizes a new equality on Moore-Penrose inverses that convexifies
the non-convex objective function while coinciding with the original objective
on all feasible binary points. This allows us to solve the problem
significantly more efficiently and to provable optimality using a cutting
plane-type algorithm. We develop a highly optimized implementation of such
algorithm, which substantially improves upon the asymptotic computational
complexity of a straightforward implementation. We further develop a heuristic
method that is guaranteed to produce a feasible solution and, as we empirically
illustrate, generates high quality warm-start solutions for the binary
optimization problem. We show, on both synthetic and real-world datasets, that
the resulting algorithm outperforms competing formulations in comparable times
across a variety of metrics including out-of-sample predictive performance,
support recovery accuracy, and false positive rate. The algorithm enables us to
train models with 10,000s of parameters, is robust to noise, and able to
effectively capture the underlying slowly changing support of the data
generating process.Comment: Submitted to Operations Research. First submission: 02/202
Where to locate COVID ‐19 mass vaccination facilities?
The outbreak of COVID-19 led to a record-breaking race to develop a vaccine.
However, the limited vaccine capacity creates another massive challenge: how to
distribute vaccines to mitigate the near-end impact of the pandemic? In the
United States in particular, the new Biden administration is launching mass
vaccination sites across the country, raising the obvious question of where to
locate these clinics to maximize the effectiveness of the vaccination campaign.
This paper tackles this question with a novel data-driven approach to optimize
COVID-19 vaccine distribution. We first augment a state-of-the-art
epidemiological model, called DELPHI, to capture the effects of vaccinations
and the variability in mortality rates across age groups. We then integrate
this predictive model into a prescriptive model to optimize the location of
vaccination sites and subsequent vaccine allocation. The model is formulated as
a bilinear, non-convex optimization model. To solve it, we propose a coordinate
descent algorithm that iterates between optimizing vaccine distribution and
simulating the dynamics of the pandemic. As compared to benchmarks based on
demographic and epidemiological information, the proposed optimization approach
increases the effectiveness of the vaccination campaign by an estimated ,
saving an extra extra lives in the United States over a three-month
period. The proposed solution achieves critical fairness objectives -- by
reducing the death toll of the pandemic in several states without hurting
others -- and is highly robust to uncertainties and forecast errors -- by
achieving similar benefits under a vast range of perturbations
Where to locate COVID
The outbreak of COVID-19 led to a record-breaking race to develop a vaccine.
However, the limited vaccine capacity creates another massive challenge: how to
distribute vaccines to mitigate the near-end impact of the pandemic? In the
United States in particular, the new Biden administration is launching mass
vaccination sites across the country, raising the obvious question of where to
locate these clinics to maximize the effectiveness of the vaccination campaign.
This paper tackles this question with a novel data-driven approach to optimize
COVID-19 vaccine distribution. We first augment a state-of-the-art
epidemiological model, called DELPHI, to capture the effects of vaccinations
and the variability in mortality rates across age groups. We then integrate
this predictive model into a prescriptive model to optimize the location of
vaccination sites and subsequent vaccine allocation. The model is formulated as
a bilinear, non-convex optimization model. To solve it, we propose a coordinate
descent algorithm that iterates between optimizing vaccine distribution and
simulating the dynamics of the pandemic. As compared to benchmarks based on
demographic and epidemiological information, the proposed optimization approach
increases the effectiveness of the vaccination campaign by an estimated ,
saving an extra extra lives in the United States over a three-month
period. The proposed solution achieves critical fairness objectives -- by
reducing the death toll of the pandemic in several states without hurting
others -- and is highly robust to uncertainties and forecast errors -- by
achieving similar benefits under a vast range of perturbations
Rapid Speech Recognizer Adaptation to New Speakers
This paper summarizes the work of the "Rapid Speech Recognizer Adaptation" team in the workshop held at Johns Hopkins University in the summer of 1998. The project addressed the modeling of dependencies between units of speech with the goal of making more effective use of small amounts of data for speaker adaptation. A variety of statistical dependence models were investigated, including (i) a Gaussian multiscale process governed by a stochastic linear dynamical system on a tree, (ii) a simple hierarchical tree-structured prior, (iii) explicit correlation models and (iv) Markov Random elds. In particular, we investigated dependence models of the bias components of "cascade" transforms of the Gaussian means, which improved the accuracy of the widely used adaptation by transform (constrained re-estimation). Modeling methodologies are contrasted, and comparative performance on the Switchboard task is presented under identical test conditions for supervised and unsupervised adaptation with controlled amounts of adaptation speech