304 research outputs found
SoK: Chasing Accuracy and Privacy, and Catching Both in Differentially Private Histogram Publication
Histograms and synthetic data are of key importance in data analysis.
However, researchers have shown that even aggregated data such as histograms,
containing no obvious sensitive attributes, can result in privacy leakage. To
enable data analysis, a strong notion of privacy is required to avoid risking
unintended privacy violations.
Such a strong notion of privacy is differential privacy, a statistical notion
of privacy that makes privacy leakage quantifiable. The caveat regarding
differential privacy is that while it has strong guarantees for privacy,
privacy comes at a cost of accuracy. Despite this trade off being a central and
important issue in the adoption of differential privacy, there exists a gap in
the literature regarding providing an understanding of the trade off and how to
address it appropriately.
Through a systematic literature review (SLR), we investigate the
state-of-the-art within accuracy improving differentially private algorithms
for histogram and synthetic data publishing. Our contribution is two-fold: 1)
we identify trends and connections in the contributions to the field of
differential privacy for histograms and synthetic data and 2) we provide an
understanding of the privacy/accuracy trade off challenge by crystallizing
different dimensions to accuracy improvement. Accordingly, we position and
visualize the ideas in relation to each other and external work, and
deconstruct each algorithm to examine the building blocks separately with the
aim of pinpointing which dimension of accuracy improvement each
technique/approach is targeting. Hence, this systematization of knowledge (SoK)
provides an understanding of in which dimensions and how accuracy improvement
can be pursued without sacrificing privacy
Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes
Demographic events shape a population's genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.
Data Mining with Newton\u27s Method.
Capable and well-organized data mining algorithms are essential and fundamental to helpful, useful, and successful knowledge discovery in databases. We discuss several data mining algorithms including genetic algorithms (GAs). In addition, we propose a modified multivariate Newton\u27s method (NM) approach to data mining of technical data. Several strategies are employed to stabilize Newton\u27s method to pathological function behavior. NM is compared to GAs and to the simplex evolutionary operation algorithm (EVOP). We find that GAs, NM, and EVOP all perform efficiently for well-behaved global optimization functions with NM providing an exponential improvement in convergence rate. For local optimization problems, we find that GAs and EVOP do not provide the desired convergence rate, accuracy, or precision compared to NM for technical data. We find that GAs are favored for their simplicity while NM would be favored for its performance
Digital Advertising: the Measure of Mobile Visits Lifts
Mobile-phone advertising enables marketers to reach customers at a personal
level and it enables the measure of costumers reaction by novel approaches, in
real time, and at scale. By keeping a device anonymous, we can deliver custom
adverts and we can check when the device owner will visit a specific
mortar-and-brick location. This is the first step in a sale. By measuring
visits and sales, the original marketers can determine their return on
advertising and they can prove the efficacy of the marketing investments. We
turn our attention to the measure of lift: we define it as the visit
acceleration during the campaign flight with respect to a controlled baseline.
We present a theoretical description; we describe a general and a simplified
approach in composing the exposed and the control baseline; we develop two
different vertical approaches with different comparable solutions; finally, we
present how to carry the experiments and the measures for a few dozens
campaigns; these campaigns range from hundred thousands devices and counting a
few hundred visits to a handful locations, to sixty million devices and
counting million visits to thousands locations. We care about experiments at
scale.Comment: 27 pages, 18 figure
Distributed Differential Privacy and Applications
Recent growth in the size and scope of databases has resulted in more
research into making productive use of this data. Unfortunately, a
significant stumbling block which remains is protecting the privacy of
the individuals that populate these datasets. As people spend more
time connected to the Internet, and conduct more of their daily lives
online, privacy becomes a more important consideration, just as the
data becomes more useful for researchers, companies, and
individuals. As a result, plenty of important information remains
locked down and unavailable to honest researchers today, due to fears
that data leakages will harm individuals.
Recent research in differential privacy opens a promising pathway to
guarantee individual privacy while simultaneously making use of the
data to answer useful queries. Differential privacy is a theory that
provides provable information theoretic guarantees on what any answer
may reveal about any single individual in the database. This approach
has resulted in a flurry of recent research, presenting novel
algorithms that can compute a rich class of computations in this
setting.
In this dissertation, we focus on some real world challenges that
arise when trying to provide differential privacy guarantees in the
real world. We design and build runtimes that achieve the mathematical
differential privacy guarantee in the face of three real world
challenges: securing the runtimes against adversaries, enabling
readers to verify that the answers are accurate, and dealing with data
distributed across multiple domains
NEW METHODS FOR MINING SEQUENTIAL AND TIME SERIES DATA
Data mining is the process of extracting knowledge from large amounts of data. It covers a variety of techniques aimed at discovering diverse types of patterns on the basis of the requirements of the domain. These techniques include association rules mining, classification, cluster analysis and outlier detection. The availability of applications that produce massive amounts of spatial, spatio-temporal (ST) and time series data (TSD) is the rationale for developing specialized techniques to excavate such data. In spatial data mining, the spatial co-location rule problem is different from the association rule problem, since there is no natural notion of transactions in spatial datasets that are embedded in continuous geographic space. Therefore, we have proposed an efficient algorithm (GridClique) to mine interesting spatial co-location patterns (maximal cliques). These patterns are used as the raw transactions for an association rule mining technique to discover complex co-location rules. Our proposal includes certain types of complex relationships – especially negative relationships – in the patterns. The relationships can be obtained from only the maximal clique patterns, which have never been used until now. Our approach is applied on a well-known astronomy dataset obtained from the Sloan Digital Sky Survey (SDSS). ST data is continuously collected and made accessible in the public domain. We present an approach to mine and query large ST data with the aim of finding interesting patterns and understanding the underlying process of data generation. An important class of queries is based on the flock pattern. A flock is a large subset of objects moving along paths close to each other for a predefined time. One approach to processing a “flock query” is to map ST data into high-dimensional space and to reduce the query to a sequence of standard range queries that can be answered using a spatial indexing structure; however, the performance of spatial indexing structures rapidly deteriorates in high-dimensional space. This thesis sets out a preprocessing strategy that uses a random projection to reduce the dimensionality of the transformed space. We use probabilistic arguments to prove the accuracy of the projection and to present experimental results that show the possibility of managing the curse of dimensionality in a ST setting by combining random projections with traditional data structures. In time series data mining, we devised a new space-efficient algorithm (SparseDTW) to compute the dynamic time warping (DTW) distance between two time series, which always yields the optimal result. This is in contrast to other approaches which typically sacrifice optimality to attain space efficiency. The main idea behind our approach is to dynamically exploit the existence of similarity and/or correlation between the time series: the more the similarity between the time series, the less space required to compute the DTW between them. Other techniques for speeding up DTW, impose a priori constraints and do not exploit similarity characteristics that may be present in the data. Our experiments demonstrate that SparseDTW outperforms these approaches. We discover an interesting pattern by applying SparseDTW algorithm: “pairs trading” in a large stock-market dataset, of the index daily prices from the Australian stock exchange (ASX) from 1980 to 2002
Internal waves in fluid flows. Possible coexistence with turbulence
Waves in fluid flows represents the underlying theme of this research work. Wave interactions in fluid flows are part of multidisciplinary physics. It is known that many ideas and phenomena recur in such apparently diverse fields, as solar physics, meteorology, oceanography, aeronautical and hydraulic engineering, optics, and population dynamics. In extreme synthesis, waves in fluids include, on the one hand, surface and internal waves, their evolution, interaction and associated wave-driven mean flows; on the other hand, phenomena related to nonlinear hydrodynamic stability and, in particular, those leading to the onset of turbulence. Close similarities and key differences exist between these two classes of phenomena. In the hope to get hints on aspects of a potential overall vision, this study considers two different systems located at the opposite limits of the range of existing physical fluid flow situations: first, sheared parallel continuum flows - perfect incompressibility and charge neutrality - second, the solar wind - extreme rarefaction and electrical conductivity. Therefore, the activity carried out during the doctoral period consists of two parts. The first is focused on the propagation properties of small internal waves in parallel flows. This work was partly carried out in the framework of a MISTI-Seeds MITOR project proposed by Prof. D. Tordella (PoliTo) and Prof. G. Staffilani (MIT) on the long term interaction in fluid flows. The second part regards the analysis of solar-wind fluctuations from in situ measurements by the Voyagers spacecrafts at the edge of the heliosphere. This work was supported by a second MISTI-Seeds MITOR project, proposed by D. Tordella (PoliTo), J. D. Richardson (MIT, Kavli Institute), with the collaboration of M. Opher (BU)
- …