304 research outputs found

    SoK: Chasing Accuracy and Privacy, and Catching Both in Differentially Private Histogram Publication

    Get PDF
    Histograms and synthetic data are of key importance in data analysis. However, researchers have shown that even aggregated data such as histograms, containing no obvious sensitive attributes, can result in privacy leakage. To enable data analysis, a strong notion of privacy is required to avoid risking unintended privacy violations. Such a strong notion of privacy is differential privacy, a statistical notion of privacy that makes privacy leakage quantifiable. The caveat regarding differential privacy is that while it has strong guarantees for privacy, privacy comes at a cost of accuracy. Despite this trade off being a central and important issue in the adoption of differential privacy, there exists a gap in the literature regarding providing an understanding of the trade off and how to address it appropriately. Through a systematic literature review (SLR), we investigate the state-of-the-art within accuracy improving differentially private algorithms for histogram and synthetic data publishing. Our contribution is two-fold: 1) we identify trends and connections in the contributions to the field of differential privacy for histograms and synthetic data and 2) we provide an understanding of the privacy/accuracy trade off challenge by crystallizing different dimensions to accuracy improvement. Accordingly, we position and visualize the ideas in relation to each other and external work, and deconstruct each algorithm to examine the building blocks separately with the aim of pinpointing which dimension of accuracy improvement each technique/approach is targeting. Hence, this systematization of knowledge (SoK) provides an understanding of in which dimensions and how accuracy improvement can be pursued without sacrificing privacy

    Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes

    Get PDF
    Demographic events shape a population's genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.

    Data Mining with Newton\u27s Method.

    Get PDF
    Capable and well-organized data mining algorithms are essential and fundamental to helpful, useful, and successful knowledge discovery in databases. We discuss several data mining algorithms including genetic algorithms (GAs). In addition, we propose a modified multivariate Newton\u27s method (NM) approach to data mining of technical data. Several strategies are employed to stabilize Newton\u27s method to pathological function behavior. NM is compared to GAs and to the simplex evolutionary operation algorithm (EVOP). We find that GAs, NM, and EVOP all perform efficiently for well-behaved global optimization functions with NM providing an exponential improvement in convergence rate. For local optimization problems, we find that GAs and EVOP do not provide the desired convergence rate, accuracy, or precision compared to NM for technical data. We find that GAs are favored for their simplicity while NM would be favored for its performance

    Digital Advertising: the Measure of Mobile Visits Lifts

    Full text link
    Mobile-phone advertising enables marketers to reach customers at a personal level and it enables the measure of costumers reaction by novel approaches, in real time, and at scale. By keeping a device anonymous, we can deliver custom adverts and we can check when the device owner will visit a specific mortar-and-brick location. This is the first step in a sale. By measuring visits and sales, the original marketers can determine their return on advertising and they can prove the efficacy of the marketing investments. We turn our attention to the measure of lift: we define it as the visit acceleration during the campaign flight with respect to a controlled baseline. We present a theoretical description; we describe a general and a simplified approach in composing the exposed and the control baseline; we develop two different vertical approaches with different comparable solutions; finally, we present how to carry the experiments and the measures for a few dozens campaigns; these campaigns range from hundred thousands devices and counting a few hundred visits to a handful locations, to sixty million devices and counting million visits to thousands locations. We care about experiments at scale.Comment: 27 pages, 18 figure

    Distributed Differential Privacy and Applications

    Get PDF
    Recent growth in the size and scope of databases has resulted in more research into making productive use of this data. Unfortunately, a significant stumbling block which remains is protecting the privacy of the individuals that populate these datasets. As people spend more time connected to the Internet, and conduct more of their daily lives online, privacy becomes a more important consideration, just as the data becomes more useful for researchers, companies, and individuals. As a result, plenty of important information remains locked down and unavailable to honest researchers today, due to fears that data leakages will harm individuals. Recent research in differential privacy opens a promising pathway to guarantee individual privacy while simultaneously making use of the data to answer useful queries. Differential privacy is a theory that provides provable information theoretic guarantees on what any answer may reveal about any single individual in the database. This approach has resulted in a flurry of recent research, presenting novel algorithms that can compute a rich class of computations in this setting. In this dissertation, we focus on some real world challenges that arise when trying to provide differential privacy guarantees in the real world. We design and build runtimes that achieve the mathematical differential privacy guarantee in the face of three real world challenges: securing the runtimes against adversaries, enabling readers to verify that the answers are accurate, and dealing with data distributed across multiple domains

    NEW METHODS FOR MINING SEQUENTIAL AND TIME SERIES DATA

    Get PDF
    Data mining is the process of extracting knowledge from large amounts of data. It covers a variety of techniques aimed at discovering diverse types of patterns on the basis of the requirements of the domain. These techniques include association rules mining, classification, cluster analysis and outlier detection. The availability of applications that produce massive amounts of spatial, spatio-temporal (ST) and time series data (TSD) is the rationale for developing specialized techniques to excavate such data. In spatial data mining, the spatial co-location rule problem is different from the association rule problem, since there is no natural notion of transactions in spatial datasets that are embedded in continuous geographic space. Therefore, we have proposed an efficient algorithm (GridClique) to mine interesting spatial co-location patterns (maximal cliques). These patterns are used as the raw transactions for an association rule mining technique to discover complex co-location rules. Our proposal includes certain types of complex relationships – especially negative relationships – in the patterns. The relationships can be obtained from only the maximal clique patterns, which have never been used until now. Our approach is applied on a well-known astronomy dataset obtained from the Sloan Digital Sky Survey (SDSS). ST data is continuously collected and made accessible in the public domain. We present an approach to mine and query large ST data with the aim of finding interesting patterns and understanding the underlying process of data generation. An important class of queries is based on the flock pattern. A flock is a large subset of objects moving along paths close to each other for a predefined time. One approach to processing a “flock query” is to map ST data into high-dimensional space and to reduce the query to a sequence of standard range queries that can be answered using a spatial indexing structure; however, the performance of spatial indexing structures rapidly deteriorates in high-dimensional space. This thesis sets out a preprocessing strategy that uses a random projection to reduce the dimensionality of the transformed space. We use probabilistic arguments to prove the accuracy of the projection and to present experimental results that show the possibility of managing the curse of dimensionality in a ST setting by combining random projections with traditional data structures. In time series data mining, we devised a new space-efficient algorithm (SparseDTW) to compute the dynamic time warping (DTW) distance between two time series, which always yields the optimal result. This is in contrast to other approaches which typically sacrifice optimality to attain space efficiency. The main idea behind our approach is to dynamically exploit the existence of similarity and/or correlation between the time series: the more the similarity between the time series, the less space required to compute the DTW between them. Other techniques for speeding up DTW, impose a priori constraints and do not exploit similarity characteristics that may be present in the data. Our experiments demonstrate that SparseDTW outperforms these approaches. We discover an interesting pattern by applying SparseDTW algorithm: “pairs trading” in a large stock-market dataset, of the index daily prices from the Australian stock exchange (ASX) from 1980 to 2002

    Internal waves in fluid flows. Possible coexistence with turbulence

    Get PDF
    Waves in fluid flows represents the underlying theme of this research work. Wave interactions in fluid flows are part of multidisciplinary physics. It is known that many ideas and phenomena recur in such apparently diverse fields, as solar physics, meteorology, oceanography, aeronautical and hydraulic engineering, optics, and population dynamics. In extreme synthesis, waves in fluids include, on the one hand, surface and internal waves, their evolution, interaction and associated wave-driven mean flows; on the other hand, phenomena related to nonlinear hydrodynamic stability and, in particular, those leading to the onset of turbulence. Close similarities and key differences exist between these two classes of phenomena. In the hope to get hints on aspects of a potential overall vision, this study considers two different systems located at the opposite limits of the range of existing physical fluid flow situations: first, sheared parallel continuum flows - perfect incompressibility and charge neutrality - second, the solar wind - extreme rarefaction and electrical conductivity. Therefore, the activity carried out during the doctoral period consists of two parts. The first is focused on the propagation properties of small internal waves in parallel flows. This work was partly carried out in the framework of a MISTI-Seeds MITOR project proposed by Prof. D. Tordella (PoliTo) and Prof. G. Staffilani (MIT) on the long term interaction in fluid flows. The second part regards the analysis of solar-wind fluctuations from in situ measurements by the Voyagers spacecrafts at the edge of the heliosphere. This work was supported by a second MISTI-Seeds MITOR project, proposed by D. Tordella (PoliTo), J. D. Richardson (MIT, Kavli Institute), with the collaboration of M. Opher (BU)
    • …
    corecore