Search CORE

460 research outputs found

Near-Optimal Algorithms for Differentially-Private Principal Components

Author: Chaudhuri Kamalika
Sarwate Anand D.
Sinha Kaushik
Publication venue
Publication date: 07/08/2013
Field of study

Principal components analysis (PCA) is a standard tool for identifying good low-dimensional approximations to data in high dimension. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method.Comment: 37 pages, 8 figures; final version to appear in the Journal of Machine Learning Research, preliminary version was at NIPS 201

arXiv.org e-Print Archive

CiteSeerX

Shocker Open Access Repository

Event Stream Processing with Multiple Threads

Author: DA Basin
G Graefe
H Nazarpour
J Ha
JJ Harrow
L Kuhtz
M Paes
PMG Apers
S Berkovich
S Hallé
S Qadeer
S Savage
Publication venue
Publication date: 09/07/2017
Field of study

Current runtime verification tools seldom make use of multi-threading to speed up the evaluation of a property on a large event trace. In this paper, we present an extension to the BeepBeep 3 event stream engine that allows the use of multiple threads during the evaluation of a query. Various parallelization strategies are presented and described on simple examples. The implementation of these strategies is then evaluated empirically on a sample of problems. Compared to the previous, single-threaded version of the BeepBeep engine, the allocation of just a few threads to specific portions of a query provides dramatic improvement in terms of running time

arXiv.org e-Print Archive

Crossref

An Optimal Algorithm for Sliding Window Order Statistics

Author: Raykov Pavel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 26th International Conference on Database Theory (ICDT 2023)
Publication date: 01/01/2023
Field of study

Assume there is a data stream of elements and a window of size m. Sliding window algorithms compute various statistic functions over the last m elements of the data stream seen so far. The time complexity of a sliding window algorithm is measured as the time required to output an updated statistic function value every time a new element is read. For example, it is well known that computing the sliding window maximum/minimum has time complexity O(1) while computing the sliding window median has time complexity O(log m). In this paper we close the gap between these two cases by (1) presenting an algorithm for computing the sliding window k-th smallest element in O(log k) time and (2) prove that this time complexity is optimal

Dagstuhl Research Online Publication Server

Self-supervised automated wrapper generation for weblog data extraction

Author: A. Laender
B. Adelberg
C. Kohlschütter
I. Muslea
N. Kushmerick
P. Geibel
R. Baumgartner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

Durham Research Online

Crossref

UCL Discovery

Warwick Research Archives Portal Repository

The state of peer-to-peer network simulators

Author: Agosti M.
Anirban Basu
Annapureddy S.
Barcellos M.
Baumgart I.
Boufkhad Y.
Cheng B.
Clarke I.
Dabek F.
de Vogeleer K.
Doulkeridis C.
Ghinita G.
Haridasan M.
Huebsch R. J.
Ian Wakeman
Iliofotou M.
James Stanier
Johansson B.
Leonini L.
Likert R.
Mavlankar A.
Maymounkov P.
Naicken S.
Naicken S.
Ren D.
Rosenberg J.
Rowstron A. I. T.
Simon Fleming
Stephen Naicken
Stingl D.
Urban P.
Vijay K. Gurbani
Wang S.
Webb S.
Zantout B.
Zhang D.
Zhao B.
Zhou Y.
Zhu W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/08/2013
Field of study

Networking research often relies on simulation in order to test and evaluate new ideas. An important requirement of this process is that results must be reproducible so that other researchers can replicate, validate and extend existing work. We look at the landscape of simulators for research in peer-to-peer (P2P) networks by conducting a survey of a combined total of over 280 papers from before and after 2007 (the year of the last survey in this area), and comment on the large quantity of research using bespoke, closed-source simulators. We propose a set of criteria that P2P simulators should meet, and poll the P2P research community for their agreement. We aim to drive the community towards performing their experiments on simulators that allow for others to validate their results

Crossref

Sussex Research Online

Differentially Private Empirical Risk Minimization

Author: Anand D. Sarwate
Claire Monteleoni
Kamalika Chaudhuri
Nicolas Vayatis
Publication venue
Publication date: 01/01/2011
Field of study

Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the

\epsilon

-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.Comment: 40 pages, 7 figures, accepted to the Journal of Machine Learning Researc

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code

Author: Almaslukh Abdulaziz
Dau Hoang Anh
Funning Gareth
Gharghabi Shaghayegh
Kamgar Kaveh
Keogh Eamonn
Mueen Abdullah
Shakibay Senobari Nader
Silva Diego Furtado
Yeh Chin-Chia Michael
Zhu Yan
Zimmerman Zachary
Publication venue: eScholarship, University of California
Publication date: 01/07/2020
Field of study

eScholarship - University of California