Search CORE

125 research outputs found

Exploring Sketches for Probability Estimation with Sublinear Memory

Author: Brown Gavin
Kleerekoper Anthony
Lujan Mikel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/12/2013
Field of study

As data sets become ever larger it becomes increasingly complex to apply traditional machine learning techniques to them. Feature selection can greatly reduce the computational requirements of machine learning but it too can be memory intensive. In this paper we explore the use of succinct data structures called sketches for probability estimation as a component of information theoretic feature selection. These data structures are sublinear in the number of items but were designed only for estimating the frequency of the most frequent items. To the best of our knowledge this is the first time they have been examined for estimating the frequency of all items and we find that often some information theoretic measures can be estimated to within a few percent of the correct values

E-space: Manchester Metropolitan University's Research Repository

Algorithmic Techniques for Processing Data Streams

Author: Ikonomovska Elena
Zelke Mariano
Publication venue: Dagstuhl Follow-Ups. Data Exchange, Integration, and Streams
Publication date: 01/01/2013
Field of study

We give a survey at some algorithmic techniques for processing data streams. After covering the basic methods of sampling and sketching, we present more evolved procedures that resort on those basic ones. In particular, we examine algorithmic schemes for similarity mining, the concept of group testing, and techniques for clustering and summarizing data streams

Dagstuhl Research Online Publication Server

Approximating Properties of Data Streams

Author: Zhou Samson
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2018
Field of study

In this dissertation, we present algorithms that approximate properties in the data stream model, where elements of an underlying data set arrive sequentially, but algorithms must use space sublinear in the size of the underlying data set. We first study the problem of finding all k-periods of a length-n string S, presented as a data stream. S is said to have k-period p if its prefix of length n − p differs from its suffix of length n − p in at most k locations. We give algorithms to compute the k-periods of a string S using poly(k, log n) bits of space and we complement these results with comparable lower bounds. We then study the problem of identifying a longest substring of strings S and T of length n that forms a d-near-alignment under the edit distance, in the simultaneous streaming model. In this model, symbols of strings S and T are streamed at the same time and form a d-near-alignment if the distance between them in some given metric is at most d. We give several algorithms, including an exact one-pass algorithm that uses O(d2 + d log n) bits of space. We then consider the distinct elements and `p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and `p-heavy hitters that is nearly optimal in both n and c. Finally, we consider the problem of estimating the maximum weighted matching of a graph whose edges are revealed in a streaming fashion. We develop a reduction from the maximum weighted matching problem to the maximum cardinality matching problem that only doubles the approximation factor of a streaming algorithm developed for the maximum cardinality matching problem. As an application, we obtain an estimator for the weight of a maximum weighted matching in bounded-arboricity graphs and in particular, a (48 + )-approximation estimator for the weight of a maximum weighted matching in planar graphs

Purdue E-Pubs

How to Make Your Approximation Algorithm Private: A Black-Box Differentially-Private Transformation for Tunable Approximation Algorithms of Functions with Low Sensitivity

Author: Blocki Jeremiah
Grigorescu Elena
Mukherjee Tamalika
Zhou Samson
Publication venue
Publication date: 07/10/2022
Field of study

We develop a framework for efficiently transforming certain approximation algorithms into differentially-private variants, in a black-box manner. Our results focus on algorithms A that output an approximation to a function f of the form

(1-a)f(x)-k <= A(x) <= (1+a)f(x)+k

, where 0<=a <1 is a parameter that can be``tuned" to small-enough values while incurring only a poly blowup in the running time/space. We show that such algorithms can be made DP without sacrificing accuracy, as long as the function f has small global sensitivity. We achieve these results by applying the smooth sensitivity framework developed by Nissim, Raskhodnikova, and Smith (STOC 2007). Our framework naturally applies to transform non-private FPRAS (resp. FPTAS) algorithms into

(\epsilon,\delta)

-DP (resp.

\epsilon

-DP) approximation algorithms. We apply our framework in the context of sublinear-time and sublinear-space algorithms, while preserving the nature of the algorithm in meaningful ranges of the parameters. Our results include the first (to the best of our knowledge)

(\epsilon,\delta)

-edge DP sublinear-time algorithm for estimating the number of triangles, the number of connected components, and the weight of a MST of a graph, as well as a more efficient algorithm (while sacrificing pure DP in contrast to previous results) for estimating the average degree of a graph. In the area of streaming algorithms, our results include

(\epsilon,\delta)

-DP algorithms for estimating L_p-norms, distinct elements, and weighted MST for both insertion-only and turnstile streams. Our transformation also provides a private version of the smooth histogram framework, which is commonly used for converting streaming algorithms into sliding window variants, and achieves a multiplicative approximation to many problems, such as estimating L_p-norms, distinct elements, and the length of the longest increasing subsequence

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching

Author: Limasset Antoine
Marchet Camille
Martayan Igor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Publication date: 01/01/2023
Field of study

The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data

Dagstuhl Research Online Publication Server

Recommended from our members

ALGORITHMS FOR MASSIVE, EXPENSIVE, OR OTHERWISE INCONVENIENT GRAPHS

Author: Tench David
Publication venue: ScholarWorks@UMass Amherst
Publication date: 18/12/2020
Field of study

A long-standing assumption common in algorithm design is that any part of the input is accessible at any time for unit cost. However, as we work with increasingly large data sets, or as we build smaller devices, we must revisit this assumption. In this thesis, I present some of my work on graph algorithms designed for circumstances where traditional assumptions about inputs do not apply. 1. Classical graph algorithms require direct access to the input graph and this is not feasible when the graph is too large to fit in memory. For computation on massive graphs we consider the dynamic streaming graph model. Given an input graph defined by as a stream of edge insertions and deletions, our goal is to approximate properties of this graph using space that is sublinear in the size of the stream. In this thesis, I present algorithms for approximating vertex connectivity, hypergraph edge connectivity, maximum coverage, unique coverage, and temporal connectivity in graph streams. 2. In certain applications the input graph is not explicitly represented, but its edges may be discovered via queries which require costly computation or measurement. I present two open-source systems which solve real-world problems via graph algorithms which may access their inputs only through costly edge queries. M ESH is a memory manager which compacts memory efficiently by finding an approximate graph matching subject to stringent time and edge query restrictions. PathCache is an efficiently scalable network measurement platform that outperforms the current state of the art

ScholarWorks@UMass Amherst

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

Author: Brown C. Titus
Canino-Koning Rosangela
Howe Adina Chuang
Pell Jason
Zhang Qingpeng
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 14/07/2014
Field of study

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

Doctor of Philosophy

Author: Daruki Samira
Publication venue: University of Utah
Publication date: 01/01/2018
Field of study

dissertationThe contributions of this dissertation are centered around designing new algorithms in the general area of sublinear algorithms such as streaming, core sets and sublinear verification, with a special interest in problems arising from data analysis including data summarization, clustering, matrix problems and massive graphs. In the first part, we focus on summaries and coresets, which are among the main techniques for designing sublinear algorithms for massive data sets. We initiate the study of coresets for uncertain data and study coresets for various types of range counting queries on uncertain data. We focus mainly on the indecisive model of locational uncertainty since it comes up frequently in real-world applications when multiple readings of the same object are made. In this model, each uncertain point has a probability density describing its location, defined as

k

distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by examining only this subset. For each type of query we provide coreset constructions with approximation-size trade-offs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based techniques on axis-aligned range queries. In the second part, we focus on designing sublinear-space algorithms for approximate computations on massive graphs. In particular, we consider graph MAXCUT and correlation clustering problems and develop sampling based approaches to construct truly sublinear (

o(n)

) sized coresets for graphs that have polynomial (i.e.,

n^{\delta}

for any

\delta >0

) average degree. Our technique is based on analyzing properties of random induced subprograms of the linear program formulations of the problems. We demonstrate this technique with two examples. Firstly, we present a sublinear sized core set to approximate the value of the MAX CUT in a graph to a

(1+\epsilon)

factor. To the best of our knowledge, all the known methods in this regime rely crucially on near-regularity assumptions. Secondly, we apply the same framework to construct a sublinear-sized coreset for correlation clustering. Our coreset construction also suggests 2-pass streaming algorithms for computing the MAX CUT and correlation clustering objective values which are left as future work at the time of writing this dissertation. Finally, we focus on streaming verification algorithms as another model for designing sublinear algorithms. We give the first polylog space and sublinear (in number of edges) communication protocols for any streaming verification problems in graphs. We present efficient streaming interactive proofs that can verify maximum matching exactly. Our results cover all flavors of matchings (bipartite/ nonbipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP and exact triangle counting, as well as for graph primitives such as the number of connected components, bipartiteness, minimum spanning tree and connectivity. In particular, these are the first results for weighted matchings and for metric TSP in any streaming verification model. Our streaming verifiers use only polylogarithmic space while exchanging only polylogarithmic communication with the prover in addition to the output size of the relevant solution. We also initiate a study of streaming interactive proofs (SIPs) for problems in data analysis and present efficient SIPs for some fundamental problems. We present protocols for clustering and shape fitting including minimum enclosing ball (MEB), width of a point set,

k

-centers and

k

-slab problem. We also present protocols for fundamental matrix analysis problems: We provide an improved protocol for rectangular matrix problems, which in turn can be used to verify

k

(approximate) eigenvectors of an

n \times n

integer matrix

A

. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space

The University of Utah: J. Willard Marriott Digital Library

Sketching for Large-Scale Learning of Mixture Models

Author: Bourrier Anthony
Gribonval Rémi
Keriven Nicolas
Pérez Patrick
Publication venue
Publication date: 20/03/2016
Field of study

Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing

arXiv.org e-Print Archive

HAL-CentraleSupelec

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1