Search CORE

37,204 research outputs found

Graph Sample and Hold: A Framework for Big-Graph Analytics

Author: Ahmed Nesreen K.
Duffield Nick
Kompella Ramana
Neville Jennifer
Publication venue
Publication date: 16/03/2014
Field of study

Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others

arXiv.org e-Print Archive

CiteSeerX

Quality Assessment of Linked Datasets using Probabilistic Approximation

Author: A Hogan
AZ Broder
BH Bloom
C Guéret
JS Vitter
P Hitzler
Publication venue
Publication date: 17/03/2015
Field of study

With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding

arXiv.org e-Print Archive

Crossref

Fraunhofer-ePrints

Optimization-driven sampling for analyzing big data streams

Author: Shih Minghung
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2020
Field of study

Real-time processing over data streams has become a popular trend for data analysis. With more business applications rely on real-time data analysis to make decisions, traditional batch data processing has become insufficient. While the demand of streaming analysis arises, analyzing big data streams quickly and accurately is a major challenge to overcome. Sampling is a good approach to provide quick analysis over big data streams. Analyzing the sample gives us an approximation of the exact answer we obtain when analyzing original data. By avoiding analyzing the entire streams, the processing time could be greatly reduced. However, sampling over data streams leads to the following challenges: (1) given a limited budget size, how to build a sample such that the accuracy of approximation over sample is good? And (2) recent data are usually more valuable to some streaming analysis applications, e.g., a real-time intrusion detection system will focus on recent event logs. How to build a sample that weighs more on recent data and eliminates the ancient data in sample is another challenge. In this research, we propose an optimization-driven sampling (ODS) framework as a solution that aims at (1) providing a more accurate analysis over streaming data and (2) elimination of older data using the sliding window model. Based on how the sample will be analyzed, we formulate the sampling process as an optimization problem and derive an optimal sampling algorithm that will be followed when constructing and maintaining sample over data stream. We study ODS with different sample usages over data streams and discuss how to construct an optimal sample in those settings. We also study lower bounds of accuracy of an ODS sample collected from data streams. Experiments and evaluations were also conducted to show our optimal sample can yield better analysis estimation compared to other existing streaming sampling methods

Digital Repository @ Iowa State University (ISU)

Random Forests for Big Data

Author: Genuer Robin
Poggi Jean-Michel
Tuleau-Malot Christine
Villa-Vialaneix Nathalie
Publication venue
Publication date: 19/11/2015
Field of study

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

ProdInra

Hal-Diderot

View Registration Using Interesting Segments of Planar Trajectories

Author: Alon Jonathan
Del Bimbo Alberto
Nunziati Walter
Sclaroff Stan
Publication venue: Boston University Computer Science Department
Publication date: 19/05/2005
Field of study

We introduce a method for recovering the spatial and temporal alignment between two or more views of objects moving over a ground plane. Existing approaches either assume that the streams are globally synchronized, so that only solving the spatial alignment is needed, or that the temporal misalignment is small enough so that exhaustive search can be performed. In contrast, our approach can recover both the spatial and temporal alignment. We compute for each trajectory a number of interesting segments, and we use their description to form putative matches between trajectories. Each pair of corresponding interesting segments induces a temporal alignment, and defines an interval of common support across two views of an object that is used to recover the spatial alignment. Interesting segments and their descriptors are defined using algebraic projective invariants measured along the trajectories. Similarity between interesting segments is computed taking into account the statistics of such invariants. Candidate alignment parameters are verified checking the consistency, in terms of the symmetric transfer error, of all the putative pairs of corresponding interesting segments. Experiments are conducted with two different sets of data, one with two views of an outdoor scene featuring moving people and cars, and one with four views of a laboratory sequence featuring moving radio-controlled cars

CiteSeerX

Boston University Institutional Repository (OpenBU)