10 research outputs found

    Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution

    Full text link
    The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution. We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail

    Tight Lower Bound for Linear Sketches of Moments

    Get PDF
    The problem of estimating frequency moments of a data stream has attracted a lot of attention since the onset of streaming algorithms [AMS99]. While the space complexity for approximately computing the pthp^{\rm th} moment, for p(0,2]p\in(0,2] has been settled [KNW10], for p>2p>2 the exact complexity remains open. For p>2p>2 the current best algorithm uses O(n12/plogn)O(n^{1-2/p}\log n) words of space [AKO11,BO10], whereas the lower bound is of Ω(n12/p)\Omega(n^{1-2/p}) [BJKS04]. In this paper, we show a tight lower bound of Ω(n12/plogn)\Omega(n^{1-2/p}\log n) words for the class of algorithms based on linear sketches, which store only a sketch AxAx of input vector xx and some (possibly randomized) matrix AA. We note that all known algorithms for this problem are linear sketches.Comment: In Proceedings of the 40th International Colloquium on Automata, Languages and Programming (ICALP), Riga, Latvia, July 201

    Finding heavy hitters from lossy or noisy data

    Get PDF
    Abstract. Motivated by Dvir et al. and Wigderson and Yehudayoff [3

    Streaming Algorithms for High Throughput Massive Datasets

    Get PDF
    The field of streaming algorithms has enjoyed a deal of focus from the theoretical computer science community over the last 20 years. Many great algorithms and mathematical results have been developed in this time, allowing for a broad class of functions to be computed and problems to be solved in the streaming model. In the same amount of time, the amount of data being generated by practical computer systems is simply staggering. In this thesis, we focus on solving problems in the streaming model that have a unified goal of being relevant to practical problems outside of the theory community. In terms of a common technical thread throughout this work, the theme here is an attention to runtime and the ability to handle large datasets that not only challenge in terms of memory available, but also in the throughput of the data and the speed at which the data must be processed. We provide these solutions in the form of both theoretical algorithm and practical systems, and demonstrate that using practice to drive theory, and vice versa, can generate powerful new approaches for difficult problems in the streaming model

    INVESTIGATIONS OF DARK MATTER USING COSMOLOGICAL SIMULATIONS

    Get PDF
    In the successful concordance model of cosmology, dark matter is crucial for structures to form as we observe it in the universe. Despite the overwhelming observational evidence for its existence, it is not yet directly detected, and its nature is largely unknown. Physicists propose various dark matter candidates, with masses ranging over dozens of orders of magnitude. However, both indirect and direct detection experiments for dark matter have reported no convincing results. Dark matter research is therefore critically relying on computer simulations. Using supercomputer numerical simulations, we can test the correctness of the current cosmological model, as well as obtain guidance for future detection experiments. In this dissertation, we study dark matter from several perspectives using cosmological simulations: its possible radiation, its warmth, and other related issues. A commonly accepted candidate for dark matter is the weakly interacting massive particle (WIMP). WIMPs interact with normal matter only through the weak force (as well as gravity). It is thus extremely challenging to detect these particles directly. However, depending on the type of dark matter, they can %self-annihilate annihilate with other dark matter particles, or decay into high-energy photons (i.e., γ\gamma-ray). We studied the spatial distribution of possible emission components from dark matter annihilation or decay in a large simulation of a galaxy like the Milky Way. The predicted emission components can be used as templates for observations such as those from the {\it Fermi}/LAT γ\gamma-ray instrument, to constrain for the physical properties of dark matter. Structure formation theory suggests that dark matter is ``cold'', i.e., moving non-relativistically during structure formation. However, cold dark matter predicts many more dark-matter satellites, or subhaloes, around galaxies such as the Milky Way than observed. One well-established mechanism to bring the theory in line with observations is that many of these satellites are not visible because they are too small for baryons to form stars in them. Another way is to attenuate the small-scale structure directly, positing ``warm'' dark matter. Using simulation, we propose a method of testing this possibility in a complementary environment, by measuring the density profile of cosmic voids. Our results suggest that there are sufficient differences between warm and cold dark matter to test using future observations. Furthermore, our data analyzing methods are based on sophisticated data stream algorithms and newly developed Graphic Process Unit (GPU) hardware. These tools lead to other studies of dark matter as well. For example, we studied the spin alignment of dark matter halos with its environment. We show that the spin alignments are highly related to the hierarchical levels of the cosmic web, in which the halo is located. We also studied the responses in different density variables to ``ringing'' the initial density field at different spatial frequencies (i.e. putting spikes in the power spectrum at a particular scale). The conventional wisdom is that power generally migrates from large to small comoving scales from the initial to final conditions. But in this work, we found that this conventional wisdom is only true for a density variable emphasizing dense regions, such as the usual overdensity field. In the log-density field, however, power stays about at the same scale but broadens. In the reciprocal-density field, emphasizing low-density regions, power moves to larger scales. This is an example of voids as ``cosmic magnifying glasses.'' The GPU density-estimation technique was crucial for this study, allowing the density to be estimated accurately even when modestly sampled with particles. Our results provide guidance for designing future statistic analytics for dark matter and the large-scale structure of the Universe in general

    Taming Big Data By Streaming

    Get PDF
    Data streams have emerged as a natural computational model for numerous applications of big data processing. In this model, algorithms are assumed to have access to a limited amount of memory and can only make a single pass (or a few passes) over the data, but need to produce sufficiently accurate answers for some objective functions on the dataset. This model captures various real-world applications and stimulates new scalable tools for solving important problems in the big data era. This dissertation focuses on the following two aspects of the streaming model. 1. Understanding the capability of the streaming model. For a vector aggregation stream, i.e., when the stream is a sequence of updates to an underlying nn-dimensional vector vv (for very large nn), we establish nearly tight space bounds on streaming algorithms of approximating functions of the form i=1ng(vi)\sum_{i=1}^n g(v_i) for nearly all functions gg of one-variable and l(v)l(v) for all symmetric norms ll. These results provide a deeper understanding of the streaming computation model. 2. Tighter upper bounds. We provide better streaming kk-median clustering algorithms in a dynamic points stream, i.e., a stream of insertion and deletion of points on a discrete Euclidean space ([Δ]d[\Delta]^d for sufficiently large Δ\Delta and dd). Our algorithms use k\cdot\poly(d \log \Delta) space/update time and maintain with high probability an approximate kk-median solution to the streaming dataset. All previous algorithms for computing an approximation for the kk-median problem over dynamic data streams required space and update time exponential in dd
    corecore