17,003 research outputs found

    New Algorithms and Lower Bounds for Sequential-Access Data Compression

    Get PDF
    This thesis concerns sequential-access data compression, i.e., by algorithms that read the input one or more times from beginning to end. In one chapter we consider adaptive prefix coding, for which we must read the input character by character, outputting each character's self-delimiting codeword before reading the next one. We show how to encode and decode each character in constant worst-case time while producing an encoding whose length is worst-case optimal. In another chapter we consider one-pass compression with memory bounded in terms of the alphabet size and context length, and prove a nearly tight tradeoff between the amount of memory we can use and the quality of the compression we can achieve. In a third chapter we consider compression in the read/write streams model, which allows us passes and memory both polylogarithmic in the size of the input. We first show how to achieve universal compression using only one pass over one stream. We then show that one stream is not sufficient for achieving good grammar-based compression. Finally, we show that two streams are necessary and sufficient for achieving entropy-only bounds.Comment: draft of PhD thesi

    Tight Lower Bound for Comparison-Based Quantile Summaries

    Get PDF
    Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a totally ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles, up to an error of at most ε\varepsilon. That is, an ε\varepsilon-approximate quantile summary first processes a stream of items and then, given any quantile query 0ϕ10\le \phi\le 1, returns an item from the stream, which is a ϕ\phi'-quantile for some ϕ=ϕ±ε\phi' = \phi \pm \varepsilon. We focus on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, due to Greenwald and Khanna (SIGMOD '01), stores at most O(1εlogεN)O(\frac{1}{\varepsilon}\cdot \log \varepsilon N) items, where NN is the number of items in the stream. We prove that this space bound is optimal by showing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparison-based quantile summary in space f(ε)o(logN)f(\varepsilon)\cdot o(\log N), for any function ff that does not depend on NN. As a corollary, we improve the lower bound for biased quantiles, which provide a stronger, relative-error guarantee of (1±ε)ϕ(1\pm \varepsilon)\cdot \phi, and for other related computational tasks.Comment: 20 pages, 2 figures, major revison of the construction (Sec. 3) and some other parts of the pape

    Flood magnitude-frequency and lithologic control on bedrock river incision in post-orogenic terrain

    Get PDF
    Mixed bedrock-alluvial rivers - bedrock channels lined with a discontinuous alluvial cover - are key agents in the shaping of mountain belt topography by bedrock fluvial incision. Whereas much research focuses upon the erosional dynamics of such rivers in the context of rapidly uplifting orogenic landscapes, the present study investigates river incision processes in a post-orogenic (cratonic) landscape undergoing extremely low rates of incision (> 5 m/Ma). River incision processes are examined as a function of substrate lithology and the magnitude and frequency of formative flows along Sandy Creek gorge, a mixed bedrock-alluvial stream in arid SE-central Australia. Incision is focused along a bedrock channel with a partial alluvial cover arranged into riffle-pool macrobedforms that reflect interactions between rock structure and large-flood hydraulics. Variations in channel width and gradient determine longitudinal trends in mean shear stress (τb) and therefore also patterns of sediment transport and deposition. A steep and narrow, non-propagating knickzone (with 5% alluvial cover) coincides with a resistant quartzite unit that subdivides the gorge into three reaches according to different rock erodibility and channel morphology. The three reaches also separate distinct erosional styles: bedrock plucking (i.e. detachment-limited erosion) prevails along the knickzone, whereas along the upper and lower gorge rock incision is dependent upon large formative floods exceeding critical erosion thresholds (τc) for coarse boulder deposits that line 70% of the channel thalweg (i.e. transport-limited erosion). The mobility of coarse bed materials (up to 2 m diameter) during late Holocene palaeofloods of known magnitude and age is evaluated using step-backwater flow modelling in conjunction with two selective entrainment equations. A new approach for quantifying the formative flood magnitude in mixed bedrock-alluvial rivers is described here based on the mobility of a key coarse fraction of the bed materials; in this case the d84 size fraction. A 350 m3/s formative flood fully mobilises the coarse alluvial cover with τb200-300 N/m2 across the upper and lower gorge riffles, peaking over 500 N/m2 in the knickzone. Such floods have an annual exceedance probability much less than 10- 2 and possibly as low as 10- 3. The role of coarse alluvial cover in the gorge is discussed at two scales: (1) modulation of bedrock exposure at the reach-scale, coupled with adjustment to channel width and gradient, accommodates uniform incision across rocks of different erodibility in steady-state fashion; and (2) at the sub-reach scale where coarse boulder deposits (corresponding to <i>τ</i><sub>b</sub> minima) cap topographic convexities in the rock floor, thereby restricting bedrock incision to rare large floods. While recent studies postulate that decreasing uplift rates during post-orogenic topographic decay might drive a shift to transport-limited conditions in river networks, observations here and elsewhere in post-orogenic settings suggest, to the contrary, that extremely low erosion rates are maintained with substantial bedrock channel exposure. Although bed material mobility is known to be rate-limiting for bedrock river incision under low sediment flux conditions, exactly how a partial alluvial cover might be spatially distributed to either optimise or impede the rate of bedrock incision is open to speculation. Observations here suggest that the small volume of very stable bed materials lining Sandy Creek gorge is distributed so as to minimise the rate of bedrock fluvial incision over time

    Approximate TF-IDF based on topic extraction from massive message stream using the GPU

    Get PDF
    The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset. To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs). This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure

    A Fast Algorithm for Approximate Quantiles in High Speed Data Streams

    Full text link
    We present a fast algorithm for computing approx-imate quantiles in high speed data streams with deter-ministic error bounds. For data streams of size N where N is unknown in advance, our algorithm par-titions the stream into sub-streams of exponentially increasing size as they arrive. For each sub-stream which has a xed size, we compute and maintain a multi-level summary structure using a novel algorithm. In order to achieve high speed performance, the algo-rithm uses simple block-wise merge and sample oper-ations. Overall, our algorithms for xed-size streams and arbitrary-size streams have a computational cost of O(N log ( 1 log N)) and an average per-element update cost of O(log log N) if is xed.

    Approximate TF–IDF based on topic extraction from massive message stream using the GPU

    Get PDF
    The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset. To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs). This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure
    corecore