11,524 research outputs found
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
Estimating changes in temperature extremes from millennial scale climate simulations using generalized extreme value (GEV) distributions
Changes in extreme weather may produce some of the largest societal impacts
of anthropogenic climate change. However, it is intrinsically difficult to
estimate changes in extreme events from the short observational record. In this
work we use millennial runs from the CCSM3 in equilibrated pre-industrial and
possible future conditions to examine both how extremes change in this model
and how well these changes can be estimated as a function of run length. We
estimate changes to distributions of future temperature extremes (annual minima
and annual maxima) in the contiguous United States by fitting generalized
extreme value (GEV) distributions. Using 1000-year pre-industrial and future
time series, we show that the magnitude of warm extremes largely shifts in
accordance with mean shifts in summertime temperatures. In contrast, cold
extremes warm more than mean shifts in wintertime temperatures, but changes in
GEV location parameters are largely explainable by mean shifts combined with
reduced wintertime temperature variability. In addition, changes in the spread
and shape of the GEV distributions of cold extremes at inland locations can
lead to discernible changes in tail behavior. We then examine uncertainties
that result from using shorter model runs. In principle, the GEV distribution
provides theoretical justification to predict infrequent events using time
series shorter than the recurrence frequency of those events. To investigate
how well this approach works in practice, we estimate 20-, 50-, and 100-year
extreme events using segments of varying lengths. We find that even using GEV
distributions, time series that are of comparable or shorter length than the
return period of interest can lead to very poor estimates. These results
suggest caution when attempting to use short observational time series or model
runs to infer infrequent extremes.Comment: 33 pages, 22 figures, 1 tabl
Using compression to identify acronyms in text
Text mining is about looking for patterns in natural language text, and may
be defined as the process of analyzing text to extract information from it for
particular purposes. In previous work, we claimed that compression is a key
technology for text mining, and backed this up with a study that showed how
particular kinds of lexical tokens---names, dates, locations, etc.---can be
identified and located in running text, using compression models to provide the
leverage necessary to distinguish different token types (Witten et al., 1999)Comment: 10 pages. A short form published in DCC200
- âŠ