Search CORE

2,450 research outputs found

The B-coder: an improved binary arithmetic coder and probability estimator

Author: Brailsford David F.
Kelly Benjamin G.
Publication venue
Publication date: 28/03/2006
Field of study

In this paper we present the B-coder, an efficient binary arithmetic coder that performs extremely well on a wide range of data. The B-coder should be classed as an `approximate’ arithmetic coder, because of its use of an approximation to multiplication. We show that the approximation used in the B-coder has an efficiency cost of 0.003 compared to Shannon entropy. At the heart of the B-coder is an efficient state machine that adapts rapidly to the data to be coded. The adaptation is achieved by allowing a fixed table of transitions and probabilities to change within a given tolerance. The combination of the two techniques gives a coder that out-performs the current state-of-the-art binary arithmetic coders

Nottingham ePrints

Nottingham eTheses

Compression of Textual Column-Oriented Data

Author: Garcia Vinicius Fulber
Mergen Sergio Luis Sardi
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 03/07/2018
Field of study

Column-oriented data are well suited for compression. Since values of the same column are stored contiguously on disk, the information entropy is lower if compared to the physical data organization of conventional databases. There are many useful light-weight compression techniques targeted at specific data types and domains, like integers and small lists of distinct values, respectively. However, compression of textual values formed by skewed and high-cardinality words is usually restricted to variations of the LZ compression algorithm. So far there are no empirical evaluations that verify how other sophisticated compression methods address columnar data that store text. In this paper we shed a light on this subject by revisiting concepts of those algorithms. We also analyse how they behave in terms of compression and speed when dealing with textual columns where values appear in adjacent positions

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Signature file access methodologies for text retrieval: a literature review with additional test cases

Author: Caviglia Karen
Publication venue: RIT Scholar Works
Publication date: 01/01/1987
Field of study

Signature files are extremely compressed versions of text files which can be used as access or index files to facilitate searching documents for text strings. These access files, or signatures, are generated by storing hashed codes for individual words. Given the possible generation of similar codes in the hashing or storing process, the primary concern in researching signature files is to determine the accuracy of retrieving information. Inaccuracy is always represented by the false signaling of the presence of a text string. Two suggested ways to alter false drop rates are: 1) to determine if either of the two methologies for storing hashed codes, by superimposing them or by concatenating them, is more efficient; and 2) to determine if a particular hashing algorithm has any impact. To assess these issues, the history of suprimposed coding is traced from its development as a tool for compressing information onto punched cards in the 1950s to its incorporation into proposed signature file methodologies in the mid-1980\u27 s. Likewise, the concept of compressing individual words by various algorithms, or by hashing them is traced through the research literature. Following this literature review, benchmark trials are performed using both superimposed and concatenated methodologies while varying hashing algorithms. It is determined that while one combination of hashing algorithm and storage methodology is better, all signature file mehods can be considered viable

RIT Scholar Works

The Many Qualities of a New Directly Accessible Compression Scheme

Author: Cantone Domenico
Faro Simone
Publication venue
Publication date: 31/03/2023
Field of study

We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length

n

over an alphabet of size

\sigma

and a fixed parameter

\lambda

, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected

\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1})

overhead, where

F_j

is the

j

-th number of the Fibonacci sequence. In the overall it uses

N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n)

bits, where

N

is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.Comment: 33 page

arXiv.org e-Print Archive

Text Augmentation: Inserting markup into natural language text with PPM Models

Author: Yeates Stuart Andrew
Publication venue: The University of Waikato
Publication date: 01/01/2006
Field of study

This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora

CiteSeerX

Research Commons@Waikato

A Configurable Statistical Lossless Compression Core Based on Variable Order Markov Modeling and Arithmetic Coding

Author: Chouliaras Vassilios A.
Nunez-Yanez Jose L
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2005
Field of study

Explore Bristol Research

Digital Image Access & Retrieval

Author: Heidorn P. Bryan
Sandore Beth
Publication venue: Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Publication date: 01/01/1997
Field of study

The 33th Annual Clinic on Library Applications of Data Processing, held at the University of Illinois at Urbana-Champaign in March of 1996, addressed the theme of "Digital Image Access & Retrieval." The papers from this conference cover a wide range of topics concerning digital imaging technology for visual resource collections. Papers covered three general areas: (1) systems, planning, and implementation; (2) automatic and semi-automatic indexing; and (3) preservation with the bulk of the conference focusing on indexing and retrieval.published or submitted for publicatio

Illinois Digital Environment for Access to Learning and Scholarship Repository