101 research outputs found

    Acronym-Meaning Extraction from Corpora Using Multi-Tape Weighted Finite-State Machines

    Full text link
    The automatic extraction of acronyms and their meaning from corpora is an important sub-task of text mining. It can be seen as a special case of string alignment, where a text chunk is aligned with an acronym. Alternative alignments have different cost, and ideally the least costly one should give the correct meaning of the acronym. We show how this approach can be implemented by means of a 3-tape weighted finite-state machine (3-WFSM) which reads a text chunk on tape 1 and an acronym on tape 2, and generates all alternative alignments on tape 3. The 3-WFSM can be automatically generated from a simple regular expression. No additional algorithms are required at any stage. Our 3-WFSM has a size of 27 states and 64 transitions, and finds the best analysis of an acronym in a few milliseconds.Comment: 6 pages, LaTe

    Compressing Proteomes: The Relevance of Medium Range Correlations

    Get PDF
    We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences

    Using semantic knowledge to improve compression on log files

    Get PDF
    With the move towards global and multi-national companies, information technology infrastructure requirements are increasing. As the size of these computer networks increases, it becomes more and more difficult to monitor, control, and secure them. Networks consist of a number of diverse devices, sensors, and gateways which are often spread over large geographical areas. Each of these devices produce log files which need to be analysed and monitored to provide network security and satisfy regulations. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data for archival purposes after the log files have been rotated. However, there are many other compression programs which exist - each with their own advantages and disadvantages. These programs each use a different amount of memory and take different compression and decompression times to achieve different compression ratios. System log files also contain redundancy which is not necessarily exploited by standard compression programs. Log messages usually use a similar format with a defined syntax. In the log files, all the ASCII characters are not used and the messages contain certain "phrases" which often repeated. This thesis investigates the use of compression as a means of data reduction and how the use of semantic knowledge can improve data compression (also applying results to different scenarios that can occur in a distributed computing environment). It presents the results of a series of tests performed on different log files. It also examines the semantic knowledge which exists in maillog files and how it can be exploited to improve the compression results. The results from a series of text preprocessors which exploit this knowledge are presented and evaluated. These preprocessors include: one which replaces the timestamps and IP addresses with their binary equivalents and one which replaces words from a dictionary with unused ASCII characters. In this thesis, data compression is shown to be an effective method of data reduction producing up to 98 percent reduction in filesize on a corpus of log files. The use of preprocessors which exploit semantic knowledge results in up to 56 percent improvement in overall compression time and up to 32 percent reduction in compressed size.TeXpdfTeX-1.40.

    Low-Complexity Approaches to Slepian–Wolf Near-Lossless Distributed Data Compression

    Get PDF
    This paper discusses the Slepian–Wolf problem of distributed near-lossless compression of correlated sources. We introduce practical new tools for communicating at all rates in the achievable region. The technique employs a simple “source-splitting” strategy that does not require common sources of randomness at the encoders and decoders. This approach allows for pipelined encoding and decoding so that the system operates with the complexity of a single user encoder and decoder. Moreover, when this splitting approach is used in conjunction with iterative decoding methods, it produces a significant simplification of the decoding process. We demonstrate this approach for synthetically generated data. Finally, we consider the Slepian–Wolf problem when linear codes are used as syndrome-formers and consider a linear programming relaxation to maximum-likelihood (ML) sequence decoding. We note that the fractional vertices of the relaxed polytope compete with the optimal solution in a manner analogous to that observed when the “min-sum” iterative decoding algorithm is applied. This relaxation exhibits the ML-certificate property: if an integral solution is found, it is the ML solution. For symmetric binary joint distributions, we show that selecting easily constructable “expander”-style low-density parity check codes (LDPCs) as syndrome-formers admits a positive error exponent and therefore provably good performance

    Cross-layer Perceptual ARQ for Video Communications over 802.11e Wireless Networks

    Get PDF
    This work presents an application-level perceptual ARQ algorithm for video streaming over 802.11e wireless networks. A simple and effective formula is proposed to combine the perceptual and temporal importance of each packet into a single priority value, which is then used to drive the packet-selection process at each retransmission opportunity. Compared to the standard 802.11 MAC-layer ARQ scheme, the proposed technique delivers higher perceptual quality because it can retransmit only the most perceptually important packets reducing retransmission bandwidth waste. Video streaming of H.264 test sequences has been simulated with ns in a realistic 802.11e home scenario, in which the various kinds of traffic flows have been assigned to different 802.11e access categories according to the Wi-Fi alliance WMM specification. Extensive simulations show that the proposed method consistently outperforms the standard link-layer 802.11 retransmission scheme, delivering PSNR gains up to 12 dB while achieving low transmission delay and limited impact on concurrent traffic. Moreover, comparisons with a MAC-level ARQ scheme which adapts the retry limit to the type of frame contained in packets and with an application-level deadline-based priority retransmission scheme show that the PSNR gain offered by the proposed algorithm is significant, up to 5 dB. Additional results obtained in a scenario in which the transmission relies on an intermediate node (i.e., the access point) further confirms the consistency of the perceptual ARQ performance. Finally, results obtained by varying network conditions such as congestion and channel noise levels show the consistency of the improvements achieved by the proposed algorithm

    Motion and disparity estimation with self adapted evolutionary strategy in 3D video coding

    Get PDF
    Real world information, obtained by humans is three dimensional (3-D). In experimental user-trials, subjective assessments have clearly demonstrated the increased impact of 3-D pictures compared to conventional flat-picture techniques. It is reasonable, therefore, that we humans want an imaging system that produces pictures that are as natural and real as things we see and experience every day. Three-dimensional imaging and hence, 3-D television (3DTV) are very promising approaches expected to satisfy these desires. Integral imaging, which can capture true 3D color images with only one camera, has been seen as the right technology to offer stress-free viewing to audiences of more than one person. In this paper, we propose a novel approach to use Evolutionary Strategy (ES) for joint motion and disparity estimation to compress 3D integral video sequences. We propose to decompose the integral video sequence down to viewpoint video sequences and jointly exploit motion and disparity redundancies to maximize the compression using a self adapted ES. A half pixel refinement algorithm is then applied by interpolating macro blocks in the previous frame to further improve the video quality. Experimental results demonstrate that the proposed adaptable ES with Half Pixel Joint Motion and Disparity Estimation can up to 1.5 dB objective quality gain without any additional computational cost over our previous algorithm.1Furthermore, the proposed technique get similar objective quality compared to the full search algorithm by reducing the computational cost up to 90%

    A visualization tool to explore alphabet orderings for the Burrows-Wheeler Transform

    Full text link
    The Burrows-Wheeler Transform (BWT) is an efficient invertible text transformation algorithm with the properties of tending to group identical characters together in a run, and enabling search of the text. This transformation has extensive uses particularly in lossless compression algorithms, indexing, and within bioinformatics for sequence alignment tasks. There has been recent interest in minimizing the number of identical character runs (rr) for a transform and in finding useful alphabet orderings for the sorting step of the matrix associated with the BWT construction. This motivates the inspection of many transforms while developing algorithms. However, the full Burrows-Wheeler matrix is O(n2)O(n^2) space and therefore very difficult to display and inspect for large input sizes. In this paper we present a graphical user interface (GUI) for working with BWTs, which includes features for searching for matrix row prefixes, skipping over sections in the right-most column (the transform), and displaying BWTs while exploring alphabet orderings with the goal of minimizing the number of runs.Comment: 8 pages, 2 figure
    corecore