90,018 research outputs found
A Multi-task Learning Approach for Improving Product Title Compression with User Search Log Data
It is a challenging and practical research problem to obtain effective
compression of lengthy product titles for E-commerce. This is particularly
important as more and more users browse mobile E-commerce apps and more
merchants make the original product titles redundant and lengthy for Search
Engine Optimization. Traditional text summarization approaches often require a
large amount of preprocessing costs and do not capture the important issue of
conversion rate in E-commerce. This paper proposes a novel multi-task learning
approach for improving product title compression with user search log data. In
particular, a pointer network-based sequence-to-sequence approach is utilized
for title compression with an attentive mechanism as an extractive method and
an attentive encoder-decoder approach is utilized for generating user search
queries. The encoding parameters (i.e., semantic embedding of original titles)
are shared among the two tasks and the attention distributions are jointly
optimized. An extensive set of experiments with both human annotated data and
online deployment demonstrate the advantage of the proposed research for both
compression qualities and online business values.Comment: 8 Pages, accepted at AAAI 201
Optimizing Wirelessly Powered Crowd Sensing: Trading energy for data
To overcome the limited coverage in traditional wireless sensor networks,
\emph{mobile crowd sensing} (MCS) has emerged as a new sensing paradigm. To
achieve longer battery lives of user devices and incentive human involvement,
this paper presents a novel approach that seamlessly integrates MCS with
wireless power transfer, called \emph{wirelessly powered crowd sensing} (WPCS),
for supporting crowd sensing with energy consumption and offering rewards as
incentives. The optimization problem is formulated to simultaneously maximize
the data utility and minimize the energy consumption for service operator, by
jointly controlling wireless-power allocation at the \emph{access point} (AP)
as well as sensing-data size, compression ratio, and sensor-transmission
duration at \emph{mobile sensor} (MS). Given the fixed compression ratios, the
optimal power allocation policy is shown to have a \emph{threshold}-based
structure with respect to a defined \emph{crowd-sensing priority} function for
each MS. Given fixed sensing-data utilities, the compression policy achieves
the optimal compression ratio. Extensive simulations are also presented to
verify the efficiency of the contributed mechanisms.Comment: arXiv admin note: text overlap with arXiv:1711.0206
A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm
Computing problems that handle large amounts of data necessitate the use of
lossless data compression for efficient storage and transmission. We present a
novel lossless universal data compression algorithm that uses parallel
computational units to increase the throughput. The length- input sequence
is partitioned into blocks. Processing each block independently of the
other blocks can accelerate the computation by a factor of , but degrades
the compression quality. Instead, our approach is to first estimate the minimum
description length (MDL) context tree source underlying the entire input, and
then encode each of the blocks in parallel based on the MDL source. With
this two-pass approach, the compression loss incurred by using more parallel
units is insignificant. Our algorithm is work-efficient, i.e., its
computational complexity is . Its redundancy is approximately
bits above Rissanen's lower bound on universal compression
performance, with respect to any context tree source whose maximal depth is at
most . We improve the compression by using different quantizers for
states of the context tree based on the number of symbols corresponding to
those states. Numerical results from a prototype implementation suggest that
our algorithm offers a better trade-off between compression and throughput than
competing universal data compression algorithms.Comment: Accepted to Journal of Selected Topics in Signal Processing special
issue on Signal Processing for Big Data (expected publication date June
2015). 10 pages double column, 6 figures, and 2 tables. arXiv admin note:
substantial text overlap with arXiv:1405.6322. Version: Mar 2015: Corrected a
typ
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
Motivation
The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for
compression and indexing of text data, but the cost of computing the BWT of
very large string collections has prevented these techniques from being widely
applied to the large sets of sequences often encountered as the outcome of DNA
sequencing experiments. In previous work, we presented a novel algorithm that
allows the BWT of human genome scale data to be computed on very moderate
hardware, thus enabling us to investigate the BWT as a tool for the compression
of such datasets.
Results
We first used simulated reads to explore the relationship between the level
of compression and the error rate, the length of the reads and the level of
sampling of the underlying genome and compare choices of second-stage
compression algorithm.
We demonstrate that compression may be greatly improved by a particular
reordering of the sequences in the collection and give a novel `implicit
sorting' strategy that enables these benefits to be realised without the
overhead of sorting the reads. With these techniques, a 45x coverage of real
human genome sequence data compresses losslessly to under 0.5 bits per base,
allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming
a small proportion of low-quality bases from the reads improves the compression
still further).
This is more than 4 times smaller than the size achieved by a standard
BWT-based compressor (bzip2) on the untrimmed reads, but an important further
advantage of our approach is that it facilitates the building of compressed
full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the
previously archived version. This submission registers the fact that the
advanced access version is now available at
http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract
. Bioinformatics should be considered as the original place of publication of
this article, please cite accordingl
Approximating Human-Like Few-shot Learning with GPT-based Compression
In this work, we conceptualize the learning process as information
compression. We seek to equip generative pre-trained models with human-like
learning capabilities that enable data compression during inference. We present
a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to
approximate Kolmogorov complexity, with the aim of estimating the optimal
Information Distance for few-shot learning. We first propose using GPT as a
prior for lossless text compression, achieving a noteworthy compression ratio.
Experiment with LLAMA2-7B backbone achieves a compression ratio of 15.5 on
enwik9. We justify the pre-training objective of GPT models by demonstrating
its equivalence to the compression length, and, consequently, its ability to
approximate the information distance for texts. Leveraging the approximated
information distance, our method allows the direct application of GPT models in
quantitative text similarity measurements. Experiment results show that our
method overall achieves superior performance compared to embedding and prompt
baselines on challenging NLP tasks, including semantic similarity, zero and
one-shot text classification, and zero-shot text ranking
A topic modeling based approach to novel document automatic summarization
© 2017 Elsevier Ltd Most of existing text automatic summarization algorithms are targeted for multi-documents of relatively short length, thus difficult to be applied immediately to novel documents of structure freedom and long length. In this paper, aiming at novel documents, we propose a topic modeling based approach to extractive automatic summarization, so as to achieve a good balance among compression ratio, summarization quality and machine readability. First, based on topic modeling, we extract the candidate sentences associated with topic words from a preprocessed novel document. Second, with the goals of compression ratio and topic diversity, we design an importance evaluation function to select the most important sentences from the candidate sentences and thus generate an initial novel summary. Finally, we smooth the initial summary to overcome the semantic confusion caused by ambiguous or synonymous words, so as to improve the summary readability. We evaluate experimentally our proposed approach on a real novel dataset. The experiment results show that compared to those from other candidate algorithms, each automatic summary generated by our approach has not only a higher compression ratio, but also better summarization quality
Human-centered compression for efficient text input
Traditional methods for efficient text entry are based on prediction. Prediction requires a constant context-shift between entering text and selecting or verifying the predictions. Previous research has shown that the advantages offered by prediction are usually eliminated by the cognitive load associated with such context-switching. We present a novel approach that relies on compression. Users are required to compress text using a very simple abbreviation technique that yields an average keystrok reduction of 26.4%. Input text is automatically decoded using weighted finite-state transducers, incorporating both word-based and letter-based n-gram language models. Decoding yields a residual error rate of 3.3%. User experiments show that this approach yields improved text input speeds
Table Substitution Box Method for Increasing Security in Interval Splitting Arithmetic Coding
Amalgamation of compression and security is indispensable in the field of multimedia applications. A novel approach to enhance security with compression is discussed in this  research paper. In secure arithmetic coder (SAC), security is provided by input and output permutation methods and compression is done by interval splitting arithmetic coding. Permutation in SAC is susceptible to attacks. Encryption issues associated with SAC is dealt in this research method. The aim of this proposed method is to encrypt the data first by Table Substitution Box (T-box) and then to compress by Interval Splitting Arithmetic Coder (ISAC). This method incorporates dynamic T-box in order to provide better security. T-box is a method, constituting elements based on the random output of Pseudo Random Generator (PRNG), which gets the input from Secure Hash Algorithm-256 (SHA-256) message digest. The current scheme is created, based on the key, which is known to the encoder and decoder. Further, T-boxes are created by using the previous message digest as a key.  Existing interval splitting arithmetic coding of SAC is applied for compression of text data. Interval splitting finds a relative position to split the intervals and this in turn brings out compression. The result divulges that permutation replaced by T-box method provides enhanced security than SAC. Data is not revealed when permutation is replaced by T-box method. Security exploration reveals that the data remains secure to cipher text attacks, known plain text attacks and chosen plain text attacks. This approach results in increased security to Interval ISAC. Additionally the compression ratio is compared by transferring the outcome of T-box to traditional arithmetic coding. The comparison proved that there is a minor reduction in compression ratio in ISAC than arithmetic coding. However the security provided by ISAC overcomes the issues of compression ratio in arithmetic coding.Â
- …