Search CORE

90,018 research outputs found

A Multi-task Learning Approach for Improving Product Title Compression with User Search Log Data

Author: Lan Man
Lang Jun
Li Sheng
Qiu Long
Si Luo
Tian Junfeng
Wang Jingang
Publication venue
Publication date: 05/01/2018
Field of study

It is a challenging and practical research problem to obtain effective compression of lengthy product titles for E-commerce. This is particularly important as more and more users browse mobile E-commerce apps and more merchants make the original product titles redundant and lengthy for Search Engine Optimization. Traditional text summarization approaches often require a large amount of preprocessing costs and do not capture the important issue of conversion rate in E-commerce. This paper proposes a novel multi-task learning approach for improving product title compression with user search log data. In particular, a pointer network-based sequence-to-sequence approach is utilized for title compression with an attentive mechanism as an extractive method and an attentive encoder-decoder approach is utilized for generating user search queries. The encoding parameters (i.e., semantic embedding of original titles) are shared among the two tasks and the attention distributions are jointly optimized. An extensive set of experiments with both human annotated data and online deployment demonstrate the advantage of the proposed research for both compression qualities and online business values.Comment: 8 Pages, accepted at AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Optimizing Wirelessly Powered Crowd Sensing: Trading energy for data

Author: Andreev Sergey
Gong Yi
Huang Kaibin
Li Xiaoyang
You Changsheng
Publication venue
Publication date: 28/12/2017
Field of study

To overcome the limited coverage in traditional wireless sensor networks, \emph{mobile crowd sensing} (MCS) has emerged as a new sensing paradigm. To achieve longer battery lives of user devices and incentive human involvement, this paper presents a novel approach that seamlessly integrates MCS with wireless power transfer, called \emph{wirelessly powered crowd sensing} (WPCS), for supporting crowd sensing with energy consumption and offering rewards as incentives. The optimization problem is formulated to simultaneously maximize the data utility and minimize the energy consumption for service operator, by jointly controlling wireless-power allocation at the \emph{access point} (AP) as well as sensing-data size, compression ratio, and sensor-transmission duration at \emph{mobile sensor} (MS). Given the fixed compression ratios, the optimal power allocation policy is shown to have a \emph{threshold}-based structure with respect to a defined \emph{crowd-sensing priority} function for each MS. Given fixed sensing-data utilities, the compression policy achieves the optimal compression ratio. Extensive simulations are also presented to verify the efficiency of the contributed mechanisms.Comment: arXiv admin note: text overlap with arXiv:1711.0206

arXiv.org e-Print Archive

Crossref

A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm

Author: Baron Dror
Krishnan Nikhil
Publication venue
Publication date: 21/03/2015
Field of study

Computing problems that handle large amounts of data necessitate the use of lossless data compression for efficient storage and transmission. We present a novel lossless universal data compression algorithm that uses parallel computational units to increase the throughput. The length-

N

input sequence is partitioned into

B

blocks. Processing each block independently of the other blocks can accelerate the computation by a factor of

B

, but degrades the compression quality. Instead, our approach is to first estimate the minimum description length (MDL) context tree source underlying the entire input, and then encode each of the

B

blocks in parallel based on the MDL source. With this two-pass approach, the compression loss incurred by using more parallel units is insignificant. Our algorithm is work-efficient, i.e., its computational complexity is

O(N/B)

. Its redundancy is approximately

B\log(N/B)

bits above Rissanen's lower bound on universal compression performance, with respect to any context tree source whose maximal depth is at most

\log(N/B)

. We improve the compression by using different quantizers for states of the context tree based on the number of symbols corresponding to those states. Numerical results from a prototype implementation suggest that our algorithm offers a better trade-off between compression and throughput than competing universal data compression algorithms.Comment: Accepted to Journal of Selected Topics in Signal Processing special issue on Signal Processing for Big Data (expected publication date June 2015). 10 pages double column, 6 figures, and 2 tables. arXiv admin note: substantial text overlap with arXiv:1405.6322. Version: Mar 2015: Corrected a typ

arXiv.org e-Print Archive

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Author: A. J. Cox
Chen
Dewey
G. Rosone
Kozanitis
M. J. Bauer
T. Jakobi
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. Results We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm. We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting' strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further). This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the previously archived version. This submission registers the fact that the advanced access version is now available at http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract . Bioinformatics should be considered as the original place of publication of this article, please cite accordingl

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Publications at Bielefeld University

Approximating Human-Like Few-shot Learning with GPT-based Compression

Author: Huang Cynthia
Jiang Zhiying
Li Ming
Lin Jimmy
Xie Yuqing
Publication venue
Publication date: 14/08/2023
Field of study

In this work, we conceptualize the learning process as information compression. We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference. We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity, with the aim of estimating the optimal Information Distance for few-shot learning. We first propose using GPT as a prior for lossless text compression, achieving a noteworthy compression ratio. Experiment with LLAMA2-7B backbone achieves a compression ratio of 15.5 on enwik9. We justify the pre-training objective of GPT models by demonstrating its equivalence to the compression length, and, consequently, its ability to approximate the information distance for texts. Leveraging the approximated information distance, our method allows the direct application of GPT models in quantitative text similarity measurements. Experiment results show that our method overall achieves superior performance compared to embedding and prompt baselines on challenging NLP tasks, including semantic similarity, zero and one-shot text classification, and zero-shot text ranking

arXiv.org e-Print Archive

A topic modeling based approach to novel document automatic summarization

Author: Chen E
Huang H
Lei L
Li G
Wu Z
Xu G
Zheng C
Publication venue: 'Elsevier BV'
Publication date: 30/10/2017
Field of study

© 2017 Elsevier Ltd Most of existing text automatic summarization algorithms are targeted for multi-documents of relatively short length, thus difficult to be applied immediately to novel documents of structure freedom and long length. In this paper, aiming at novel documents, we propose a topic modeling based approach to extractive automatic summarization, so as to achieve a good balance among compression ratio, summarization quality and machine readability. First, based on topic modeling, we extract the candidate sentences associated with topic words from a preprocessed novel document. Second, with the goals of compression ratio and topic diversity, we design an importance evaluation function to select the most important sentences from the candidate sentences and thus generate an initial novel summary. Finally, we smooth the initial summary to overcome the semantic confusion caused by ambiguous or synonymous words, so as to improve the summary readability. We evaluate experimentally our proposed approach on a real novel dataset. The experiment results show that compared to those from other candidate algorithms, each automatic summary generated by our approach has not only a higher compression ratio, but also better summarization quality

OPUS - University of Technology Sydney

Human-centered compression for efficient text input

Author: Nelken Rani
Shieber Stuart M.
Publication venue: Dagstuhl Seminar Proceedings. 05382 - Efficient Text Entry
Publication date: 01/01/2006
Field of study

Traditional methods for efficient text entry are based on prediction. Prediction requires a constant context-shift between entering text and selecting or verifying the predictions. Previous research has shown that the advantages offered by prediction are usually eliminated by the cognitive load associated with such context-switching. We present a novel approach that relies on compression. Users are required to compress text using a very simple abbreviation technique that yields an average keystrok reduction of 26.4%. Input text is automatically decoded using weighted finite-state transducers, incorporating both word-based and letter-based n-gram language models. Decoding yields a residual error rate of 3.3%. User experiments show that this approach yields improved text input speeds

Dagstuhl Research Online Publication Server

Table Substitution Box Method for Increasing Security in Interval Splitting Arithmetic Coding

Author: K. Ramar
Kiruba E.Wiselin
Publication venue: 'CIRWOLRD'
Publication date: 25/02/2017
Field of study

Amalgamation of compression and security is indispensable in the field of multimedia applications. A novel approach to enhance security with compression is discussed in this Â research paper. In secure arithmetic coder (SAC), security is provided by input and output permutation methods and compression is done by interval splitting arithmetic coding. Permutation in SAC is susceptible to attacks. Encryption issues associated with SAC is dealt in this research method. The aim of this proposed method is to encrypt the data first by Table Substitution Box (T-box) and then to compress by Interval Splitting Arithmetic Coder (ISAC). This method incorporates dynamic T-box in order to provide better security. T-box is a method, constituting elements based on the random output of Pseudo Random Generator (PRNG), which gets the input from Secure Hash Algorithm-256 (SHA-256) message digest. The current scheme is created, based on the key, which is known to the encoder and decoder. Further, T-boxes are created by using the previous message digest as a key. Â Existing interval splitting arithmetic coding of SAC is applied for compression of text data. Interval splitting finds a relative position to split the intervals and this in turn brings out compression. The result divulges that permutation replaced by T-box method provides enhanced security than SAC. Data is not revealed when permutation is replaced by T-box method. Security exploration reveals that the data remains secure to cipher text attacks, known plain text attacks and chosen plain text attacks. This approach results in increased security to Interval ISAC. Additionally the compression ratioÂ is compared by transferring the outcome of T-boxÂ to traditionalÂ arithmetic coding. The comparison proved that there is a minor reduction in compression ratio in ISAC than arithmetic coding. However the security provided by ISAC overcomes the issues of compression ratio inÂ arithmetic coding.Â

KHALSA PUBLICATIONS