32,559 research outputs found
Evaluation and Implementation of n-Gram-Based Algorithm for Fast Text Comparison
This paper presents a study of an n-gram-based document comparison method. The method is intended to build a large-scale plagiarism detection system. The work focuses not only on an efficiency of the text similarity extraction but also on the execution performance of the implemented algorithms. We took notice of detection performance, storage requirements and execution time of the proposed approach. The obtained results show the trade-offs between detection quality and computational requirements. The GPGPU and multi-CPU platforms were considered to implement the algorithms and to achieve good execution speed. The method consists of two main algorithms: a document's feature extraction and fast text comparison. The winnowing algorithm is used to generate a compressed representation of the analyzed documents. The authors designed and implemented a dedicated test framework for the algorithm. That allowed for the tuning, evaluation, and optimization of the parameters. Well-known metrics (e.g. precision, recall) were used to evaluate detection performance. The authors conducted the tests to determine the performance of the winnowing algorithm for obfuscated and unobfuscated texts for a different window and n-gram size. Also, a simplified version of the text comparison algorithm was proposed and evaluated to reduce the computational complexity of the text comparison process. The paper also presents GPGPU and multi-CPU implementations of the algorithms for different data structures. The implementation speed was tested for different algorithms' parameters and the size of data. The scalability of the algorithm on multi-CPU platforms was verified. The authors of the paper provide the repository of software tools and programs used to perform the conducted experiments.he appropriate fast document comparison system. Its performance is given in the paper
Fast and Tiny Structural Self-Indexes for XML
XML document markup is highly repetitive and therefore well compressible
using dictionary-based methods such as DAGs or grammars. In the context of
selectivity estimation, grammar-compressed trees were used before as synopsis
for structural XPath queries. Here a fully-fledged index over such grammars is
presented. The index allows to execute arbitrary tree algorithms with a
slow-down that is comparable to the space improvement. More interestingly,
certain algorithms execute much faster over the index (because no decompression
occurs). E.g., for structural XPath count queries, evaluating over the index is
faster than previous XPath implementations, often by two orders of magnitude.
The index also allows to serialize XML results (including texts) faster than
previous systems, by a factor of ca. 2-3. This is due to efficient copy
handling of grammar repetitions, and because materialization is totally
avoided. In order to compare with twig join implementations, we implemented a
materializer which writes out pre-order numbers of result nodes, and show its
competitiveness.Comment: 13 page
AntiPlag: Plagiarism Detection on Electronic Submissions of Text Based Assignments
Plagiarism is one of the growing issues in academia and is always a concern
in Universities and other academic institutions. The situation is becoming even
worse with the availability of ample resources on the web. This paper focuses
on creating an effective and fast tool for plagiarism detection for text based
electronic assignments. Our plagiarism detection tool named AntiPlag is
developed using the tri-gram sequence matching technique. Three sets of text
based assignments were tested by AntiPlag and the results were compared against
an existing commercial plagiarism detection tool. AntiPlag showed better
results in terms of false positives compared to the commercial tool due to the
pre-processing steps performed in AntiPlag. In addition, to improve the
detection latency, AntiPlag applies a data clustering technique making it four
times faster than the commercial tool considered. AntiPlag could be used to
isolate plagiarized text based assignments from non-plagiarised assignments
easily. Therefore, we present AntiPlag, a fast and effective tool for
plagiarism detection on text based electronic assignments
Learning Word Representations with Hierarchical Sparse Coding
We propose a new method for learning word representations using hierarchical
regularization in sparse coding inspired by the linguistic study of word
meanings. We show an efficient learning algorithm based on stochastic proximal
methods that is significantly faster than previous approaches, making it
possible to perform hierarchical sparse coding on a corpus of billions of word
tokens. Experiments on various benchmark tasks---word similarity ranking,
analogies, sentence completion, and sentiment analysis---demonstrate that the
method outperforms or is competitive with state-of-the-art methods. Our word
representations are available at
\url{http://www.ark.cs.cmu.edu/dyogatam/wordvecs/}
A Batch Noise Contrastive Estimation Approach for Training Large Vocabulary Language Models
Training large vocabulary Neural Network Language Models (NNLMs) is a
difficult task due to the explicit requirement of the output layer
normalization, which typically involves the evaluation of the full softmax
function over the complete vocabulary. This paper proposes a Batch Noise
Contrastive Estimation (B-NCE) approach to alleviate this problem. This is
achieved by reducing the vocabulary, at each time step, to the target words in
the batch and then replacing the softmax by the noise contrastive estimation
approach, where these words play the role of targets and noise samples at the
same time. In doing so, the proposed approach can be fully formulated and
implemented using optimal dense matrix operations. Applying B-NCE to train
different NNLMs on the Large Text Compression Benchmark (LTCB) and the One
Billion Word Benchmark (OBWB) shows a significant reduction of the training
time with no noticeable degradation of the models performance. This paper also
presents a new baseline comparative study of different standard NNLMs on the
large OBWB on a single Titan-X GPU.Comment: Accepted for publication at INTERSPEECH'1
- …