2,927 research outputs found
An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification
End-to-end neural machine translation has overtaken statistical machine
translation in terms of translation quality for some language pairs, specially
those with large amounts of parallel data. Besides this palpable improvement,
neural networks provide several new properties. A single system can be trained
to translate between many languages at almost no additional cost other than
training time. Furthermore, internal representations learned by the network
serve as a new semantic representation of words -or sentences- which, unlike
standard word embeddings, are learned in an essentially bilingual or even
multilingual context. In view of these properties, the contribution of the
present work is two-fold. First, we systematically study the NMT context
vectors, i.e. output of the encoder, and their power as an interlingua
representation of a sentence. We assess their quality and effectiveness by
measuring similarities across translations, as well as semantically related and
semantically unrelated sentence pairs. Second, as extrinsic evaluation of the
first point, we identify parallel sentences in comparable corpora, obtaining an
F1=98.2% on data from a shared task when using only NMT context vectors. Using
context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure
Contextual Information Retrieval based on Algorithmic Information Theory and Statistical Outlier Detection
The main contribution of this paper is to design an Information Retrieval
(IR) technique based on Algorithmic Information Theory (using the Normalized
Compression Distance- NCD), statistical techniques (outliers), and novel
organization of data base structure. The paper shows how they can be integrated
to retrieve information from generic databases using long (text-based) queries.
Two important problems are analyzed in the paper. On the one hand, how to
detect "false positives" when the distance among the documents is very low and
there is actual similarity. On the other hand, we propose a way to structure a
document database which similarities distance estimation depends on the length
of the selected text. Finally, the experimental evaluations that have been
carried out to study previous problems are shown.Comment: Submitted to 2008 IEEE Information Theory Workshop (6 pages, 6
figures
The System Kato: Detecting Cases of Plagiarism for Answer-Set Programs
Plagiarism detection is a growing need among educational institutions and
solutions for different purposes exist. An important field in this direction is
detecting cases of source-code plagiarism. In this paper, we present the tool
Kato for supporting the detection of this kind of plagiarism in the area of
answer-set programming (ASP). Currently, the tool is implemented for DLV
programs but it is designed to handle other logic-programming dialects as well.
We review the basic features of Kato, introduce its theoretical underpinnings,
and discuss an application of Kato for plagiarism detection in the context of
courses on logic programming at the Vienna University of Technology
- …