44 research outputs found
Sequence to Sequence Mixture Model for Diverse Machine Translation
Sequence to sequence (SEQ2SEQ) models often lack diversity in their generated
translations. This can be attributed to the limitation of SEQ2SEQ models in
capturing lexical and syntactic variations in a parallel corpus resulting from
different styles, genres, topics, or ambiguity of the translation process. In
this paper, we develop a novel sequence to sequence mixture (S2SMIX) model that
improves both translation diversity and quality by adopting a committee of
specialized translation models rather than a single translation model. Each
mixture component selects its own training dataset via optimization of the
marginal loglikelihood, which leads to a soft clustering of the parallel
corpus. Experiments on four language pairs demonstrate the superiority of our
mixture model compared to a SEQ2SEQ baseline with standard or diversity-boosted
beam search. Our mixture model uses negligible additional parameters and incurs
no extra computation cost during decoding.Comment: 11 pages, 5 figures, accepted to CoNLL201
Word Representation Models for Morphologically Rich Languages in Neural Machine Translation
Dealing with the complex word forms in morphologically rich languages is an
open problem in language processing, and is particularly important in
translation. In contrast to most modern neural systems of translation, which
discard the identity for rare words, in this paper we propose several
architectures for learning word representations from character and morpheme
level word decompositions. We incorporate these representations in a novel
machine translation model which jointly learns word alignments and translations
via a hard attention mechanism. Evaluating on translating from several
morphologically rich languages into English, we show consistent improvements
over strong baseline methods, of between 1 and 1.5 BLEU points
Generative Models are Self-Watermarked: Declaring Model Authentication through Re-Generation
As machine- and AI-generated content proliferates, protecting the
intellectual property of generative models has become imperative, yet verifying
data ownership poses formidable challenges, particularly in cases of
unauthorized reuse of generated data. The challenge of verifying data ownership
is further amplified by using Machine Learning as a Service (MLaaS), which
often functions as a black-box system.
Our work is dedicated to detecting data reuse from even an individual sample.
Traditionally, watermarking has been leveraged to detect AI-generated content.
However, unlike watermarking techniques that embed additional information as
triggers into models or generated content, potentially compromising output
quality, our approach identifies latent fingerprints inherently present within
the outputs through re-generation. We propose an explainable verification
procedure that attributes data ownership through re-generation, and further
amplifies these fingerprints in the generative models through iterative data
re-generation. This methodology is theoretically grounded and demonstrates
viability and robustness using recent advanced text and image generative
models. Our methodology is significant as it goes beyond protecting the
intellectual property of APIs and addresses important issues such as the spread
of misinformation and academic misconduct. It provides a useful tool to ensure
the integrity of sources and authorship, expanding its application in different
scenarios where authenticity and ownership verification are essential
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities
Spurred by the recent rapid increase in the development and distribution of
large language models (LLMs) across industry and academia, much recent work has
drawn attention to safety- and security-related threats and vulnerabilities of
LLMs, including in the context of potentially criminal activities.
Specifically, it has been shown that LLMs can be misused for fraud,
impersonation, and the generation of malware; while other authors have
considered the more general problem of AI alignment. It is important that
developers and practitioners alike are aware of security-related problems with
such models. In this paper, we provide an overview of existing - predominantly
scientific - efforts on identifying and mitigating threats and vulnerabilities
arising from LLMs. We present a taxonomy describing the relationship between
threats caused by the generative capabilities of LLMs, prevention measures
intended to address such threats, and vulnerabilities arising from imperfect
prevention measures. With our work, we hope to raise awareness of the
limitations of LLMs in light of such security concerns, among both experienced
developers and novel users of such technologies.Comment: Pre-prin
Koala: An Index for Quantifying Overlaps with Pre-training Corpora
In very recent years more attention has been placed on probing the role of
pre-training data in Large Language Models (LLMs) downstream behaviour. Despite
the importance, there is no public tool that supports such analysis of
pre-training corpora at large scale. To help research in this space, we launch
Koala, a searchable index over large pre-training corpora using compressed
suffix arrays with highly efficient compression rate and search support. In its
first release we index the public proportion of OPT 175B pre-training data.
Koala provides a framework to do forensic analysis on the current and future
benchmarks as well as to assess the degree of memorization in the output from
the LLMs. Koala is available for public use at
https://koala-index.erc.monash.edu/.Comment: Available here: https://koala-index.erc.monash.edu
Can Knowledge Graphs Simplify Text?
Knowledge Graph (KG)-to-Text Generation has seen recent improvements in
generating fluent and informative sentences which describe a given KG. As KGs
are widespread across multiple domains and contain important entity-relation
information, and as text simplification aims to reduce the complexity of a text
while preserving the meaning of the original text, we propose KGSimple, a novel
approach to unsupervised text simplification which infuses KG-established
techniques in order to construct a simplified KG path and generate a concise
text which preserves the original input's meaning. Through an iterative and
sampling KG-first approach, our model is capable of simplifying text when
starting from a KG by learning to keep important information while harnessing
KG-to-text generation to output fluent and descriptive sentences. We evaluate
various settings of the KGSimple model on currently-available KG-to-text
datasets, demonstrating its effectiveness compared to unsupervised text
simplification models which start with a given complex text. Our code is
available on GitHub.Comment: Accepted as a Main Conference Long Paper at CIKM 202