6,236 research outputs found
Joint Dropout: Improving Generalizability in Low-Resource Neural Machine Translation through Phrase Pair Variables
Despite the tremendous success of Neural Machine Translation (NMT), its
performance on low-resource language pairs still remains subpar, partly due to
the limited ability to handle previously unseen inputs, i.e., generalization.
In this paper, we propose a method called Joint Dropout, that addresses the
challenge of low-resource neural machine translation by substituting phrases
with variables, resulting in significant enhancement of compositionality, which
is a key aspect of generalization. We observe a substantial improvement in
translation quality for language pairs with minimal resources, as seen in BLEU
and Direct Assessment scores. Furthermore, we conduct an error analysis, and
find Joint Dropout to also enhance generalizability of low-resource NMT in
terms of robustness and adaptability across different domainsComment: Accepted at MT Summit 202
Recommended from our members
Production networks in the cultural and creative sector: case studies from the publishing industry
The CICERONE project investigates cultural and creative industries through case study research, with a focus on production networks. This report, part of WP2, examines the publishing industry within this framework. It aims to understand the industry’s hidden aspects, address statistical issues in measurement, and explore the industry’s transformation and integration of cultural and economic values. The report provides an overview of the production network, explores statistical challenges, and presents qualitative analyses of two case studies. It concludes by highlighting the potential of the Global Production Network (GPN) approach for analyzing, researching, policymaking, and intervening in the European publishing network.
The CICERONE project’s case study research delves into the publishing industry, investigating its production networks and examining key aspects often unseen by the public. The report addresses statistical challenges in measuring the industry and sheds light on its ongoing transformations and integration of cultural and economic values. It presents an overview of the production network, explores statistical issues, and provides qualitative analyses of two case studies. The report emphasizes the potential of the GPN approach for analyzing and intervening in the European publishing network, ultimately contributing to research, policymaking, and understanding within the industry
Writing Facts
»Fact« is one of the most crucial inventions of modern times. Susanne Knaller discusses the functions of this powerful notion in the arts and the sciences, its impact on aesthetic models and systems of knowledge. The practice of writing provides an effective procedure to realize and to understand facts. This concerns preparatory procedures, formal choices, models of argumentation, and narrative patterns. By considering »writing facts« and »writing facts«, the volume shows why and how »facts« are a result of knowledge, rules, and norms as well as of description, argumentation, and narration. This approach allows new perspectives on »fact« and its impact on modernity
Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation
Neural machine translation (NMT) has progressed rapidly over the past several
years, and modern models are able to achieve relatively high quality using only
monolingual text data, an approach dubbed Unsupervised Machine Translation
(UNMT). However, these models still struggle in a variety of ways, including
aspects of translation that for a human are the easiest - for instance,
correctly translating common nouns. This work explores a cheap and abundant
resource to combat this problem: bilingual lexica. We test the efficacy of
bilingual lexica in a real-world set-up, on 200-language translation models
trained on web-crawled text. We present several findings: (1) using lexical
data augmentation, we demonstrate sizable performance gains for unsupervised
translation; (2) we compare several families of data augmentation,
demonstrating that they yield similar improvements, and can be combined for
even greater improvements; (3) we demonstrate the importance of carefully
curated lexica over larger, noisier ones, especially with larger models; and
(4) we compare the efficacy of multilingual lexicon data versus
human-translated parallel data. Finally, we open-source GATITOS (available at
https://github.com/google-research/url-nlp/tree/main/gatitos), a new
multilingual lexicon for 26 low-resource languages, which had the highest
performance among lexica in our experiments
Development of linguistic linked open data resources for collaborative data-intensive research in the language sciences
Making diverse data in linguistics and the language sciences open, distributed, and accessible: perspectives from language/language acquistiion researchers and technical LOD (linked open data) researchers. This volume examines the challenges inherent in making diverse data in linguistics and the language sciences open, distributed, integrated, and accessible, thus fostering wide data sharing and collaboration. It is unique in integrating the perspectives of language researchers and technical LOD (linked open data) researchers. Reporting on both active research needs in the field of language acquisition and technical advances in the development of data interoperability, the book demonstrates the advantages of an international infrastructure for scholarship in the field of language sciences. With contributions by researchers who produce complex data content and scholars involved in both the technology and the conceptual foundations of LLOD (linguistics linked open data), the book focuses on the area of language acquisition because it involves complex and diverse data sets, cross-linguistic analyses, and urgent collaborative research. The contributors discuss a variety of research methods, resources, and infrastructures. Contributors Isabelle Barrière, Nan Bernstein Ratner, Steven Bird, Maria Blume, Ted Caldwell, Christian Chiarcos, Cristina Dye, Suzanne Flynn, Claire Foley, Nancy Ide, Carissa Kang, D. Terence Langendoen, Barbara Lust, Brian MacWhinney, Jonathan Masci, Steven Moran, Antonio Pareja-Lora, Jim Reidy, Oya Y. Rieger, Gary F. Simons, Thorsten Trippel, Kara Warburton, Sue Ellen Wright, Claus Zin
Learning disentangled speech representations
A variety of informational factors are contained within the speech signal and a single short recording of speech reveals much more than the spoken words. The best method to extract and represent informational factors from the speech signal ultimately depends on which informational factors are desired and how they will be used. In addition, sometimes methods will capture more than one informational factor at the same time such as speaker identity, spoken content, and speaker prosody.
The goal of this dissertation is to explore different ways to deconstruct the speech signal into abstract representations that can be learned and later reused in various speech technology tasks. This task of deconstructing, also known as disentanglement, is a form of distributed representation learning. As a general approach to disentanglement, there are some guiding principles that elaborate what a learned representation should contain as well as how it should function. In particular, learned representations should contain all of the requisite information in a more compact manner, be interpretable, remove nuisance factors of irrelevant information, be useful in downstream tasks, and independent of the task at hand. The learned representations should also be able to answer counter-factual questions.
In some cases, learned speech representations can be re-assembled in different ways according to the requirements of downstream applications. For example, in a voice conversion task, the speech content is retained while the speaker identity is changed. And in a content-privacy task, some targeted content may be concealed without affecting how surrounding words sound. While there is no single-best method to disentangle all types of factors, some end-to-end approaches demonstrate a promising degree of generalization to diverse speech tasks.
This thesis explores a variety of use-cases for disentangled representations including phone recognition, speaker diarization, linguistic code-switching, voice conversion, and content-based privacy masking. Speech representations can also be utilised for automatically assessing the quality and authenticity of speech, such as automatic MOS ratings or detecting deep fakes. The meaning of the term "disentanglement" is not well defined in previous work, and it has acquired several meanings depending on the domain (e.g. image vs. speech). Sometimes the term "disentanglement" is used interchangeably with the term "factorization". This thesis proposes that disentanglement of speech is distinct, and offers a viewpoint of disentanglement that can be considered both theoretically and practically
Publishing Sacrobosco’s De sphaera in Early Modern Europe
This open access volume focuses on the cultural background of the pivotal transformations of scientific knowledge in the early modern period. It investigates the rich edition history of Johannes de Sacrobosco’s Tractatus de sphaera, by far the most widely disseminated textbook on geocentric cosmology, from the unique standpoint of the many printers, publishers, and booksellers who steered this text from manuscript to print culture, and in doing so transformed it into an established platform of scientific learning. The corpus, constituted of 359 different editions featuring Sacrobosco’s treatise on cosmology and astronomy printed between 1472 and 1650, represents the scientific European shared knowledge concerned with the cosmological worldview of the early modern period until far after the publication of Copernicus’ De revolutionibus orbium coelestium in 1543. The contributions to this volume show how the academic book trade influenced the process of homogenization of scientific knowledge. They also describe the material infrastructure through which such knowledge was disseminated, and thus define the premises for the foundation of modern scientific communities
- …