52 research outputs found
Recommended from our members
The Near-Synonymous Classifiers in Mandarin Chinese: Etymology, Modern Usage, And Possible Problems in L2 Classroom
Many Chinese classifiers are nearly synonymic – they can be used with the same head nouns without changing the meaning of the sentence, in other words, such classifiers can be used interchangeably or almost interchangeably. This poses a challenge for Chinese language learners, especially those who lack such a grammatical category in their own native language. Another complication arises from the ambiguous English translations of many classifiers.
In this paper we investigate the collocation behavior of near-synonymous Chinese classifiers, focusing on their semantic nuances and interchangeability. Analyzing 6 pairs of classifiers — 栋 and 幢, 匹 and 头, 批 and 派, 颗 and 粒, 辆 and 台, and 根 and 支— drawn from the HSK exam glossary, the dataset for this study encompasses 1200 samples (100 per each variable) and 416 distinct head nouns.
Through a corpus-based approach we analyze collocation behavior of each classifier on its own and as a part of the pair. The results showcase that not all pairs exhibit complete interchangeability. The collocation behavior of 批 and 派 differ significantly, where 批 primarily quantifies batches with a \u27first\u27 connotation, while 派 is used more in artistic expressions. The interchangeability of 栋 and 幢 varies with context. 幢 emerges as the least fre¬¬quent morpheme in the corpus, emphasizing its specific contextual usage. While both are used in address lines, 栋 predominantly quantifies standalone buildings, whereas 幢 is more aligned with larger architectural complexes. The analysis of 匹 and 头 highlights their distinctiveness, with 匹 counting horses and wolves and 头 being more versatile with various animals. 颗 and 粒 appear partially interchangeable, particularly with 珠-related head nouns and items associated with plants, fruits, and trees. The research also underscores that 辆 is primarily linked to car-related nouns, while 台 is used more versatile as a classifier for machines and electronic devices, including computers, printers, phones, cameras. 根 and 支 only overlap in the head noun 笔, and their roles diverge, with 根 being a versatile classifier and 支 also appearing as part of medical terms
Mixed-Language Arabic- English Information Retrieval
Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
Deep latent-variable models for neural text generation
Text generation aims to produce human-like natural language output for down-stream tasks. It covers a wide range of applications like machine translation, document summarization, dialogue generation and so on. Recently deep neural network-based end-to-end architectures are known to be data-hungry, and text generated from them usually suffer from low diversity, interpretability and controllability. As a result, it is difficult to trust the output from them in real-life applications. Deep latent-variable models, by specifying the probabilistic distribution over an intermediate latent process, provide a potential way of addressing these problems while maintaining the expressive power of deep neural networks. This presentation will explain how deep latent-variable models can improve over the standard encoder-decoder model for text generation. We will start from an introduction of encoder-decoder and deep latent-variable models, then go over popular optimization strategies, and finally elaborate on how latent variable models can help improve the diversity, interpretability and data efficiency in different applications of text generation tasks.Textgenerierung zielt darauf ab, eine menschenähnliche Textausgabe in natürlicher Sprache für Anwendungen zu erzeugen. Es deckt eine breite Palette von Anwendungen ab, wie maschinelle Übersetzung, Zusammenfassung von Dokumenten, Generierung von Dialogen usw. In letzter Zeit werden dafür hauptsächlich Endto- End-Architekturen auf der Basis von tiefen neuronalen Netzwerken verwendet. Der End-to-End-Ansatz fasst alle Submodule, die früher nach komplexen handgefertigten Regeln entworfen wurden, zu einer ganzheitlichen Codierungs- Decodierungs-Architektur zusammen. Bei ausreichenden Trainingsdaten kann eine Leistung auf dem neuesten Stand der Technik erzielt werden, ohne dass sprach- und domänenabhängiges Wissen erforderlich ist. Deep-Learning-Modelle sind jedoch als extrem datenhungrig bekannt und daraus generierter Text leidet normalerweise unter geringer Diversität, Interpretierbarkeit und Kontrollierbarkeit. Infolgedessen ist es schwierig, der Ausgabe von ihnen in realen Anwendungen zu vertrauen. Tiefe Modelle mit latenten Variablen bieten durch Angabe der Wahrscheinlichkeitsverteilung über einen latenten Zwischenprozess eine potenzielle Möglichkeit, diese Probleme zu lösen und gleichzeitig die Ausdruckskraft tiefer neuronaler Netze zu erhalten. Diese Dissertation zeigt, wie tiefe Modelle mit latenten Variablen Texterzeugung verbessern gegenüber dem üblichen Encoder-Decoder-Modell. Wir beginnen mit einer Einführung in Encoder-Decoder- und Deep Latent Variable-Modelle und gehen dann auf gängige Optimierungsstrategien wie Variationsinferenz, dynamische Programmierung, Soft Relaxation und Reinforcement Learning ein. Danach präsentieren wir Folgendes: 1. Wie latente Variablen Vielfalt der Texterzeugung verbessern können, indem ganzheitliche, latente Darstellungen auf Satzebene gelernt werden. Auf diese Weise kann zunächst eine latente Darstellung ausgewählt werden, aus der verschiedene Texte generiert werden können. Wir präsentieren effektive Algorithmen, um gleichzeitig das Lernen der Repräsentation und die Texterzeugung durch Variationsinferenz zu trainieren. Um die Einschränkungen der Variationsinferenz bezüglich Uni-Modalität und Inkonsistenz anzugehen, schlagen wir eine Wake-Sleep-Variation und ein auf Transinformation basierendes Trainingsziel vor. Experimente zeigen, dass sie sowohl die übliche Variationsinferenz als auch nicht-latente Variablenmodelle bei der Dialoggenerierung übertreffen. 2. Wie latente Variablen die Steuerbarkeit und Interpretierbarkeit der Texterzeugung verbessern können, indem feinkörnigere latente Spezifikationen zum Zwischengenerierungsprozess hinzugefügt werden. Wir veranschaulichen die Verwendung latenter Variablen für Wortausrichtung, Inhaltsauswahl, Textsegmentierung und Feldsegmentkorrespondenz. Wir leiten für sie effiziente Trainingsalgorithmen ab, damit die Texterzeugung explizit gesteuert werden kann, indem die latente Variable, die durch ihre Definition vom Menschen interpretiert werden kann, manipuliert wird. 3. Überwindung der Seltenheit von Trainingsmustern durch Behandlung von nicht parallelem Text als latente Variablen. Das Training kann wie beim Standard-EM-Algorithmus durchgeführt werden, der stabil konvergiert. Wir zeigen, dass es bei der Dialoggenerierung erfolgreich angewendet werden kann und den Generierungsraum durch die Verwendung von nicht-konversativem Text erheblich bereichert
Getting Past the Language Gap: Innovations in Machine Translation
In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT
Word sense disambiguation : Scaling up, domain adaptation and application to machine translation
Ph.DDOCTOR OF PHILOSOPH
- …