1,205 research outputs found
Semi-automatic acquisition of domain-specific semantic structures.
Siu, Kai-Chung.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 99-106).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Thesis Outline --- p.5Chapter 2 --- Background --- p.6Chapter 2.1 --- Natural Language Understanding --- p.6Chapter 2.1.1 --- Rule-based Approaches --- p.7Chapter 2.1.2 --- Stochastic Approaches --- p.8Chapter 2.1.3 --- Phrase-Spotting Approaches --- p.9Chapter 2.2 --- Grammar Induction --- p.10Chapter 2.2.1 --- Semantic Classification Trees --- p.11Chapter 2.2.2 --- Simulated Annealing --- p.12Chapter 2.2.3 --- Bayesian Grammar Induction --- p.12Chapter 2.2.4 --- Statistical Grammar Induction --- p.13Chapter 2.3 --- Machine Translation --- p.14Chapter 2.3.1 --- Rule-based Approach --- p.15Chapter 2.3.2 --- Statistical Approach --- p.15Chapter 2.3.3 --- Example-based Approach --- p.16Chapter 2.3.4 --- Knowledge-based Approach --- p.16Chapter 2.3.5 --- Evaluation Method --- p.19Chapter 3 --- Semi-Automatic Grammar Induction --- p.20Chapter 3.1 --- Agglomerative Clustering --- p.20Chapter 3.1.1 --- Spatial Clustering --- p.21Chapter 3.1.2 --- Temporal Clustering --- p.24Chapter 3.1.3 --- Free Parameters --- p.26Chapter 3.2 --- Post-processing --- p.27Chapter 3.3 --- Chapter Summary --- p.29Chapter 4 --- Application to the ATIS Domain --- p.30Chapter 4.1 --- The ATIS Domain --- p.30Chapter 4.2 --- Parameters Selection --- p.32Chapter 4.3 --- Unsupervised Grammar Induction --- p.35Chapter 4.4 --- Prior Knowledge Injection --- p.40Chapter 4.5 --- Evaluation --- p.43Chapter 4.5.1 --- Parse Coverage in Understanding --- p.45Chapter 4.5.2 --- Parse Errors --- p.46Chapter 4.5.3 --- Analysis --- p.47Chapter 4.6 --- Chapter Summary --- p.49Chapter 5 --- Portability to Chinese --- p.50Chapter 5.1 --- Corpus Preparation --- p.50Chapter 5.1.1 --- Tokenization --- p.51Chapter 5.2 --- Experiments --- p.52Chapter 5.2.1 --- Unsupervised Grammar Induction --- p.52Chapter 5.2.2 --- Prior Knowledge Injection --- p.56Chapter 5.3 --- Evaluation --- p.58Chapter 5.3.1 --- Parse Coverage in Understanding --- p.59Chapter 5.3.2 --- Parse Errors --- p.60Chapter 5.4 --- Grammar Comparison Across Languages --- p.60Chapter 5.5 --- Chapter Summary --- p.64Chapter 6 --- Bi-directional Machine Translation --- p.65Chapter 6.1 --- Bilingual Dictionary --- p.67Chapter 6.2 --- Concept Alignments --- p.68Chapter 6.3 --- Translation Procedures --- p.73Chapter 6.3.1 --- The Matching Process --- p.74Chapter 6.3.2 --- The Searching Process --- p.76Chapter 6.3.3 --- Heuristics to Aid Translation --- p.81Chapter 6.4 --- Evaluation --- p.82Chapter 6.4.1 --- Coverage --- p.83Chapter 6.4.2 --- Performance --- p.86Chapter 6.5 --- Chapter Summary --- p.89Chapter 7 --- Conclusions --- p.90Chapter 7.1 --- Summary --- p.90Chapter 7.2 --- Future Work --- p.92Chapter 7.2.1 --- Suggested Improvements on Grammar Induction Process --- p.92Chapter 7.2.2 --- Suggested Improvements on Bi-directional Machine Trans- lation --- p.96Chapter 7.2.3 --- Domain Portability --- p.97Chapter 7.3 --- Contributions --- p.97Bibliography --- p.99Chapter A --- Original SQL Queries --- p.107Chapter B --- Induced Grammar --- p.109Chapter C --- Seeded Categories --- p.11
A Novel and Robust Approach for Pro-Drop Language Translation
A significant challenge for machine translation (MT) is the phenomena of dropped pronouns (DPs), where certain classes of pronouns are frequently dropped in the source language but should be retained in the target language. In response to this common problem, we propose a semi-supervised approach with a universal framework to recall missing pronouns in translation. Firstly, we build training data for DP generation in which the DPs are automatically labelled according to the alignment information from a parallel corpus. Secondly, we build a deep learning-based DP generator for input sentences in decoding when no corresponding references exist. More specifically, the generation has two phases: (1) DP position detection, which is modeled as a sequential labelling task with recurrent neural networks; and (2) DP prediction, which employs a multilayer perceptron with rich features. Finally, we integrate the above outputs into our statistical MT (SMT) system to recall missing pronouns by both extracting rules from the DP-labelled training data and translating the DP-generated input sentences. To validate the robustness of our approach, we investigate our approach on both Chinese–English and Japanese–English corpora extracted from movie subtitles. Compared with an SMT baseline system, experimental results show that our approach achieves a significant improvement of++1.58 BLEU points in translation performance with 66% F-score for DP generation accuracy for Chinese–English, and nearly++1 BLEU point with 58% F-score for Japanese–English. We believe that this work could help both MT researchers and industries to boost the performance of MT systems between pro-drop and non-pro-drop languages
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar
A usage-based Construction Grammar (CxG) posits that slot-constraints
generalize from common exemplar constructions. But what is the best model of
constraint generalization? This paper evaluates competing frequency-based and
association-based models across eight languages using a metric derived from the
Minimum Description Length paradigm. The experiments show that
association-based models produce better generalizations across all languages by
a significant margin
Semi-automatic grammar induction for bidirectional machine translation.
Wong, Chin Chung.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 137-143).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Objectives --- p.3Chapter 1.2 --- Thesis Outline --- p.5Chapter 2 --- Background in Natural Language Understanding --- p.6Chapter 2.1 --- Rule-based Approaches --- p.7Chapter 2.2 --- Corpus-based Approaches --- p.8Chapter 2.2.1 --- Stochastic Approaches --- p.8Chapter 2.2.2 --- Phrase-spotting Approaches --- p.9Chapter 2.3 --- The ATIS Domain --- p.10Chapter 2.3.1 --- Chinese Corpus Preparation --- p.11Chapter 3 --- Semi-automatic Grammar Induction - Baseline Approach --- p.13Chapter 3.1 --- Background in Grammar Induction --- p.13Chapter 3.1.1 --- Simulated Annealing --- p.14Chapter 3.1.2 --- Bayesian Grammar Induction --- p.14Chapter 3.1.3 --- Probabilistic Grammar Acquisition --- p.15Chapter 3.2 --- Semi-automatic Grammar Induction 一 Baseline Approach --- p.16Chapter 3.2.1 --- Spatial Clustering --- p.16Chapter 3.2.2 --- Temporal Clustering --- p.18Chapter 3.2.3 --- Post-processing --- p.19Chapter 3.2.4 --- Four Aspects for Enhancements --- p.20Chapter 3.3 --- Chapter Summary --- p.22Chapter 4 --- Semi-automatic Grammar Induction - Enhanced Approach --- p.23Chapter 4.1 --- Evaluating Induced Grammars --- p.24Chapter 4.2 --- Stopping Criterion --- p.26Chapter 4.2.1 --- Cross-checking with Recall Values --- p.29Chapter 4.3 --- Improvements on Temporal Clustering --- p.32Chapter 4.3.1 --- Evaluation --- p.39Chapter 4.4 --- Improvements on Spatial Clustering --- p.46Chapter 4.4.1 --- Distance Measures --- p.48Chapter 4.4.2 --- Evaluation --- p.57Chapter 4.5 --- Enhancements based on Intelligent Selection --- p.62Chapter 4.5.1 --- Informed Selection between Spatial Clustering and Tem- poral Clustering --- p.62Chapter 4.5.2 --- Selecting the Number of Clusters Per Iteration --- p.64Chapter 4.5.3 --- An Example for Intelligent Selection --- p.64Chapter 4.5.4 --- Evaluation --- p.68Chapter 4.6 --- Chapter Summary --- p.71Chapter 5 --- Bidirectional Machine Translation using Induced Grammars ´ؤBaseline Approach --- p.73Chapter 5.1 --- Background in Machine Translation --- p.75Chapter 5.1.1 --- Rule-based Machine Translation --- p.75Chapter 5.1.2 --- Statistical Machine Translation --- p.76Chapter 5.1.3 --- Knowledge-based Machine Translation --- p.77Chapter 5.1.4 --- Example-based Machine Translation --- p.78Chapter 5.1.5 --- Evaluation --- p.79Chapter 5.2 --- Baseline Configuration on Bidirectional Machine Translation System --- p.84Chapter 5.2.1 --- Bilingual Dictionary --- p.84Chapter 5.2.2 --- Concept Alignments --- p.85Chapter 5.2.3 --- Translation Process --- p.89Chapter 5.2.4 --- Two Aspects for Enhancements --- p.90Chapter 5.3 --- Chapter Summary --- p.91Chapter 6 --- Bidirectional Machine Translation ´ؤ Enhanced Approach --- p.92Chapter 6.1 --- Concept Alignments --- p.93Chapter 6.1.1 --- Enhanced Alignment Scheme --- p.95Chapter 6.1.2 --- Experiment --- p.97Chapter 6.2 --- Grammar Checker --- p.100Chapter 6.2.1 --- Components for Grammar Checking --- p.101Chapter 6.3 --- Evaluation --- p.117Chapter 6.3.1 --- Bleu Score Performance --- p.118Chapter 6.3.2 --- Modified Bleu Score --- p.122Chapter 6.4 --- Chapter Summary --- p.130Chapter 7 --- Conclusions --- p.131Chapter 7.1 --- Summary --- p.131Chapter 7.2 --- Contributions --- p.134Chapter 7.3 --- Future work --- p.136Bibliography --- p.137Chapter A --- Original SQL Queries --- p.144Chapter B --- Seeded Categories --- p.146Chapter C --- 3 Alignment Categories --- p.147Chapter D --- Labels of Syntactic Structures in Grammar Checker --- p.14
Recommended from our members
Cross-Lingual Transfer of Natural Language Processing Systems
Accurate natural language processing systems rely heavily on annotated datasets. In the absence of such datasets, transfer methods can help to develop a model by transferring annotations from one or more rich-resource languages to the target language of interest. These methods are generally divided into two approaches: 1) annotation projection from translation data, aka parallel data, using supervised models in rich-resource languages, and 2) direct model transfer from annotated datasets in rich-resource languages.
In this thesis, we demonstrate different methods for transfer of dependency parsers and sentiment analysis systems. We propose an annotation projection method that performs well in the scenarios for which a large amount of in-domain parallel data is available. We also propose a method which is a combination of annotation projection and direct transfer that can leverage a minimal amount of information from a small out-of-domain parallel dataset to develop highly accurate transfer models. Furthermore, we propose an unsupervised syntactic reordering model to improve the accuracy of dependency parser transfer for non-European languages. Finally, we conduct a diverse set of experiments for the transfer of sentiment analysis systems in different data settings.
A summary of our contributions are as follows:
* We develop accurate dependency parsers using parallel text in an annotation projection framework. We make use of the fact that the density of word alignments is a valuable indicator of reliability in annotation projection.
* We develop accurate dependency parsers in the absence of a large amount of parallel data. We use the Bible data, which is in orders of magnitude smaller than a conventional parallel dataset, to provide minimal cues for creating cross-lingual word representations. Our model is also capable of boosting the performance of annotation projection with a large amount of parallel data. Our model develops cross-lingual word representations for going beyond the traditional delexicalized direct transfer methods. Moreover, we propose a simple but effective word translation approach that brings in explicit lexical features from the target language in our direct transfer method.
* We develop different syntactic reordering models that can change the source treebanks in rich-resource languages, thus preventing learning a wrong model for a non-related language. Our experimental results show substantial improvements over non-European languages.
* We develop transfer methods for sentiment analysis in different data availability scenarios. We show that we can leverage cross-lingual word embeddings to create accurate sentiment analysis systems in the absence of annotated data in the target language of interest.
We believe that the novelties that we introduce in this thesis indicate the usefulness of transfer methods. This is appealing in practice, especially since we suggest eliminating the requirement for annotating new datasets for low-resource languages which is expensive, if not impossible, to obtain
Treebank-based acquisition of Chinese LFG resources for parsing and generation
This thesis describes a treebank-based approach to automatically acquire robust,wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing
and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena and (in cooperation with PARC) develop a gold-standard dependency-bank of Chinese f-structures for evaluation. Based on the Penn Chinese Treebank, I design and implement two architectures for inducing Chinese LFG resources, one annotation-based and the other dependency conversion-based. I then apply the f-structure acquisition algorithm together with external, state-of-the-art parsers to parsing new text into "proto" f-structures. In order to convert "proto" f-structures into "proper" f-structures or deep dependencies, I present a novel Non-Local Dependency (NLD) recovery algorithm using subcategorisation frames and f-structure paths linking antecedents and traces in NLDs extracted from the automatically-built LFG f-structure treebank. Based on the grammars extracted from the f-structure annotated treebank, I develop a PCFG-based chart generator and a new n-gram based pure dependency generator to realise Chinese sentences from LFG f-structures.
The work reported in this thesis is the first effort to scale treebank-based, probabilistic Chinese LFG resources from proof-of-concept research to unrestricted, real
text. Although this thesis concentrates on Chinese and LFG, many of the methodologies, e.g. the acquisition of predicate-argument structures, NLD resolution and
the PCFG- and dependency n-gram-based generation models, are largely language and formalism independent and should generalise to diverse languages as well as to labelled bilexical dependency representations other than LFG
Construction Grammar and Language Models
Recent progress in deep learning and natural language processing has given
rise to powerful models that are primarily trained on a cloze-like task and
show some evidence of having access to substantial linguistic information,
including some constructional knowledge. This groundbreaking discovery presents
an exciting opportunity for a synergistic relationship between computational
methods and Construction Grammar research. In this chapter, we explore three
distinct approaches to the interplay between computational methods and
Construction Grammar: (i) computational methods for text analysis, (ii)
computational Construction Grammar, and (iii) deep learning models, with a
particular focus on language models. We touch upon the first two approaches as
a contextual foundation for the use of computational methods before providing
an accessible, yet comprehensive overview of deep learning models, which also
addresses reservations construction grammarians may have. Additionally, we
delve into experiments that explore the emergence of constructionally relevant
information within these models while also examining the aspects of
Construction Grammar that may pose challenges for these models. This chapter
aims to foster collaboration between researchers in the fields of natural
language processing and Construction Grammar. By doing so, we hope to pave the
way for new insights and advancements in both these fields.Comment: Accepted for publication in The Cambridge Handbook of Construction
Grammar, edited by Mirjam Fried and Kiki Nikiforidou. To appear in 202
- …