501 research outputs found
Improving Japanese Zero Pronoun Resolution by Global Word Sense Disambiguation
This paper proposes unsupervised word sense disambiguation based on automatically constructed case frames and its incorporation into our zero pronoun resolution system. The word sense disambiguation is applied to verbs and nouns. We consider that case frames define verb senses and semantic features in a thesaurus define noun senses, respectively, and perform sense disambiguation by selecting them based on case analysis. In addition, according to the one sense per discourse heuristic, the word sense disambiguation results are cached and applied globally to the subsequent words. We integrated this global word sense disambiguation into our zero pronoun resolution system, and conducted experiments of zero pronoun resolution on two different domain corpora. Both of the experimental results indicated the effectiveness of our approach.
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
Automatic Acquisition of Lexical-Functional Grammar Resources from a Japanese Dependency Corpus
PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200
Domain-Specific Knowledge Acquisition for Conceptual Sentence Analysis
The availability of on-line corpora is rapidly changing the field of natural language processing (NLP) from one dominated by theoretical models of often very specific linguistic phenomena to one guided by computational models that simultaneously account for a wide variety of phenomena that occur in real-world text. Thus far, among the best-performing and most robust systems for reading and summarizing large amounts of real-world text are knowledge-based natural language systems. These systems rely heavily on domain-specific, handcrafted knowledge to handle the myriad syntactic, semantic, and pragmatic ambiguities that pervade virtually all aspects of sentence analysis. Not surprisingly, however, generating this knowledge for new domains is time-consuming, difficult, and error-prone, and requires the expertise of computational linguists familiar with the underlying NLP system. This thesis presents Kenmore, a general framework for domain-specific knowledge acquisition for conceptual sentence analysis. To ease the acquisition of knowledge in new domains, Kenmore exploits an on-line corpus using symbolic machine learning techniques and robust sentence analysis while requiring only minimal human intervention. Unlike most approaches to knowledge acquisition for natural language systems, the framework uniformly addresses a range of subproblems in sentence analysis, each of which traditionally had required a separate computational mechanism. The thesis presents the results of using Kenmore with corpora from two real-world domains (1) to perform part-of-speech tagging, semantic feature tagging, and concept tagging of all open-class words in the corpus; (2) to acquire heuristics for part-ofspeech disambiguation, semantic feature disambiguation, and concept activation; and (3) to find the antecedents of relative pronouns
Linguistic Structure in Statistical Machine Translation
This thesis investigates the influence of linguistic structure in statistical machine translation. We develop a word reordering model based on syntactic parse trees and address the issues of pronouns and morphological agreement with a source discriminative word lexicon predicting the translation for individual words using structural features. When used in phrase-based machine translation, the models improve the translation for language pairs with different word order and morphological variation
Translation of Pronominal Anaphora between English and Spanish: Discrepancies and Evaluation
This paper evaluates the different tasks carried out in the translation of
pronominal anaphora in a machine translation (MT) system. The MT interlingua
approach named AGIR (Anaphora Generation with an Interlingua Representation)
improves upon other proposals presented to date because it is able to translate
intersentential anaphors, detect co-reference chains, and translate Spanish
zero pronouns into English---issues hardly considered by other systems. The
paper presents the resolution and evaluation of these anaphora problems in AGIR
with the use of different kinds of knowledge (lexical, morphological,
syntactic, and semantic). The translation of English and Spanish anaphoric
third-person personal pronouns (including Spanish zero pronouns) into the
target language has been evaluated on unrestricted corpora. We have obtained a
precision of 80.4% and 84.8% in the translation of Spanish and English
pronouns, respectively. Although we have only studied the Spanish and English
languages, our approach can be easily extended to other languages such as
Portuguese, Italian, or Japanese
A Hybrid Environment for Syntax-Semantic Tagging
The thesis describes the application of the relaxation labelling algorithm to
NLP disambiguation. Language is modelled through context constraint inspired on
Constraint Grammars. The constraints enable the use of a real value statind
"compatibility". The technique is applied to POS tagging, Shallow Parsing and
Word Sense Disambigation. Experiments and results are reported. The proposed
approach enables the use of multi-feature constraint models, the simultaneous
resolution of several NL disambiguation tasks, and the collaboration of
linguistic and statistical models.Comment: PhD Thesis. 120 page
Anaphora resolution for Arabic machine translation :a case study of nafs
PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing.
This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government
DUMB: A Benchmark for Smart Evaluation of Dutch Models
We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a
diverse set of datasets for low-, medium- and high-resource tasks. The total
set of nine tasks includes four tasks that were previously not available in
Dutch. Instead of relying on a mean score across tasks, we propose Relative
Error Reduction (RER), which compares the DUMB performance of language models
to a strong baseline which can be referred to in the future even when assessing
different sets of language models. Through a comparison of 14 pre-trained
language models (mono- and multi-lingual, of varying sizes), we assess the
internal consistency of the benchmark tasks, as well as the factors that likely
enable high performance. Our results indicate that current Dutch monolingual
models under-perform and suggest training larger Dutch models with other
architectures and pre-training objectives. At present, the highest performance
is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base). In
addition to highlighting best strategies for training larger Dutch models, DUMB
will foster further research on Dutch. A public leaderboard is available at
https://dumbench.nl.Comment: EMNLP 2023 camera-read
- …