529 research outputs found
Anaphora resolution for Arabic machine translation :a case study of nafs
PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing.
This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government
Review of coreference resolution in English and Persian
Coreference resolution (CR) is one of the most challenging areas of natural
language processing. This task seeks to identify all textual references to the
same real-world entity. Research in this field is divided into coreference
resolution and anaphora resolution. Due to its application in textual
comprehension and its utility in other tasks such as information extraction
systems, document summarization, and machine translation, this field has
attracted considerable interest. Consequently, it has a significant effect on
the quality of these systems. This article reviews the existing corpora and
evaluation metrics in this field. Then, an overview of the coreference
algorithms, from rule-based methods to the latest deep learning techniques, is
provided. Finally, coreference resolution and pronoun resolution systems in
Persian are investigated.Comment: 44 pages, 11 figures, 5 table
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
Aprendizagem à distância de anáfora em inglês e espanhol como línguas estrangeiras
A presente tese de doutoramento investiga a aprendizagem à distância de anáfora em inglês e espanhol como línguas estrangeiras. Analisa-se como falantes nativos de português, aprendizes de inglês ou espanhol, compreendem e produzem anáforas com antecedentes nominais em textos escritos e como diferentes modalidades de ensino à distância podem contribuir para a aprendizagem deste mecanismo discursivo. Ao todo, foram escritos 11 artigos, distribuídos em 4 seções. A primeira seção tem como foco a investigação da resolução de ambiguidade com base em um questionário online distribuído a aprendizes e falantes nativos de português, inglês e espanhol. Enquanto o primeiro texto foi um estudo-piloto realizado em Portugal, o segundo incluiu dados do Brasil, e o terceiro foi escrito após a coleta ser concluída. Nos questionários, foi possível controlar diversas variáveis para analisar como os falantes resolviam a ambiguidade anafórica. A segunda seção destina-se à revisão da literatura sobre o ensino-aprendizagem da anáfora, as teorias e métodos voltados ao ensino de línguas, e as diferentes modalidades de ensino. Estes estudos permitiram a elaboração conceitual do experimento realizado posteriormente. Finalmente, a terceira seção da tese trata do experimento realizado, que consistiu na oferta de um curso sobre anáfora nas modalidades de ensino à distância síncrona e assíncrona, com acompanhamento da aprendizagem ao longo do tempo. O primeiro artigo explica como o curso foi planejado; o segundo apresenta os resultados dos grupos nos testes de compreensão; e o terceiro avalia o curso qualitativamente. A quarta seção apresenta os corpora de aprendizagem compilados, BRANEN e BRANES, e a análise das relações anafóricas produzidas pelos estudantes ao longo de quatro testes (um pré-teste, um teste intermédio, um teste imediatamente final, e um teste de retenção após um mês). A tese conclui-se com uma sinopse dos resultados obtidos, sua discussão e uma conclusão perspectivando linhas de investigação futuras.This doctoral thesis investigates the distance learning of anaphora in English and Spanish as foreign languages. It analyses how native speakers of Portuguese, learners of English or Spanish, understand and produce anaphora with nominal antecedents in written texts and how different distance learning modalities can contribute to the learning of this discursive mechanism. In total, 11 articles were written and distributed in 4 sections. The first section focuses on investigating ambiguity resolution based on an online questionnaire distributed to learners and native speakers of Portuguese, English, and Spanish. While the first paper presents a pilot study conducted in Portugal, the second included data from Brazil, and the third was written after the data collection was completed. In the questionnaires, it was possible to control several variables to analyse how speakers resolved anaphoric ambiguity. The second section reviews the literature on the teaching and learning of anaphora, the theories and methods focused on language teaching, and the different teaching modalities. These studies allowed the conceptual elaboration of the experiment carried out later. Finally, the third section of the thesis presents the experiment carried out, which consisted in offering a course on anaphora in synchronous and asynchronous distance learning modalities, with monitoring of learning over time. The first article explains how the course was planned; the second presents the groups’ results in the comprehension tests; and the third evaluated the course qualitatively. The fourth section presents the new learner corpora, BRANEN and BRANES, and the analysis of the anaphoric relations produced by the students over four tests (a pre-test, an intermediate test, an immediately final test, and a retention test after one month). The thesis ends with a synopsis of the results obtained, their discussion, and a conclusion looking towards future lines of research
Towards Multilingual Coreference Resolution
The current work investigates the problems that occur when coreference resolution is considered as a multilingual task. We assess the issues that arise when a framework using the mention-pair coreference resolution model and memory-based learning for the resolution process are used. Along the way, we revise three essential subtasks of coreference resolution: mention detection, mention head detection and feature selection. For each of these aspects we propose various multilingual solutions including both heuristic, rule-based and machine learning methods. We carry out a detailed analysis that includes eight different languages (Arabic, Catalan, Chinese, Dutch, English, German, Italian and Spanish) for which datasets were provided by the only two multilingual shared tasks on coreference resolution held so far: SemEval-2 and CoNLL-2012. Our investigation shows that, although complex, the coreference resolution task can be targeted in a multilingual and even language independent way. We proposed machine learning methods for each of the subtasks that are affected by the transition, evaluated and compared them to the performance of rule-based and heuristic approaches. Our results confirmed that machine learning provides the needed flexibility for the multilingual task and that the minimal requirement for a language independent system is a part-of-speech annotation layer provided for each of the approached languages. We also showed that the performance of the system can be improved by introducing other layers of linguistic annotations, such as syntactic parses (in the form of either constituency or dependency parses), named entity information, predicate argument structure, etc. Additionally, we discuss the problems occurring in the proposed approaches and suggest possibilities for their improvement
Linguistics parameters for zero anaphora resolution
Dissertação de mest., Natural Language Processing and Human Language Technology, Univ. do Algarve, 2009This dissertation describes and proposes a set of linguistically motivated rules for zero
anaphora resolution in the context of a natural language processing chain developed for
Portuguese. Some languages, like Portuguese, allow noun phrase (NP) deletion (or zeroing)
in several syntactic contexts in order to avoid the redundancy that would result from
repetition of previously mentioned words. The co-reference relation between the zeroed
element and its antecedent (or previous mention) in the discourse is here called zero
anaphora (Mitkov, 2002). In Computational Linguistics, zero anaphora resolution may be
viewed as a subtask of anaphora resolution and has an essential role in various Natural
Language Processing applications such as information extraction, automatic abstracting,
dialog systems, machine translation and question answering. The main goal of this
dissertation is to describe the grammatical rules imposing subject NP deletion and referential
constraints in the Brazilian Portuguese, in order to allow a correct identification of the
antecedent of the deleted subject NP. Some of these rules were then formalized into the
Xerox Incremental Parser or XIP (Ait-Mokhtar et al., 2002: 121-144) in order to constitute a
module of the Portuguese grammar (Mamede et al. 2010) developed at Spoken Language
Laboratory (L2F). Using this rule-based approach we expected to improve the performance
of the Portuguese grammar namely by producing better dependency structures with
(reconstructed) zeroed NPs for the syntactic-semantic interface. Because of the complexity
of the task, the scope of this dissertation had to be limited: (a) subject NP deletion; b) within
sentence boundaries and (c) with an explicit antecedent; besides, (d) rules were formalized
based solely on the results of the shallow parser (or chunks), that is, with minimal syntactic
(and no semantic) knowledge. A corpus of different text genres was manually annotated for
zero anaphors and other zero-shaped, usually indefinite, subjects. The rule-based
approached is evaluated and results are presented and discussed
On L1 Attrition and Prosody in Pronominal Anaphora Resolution
This thesis is a collection of four studies on pronominal anaphora resolution with a focus on first language (L1) attrition and prosody. In Study I, we explored the temporariness of attrition effects on anaphora resolution in L1 Italian speakers who moved to Sweden after puberty (i.e., late bilinguals). An experimental group of 20 late Italian-Swedish bilinguals and a control group of 21 Italian monolinguals completed a self-paced interpretation task twice, and we measured response preferences and response times. In Study II, we investigated how L1 Italian and L1 Swedish speakers use pause features and prominence cues to resolve globally ambiguous anaphora sentences, and whether their patterns in the use of prosody mirror the divergent coreference patterns in the two languages. 28 L1 Italian speakers and 28 L1 Swedish speakers completed a speech production task, in which we analyzed the inter-clausal pause length and the pronoun’s degree of prosodic prominence, and a control interpretation task, in which we considered response preferences. Study III represents a continuation of Study II, since we examined a group of 18 late Italian-Swedish bilinguals, who completed the same experimental tasks of Study II. Study IV is a theoretical investigation, in which we discussed previous inconsistent findings on anaphora resolution in light of the interplay between hierarchical structure and linear order of a sentence. The results of the four studies suggest, first, that anaphora resolution may also affect null pronouns, and that task-learning effects should be taken into account for further research on L1 re-immersion. Second, they suggest that inter-clausal pause and prosodic prominence of pronouns are likely to break the canonical coreference pattern, both in a null subject language and in a non-null subject language. Third, the findings also reveal that L1 attrition affects prominence patterns and pause features in pronoun resolution. In particular, the longer the residence in the foreign language (FL) environment, the higher the probability that late bilinguals adapt to the FL patterns when they use prosody to resolve anaphora sentences. Fourth, both monolinguals and bilinguals are sensitive to the interplay between hierarchical structure and linear order of anaphora. However, they employ different strategies to interpret an anaphora sentence, in which hierarchical structure and linear order favor different antecedents. The implications of the findings are discussed in light of the role of processing and cross-linguistic influence (CLI) in L1 attrition, as well as in light of the use of prosodic cues to resolve an anaphoric reference, both in relation to the Null Subject Parameter and in relation to L1 attrition
Anaphoric resolution of zero pronouns in Chinese in translation and reading comprehension
The primary aim of the thesis is to investigate some of the processes of
reading Chinese text by means of comparing and analysing approximately
100 parallel translations of four texts from Chinese to English. The
translations are answers to A Level examination questions. The focus of the
investigation is interpretation of the zero pronoun, a common phenomenon in
Chinese, which often requires explicitation when translated into English. The
secondary aim is to show how translation gives evidence of comprehension,
as shown by the variation in interpretation of zero pronouns. The thesis
reviews relevant psycholinguistic research into reading, particularly reading
of Chinese text. This is followed by reviews of relevant research into
translation as a
reading activity, and a discussion of its role in language
teaching and testing.The core of the thesis is the discussion of the zero pronoun in Chinese,
including discussion of anaphoric choice - the writer's decision on when to
use zero in preference to an explicit anaphoric form - and of anaphoric
resolution - how a reader decides what a zero pronoun refers to. Anaphoric
resolution may be problematic for less experienced readers of Chinese owing
to its lack of rich morphological inflection which, in other languages, provides
the reader with information. Some of the key ideas on anaphoric choice and
resolution are then applied to the analysis of the data in the parallel
translations. It would appear that factors in Chinese texts which have an effect
on comprehending zero pronouns are antecedent distance, topic persistence,
abstraction, multiplicity of arguments and the meaning of the verb.
Characteristics of the reader which may affect comprehension of the zero
pronoun include personal schemata which may lead to elaborative inferences.
On the basis of the data I suggest that mark schemes could be devised on a
scalar system encompassing optimal solution, proximal solution and nonsolution, which might help to solve the problem of variability in marking
translation.A by-product of the thesis, and an avenue for further research, is the apparent
close relationship between idea units, clause length, punctuation breaks and
antecedent distance in Chinese texts and saccade length and working memory
capacity in the reader of Chinese
Intelligent text processing to help readers with autism
© 2018, Springer International Publishing AG. Autistic Spectrum Disorder (ASD) is a neurodevelopmental disorder which has a life-long impact on the lives of people diagnosed with the condition. In many cases, people with ASD are unable to derive the gist or meaning of written documents due to their inability to process complex sentences, understand non-literal text, and understand uncommon and technical terms. This paper presents FIRST, an innovative project which developed language technology (LT) to make documents more accessible to people with ASD. The project has produced a powerful editor which enables carers of people with ASD to prepare texts suitable for this population. Assessment of the texts generated using the editor showed that they are not less readable than those generated more slowly as a result of onerous unaided conversion and were significantly more readable than the originals. Evaluation of the tool shows that it can have a positive impact on the lives of people with ASD.Published versio
- …