96 research outputs found
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Turkish lexicon expansion by using finite state automata
© 2019 The Authors. Published by The Scientific and Technological Research Council of Turkey. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://journals.tubitak.gov.tr/elektrik/issues/elk-19-27-2/elk-27-2-25-1804-10.pdfTurkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.Published versio
A rules based system for named entity recognition in modern standard Arabic
The amount of textual information available electronically has made it difficult for many users to find and access the right information within acceptable time. Research communities in the natural language processing (NLP) field are developing tools and techniques to alleviate these problems and help users in exploiting these vast resources. These techniques include Information Retrieval (IR) and Information Extraction (IE). The work described in this thesis concerns IE and more specifically, named entity extraction in Arabic. The Arabic language is of significant interest to the NLP community mainly due to its political and economic significance, but also due to its interesting characteristics. Text usually contains all kinds of names such as person names, company names, city and country names, sports teams, chemicals and lots of other names from specific domains. These names are called Named Entities (NE) and Named Entity Recognition (NER), one of the main tasks of IE systems, seeks to locate and classify automatically these names into predefined categories. NER systems are developed for different applications and can be beneficial to other information management technologies as it can be built over an IR system or can be used as the base module of a Data Mining application. In this thesis we propose an efficient and effective framework for extracting Arabic NEs from text using a rule based approach. Our approach makes use of Arabic contextual and morphological information to extract named entities. The context is represented by means of words that are used as clues for each named entity type. Morphological information is used to detect the part of speech of each word given to the morphological analyzer. Subsequently we developed and implemented our rules in order to recognise each position of the named entity. Finally, our system implementation, evaluation metrics and experimental results are presented.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Anaphora resolution for Arabic machine translation :a case study of nafs
PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing.
This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government
A Rules Based System for Named Entity Recognition in Modern Standard Arabic
The amount of textual information available electronically has made it difficult formany users to find and access the right information within acceptable time. Researchcommunities in the natural language processing (NLP) field are developing tools andtechniques to alleviate these problems and help users in exploiting these vast resources.These techniques include Information Retrieval (IR) and Information Extraction (IE). Thework described in this thesis concerns IE and more specifically, named entity extraction inArabic. The Arabic language is of significant interest to the NLP community mainly due toits political and economic significance, but also due to its interesting characteristics.Text usually contains all kinds of names such as person names, company names,city and country names, sports teams, chemicals and lots of other names from specificdomains. These names are called Named Entities (NE) and Named Entity Recognition(NER), one of the main tasks of IE systems, seeks to locate and classify automaticallythese names into predefined categories. NER systems are developed for differentapplications and can be beneficial to other information management technologies as it canbe built over an IR system or can be used as the base module of a Data Mining application.In this thesis we propose an efficient and effective framework for extracting Arabic NEsfrom text using a rule based approach. Our approach makes use of Arabic contextual andmorphological information to extract named entities. The context is represented by meansof words that are used as clues for each named entity type. Morphological information isused to detect the part of speech of each word given to the morphological analyzer.Subsequently we developed and implemented our rules in order to recognise each positionof the named entity. Finally, our system implementation, evaluation metrics andexperimental results are presented
A morphological-syntactical analysis approach for Arabic textual tagging
Part-of-Speech (POS) tagging is the process of labeling or classifying each word in
written text with its grammatical category or part-of-speech, i.e. noun, verb, preposition,
adjective, etc. It is the most common disambiguation process in the field of
Natural Language Processing (NLP). POS tagging systems are often preprocessors in
many NLP applications.
The Arabic language has a valuable and an important feature, called diacritics, which
are marks placed over and below the letters of the word. An Arabic text is partiallyvocalisedl
when the diacritical mark is assigned to one or maximum two letters in the
word.
Diacritics in Arabic texts are extremely important especially at the end of the word.
They help determining not only the correct POS tag for each word in the sentence,
but also in providing full information regarding the inflectional features, such as tense,
number, gender, etc. for the sentence words. They add semantic information to words
which helps with resolving ambiguity in the meaning of words. Furthermore, diacritics
ascribe grammatical functions to the words, differentiating the word from other words,
and determining the syntactic position of the word in the sentence.
1. Vocalisation (also referred as diacritisation or vowelisation).
This thesis presents a rule-based Part-of-Speech tagging system called AMT - short
for Arabic Morphosyntactic Tagger. The main function of the AMT system is to assign
the correct tag to each word in an untagged raw partially-vocalised Arabic corpus,
and to produce a POS tagged corpus without using a manually tagged or untagged
lexicon (dictionary) for training. Two different techniques were used in this work, the
pattem-based technique and the lexical and contextual technique.
The rules in the pattem-based technique technique are based on the pattern of the
testing word. A novel algorithm, Pattern-Matching Algorithm (PMA), has been designed
and introduced in this work. The aim of this algorithm is to match the testing
word with its correct pattern in pattern lexicon.
The lexical and contextual technique on the other hand is used to assist the pattembased
technique technique to assign the correct tag to those words not have a pattern to
follow. The rules in the lexical and contextual technique are based on the character(s),
the last diacritical mark, the word itself, and the tags of the surrounding words.
The importance of utilizing the diacritic feature of the Arabic language to reduce the
lexical ambiguity in POS tagging has been addressed. In addition, a new Arabic tag
set and a new partially-vocalised Arabic corpus to test AMT have been compiled and
presented in this work. The AMT system has achieved an average accuracy of 91 %
Recommended from our members
Arabic Language Processing for Text Classification. Contributions to Arabic Root Extraction Techniques, Building An Arabic Corpus, and to Arabic Text Classification Techniques.
The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels.
This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic.
This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques.
Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations.Petra University, Amman (Jordan
Arabic named entity recognition
En esta tesis doctoral se describen las investigaciones realizadas con el objetivo de determinar
las mejores tecnicas para construir un Reconocedor de Entidades Nombradas
en Arabe. Tal sistema tendria la habilidad de identificar y clasificar las entidades
nombradas que se encuentran en un texto arabe de dominio abierto.
La tarea de Reconocimiento de Entidades Nombradas (REN) ayuda a otras tareas de
Procesamiento del Lenguaje Natural (por ejemplo, la Recuperacion de Informacion, la
Busqueda de Respuestas, la Traduccion Automatica, etc.) a lograr mejores resultados
gracias al enriquecimiento que a~nade al texto. En la literatura existen diversos trabajos
que investigan la tarea de REN para un idioma especifico o desde una perspectiva
independiente del lenguaje. Sin embargo, hasta el momento, se han publicado muy
pocos trabajos que estudien dicha tarea para el arabe.
El arabe tiene una ortografia especial y una morfologia compleja, estos aspectos aportan
nuevos desafios para la investigacion en la tarea de REN. Una investigacion completa
del REN para elarabe no solo aportaria las tecnicas necesarias para conseguir
un alto rendimiento, sino que tambien proporcionara un analisis de los errores y una
discusion sobre los resultados que benefician a la comunidad de investigadores del
REN. El objetivo principal de esta tesis es satisfacer esa necesidad. Para ello hemos:
1. Elaborado un estudio de los diferentes aspectos del arabe relacionados con dicha
tarea;
2. Analizado el estado del arte del REN;
3. Llevado a cabo una comparativa de los resultados obtenidos por diferentes
tecnicas de aprendizaje automatico;
4. Desarrollado un metodo basado en la combinacion de diferentes clasificadores,
donde cada clasificador trata con una sola clase de entidades nombradas y emplea
el conjunto de caracteristicas y la tecnica de aprendizaje automatico mas
adecuados para la clase de entidades nombradas en cuestion.
Nuestros experimentos han sido evaluados sobre nueve conjuntos de test.Benajiba, Y. (2009). Arabic named entity recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8318Palanci
- …