944 research outputs found
Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification
This paper deals with the identification of Multiword Expressions (MWEs) in
Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the
Eight Schedule of Indian Constitution. MWE plays an important role in the
applications of Natural Language Processing(NLP) like Machine Translation, Part
of Speech tagging, Information Retrieval, Question Answering etc. Feature
selection is an important factor in the recognition of Manipuri MWEs using
Conditional Random Field (CRF). The disadvantage of manual selection and
choosing of the appropriate features for running CRF motivates us to think of
Genetic Algorithm (GA). Using GA we are able to find the optimal features to
run the CRF. We have tried with fifty generations in feature selection along
with three fold cross validation as fitness function. This model demonstrated
the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%,
showing an improvement over the CRF based Manipuri MWE identification without
GA application.Comment: 14 pages, 6 figures, see
http://airccse.org/journal/jcsit/1011csit05.pd
Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods
Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay <N+N> compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (T-test , Piatersky-Shapiro (PS) , C_value, FGM and rank combination method) are the best association measure and outperforms the other association measures for <N+N> collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score
Examining the Tip of the Iceberg: A Data Set for Idiom Translation
Neural Machine Translation (NMT) has been widely used in recent years with
significant improvements for many language pairs. Although state-of-the-art NMT
systems are generating progressively better translations, idiom translation
remains one of the open challenges in this field. Idioms, a category of
multiword expressions, are an interesting language phenomenon where the overall
meaning of the expression cannot be composed from the meanings of its parts. A
first important challenge is the lack of dedicated data sets for learning and
evaluating idiom translation. In this paper we address this problem by creating
the first large-scale data set for idiom translation. Our data set is
automatically extracted from a widely used German-English translation corpus
and includes, for each language direction, a targeted evaluation set where all
sentences contain idioms and a regular training corpus where sentences
including idioms are marked. We release this data set and use it to perform
preliminary NMT experiments as the first step towards better idiom translation.Comment: Accepted at LREC 201
Automatic extraction of Arabic multiword expressions
In this paper we investigate the automatic acquisition of Arabic Multiword Expressions (MWE). We propose three complementary approaches to extract MWEs from available data resources. The first approach relies on the correspondence asymmetries between Arabic Wikipedia titles and titles in 21 different languages. The second approach collects English MWEs from Princeton WordNet 3.0, translates the collection into Arabic using Google Translate, and utilizes different search engines to validate the output. The third uses lexical association measures to extract MWEs from a large unannotated corpus. We experimentally explore the feasibility of each approach and measure the quality and coverage of the output against gold standards
Normalized Google Distance for Collocation Extraction from Islamic Domain
This study investigates the properties of Arabic collocations, and classifies them according to their structural patterns on Islamic domain. Based on linguistic information, the patterns and the variation of the collocations have been identified. Then, a system that extracts the collocations from Islamic domain based on statistical measures has been described. In candidate ranking, the normalized Google distance has been adapted to measure the associations between the words in the candidates set. Finally, the n-best evaluation that selects n-best lists for each association measure has been used to annotate all candidates in these lists manually. The following association measures (log-likelihood ratio, t-score, mutual information, and enhanced mutual information) have been utilized in the candidate ranking step to compare these measures with the normalized Google distance in Arabic collocation extraction. In the experiment of this work, the normalized Google distance achieved the highest precision value 93% compared with other association measures. In fact, this strengthens our motivation to utilize the normalized Google distance to measure the relatedness between the constituent words of the collocations instead of using the frequency-based association measures as in the state-of-the-art methods. Keywords: normalized Google distance, collocation extraction, Islamic domai
A Computational Lexicon and Representational Model for Arabic Multiword Expressions
The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations.
This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions.
This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena
- …