276 research outputs found
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
A Hybrid Algorithm for Recognizing the Position of Ezafe Constructions in Persian Texts
In the Persian language, an Ezafe construction is a linking element which joins the head of a phrase to its modifiers. The Ezafe in its simplest form is pronounced as –e, but generally not indicated in writing. Determining the position of an Ezafe is advantageous for disambiguating the boundary of the syntactic phrases which is a fundamental task in most natural language processing applications. This paper introduces a framework for combining genetic algorithms with rule-based models that brings the advantages of both approaches and overcomes their problems. This framework was used for recognizing the position of Ezafe constructions in Persian written texts. At the first stage, the rule-based model was applied to tag some tokens of an input sentence. Then, in the second stage, the search capabilities of the genetic algorithm were used to assign the Ezafe tag to untagged tokens using the previously captured training information. The proposed framework was evaluated on Peykareh corpus and it achieved 95.26 percent accuracy. Test results show that this proposed approach outperformed other approaches for recognizing the position of Ezafe constructions
Improving accuracy of Part-of-Speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language
In Natural Language Processing (NLP), Word segmentation and Part-of-Speech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
Part-Of-Speech Tagging Of Urdu in Limited Resources Scenario
We address the problem of Part-of-Speech (POS) tagging of Urdu. POS tagging is the process of assigning a part-of-speech or lexical class marker to each word in the given text. Tagging for natural languages is similar to tokenization and lexical analysis for computer languages, except that we encounter ambiguities which are to be resolved. It plays a fundamental role in various Natural Language Processing (NLP) applications such as word sense disambiguation, parsing, name entity recognition and chunking. POS tagging, particularly plays very important role in processing free-word-order languages because such languages have relatively complex morphological structure. Urdu is a morphologically rich language. Forms of the verb, as well as case, gender, and number are expressed by the morphology. It shares its morphology, phonology and grammatical structures with Hindi. It shares its vocabulary with Arabic, Persian, Sanskrit, Turkish and Pashto languages. Urdu is written using the Perso-Arabic script. POS tagging of Urdu is a necessary component for most NLP applications of Urdu. Development of an Urdu POS tagger will influence several pipelined modules of natural language understanding system, including machine translation; partial parsing and word sense disambiguation. Our objective is to develop a robust POS tagger for Urdu. We have worked on the automatic annotation of part-of-speech for Urdu. We have defined a tag-set for Urdu. We manually annotated a corpus of 10,000 sentences. We have used different machine learning methods, namely Hidden Markov Model (HMM), Maximum Entropy Model (ME) and Conditional Random Field (CRF). Further, to deal with a small-annotated corpus, we explored the use of semi-supervised learning by using an additional un-annotated corpus. We also explored the use of a dictionary to provide to us all possible POS labeling for a given word. Since Urdu is morphologically productive. Hence we augmented Hidden Markov Model, Maximum Entropy Model and Conditional Random Field with morphological features, word suffixes and POS categories of words to develop robust POS tagger for Urdu in the limited resources scenario
Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging
International audienceThis paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve POS tagging performance. Focusing on French tagging, we introduce a maximum entropy conditional sequence tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.7% accuracy on the French Treebank, an error reduction of 23% (28% on unknown words) over the same tagger without lexical information. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data vs. developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half
Conference Program
Proceedings of the 18th Nordic Conference of Computational Linguistics
NODALIDA 2011.
Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa.
NEALT Proceedings Series, Vol. 11 (2011), xii-xvii.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/16955
UNSUPERVISED PART OF SPEECH TAGGING FOR PERSIAN
Abstract In this paper we present a rather novel unsupervised method for part of speech (below POS) disambiguation which has been applied to Persian. This method known as Iterative Improved Feedback (IIF) Model, which is a heuristic one, uses only a raw corpus of Persian as well as all possible tags for every word in that corpus as input. During the process of tagging, the algorithm passes through several iterations corresponding to n-gram levels of analysis to disambiguate each word based on a previously defined threshold. The total accuracy of the program applying in Persian texts has been calculated as 93 percent, which seems very encouraging for POS tagging in this language
Contents
Proceedings of the 18th Nordic Conference of Computational Linguistics
NODALIDA 2011.
Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa.
NEALT Proceedings Series, Vol. 11 (2011), iii-vii.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/16955
- …