163 research outputs found

    EXPLOITING TAGGED AND UNTAGGED CORPORA FOR WORD SENSE DISAMBIGUATION

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Distinguishing Word Senses in Untagged Text

    Full text link
    This paper describes an experimental comparison of three unsupervised learning algorithms that distinguish the sense of an ambiguous word in untagged text. The methods described in this paper, McQuitty's similarity analysis, Ward's minimum-variance method, and the EM algorithm, assign each instance of an ambiguous word to a known sense definition based solely on the values of automatically identifiable features in text. These methods and feature sets are found to be more successful in disambiguating nouns rather than adjectives or verbs. Overall, the most accurate of these procedures is McQuitty's similarity analysis in combination with a high dimensional feature set.Comment: 11 pages, latex, uses aclap.st

    The interaction of knowledge sources in word sense disambiguation

    Get PDF
    Word sense disambiguation (WSD) is a computational linguistics task likely to benefit from the tradition of combining different knowledge sources in artificial in telligence research. An important step in the exploration of this hypothesis is to determine which linguistic knowledge sources are most useful and whether their combination leads to improved results. We present a sense tagger which uses several knowledge sources. Tested accuracy exceeds 94% on our evaluation corpus.Our system attempts to disambiguate all content words in running text rather than limiting itself to treating a restricted vocabulary of words. It is argued that this approach is more likely to assist the creation of practical systems

    Using target-language information to train part-of-speech taggers for machine translation

    Get PDF
    Although corpus-based approaches to machine translation (MT) are growing in interest, they are not applicable when the translation involves less-resourced language pairs for which there are no parallel corpora available; in those cases, the rule-based approach is the only applicable solution. Most rule-based MT systems make use of part-of-speech (PoS) taggers to solve the PoS ambiguities in the source-language texts to translate; those MT systems require accurate PoS taggers to produce reliable translations in the target language (TL). The standard statistical approach to PoS ambiguity resolution (or tagging) uses hidden Markov models (HMM) trained in a supervised way from hand-tagged corpora, an expensive resource not always available, or in an unsupervised way through the Baum-Welch expectation-maximization algorithm; both methods use information only from the language being tagged. However, when tagging is considered as an intermediate task for the translation procedure, that is, when the PoS tagger is to be embedded as a module within an MT system, information from the TL can be (unsupervisedly) used in the training phase to increase the translation quality of the whole MT system. This paper presents a method to train HMM-based PoS taggers to be used in MT; the new method uses not only information from the source language (SL), as general-purpose methods do, but also information from the TL and from the remaining modules of the MT system in which the PoS tagger is to be embedded. We find that the translation quality of the MT system embedding a PoS tagger trained in an unsupervised manner through this new method is clearly better than that of the same MT system embedding a PoS tagger trained through the Baum-Welch algorithm, and comparable to that obtained by embedding a PoS tagger trained in a supervised way from hand-tagged corpora.Work funded by the Spanish Ministry of Science and Technology through project TIC2003-08601-C02-01 and by the Spanish Ministry of Education and Science and the European Social Fund through research grant BES-2004-4711 and project TIN2006-15071-C03-01

    A Hybrid Environment for Syntax-Semantic Tagging

    Full text link
    The thesis describes the application of the relaxation labelling algorithm to NLP disambiguation. Language is modelled through context constraint inspired on Constraint Grammars. The constraints enable the use of a real value statind "compatibility". The technique is applied to POS tagging, Shallow Parsing and Word Sense Disambigation. Experiments and results are reported. The proposed approach enables the use of multi-feature constraint models, the simultaneous resolution of several NL disambiguation tasks, and the collaboration of linguistic and statistical models.Comment: PhD Thesis. 120 page

    Translation-based Word Sense Disambiguation

    Get PDF
    This thesis investigates the use of the translation-based Mirrors method (Dyvik, 2005, inter alia) for Word Sense Disambiguation (WSD) for Norwegian. Word Sense Disambiguation is the process of determining the relevant sense of an ambiguous word in context automatically. Automated WSD is relevant for Natural Language Processing systems such as machine translation (MT), information retrieval, information extraction and content analysis. The most successful WSD approaches to date are so-called supervised machine learning (ML) techniques, in which the system ‘learns’ the contextual characteristics of each sense from a training corpus that contains concrete examples of contexts in which a word sense typically occurs. This approach suffers from a knowledge acquisition problem since word senses are not overtly available in corpus text. First, we therefore need a sense inventory which is computationally tractable. Subjectively defined sense distinctions have been the norm in WSD research (especially the Princeton WordNet, Fellbaum, 1998). But WSD studies increasingly show that the WordNet senses are too fine-grained for efficient WSD, which has made WordNet less attractive for machine-learned WSD. Ide and Wilks (2006) recommend instead to approximate word senses by way of cross-lingual sense definitions. Second, we need a method for sense-tagging context examples with the relevant sense given the context. Preparing such sense-tagged training corpora manually is costly and time-consuming, in particular because statistical methods require large amounts of training examples, and automated methods are therefore desirable. This thesis introduces an experimental lexical knowledge source which derives word senses and relations between word senses on the basis of translational correspondences in a parallel corpus, resulting in a structured semantic network (Dyvik, 2009). The Mirrors method is applicable for any language pair for which a parallel corpus and word alignment is available. The appeal of the Mirrors method and its translational basis for lexical semantics is that it offers an objective and consistent—and hence, testable—criterion, as opposed to the traditional subjective judgements in lexicon classification (cf. the Princeton WordNet). But due to the lack of intersubjective “gold standards” for lexical semantics, it is not an easy task to evaluate the Mirrors method. The main research question of this thesis may thus be formulated as follows: are the translation-based senses and semantic relations in the Mirrors method linguistically motivated from a monolingual point of view? To this end, this thesis proposes to use monolingual task of WSD as a practical framework to evaluate the usefulness of the Mirrors method as a lexical knowledge source. This is motivated by the idea that a well-defined end-user application may provide a stable framework within which the benefits and drawbacks of a resource or a system can be demonstrated (e.g. Ng & Lee, 1996; Stevenson & Wilks, 2001; Yarowsky & Florian, 2002; Specia et al., 2009). The innovative aspect of applying the Mirrors method for WSD is two-fold: first, the Mirrors method is used to obtain sense-tagged data automatically (using cross-lingual data), providing a SemCor-like corpus which allows us to exploit semantically analysed context features in a subsequent WSD classifier. Second, we will test whether training on semantically analysed context features, based on information from the Mirrors method, means that the system resolves other instances than a ‘traditional’ classifier trained on words. In the absence of existing data sets for WSD for Norwegian, an automatically sense-tagged parallel corpus and a manually verified lexical sample of fifteen target words was developed for Norwegian as part of this thesis. The proposed automatic sense-tagging method is based on the Mirrors sense inventory and on the translational correspondents of each word occurrence. The sense-tagger provides a partially semantically analysed context—partially, because the translation-based sense-tagger can only sense-tag tokens that were successfully word-aligned. The sense-tagged English-Norwegian Parallel Corpus (the ENPC) is comparable in size to the existing SemCor. The sense-tagged material formed the basis for a series of controlled experiments, in which the knowledge source is varied but where we maintain the same experimental framework in terms of the classification algorithm, data sets, lexical sample and sense inventory. First, a WSD classifier is trained on the actually co-occurring context WORDS. This knowledge source functions as a point of reference to indicate how well a traditional word-based classifier could be expected to perform, given our specific data sample and using the Mirrors sense inventory. Second, two Mirrors-derived knowledge sources were tentatively implemented, both of which attempt to generalise from the actually occurring context words as a means of alleviating the sparse data problem in WSD. For instance, if the noun phone was found to co-occur with the ambiguous noun billN in the ‘invoice’ sense, and if the classifier can generalise from this to include words that are semantically close to phone, such as telephone, this means that the presence of only one of them during learning could make both of them ‘known’ to the classifier at classification time. In other words, it might be desirable to study not only word co-occurrences, as unanalysed and isolated units, but also how words enter into relations with other words (classes of words) in the structured network that constitutes the vocabulary of a language. In ML terms, it might be interesting to build a WSD model which learns, not how a word sense correlates with isolated words, but rather how a word sense correlates with certain classes of semantically related words. Such a tool for generalisation is clearly desirable in the face of sparse data and in view of the fact that most content words have a relatively low frequency even in larger text corpora. The first of the two Mirrors-based knowledge source rests on so-called SEMANTIC-FEATURES that are shared between word senses in the Mirrors network. Since SEMANTIC-FEATURES may include a very high number of related words, a second knowledge source was also developed—RELATED-WORDS—which attempts to selects a stricter class of near-related word senses in the wordnet-like Mirrors network. The results indicated that the gain in abstracting from context words to classes of semantically related word senses was only marginal in that the two Mirrorsbased knowledge sources only knew marginally more of the context words at classification time compared to a traditional word-based classifier. Regarding classification accuracy, the Mirrors-based SEMANTIC-FEATURES seemed to suffer from including too broad semantic information and performed significantly worse than the other two knowledge sources. The Mirrors-based RELATED-WORDS, on the other hand, was as good as, and sometimes better, than the traditional word model, but the differences were not found to be statistically significant. Although unfortunate for the purpose of enriching a traditional WSD model with Mirrorsderived information, the lack of a difference between the traditional word model and RELATED-WORDS nevertheless provides promising indications with regard to the plausibility of the Mirrors method

    Part-Of-Speech Tagging Of Urdu in Limited Resources Scenario

    Get PDF
    We address the problem of Part-of-Speech (POS) tagging of Urdu. POS tagging is the process of assigning a part-of-speech or lexical class marker to each word in the given text. Tagging for natural languages is similar to tokenization and lexical analysis for computer languages, except that we encounter ambiguities which are to be resolved. It plays a fundamental role in various Natural Language Processing (NLP) applications such as word sense disambiguation, parsing, name entity recognition and chunking. POS tagging, particularly plays very important role in processing free-word-order languages because such languages have relatively complex morphological structure. Urdu is a morphologically rich language. Forms of the verb, as well as case, gender, and number are expressed by the morphology. It shares its morphology, phonology and grammatical structures with Hindi. It shares its vocabulary with Arabic, Persian, Sanskrit, Turkish and Pashto languages. Urdu is written using the Perso-Arabic script. POS tagging of Urdu is a necessary component for most NLP applications of Urdu. Development of an Urdu POS tagger will influence several pipelined modules of natural language understanding system, including machine translation; partial parsing and word sense disambiguation. Our objective is to develop a robust POS tagger for Urdu. We have worked on the automatic annotation of part-of-speech for Urdu. We have defined a tag-set for Urdu. We manually annotated a corpus of 10,000 sentences. We have used different machine learning methods, namely Hidden Markov Model (HMM), Maximum Entropy Model (ME) and Conditional Random Field (CRF). Further, to deal with a small-annotated corpus, we explored the use of semi-supervised learning by using an additional un-annotated corpus. We also explored the use of a dictionary to provide to us all possible POS labeling for a given word. Since Urdu is morphologically productive. Hence we augmented Hidden Markov Model, Maximum Entropy Model and Conditional Random Field with morphological features, word suffixes and POS categories of words to develop robust POS tagger for Urdu in the limited resources scenario

    UNSUPERVISED PART OF SPEECH TAGGING FOR PERSIAN

    Get PDF
    Abstract In this paper we present a rather novel unsupervised method for part of speech (below POS) disambiguation which has been applied to Persian. This method known as Iterative Improved Feedback (IIF) Model, which is a heuristic one, uses only a raw corpus of Persian as well as all possible tags for every word in that corpus as input. During the process of tagging, the algorithm passes through several iterations corresponding to n-gram levels of analysis to disambiguate each word based on a previously defined threshold. The total accuracy of the program applying in Persian texts has been calculated as 93 percent, which seems very encouraging for POS tagging in this language
    corecore