12,081 research outputs found

    Word Sense Disambiguation on English Translation of Holy Quran

    Get PDF
    This article proposes a system based on the interpretation on the Quranic text that has been translated into English language using word sense disambiguation. This system is based on a combination of three traditional semantic similarity measurements, which are Wu-Palmer (WUP), Lin (LIN), and Jiang-Conrath (JCN) for word sense disambiguation on the English Al-Quran. The experiment was performed to obtain the best overall similarity score. The empirical results demonstrate that the combination of the three mentioned semantic similarity techniques obtained competitive results when compared with using individual similarity measurements

    Similarity-Based Models of Word Cooccurrence Probabilities

    Full text link
    In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error. We also compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easy-to-disambiguate high-frequency configurations. The similarity-based methods perform up to 40% better on this particular task.Comment: 26 pages, 5 figure

    Sentence similarity-based source context modelling in PBSMT

    Get PDF
    Target phrase selection, a crucial component of the state-of-the-art phrase-based statistical machine translation (PBSMT) model, plays a key role in generating accurate translation hypotheses. Inspired by context-rich word-sense disambiguation techniques, machine translation (MT) researchers have successfully integrated various types of source language context into the PBSMT model to improve target phrase selection. Among the various types of lexical and syntactic features, lexical syntactic descriptions in the form of supertags that preserve long-range word-to-word dependencies in a sentence have proven to be effective. These rich contextual features are able to disambiguate a source phrase, on the basis of the local syntactic behaviour of that phrase. In addition to local contextual information, global contextual information such as the grammatical structure of a sentence, sentence length and n-gram word sequences could provide additional important information to enhance this phrase-sense disambiguation. In this work, we explore various sentence similarity features by measuring similarity between a source sentence to be translated with the source-side of the bilingual training sentences and integrate them directly into the PBSMT model. We performed experiments on an English-to-Chinese translation task by applying sentence-similarity features both individually, and collaboratively with supertag-based features. We evaluate the performance of our approach and report a statistically significant relative improvement of 5.25% BLEU score when adding a sentence-similarity feature together with a supertag-based feature

    Sentence Semantic Similarity based Complex Network approach for Word Sense Disambiguation

    Get PDF
    Word Sense Disambiguation is a branch of Natural Language Processing(NLP) that deals with multi-sense words. The multi-sense words are referred to as the polysemous words. The term lexical ambiguity is introduced by the multi-sense words. The existing sense disambiguation module works effectively for single sentences with available context information. The word embedding plays a vital role in the process of disambiguation. The context-dependent word embedding model is used for disambiguation. The main goal of this research paper is to disambiguate the polysemous words by considering available context information. The main identified challenge of disambiguation is the ambiguous word without context information. The discussed complex network approach is disambiguating ambiguous sentences by considering the semantic similarities. The sentence semantic similarity-based network is constructed for disambiguating ambiguous sentences. The proposed methodology is trained with SemCor, Adaptive-Lex, and OMSTI standard lexical resources. The findings state that the discussed methodology is working fine for disambiguating large documents where the sense of ambiguous sentences is on the adjacent sentences

    A Similarity Based Concordance Approach to Word Sense Disambiguation

    Get PDF
    This study attempts to solve the problem of Word Sense Disambiguation using a combination of statistical, probabilistic and word matching algorithms. These algorithms consider that words and sentences have some hidden similarities and that the polysemous words in any context should be assigned to a sense after each execution of the algorithm. The algorithm was tested with sufficient sample data and the efficiency of the disambiguation performance has proven to increase significantly after the inclusion of the concordance methodology

    EBL-Hope: Multilingual Word Sense Disambiguation Using a Hybrid Knowledge-Based Technique

    Get PDF
    We present a hybrid knowledge-based approach to multilingual word sense disambiguation using BabelNet. Our approach is based on a hybrid technique derived from the modified version of the Lesk algorithm and the Jiang & Conrath similarity measure. We present our system's runs for the word sense disambiguation subtask of the Multilingual Word Sense Disambiguation and Entity Linking task of SemEval 2015. Our system ranked 9th among the participating systems for English

    A Similarity Based Concordance Approach to Word Sense Disambiguation

    Get PDF
    This study attempts to solve the problem of Word Sense Disambiguation using a combination of statistical, probabilistic and word matching algorithms. These algorithms consider that words and sentences have some hidden similarities and that the polysemous words in any context should be assigned to a sense after each execution of the algorithm. The algorithm was tested with sufficient sample data and the efficiency of the disambiguation performance has proven to increase significantly after the inclusion of the concordance methodology

    Ensemble similarity measures for clustering terms

    Get PDF
    Clustering semantically related terms is crucial for many applications such as document categorization, and word sense disambiguation. However, automatically identifying semantically similar terms is challenging. We present a novel approach for automatically determining the degree of relatedness between terms to facilitate their subsequent clustering. Using the analogy of ensemble classifiers in Machine Learning, we combine multiple techniques like contextual similarity and semantic relatedness to boost the accuracy of our computations. A new method, based on Yarowsky's [9] word sense disambiguation approach, to generate high-quality topic signatures for contextual similarity computations, is presented. A technique to measure semantic relatedness between multi-word terms, based on the work of Hirst and St. Onge [2] is also proposed. Experimental evaluation reveals that our method outperforms similar related works. We also investigate the effects of assigning different importance levels to the different similarity measures based on the corpus characteristics.</p

    Evaluating Feature Extraction Methods for Biomedical Word Sense Disambiguation

    Get PDF
    Evaluating Feature Extraction Methods for Biomedical WSD Clint Cuffy, Sam Henry and Bridget McInnes, PhD Virginia Commonwealth University, Richmond, Virginia, USA Introduction. Biomedical text processing is currently a high active research area but ambiguity is still a barrier to the processing and understanding of these documents. Many word sense disambiguation (WSD) approaches represent instances of an ambiguous word as a distributional context vector. One problem with using these vectors is noise -- information that is overly general and does not contribute to the wordā€™s representation. Feature extraction approaches attempt to compensate for sparsity and reduce noise by transforming the data from high-dimensional space to a space of fewer dimensions. Currently, word embeddings [1] have become an increasingly popular method to reduce the dimensionality of vector representations. In this work, we evaluate word embeddings in a knowledge-based word sense disambiguation method. Methods. Context requiring disambiguation consists of an instance of an ambiguous word, and multiple denotative senses. In our method, each word is replaced with its respective word embedding and either summed or averaged to form a single instance vector representation. This also is performed for each sense of an ambiguous word using the senseā€™s definition obtained from the Unified Medical Language System (UMLS). We calculate the cosine similarity between each sense and instance vectors, and assign the instance the sense with the highest value. Evaluation. We evaluate our method on three biomedical WSD datasets: NLM-WSD, MSH-WSD and Abbrev. The word embeddings were trained on the titles and abstracts from the 2016 Medline baseline. We compare using two word embedding models, Skip-gram and Continuous Bag of Words (CBOW), and vary the word vector representational lengths, from one-hundred to one-thousand, and compare differences in accuracy. Results. The overall outcome of this method demonstrates fairly high accuracy at disambiguating biomedical instance context from groups of denotative senses. The results showed the Skip-gram model obtained a higher disambiguation accuracy than CBOW but the increase was not significant for all of the datasets. Similarly, vector representations of differing lengths displayed minimal change in results, often differing by mere tenths in percentage. We also compared our results to current state-of-the-art knowledge-based WSD systems, including those that have used word embeddings, showing comparable or higher disambiguation accuracy. Conclusion. Although biomedical literature can be ambiguous, our knowledge-based feature extraction method using word embeddings demonstrates a high accuracy in disambiguating biomedical text while eliminating variations of associated noise. In the future, we plan to explore additional dimensionality reduction methods and training data. [1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, pp. 3111-3119, 2013.https://scholarscompass.vcu.edu/uresposters/1278/thumbnail.jp
    • ā€¦
    corecore