28 research outputs found

    An Urdu semantic tagger - lexicons, corpora, methods and tools

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports F1F_1 of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F1_1 of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%

    Reusing Stanford POS Tagger for Tagging Urdu Sentences

    Get PDF

    Semantic Tagging for the Urdu Language:Annotated Corpus and Multi-Target Classification Methods

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download

    An auxiliary Part-of-Speech tagger for blog and microblog cyber-slang

    Get PDF
    The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state-of-the-art Part-of-Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens’ information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g.,Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeledWeb 2.0 data

    A word sense disambiguation corpus for Urdu

    Get PDF
    The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an important problem in natural language processing (NLP). Standard evaluation resources are needed to develop, evaluate and compare WSD methods. A range of initiatives have lead to the development of benchmark WSD corpora for a wide range of languages from various language families. However, there is a lack of benchmark WSD corpora for South Asian languages including Urdu, despite there being over 300 million Urdu speakers and a large amounts of Urdu digital text available online. To address that gap, this study describes a novel benchmark corpus for the Urdu Lexical Sample WSD task. This corpus contains 50 target words (30 nouns, 11 adjectives, and 9 verbs). A standard, manually crafted dictionary called Urdu Lughat is used as a sense inventory. Four baseline WSD approaches were applied to the corpus. The results show that the best performance was obtained using a simple Bag of Words approach. To encourage NLP research on the Urdu language the corpus is freely available to the research community

    A word sense disambiguation corpus for Urdu

    Get PDF
    The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an important problem in natural language processing (NLP). Standard evaluation resources are needed to develop, evaluate and compare WSD methods. A range of initiatives have lead to the development of benchmark WSD corpora for a wide range of languages from various language families. However, there is a lack of benchmark WSD corpora for South Asian languages including Urdu, despite there being over 300 million Urdu speakers and a large amounts of Urdu digital text available online. To address that gap, this study describes a novel benchmark corpus for the Urdu Lexical Sample WSD task. This corpus contains 50 target words (30 nouns, 11 adjectives, and 9 verbs). A standard, manually crafted dictionary called Urdu Lughat is used as a sense inventory. Four baseline WSD approaches were applied to the corpus. The results show that the best performance was obtained using a simple Bag of Words approach. To encourage NLP research on the Urdu language the corpus is freely available to the research community

    A sense annotated corpus for All-Words Urdu Word Sense Disambiguation

    Get PDF
    Word Sense Disambiguation (WSD) aims to automatically predict the correct sense of a word used in a given context. All human languages exhibit word sense ambiguity and resolving this ambiguity can be difficult. Standard benchmark resources are required to develop, compare and evaluate WSD techniques. These are available for many languages but not for Urdu, despite this being a language with more than 300 million speakers and large volumes of text available digitally. To fill this gap, this study proposes a novel benchmark corpus for the Urdu All-Words WSD task. The corpus contains 5,042 words of Urdu running text in which all ambiguous words (856 instances) are manually tagged with senses from the Urdu Lughat dictionary. A range of baseline WSD models based on n-grams are applied to the corpus and the best performance (accuracy of 57.71%) is achieved using word 4-grams. The corpus is freely available to the research community to encourage further WSD research in Urdu

    Semi-Semantic Annotation: A guideline for the URDU.KON-TB treebank POS annotation

    Get PDF

    Building a Text Collection for Urdu Information Retrieval

    Get PDF
    Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education
    corecore