5 research outputs found

    Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification

    Full text link
    This paper deals with the identification of Multiword Expressions (MWEs) in Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the Eight Schedule of Indian Constitution. MWE plays an important role in the applications of Natural Language Processing(NLP) like Machine Translation, Part of Speech tagging, Information Retrieval, Question Answering etc. Feature selection is an important factor in the recognition of Manipuri MWEs using Conditional Random Field (CRF). The disadvantage of manual selection and choosing of the appropriate features for running CRF motivates us to think of Genetic Algorithm (GA). Using GA we are able to find the optimal features to run the CRF. We have tried with fifty generations in feature selection along with three fold cross validation as fitness function. This model demonstrated the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%, showing an improvement over the CRF based Manipuri MWE identification without GA application.Comment: 14 pages, 6 figures, see http://airccse.org/journal/jcsit/1011csit05.pd

    Combination of Genetic Algorithm and Brill Tagger Algorithm for Part of Speech Tagging Bahasa Madura

    Get PDF
    Part of speech (POS) is commonly known as word types in a sentence such as verbs, adjectives, nouns, and so on. Part of Speech (POS) Tagging is a process of marking the word class or part of speech in every word in a sentence. Part of Speech Tagging has an important role to be used as a basis for research in Natural Language Processing. That is why research on Part of Speech Tagging for Bahasa Madura as an effort to preserve and develop the use of regional languages. In this research, POS Tagging is done using the Brill Tagger Algorithm which is combined with the Genetic Algorithm. Brill Tagger is a POS Tagging Algorithm that has the best level of accuracy when implemented in other languages. Genetic Algorithms used in the contextual learner process with consideration in previous studies can increase the speed of the training process so that it is more efficient. The results of this study are then compared with the results of the previous study so that we can find out suitable algorithms used for the development of text processing in Bahasa Madura. From a series of experiments, the average accuracy obtained by using Brill Tagger is 86.4% with the highest accuracy of 86.7%, while using GA Brill Tagger shows an average accuracy of 86.5% with the highest accuracy of 86.6%. Testing by observing OOV (Out of Vocabulary) achieves an average accuracy of 67.7% for Brill Taggers and 64.6% for GA Brill Taggers. Testing by considering multiple POS with Brill Tagger produces an average accuracy of 73.3% while testing using GA Brill Tagger produces an average accuracy of 90.9%. This shows that the accuracy with GA Brill Tagger is better than Brill Tagger, especially if considering multiple POS. This is because GA Brill Tagger can generate rules for handling the existence of multiple POS more than pure Brill Tagger.Part of speech (POS) is commonly known as word types in a sentence such as verbs, adjectives, nouns, and so on. Part of Speech (POS) Tagging is a process of marking the word class or part of speech in every word in a sentence. Part of Speech Tagging has an important role to be used as a basis for research in Natural Language Processing. That is why research on Part of Speech Tagging for Bahasa Madura as an effort to preserve and develop the use of regional languages. In this research, POS Tagging is done using the Brill Tagger Algorithm which is combined with the Genetic Algorithm. Brill Tagger is a POS Tagging Algorithm that has the best level of accuracy when implemented in other languages. Genetic Algorithms used in the contextual learner process with consideration in previous studies can increase the speed of the training process so that it is more efficient. The results of this study are then compared with the results of the previous study so that we can find out suitable algorithms used for the development of text processing in Bahasa Madura. From a series of experiments, the average accuracy obtained by using Brill Tagger is 86.4% with the highest accuracy of 86.7%, while using GA Brill Tagger shows an average accuracy of 86.5% with the highest accuracy of 86.6%. Testing by observing OOV (Out of Vocabulary) achieves an average accuracy of 67.7% for Brill Taggers and 64.6% for GA Brill Taggers. Testing by considering multiple POS with Brill Tagger produces an average accuracy of 73.3% while testing using GA Brill Tagger produces an average accuracy of 90.9%. This shows that the accuracy with GA Brill Tagger is better than Brill Tagger, especially if considering multiple POS. This is because GA Brill Tagger can generate rules for handling the existence of multiple POS more than pure Brill Tagge

    Design of an Offline Handwriting Recognition System Tested on the Bangla and Korean Scripts

    Get PDF
    This dissertation presents a flexible and robust offline handwriting recognition system which is tested on the Bangla and Korean scripts. Offline handwriting recognition is one of the most challenging and yet to be solved problems in machine learning. While a few popular scripts (like Latin) have received a lot of attention, many other widely used scripts (like Bangla) have seen very little progress. Features such as connectedness and vowels structured as diacritics make it a challenging script to recognize. A simple and robust design for offline recognition is presented which not only works reliably, but also can be used for almost any alphabetic writing system. The framework has been rigorously tested for Bangla and demonstrated how it can be transformed to apply to other scripts through experiments on the Korean script whose two-dimensional arrangement of characters makes it a challenge to recognize. The base of this design is a character spotting network which detects the location of different script elements (such as characters, diacritics) from an unsegmented word image. A transcript is formed from the detected classes based on their corresponding location information. This is the first reported lexicon-free offline recognition system for Bangla and achieves a Character Recognition Accuracy (CRA) of 94.8%. This is also one of the most flexible architectures ever presented. Recognition of Korean was achieved with a 91.2% CRA. Also, a powerful technique of autonomous tagging was developed which can drastically reduce the effort of preparing a dataset for any script. The combination of the character spotting method and the autonomous tagging brings the entire offline recognition problem very close to a singular solution. Additionally, a database named the Boise State Bangla Handwriting Dataset was developed. This is one of the richest offline datasets currently available for Bangla and this has been made publicly accessible to accelerate the research progress. Many other tools were developed and experiments were conducted to more rigorously validate this framework by evaluating the method against external datasets (CMATERdb 1.1.1, Indic Word Dataset and REID2019: Early Indian Printed Documents). Offline handwriting recognition is an extremely promising technology and the outcome of this research moves the field significantly ahead

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Critical point of view: a Wikipedia reader

    Get PDF
    For millions of internet users around the globe, the search for new knowledge begins with Wikipedia. The encyclopedia’s rapid rise, novel organization, and freely offered content have been marveled at and denounced by a host of commentators. Critical Point of View moves beyond unflagging praise, well-worn facts, and questions about its reliability and accuracy, to unveil the complex, messy, and controversial realities of a distributed knowledge platform. The essays, interviews and artworks brought together in this reader form part of the overarching Critical Point of View research initiative, which began with a conference in Bangalore (January 2010), followed by events in Amsterdam (March 2010) and Leipzig (September 2010). With an emphasis on theoretical reflection, cultural difference and indeed, critique, contributions to this collection ask: What values are embedded in Wikipedia’s software? On what basis are Wikipedia’s claims to neutrality made? How can Wikipedia give voice to those outside the Western tradition of Enlightenment, or even its own administrative hierarchies? Critical Point of View collects original insights on the next generation of wiki-related research, from radical artistic interventions and the significant role of bots to hidden trajectories of encyclopedic knowledge and the politics of agency and exclusion
    corecore