75 research outputs found

    Opinion mining on newspaper headlines using SVM and NLP

    Get PDF
    Opinion Mining also known as Sentiment Analysis, is a technique or procedure which uses Natural Language processing (NLP) to classify the outcome from text. There are various NLP tools available which are used for processing text data. Multiple research have been done in opinion mining for online blogs, Twitter, Facebook etc. This paper proposes a new opinion mining technique using Support Vector Machine (SVM) and NLP tools on newspaper headlines. Relative words are generated using Stanford CoreNLP, which is passed to SVM using count vectorizer. On comparing three models using confusion matrix, results indicate that Tf-idf and Linear SVM provides better accuracy for smaller dataset. While for larger dataset, SGD and linear SVM model outperform other models

    Towards Detecting, Recognizing, and Parsing the Address Information from Bangla Signboard: A Deep Learning-based Approach

    Full text link
    Retrieving textual information from natural scene images is an active research area in the field of computer vision with numerous practical applications. Detecting text regions and extracting text from signboards is a challenging problem due to special characteristics like reflecting lights, uneven illumination, or shadows found in real-life natural scene images. With the advent of deep learning-based methods, different sophisticated techniques have been proposed for text detection and text recognition from the natural scene. Though a significant amount of effort has been devoted to extracting natural scene text for resourceful languages like English, little has been done for low-resource languages like Bangla. In this research work, we have proposed an end-to-end system with deep learning-based models for efficiently detecting, recognizing, correcting, and parsing address information from Bangla signboards. We have created manually annotated datasets and synthetic datasets to train signboard detection, address text detection, address text recognition, address text correction, and address text parser models. We have conducted a comparative study among different CTC-based and Encoder-Decoder model architectures for Bangla address text recognition. Moreover, we have designed a novel address text correction model using a sequence-to-sequence transformer-based network to improve the performance of Bangla address text recognition model by post-correction. Finally, we have developed a Bangla address text parser using the state-of-the-art transformer-based pre-trained language model

    DPCSpell: A Transformer-based Detector-Purificator-Corrector Framework for Spelling Error Correction of Bangla and Resource Scarce Indic Languages

    Full text link
    Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this work, we propose a novel detector-purificator-corrector framework based on denoising transformers by addressing previous issues. Moreover, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach that outperforms previous state-of-the-art methods by a significant margin for Bangla spelling error correction. The models and corpus are publicly available at https://tinyurl.com/DPCSpell.Comment: 23 pages, 4 figures, and 7 table

    An Evaluation of Sinhala Language NLP Tools and Neural Network Based POS Taggers

    Get PDF
    PoS sildistamine on fundamentaalne probleem, NLP domeenis ja PoS sildistajaid (ühestajaid) kasutatakse selle väljakutse lahendamiseks. Kuigi reeglipõhist, tõenäosuslikku või süvaõppe lähenemisviisi saab kasutada, PoS sildistaja (ühestaja) väljatöötamiseks, aga süvaõppel põhinevad PoS sildistajad (ühestajad) on paremaid tulemusi näidanud. Kõik senimaani läbi viidud singala keele PoS sildistamise uuringud, on läbi viidud kasutades reeglipõhist ja tõenäosuslikku meetodit. See uurimistöö keskendub süvaõppel põhinevate PoS sildistamise (ühendamise) arendamisele ja hindamisele, kasutades singala keele jaoks LSTM võrku. Selle uurimistöö käigus koolitasime viite (5) süvaõppele tuginevat PoS sildistamise (ühendamise) mudelit, kahel erineval andmekogumil ja hindasime nende mudelite tulemusi. Hindamistulemused on näidanud, et süvaõppel põhinevaid PoS sildistajaid (ühestajaid), saab singala keele jaoks kasutada ja nende jõudlus on parem, kui olemasolevad reeglipõhised või tõenäosuslikud PoS sildistajad (ühestajad).Part Of Speech tagging is a fundamental problem in the NLP domain and Part Of Speech taggers are used to address this challenge. Though Rule based, probabilistic or deep learning approaches can be used to develop a Part Of Speech tagger, deep learning based Part Of Speech taggers have shown better results. All the Part Of Speech tagging researches that have been carried out so far for the Sinhala language have been done using rule based and probabilistic approaches. This research focuses on developing and evaluating deep learning based Part Of Speech taggers using LSTM network for the Sinhala language.In this research we trained 5 deep learning based Part Of Speech tagging models on two different data sets and evaluated the results of those models. The evaluation results have shown that deep learning based Part Of Speech taggers can be used for Sinhala language and their performance is better than the existing rule based or probabilistic Part Of Speech taggers

    Comparative psychosyntax

    Get PDF
    Every difference between languages is a “choice point” for the syntactician, psycholinguist, and language learner. The syntactician must describe the differences in representations that the grammars of different languages can assign. The psycholinguist must describe how the comprehension mechanisms search the space of the representations permitted by a grammar to quickly and effortlessly understand sentences in real time. The language learner must determine which representations are permitted in her grammar on the basis of her primary linguistic evidence. These investigations are largely pursued independently, and on the basis of qualitatively different data. In this dissertation, I show that these investigations can be pursued in a way that is mutually informative. Specifically, I show how learnability con- cerns and sentence processing data can constrain the space of possible analyses of language differences. In Chapter 2, I argue that “indirect learning”, or abstract, cross-contruction syntactic inference, is necessary in order to explain how the learner determines which complementizers can co-occur with subjects gaps in her target grammar. I show that adult speakers largely converge in the robustness of the that-trace effect, a constraint on complementation complementizers and subject gaps observed in lan- guages like English, but unobserved in languages like Spanish or Italian. I show that realistic child-directed speech has very few long-distance subject extractions in En- glish, Spanish, and Italian, implying that learners must be able to distinguish these different hypotheses on the basis of other data. This is more consistent with more conservative approaches to these phenomena (Rizzi, 1982), which do not rely on ab- stract complementizer agreement like later analyses (Rizzi, 2006; Rizzi & Shlonsky, 2007). In Chapter 3, I show that resumptive pronoun dependencies inside islands in English are constructed in a non-active fashion, which contrasts with recent findings in Hebrew (Keshev & Meltzer-Asscher, ms). I propose that an expedient explanation of these facts is to suppose that resumptive pronouns in English are ungrammat- ical repair devices (Sells, 1984), whereas resumptive pronouns in island contexts are grammatical in Hebrew. This implies that learners must infer which analysis is appropriate for their grammars on the basis of some evidence in linguistic envi- ronment. However, a corpus study reveals that resumptive pronouns in islands are exceedingly rare in both languages, implying that this difference must be indirectly learned. I argue that theories of resumptive dependencies which analyze resump- tive pronouns as incidences of the same abstract construction (e.g., Hayon 1973; Chomsky 1977) license this indirect learning, as long as resumptive dependencies in English are treated as ungrammatical repair mechanisms. In Chapter 4, I compare active dependency formation processes in Japanese and Bangla. These findings suggest that filler-gap dependencies are preferentially resolved with the first position available. In Japanese, this is the most deeply em- bedded clause, since embedded clauses always precede the embedding verb(Aoshima et al., 2004; Yoshida, 2006; Omaki et al., 2014). Bangla allows a within-language comparison of the relationship between active dependency formation processes and word order, since embedded clauses may precede or follow the embedding verb (Bayer, 1996). However, the results from three experiments in Bangla are mixed, suggesting a weaker preference for a lineary local resolution of filler-gap dependen- cies, unlike in Japanese. I propose a number of possible explanations for these facts, and discuss how differences in processing profiles may be accounted for in a variety of ways. In Chapter 5, I conclude the dissertation

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Automatic Clause Boundary Annotation in the Hindi Treebank

    Get PDF

    Exploration of Corpus Augmentation Approach for English-Hindi Bidirectional Statistical Machine Translation System

    Get PDF
    Even though lot of Statistical Machine Translation(SMT) research work is happening for English-Hindi language pair, there is no effort done to standardize the dataset. Each of the research work uses different dataset, different parameters and different number of sentences during various phases of translation resulting in varied translation output. So comparing  these models, understand the result of these models, to get insight into corpus behavior for these models, regenerating the result of these research work  becomes tedious. This necessitates the need for standardization of dataset and to identify the common parameter for the development of model.  The main contribution of this paper is to discuss an approach to standardize the dataset and to identify the best parameter which in combination gives best performance. It also investigates a novel corpus augmentation approach to improve the translation quality of English-Hindi bidirectional statistical machine translation system. This model works well for the scarce resource without incorporating the external parallel data corpus of the underlying language.  This experiment is carried out using Open Source phrase-based toolkit Moses. Indian Languages Corpora Initiative (ILCI) Hindi-English tourism corpus is used.  With limited dataset, considerable improvement is achieved using the corpus augmentation approach for the English-Hindi bidirectional SMT system
    corecore