1,629 research outputs found

    Multi-label text classification of Indonesian customer reviews using bidirectional encoder representations from transformers language model

    Get PDF
    Customer review is a critical resource to support the decision-making process in various industries. To understand how customers perceived each aspect of the product, we can first identify all aspects discussed in the customer reviews by performing multi-label text classification. In this work, we want to know the effectiveness of our two proposed strategies using bidirectional encoder representations from transformers (BERT) language model that was pre-trained on the Indonesian language, referred to as IndoBERT, to perform multi-label text classification. First, IndoBERT is used as feature representation to be combined with convolutional neural network-extreme gradient boosting (CNN-XGBoost). Second, IndoBERT is used both as the feature representation as well as the classifier to directly solve the classification task. Additional analysis is performed to compare our results with those using multilingual BERT model. According to our experimental results, our first model using IndoBERT as feature representation shows significant performance over some baselines. Our second model using IndoBERT as both feature representation and classifier can significantly enhance the effectiveness of our first model. In summary, our proposed models can improve the effectiveness of the baseline using Word2Vec-CNN-XGBoost by 19.19% and 6.17%, in terms of accuracy and F-1 score, respectively

    Machine translation as an underrated ingredient? : solving classification tasks with large language models for comparative research

    Get PDF
    While large language models have revolutionised computational text analysis methods, the field is still tilted towards English language resources. Even as there are pre-trained models for some "smaller" languages, the coverage is far from universal, and pre-training large language models is an expensive and complicated task. This uneven language coverage limits comparative social research in terms of its geographical and linguistic scope. We propose a solution that sidesteps these issues by leveraging transfer learning and open-source machine translation. We use English as a bridge language between Hungarian and Polish bills and laws to solve a classification task related to the Comparative Agendas Project (CAP) coding scheme. Using the Hungarian corpus as training data for model fine-tuning, we categorise the Polish laws into 20 CAP categories. In doing so, we compare the performance of Transformer-based deep learning models (monolinguals, such as BERT, and multilinguals such as XLM-RoBERTa) and machine learning algorithms (e.g., SVM). Results show that the fine-tuned large language models outperform the traditional supervised learning benchmarks but are themselves surpassed by the machine translation approach. Overall, the proposed solution demonstrates a viable option for applying a transfer learning framework for low-resource languages and achieving state-of-the-art results without requiring expensive pre-training

    Domain Classification for Marathi Blog Articles using Deep Learning

    Get PDF
    Nowadays the exponential growth of online content, particularly in the form of blog articles is tremendous, the need for effective techniques to automatically categorize them into relevant domains has become increasingly important. To overcome the challenges the domains like natural language processing (NLP), machine learning (ML) and deep learning (DL)are being working as booster effect to emerge out with solutions. In this proposed system methodology-based NLP and DL domain the long short-term memory (LSTM) classifier for domain classification and compared the existing multiclass classification techniques with having accuracy around 94% and 91% by long short-term memory (LSTM) model using two different data sets one is Marathi new article and another one Financial article data set. The proposed model is being compared with multiple other models like naïve bayes (NB), XGBoost, support vector machine (SVM) and random forest (RF). The final estimated result achieved is best combination of dataset and deep learning algorithm LSTM

    The processing of ambiguous sentences by first and second language learners of English

    Get PDF
    This study compares the way English-speaking children and adult second language learners of English resolve relative clause attachment ambiguities in sentences such as The dean liked the secretary of the professor who was reading a letter. Two groups of advanced L2 learners of English with Greek or German as their L1 participated in a set of off-line and on-line tasks. While the participants ' disambiguation preferences were influenced by lexical-semantic properties of the preposition linking the two potential antecedent NPs (of vs. with), there was no evidence that they were applying any structure-based ambiguity resolution strategies of the type that have been claimed to influence sentence processing in monolingual adults. These findings differ markedly from those obtained from 6 to 7 yearold monolingual English children in a parallel auditory study (Felser, Marinis, & Clahsen, submitted) in that the children's attachment preferences were not affected by the type of preposition at all. We argue that whereas children primarily rely on structure-based parsing principles during processing, adult L2 learners are guided mainly by non-structural informatio

    Code Mixed Cross Script Factoid Question Classification - A Deep Learning Approach

    Full text link
    [EN] Before the advent of the Internet era, code-mixing was mainly used in the spoken form. However, with the recent popular informal networking platforms such as Facebook, Twitter, Instagram, etc., in social media, code-mixing is being used more and more in written form. User-generated social media content is becoming an increasingly important resource in applied linguistics. Recent trends in social media usage have led to a proliferation of studies on social media content. Multilingual social media users often write native language content in non-native script (cross-script). Recently Banerjee et al. [9] introduced the code-mixed cross-script question answering research problem and reported that the ever increasing social media content could serve as a potential digital resource for less-computerized languages to build question answering systems. Question classification is a core task in question answering in which questions are assigned a class or a number of classes which denote the expected answer type(s). In this research work, we address the question classification task as part of the code-mixed cross-script question answering research problem. We combine deep learning framework with feature engineering to address the question classification task and enhance the state-of-the-art question classification accuracy by over 4% for code-mixed cross-script questions.The work of the third author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project.Banerjee, S.; Kumar Naskar, S.; Rosso, P.; Bandyopadhyay, S. (2018). Code Mixed Cross Script Factoid Question Classification - A Deep Learning Approach. Journal of Intelligent & Fuzzy Systems. 34(5):2959-2969. https://doi.org/10.3233/JIFS-169481S2959296934

    The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

    Full text link
    We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.Comment: 27 pages, 13 figure

    Challenging the oral-only narrative: Enhancing early signed input for deaf children with hearing parents

    Get PDF
    Learning a language is, at its core, a process of noticing patterns in the language input surrounding the learner. Although many of these language patterns are complex and difficult for adult speakers/signers to recognize, infants are able to find and learn them from the youngest age, without explicit instruction. However, this impressive feat is dependent on children’s early access to ample and well-formed input that displays the regular patterns of natural language. Such input is far from guaranteed for the great majority of deaf and hard of hearing (DHH) children, leading to well-documented difficulties and delays in linguistic development. Efforts to remedy this situation have focused disproportionately on amplifying DHH children’s hearing levels, often through cochlear implants, as young as possible to facilitate early access to spoken language. Given the time required for cochlear implantation, its lack of guaranteed success, and the critical importance of exposing infants to quality language input as early as possible, a bimodal bilingual approach can optimize DHH infants’ chances for on-time language development by providing them with both spoken and signed language input from the start. This paper addresses the common claim that signing with DHH children renders the task of learning spoken language more difficult, leading to delays and inferior language development, compared to DHH children in oral-only environments. That viewpoint has most recently been articulated by Geers et al. (2017a), which I will discuss as a representative of the many studies promoting an oral-only approach. Contrary to their claims that signing degrades the language input available to DHH children, recent research has demonstrated that the formidable pattern-finding skills of newborn infants extends to linguistic cues in both the spoken and signed modalities, and that the additional challenge of simultaneously acquiring two languages is offset by important “bilingual advantages.” Of course, securing early access to high quality signed input for DHH children from hearing families requires considerable effort, especially since most hearing parents are still novice signers. This paper closes with some suggestions for how to address this challenge through partnerships between linguistics researchers and early intervention programs to support family-centered bimodal bilingual development for DHH children

    MIRACLE at NTCIR-7 MOAT: First experiments on multilingual opinion analysis

    Get PDF
    This paper describes the participation of MIRACLE research consortium at NTCIR-7 Multilingual Opinion Analysis Task, our first attempt on sentiment analysis and second on East Asian languages. We took part in the main mandatory opinionated sentence judgment subtask (to decide whether each sentence expresses an opinion or not) and the optional relevance and polarity judgment subtasks (to decide whether a given sentence is relevant to the given topic and also the polarity of the expressed opinion). Our approach combines a semantic languagedependent tagging of the terms of the sentence and the topic and three different ad-hoc classifiers that provide the specific annotation for each subtask, run in cascade. These models have been trained with the corpus provided in NTCIR-6 Opinion Analysis pilot task
    corecore