142 research outputs found

    Automatic text summarization of konkani texts using pre-trained word embeddings and deep learning

    Get PDF
    Automatic text summarization has gained immense popularity in research. Previously, several methods have been explored for obtaining effective text summarization outcomes. However, most of the work pertains to the most popular languages spoken in the world. Through this paper, we explore the area of extractive automatic text summarization using deep learning approach and apply it to Konkani language, which is a low-resource language as there are limited resources, such as data, tools, speakers and/or experts in Konkani. In the proposed technique, Facebook’s fastText pre-trained word embeddings are used to get a vector representation for sentences. Thereafter, deep multi-layer perceptron technique is employed, as a supervised binary classification task for auto-generating summaries using the feature vectors. Using pre-trained fastText word embeddings eliminated the requirement of a large training set and reduced training time. The system generated summaries were evaluated against the ‘gold-standard’ human generated summaries with recall-oriented understudy for gisting evaluation (ROUGE) toolkit. The results thus obtained showed that performance of the proposed system matched closely to the performance of the human annotators in generating summaries

    Measuring Enrichment Of Word Embeddings With Subword And Dictionary Information

    Get PDF
    Word embedding models have been an important contribution to natural language processing; following the distributional hypothesis, words used in similar contexts tend to have similar meanings , these models are trained on large corpora of natural text and use the contexts where words are used to learn the mapping of a word to a vector. Some later models have tried to incorporate external information to enhance the vectors. The goal of this thesis is to evaluate three models of word embeddings and measure how these enhancements improve the word embeddings. These models, along with a hybrid variant, are evaluated on several tasks of word similarity and a task of word analogy. Results show that ne-tuning the vectors with semantic information improves performance in word similarity dramatically; conversely, enriching word vectors with syntactic information increases performance in word analogy tasks, with the hybrid approach nding a solid middle ground

    Different valuable tools for Arabic sentiment analysis: a comparative evaluation

    Get PDF
    Arabic Natural language processing (ANLP) is a subfield of artificial intelligence (AI) that tries to build various applications in the Arabic language like Arabic sentiment analysis (ASA) that is the operation of classifying the feelings and emotions expressed for defining the attitude of the writer (neutral, negative or positive). In order to work on ASA, researchers can use various tools in their research projects without explaining the cause behind this use, or they choose a set of libraries according to their knowledge about a specific programming language. Because of their libraries' abundance in the ANLP field, especially in ASA, we are relying on JAVA and Python programming languages in our research work. This paper relies on making an in-depth comparative evaluation of different valuable Python and Java libraries to deduce the most useful ones in Arabic sentiment analysis (ASA). According to a large variety of great and influential works in the domain of ASA, we deduce that the NLTK, Gensim and TextBlob libraries are the most useful for Python ASA task. In connection with Java ASA libraries, we conclude that Weka and CoreNLP tools are the most used, and they have great results in this research domain

    Character-level and syntax-level models for low-resource and multilingual natural language processing

    Get PDF
    There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

    SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

    Get PDF
    We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, English also featured Subtasks B and C. OffensEval 2020 was one of the most popular tasks at SemEval-2020 attracting a large number of participants across all subtasks and also across all languages. A total of 528 teams signed up to participate in the task, 145 teams submitted systems during the evaluation period, and 70 submitted system description papers.Comment: Proceedings of the International Workshop on Semantic Evaluation (SemEval-2020

    Modified EDA and Backtranslation Augmentation in Deep Learning Models for Indonesian Aspect-Based Sentiment Analysis

    Get PDF
    In the process of developing a business, aspect-based sentiment analysis (ABSA) could help extract customers' opinions on different aspects of the business from online reviews. Researchers have found great prospective in deep learning approaches to solving ABSA tasks. Furthermore, studies have also explored the implementation of text augmentation, such as Easy Data Augmentation (EDA), to improve the deep learning models’ performance using only simple operations. However, when implementing EDA to ABSA, there will be high chances that the augmented sentences could lose important aspects or sentiment-related words (target words) critical for training. Corresponding to that, another study has made adjustments to EDA for English aspect-based sentiment data provided with the target words tag. However, the solution still needs additional modifications in the case of non-tagged data. Hence, in this work, we will focus on modifying EDA that integrates POS tagging and word similarity to not only understand the context of the words but also extract the target words directly from non-tagged sentences. Additionally, the modified EDA is combined with the backtranslation method, as the latter has also shown quite a significant contribution to the model’s performance in several research studies. The proposed method is then evaluated on a small Indonesian ABSA dataset using baseline deep learning models. Results show that the augmentation method could increase the model’s performance on a limited dataset problem. In general, the best performance for aspect classification is achieved by implementing the proposed method, which increases the macro-accuracy and F1, respectively, on Long Short-Term Memory (LSTM) and Bidirectional LSTM models compared to the original EDA. The proposed method also obtained the best performance for sentiment classification using a convolutional neural network, increasing the overall accuracy by 2.2% and F1 by 3.2%. Doi: 10.28991/ESJ-2023-07-01-018 Full Text: PD

    Qmul-sds at exist: Leveraging pre-trained semantics and lexical features for multilingual sexism detection in social networks

    Get PDF
    Online sexism is an increasing concern for those who experi- ence gender-based abuse in social media platforms as it has affected the healthy development of the Internet with negative impacts in society. The EXIST shared task proposes the first task on sEXism Identifica- tion in Social neTworks (EXIST) at IberLEF 2021 [30]. It provides a benchmark sexism dataset with Twitter and Gab posts in both English and Spanish, along with a task articulated in two subtasks consisting in sexism detection at different levels of granularity: Subtask 1 Sexism Iden- tification is a classical binary classification task to determine whether a given text is sexist or not, while Subtask 2 Sexism Categorisation is a finer-grained classification task focused on distinguishing different types of sexism. In this paper, we describe the participation of the QMUL-SDS team in EXIST. We propose an architecture made of the last 4 hidden states of XLM-RoBERTa and a TextCNN with 3 kernels. Our model also exploits lexical features relying on the use of new and existing lexicons of abusive words, with a special focus on sexist slurs and abusive words targeting women. Our team ranked 11th in Subtask 1 and 4th in Sub- task 2 among all the teams on the leaderboard, clearly outperforming the baselines offered by EXIST
    • …
    corecore