1,510 research outputs found

    ParsBERT: Transformer-based Model for Persian Language Understanding

    Full text link
    The surge of pre-trained language models has begun a new era in the field of Natural Language Processing (NLP) by allowing us to build powerful language models. Among these models, Transformer-based models such as BERT have become increasingly popular due to their state-of-the-art performance. However, these models are usually focused on English, leaving other languages to multilingual models with limited resources. This paper proposes a monolingual BERT for the Persian language (ParsBERT), which shows its state-of-the-art performance compared to other architectures and multilingual models. Also, since the amount of data available for NLP tasks in Persian is very restricted, a massive dataset for different NLP tasks as well as pre-training the model is composed. ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones and improves the state-of-the-art performance by outperforming both multilingual BERT and other prior works in Sentiment Analysis, Text Classification and Named Entity Recognition tasks.Comment: 10 pages, 5 figures, 7 tables, table 7 corrected and some refs related to table

    Persian Vowel recognition with MFCC and ANN on PCVC speech dataset

    Full text link
    In this paper a new method for recognition of consonant-vowel phonemes combination on a new Persian speech dataset titled as PCVC (Persian Consonant-Vowel Combination) is proposed which is used to recognize Persian phonemes. In PCVC dataset, there are 20 sets of audio samples from 10 speakers which are combinations of 23 consonant and 6 vowel phonemes of Persian language. In each sample, there is a combination of one vowel and one consonant. First, the consonant phoneme is pronounced and just after it, the vowel phoneme is pronounced. Each sound sample is a frame of 2 seconds of audio. In every 2 seconds, there is an average of 0.5 second speech and the rest is silence. In this paper, the proposed method is the implementations of the MFCC (Mel Frequency Cepstrum Coefficients) on every partitioned sound sample. Then, every train sample of MFCC vector is given to a multilayer perceptron feed-forward ANN (Artificial Neural Network) for training process. At the end, the test samples are examined on ANN model for phoneme recognition. After training and testing process, the results are presented in recognition of vowels. Then, the average percent of recognition for vowel phonemes are computed.Comment: The 5th International Conference of Electrical Engineering, Computer Science and Information Technology 201

    Creating a New Persian Poet Based on Machine Learning

    Full text link
    In this article we describe an application of Machine Learning (ML) and Linguistic Modeling to generate persian poems. In fact we teach machine by reading and learning persian poems to generate fake poems in the same style of the original poems. As two well known poets we used Hafez (1310-1390) and Saadi (1210-1292) poems. First we feed the machine with Hafez poems to generate fake poems with the same style and then we feed the machine with the both Hafez and Saadi poems to generate a new style poems which is combination of these two poets styles with emotional (Hafez) and rational (Saadi) elements. This idea of combination of different styles with ML opens new gates for extending the treasure of past literature of different cultures. Results show with enough memory, processing power and time it is possible to generate reasonable good poems

    A Deep Learning Approach for Similar Languages, Varieties and Dialects

    Full text link
    Deep learning mechanisms are prevailing approaches in recent days for the various tasks in natural language processing, speech recognition, image processing and many others. To leverage this we use deep learning based mechanism specifically Bidirectional- Long Short-Term Memory (B-LSTM) for the task of dialectic identification in Arabic and German broadcast speech and Long Short-Term Memory (LSTM) for discriminating between similar Languages. Two unique B-LSTM models are created using the Large-vocabulary Continuous Speech Recognition (LVCSR) based lexical features and a fixed length of 400 per utterance bottleneck features generated by i-vector framework. These models were evaluated on the VarDial 2017 datasets for the tasks Arabic, German dialect identification with dialects of Egyptian, Gulf, Levantine, North African, and MSA for Arabic and Basel, Bern, Lucerne, and Zurich for German. Also for the task of Discriminating between Similar Languages like Bosnian, Croatian and Serbian. The B-LSTM model showed accuracy of 0.246 on lexical features and accuracy of 0.577 bottleneck features of i-Vector framework.Comment: 17 page

    SyntaxNet Models for the CoNLL 2017 Shared Task

    Full text link
    We describe a baseline dependency parsing system for the CoNLL2017 Shared Task. This system, which we call "ParseySaurus," uses the DRAGNN framework [Kong et al, 2017] to combine transition-based recurrent parsing and tagging with character-based word representations. On the v1.3 Universal Dependencies Treebanks, the new system outpeforms the publicly available, state-of-the-art "Parsey's Cousins" models by 3.47% absolute Labeled Accuracy Score (LAS) across 52 treebanks.Comment: Tech repor

    Character Eyes: Seeing Language through Character-Level Taggers

    Full text link
    Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations. In one popular architecture, character-level LSTMs are used to feed token representations into a sequence tagger predicting token-level annotations such as part-of-speech (POS) tags. In this work, we examine the behavior of POS taggers across languages from the perspective of individual hidden units within the character LSTM. We aggregate the behavior of these units into language-level metrics which quantify the challenges that taggers face on languages with different morphological properties, and identify links between synthesis and affixation preference and emergent behavior of the hidden tagger layer. In a comparative experiment, we show how modifying the balance between forward and backward hidden units affects model arrangement and performance in these types of languages

    Multi Task Deep Morphological Analyzer: Context Aware Joint Morphological Tagging and Lemma Prediction

    Full text link
    The ambiguities introduced by the recombination of morphemes constructing several possible inflections for a word makes the prediction of syntactic traits in Morphologically Rich Languages (MRLs) a notoriously complicated task. We propose the Multi Task Deep Morphological analyzer (MT-DMA), a character-level neural morphological analyzer based on multitask learning of word-level tag markers for Hindi and Urdu. MT-DMA predicts a set of six morphological tags for words of Indo-Aryan languages: Parts-of-speech (POS), Gender (G), Number (N), Person (P), Case (C), Tense-Aspect-Modality (TAM) marker as well as the Lemma (L) by jointly learning all these in one trainable framework. We show the effectiveness of training of such deep neural networks by the simultaneous optimization of multiple loss functions and sharing of initial parameters for context-aware morphological analysis. Exploiting character-level features in phonological space optimized for each tag using multi-objective genetic algorithm, our model establishes a new state-of-the-art accuracy score upon all seven of the tasks for both the languages. MT-DMA is publicly accessible: code, models and data are available at https://github.com/Saurav0074/morph_analyzer.Comment: 28 pages, 8 figures, 11 table

    A Deep Learning-Based Approach for Measuring the Domain Similarity of Persian Texts

    Full text link
    In this paper, we propose a novel approach for measuring the degree of similarity between categories of two pieces of Persian text, which were published as descriptions of two separate advertisements. We built an appropriate dataset for this work using a dataset which consists of advertisements posted on an e-commerce website. We generated a significant number of paired texts from this dataset and assigned each pair a score from 0 to 3, which demonstrates the degree of similarity between the domains of the pair. In this work, we represent words with word embedding vectors derived from word2vec. Then deep neural network models are used to represent texts. Eventually, we employ concatenation of absolute difference and bit-wise multiplication and a fully-connected neural network to produce a probability distribution vector for the score of the pairs. Through a supervised learning approach, we trained our model on a GPU, and our best model achieved an F1 score of 0.9865

    Improving Lemmatization of Non-Standard Languages with Joint Learning

    Full text link
    Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (iii): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an encoder-decoder architecture which we enrich with sentence context information using a hierarchical sentence encoder. We show significant improvements over the state-of-the-art when training the sentence encoder jointly for lemmatization and language modeling. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typologically diverse standard languages showing results on par or better than a model without enhanced sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, based on openly accessible sources

    Low-dimensional Query Projection based on Divergence Minimization Feedback Model for Ad-hoc Retrieval

    Full text link
    Low-dimensional word vectors have long been used in a wide range of applications in natural language processing. In this paper we shed light on estimating query vectors in ad-hoc retrieval where a limited information is available in the original query. Pseudo-relevance feedback (PRF) is a well-known technique for updating query language models and expanding the queries with a number of relevant terms. We formulate the query updating in low-dimensional spaces first with rotating the query vector and then with scaling. These consequential steps are embedded in a query-specific projection matrix capturing both angle and scaling. In this paper we propose a new but not the most effective technique necessarily for PRF in language modeling, based on the query projection algorithm. We learn an embedded coefficient matrix for each query, whose aim is to improve the vector representation of the query by transforming it to a more reliable space, and then update the query language model. The proposed embedded coefficient divergence minimization model (ECDMM) takes top-ranked documents retrieved by the query and obtains a couple of positive and negative sample sets; these samples are used for learning the coefficient matrix which will be used for projecting the query vector and updating the query language model using a softmax function. Experimental results on several TREC and CLEF data sets in several languages demonstrate effectiveness of ECDMM. The experimental results reveal that the new formulation for the query works as well as state-of-the-art PRF techniques and outperforms state-of-the-art PRF techniques in a TREC collection in terms of MAP,P@5, and P@10 significantly
    corecore