1,510 research outputs found
ParsBERT: Transformer-based Model for Persian Language Understanding
The surge of pre-trained language models has begun a new era in the field of
Natural Language Processing (NLP) by allowing us to build powerful language
models. Among these models, Transformer-based models such as BERT have become
increasingly popular due to their state-of-the-art performance. However, these
models are usually focused on English, leaving other languages to multilingual
models with limited resources. This paper proposes a monolingual BERT for the
Persian language (ParsBERT), which shows its state-of-the-art performance
compared to other architectures and multilingual models. Also, since the amount
of data available for NLP tasks in Persian is very restricted, a massive
dataset for different NLP tasks as well as pre-training the model is composed.
ParsBERT obtains higher scores in all datasets, including existing ones as well
as composed ones and improves the state-of-the-art performance by outperforming
both multilingual BERT and other prior works in Sentiment Analysis, Text
Classification and Named Entity Recognition tasks.Comment: 10 pages, 5 figures, 7 tables, table 7 corrected and some refs
related to table
Persian Vowel recognition with MFCC and ANN on PCVC speech dataset
In this paper a new method for recognition of consonant-vowel phonemes
combination on a new Persian speech dataset titled as PCVC (Persian
Consonant-Vowel Combination) is proposed which is used to recognize Persian
phonemes. In PCVC dataset, there are 20 sets of audio samples from 10 speakers
which are combinations of 23 consonant and 6 vowel phonemes of Persian
language. In each sample, there is a combination of one vowel and one
consonant. First, the consonant phoneme is pronounced and just after it, the
vowel phoneme is pronounced. Each sound sample is a frame of 2 seconds of
audio. In every 2 seconds, there is an average of 0.5 second speech and the
rest is silence. In this paper, the proposed method is the implementations of
the MFCC (Mel Frequency Cepstrum Coefficients) on every partitioned sound
sample. Then, every train sample of MFCC vector is given to a multilayer
perceptron feed-forward ANN (Artificial Neural Network) for training process.
At the end, the test samples are examined on ANN model for phoneme recognition.
After training and testing process, the results are presented in recognition of
vowels. Then, the average percent of recognition for vowel phonemes are
computed.Comment: The 5th International Conference of Electrical Engineering, Computer
Science and Information Technology 201
Creating a New Persian Poet Based on Machine Learning
In this article we describe an application of Machine Learning (ML) and
Linguistic Modeling to generate persian poems. In fact we teach machine by
reading and learning persian poems to generate fake poems in the same style of
the original poems. As two well known poets we used Hafez (1310-1390) and Saadi
(1210-1292) poems. First we feed the machine with Hafez poems to generate fake
poems with the same style and then we feed the machine with the both Hafez and
Saadi poems to generate a new style poems which is combination of these two
poets styles with emotional (Hafez) and rational (Saadi) elements. This idea of
combination of different styles with ML opens new gates for extending the
treasure of past literature of different cultures. Results show with enough
memory, processing power and time it is possible to generate reasonable good
poems
A Deep Learning Approach for Similar Languages, Varieties and Dialects
Deep learning mechanisms are prevailing approaches in recent days for the
various tasks in natural language processing, speech recognition, image
processing and many others. To leverage this we use deep learning based
mechanism specifically Bidirectional- Long Short-Term Memory (B-LSTM) for the
task of dialectic identification in Arabic and German broadcast speech and Long
Short-Term Memory (LSTM) for discriminating between similar Languages. Two
unique B-LSTM models are created using the Large-vocabulary Continuous Speech
Recognition (LVCSR) based lexical features and a fixed length of 400 per
utterance bottleneck features generated by i-vector framework. These models
were evaluated on the VarDial 2017 datasets for the tasks Arabic, German
dialect identification with dialects of Egyptian, Gulf, Levantine, North
African, and MSA for Arabic and Basel, Bern, Lucerne, and Zurich for German.
Also for the task of Discriminating between Similar Languages like Bosnian,
Croatian and Serbian. The B-LSTM model showed accuracy of 0.246 on lexical
features and accuracy of 0.577 bottleneck features of i-Vector framework.Comment: 17 page
SyntaxNet Models for the CoNLL 2017 Shared Task
We describe a baseline dependency parsing system for the CoNLL2017 Shared
Task. This system, which we call "ParseySaurus," uses the DRAGNN framework
[Kong et al, 2017] to combine transition-based recurrent parsing and tagging
with character-based word representations. On the v1.3 Universal Dependencies
Treebanks, the new system outpeforms the publicly available, state-of-the-art
"Parsey's Cousins" models by 3.47% absolute Labeled Accuracy Score (LAS) across
52 treebanks.Comment: Tech repor
Character Eyes: Seeing Language through Character-Level Taggers
Character-level models have been used extensively in recent years in NLP
tasks as both supplements and replacements for closed-vocabulary token-level
word representations. In one popular architecture, character-level LSTMs are
used to feed token representations into a sequence tagger predicting
token-level annotations such as part-of-speech (POS) tags. In this work, we
examine the behavior of POS taggers across languages from the perspective of
individual hidden units within the character LSTM. We aggregate the behavior of
these units into language-level metrics which quantify the challenges that
taggers face on languages with different morphological properties, and identify
links between synthesis and affixation preference and emergent behavior of the
hidden tagger layer. In a comparative experiment, we show how modifying the
balance between forward and backward hidden units affects model arrangement and
performance in these types of languages
Multi Task Deep Morphological Analyzer: Context Aware Joint Morphological Tagging and Lemma Prediction
The ambiguities introduced by the recombination of morphemes constructing
several possible inflections for a word makes the prediction of syntactic
traits in Morphologically Rich Languages (MRLs) a notoriously complicated task.
We propose the Multi Task Deep Morphological analyzer (MT-DMA), a
character-level neural morphological analyzer based on multitask learning of
word-level tag markers for Hindi and Urdu. MT-DMA predicts a set of six
morphological tags for words of Indo-Aryan languages: Parts-of-speech (POS),
Gender (G), Number (N), Person (P), Case (C), Tense-Aspect-Modality (TAM)
marker as well as the Lemma (L) by jointly learning all these in one trainable
framework. We show the effectiveness of training of such deep neural networks
by the simultaneous optimization of multiple loss functions and sharing of
initial parameters for context-aware morphological analysis. Exploiting
character-level features in phonological space optimized for each tag using
multi-objective genetic algorithm, our model establishes a new state-of-the-art
accuracy score upon all seven of the tasks for both the languages. MT-DMA is
publicly accessible: code, models and data are available at
https://github.com/Saurav0074/morph_analyzer.Comment: 28 pages, 8 figures, 11 table
A Deep Learning-Based Approach for Measuring the Domain Similarity of Persian Texts
In this paper, we propose a novel approach for measuring the degree of
similarity between categories of two pieces of Persian text, which were
published as descriptions of two separate advertisements. We built an
appropriate dataset for this work using a dataset which consists of
advertisements posted on an e-commerce website. We generated a significant
number of paired texts from this dataset and assigned each pair a score from 0
to 3, which demonstrates the degree of similarity between the domains of the
pair. In this work, we represent words with word embedding vectors derived from
word2vec. Then deep neural network models are used to represent texts.
Eventually, we employ concatenation of absolute difference and bit-wise
multiplication and a fully-connected neural network to produce a probability
distribution vector for the score of the pairs. Through a supervised learning
approach, we trained our model on a GPU, and our best model achieved an F1
score of 0.9865
Improving Lemmatization of Non-Standard Languages with Joint Learning
Lemmatization of standard languages is concerned with (i) abstracting over
morphological differences and (ii) resolving token-lemma ambiguities of
inflected words in order to map them to a dictionary headword. In the present
paper we aim to improve lemmatization performance on a set of non-standard
historical languages in which the difficulty is increased by an additional
aspect (iii): spelling variation due to lacking orthographic standards. We
approach lemmatization as a string-transduction task with an encoder-decoder
architecture which we enrich with sentence context information using a
hierarchical sentence encoder. We show significant improvements over the
state-of-the-art when training the sentence encoder jointly for lemmatization
and language modeling. Crucially, our architecture does not require POS or
morphological annotations, which are not always available for historical
corpora. Additionally, we also test the proposed model on a set of
typologically diverse standard languages showing results on par or better than
a model without enhanced sentence representations and previous state-of-the-art
systems. Finally, to encourage future work on processing of non-standard
varieties, we release the dataset of non-standard languages underlying the
present study, based on openly accessible sources
Low-dimensional Query Projection based on Divergence Minimization Feedback Model for Ad-hoc Retrieval
Low-dimensional word vectors have long been used in a wide range of
applications in natural language processing. In this paper we shed light on
estimating query vectors in ad-hoc retrieval where a limited information is
available in the original query. Pseudo-relevance feedback (PRF) is a
well-known technique for updating query language models and expanding the
queries with a number of relevant terms. We formulate the query updating in
low-dimensional spaces first with rotating the query vector and then with
scaling. These consequential steps are embedded in a query-specific projection
matrix capturing both angle and scaling. In this paper we propose a new but not
the most effective technique necessarily for PRF in language modeling, based on
the query projection algorithm. We learn an embedded coefficient matrix for
each query, whose aim is to improve the vector representation of the query by
transforming it to a more reliable space, and then update the query language
model. The proposed embedded coefficient divergence minimization model (ECDMM)
takes top-ranked documents retrieved by the query and obtains a couple of
positive and negative sample sets; these samples are used for learning the
coefficient matrix which will be used for projecting the query vector and
updating the query language model using a softmax function. Experimental
results on several TREC and CLEF data sets in several languages demonstrate
effectiveness of ECDMM. The experimental results reveal that the new
formulation for the query works as well as state-of-the-art PRF techniques and
outperforms state-of-the-art PRF techniques in a TREC collection in terms of
MAP,P@5, and P@10 significantly
- …