884 research outputs found
Authorship Identification in Bengali Literature: a Comparative Analysis
Stylometry is the study of the unique linguistic styles and writing behaviors
of individuals. It belongs to the core task of text categorization like
authorship identification, plagiarism detection etc. Though reasonable number
of studies have been conducted in English language, no major work has been done
so far in Bengali. In this work, We will present a demonstration of authorship
identification of the documents written in Bengali. We adopt a set of
fine-grained stylistic features for the analysis of the text and use them to
develop two different models: statistical similarity model consisting of three
measures and their combination, and machine learning model with Decision Tree,
Neural Network and SVM. Experimental results show that SVM outperforms other
state-of-the-art methods after 10-fold cross validations. We also validate the
relative importance of each stylistic feature to show that some of them remain
consistently significant in every model used in this experiment.Comment: 9 pages, 5 tables, 4 picture
Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks
Authorship classification is a method of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification approach made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors’ classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and those selected are the 9 best performing models out of 90 for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets (BACC-18, BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively
The Word2vec Graph Model for Author Attribution and Genre Detection in Literary Analysis
Analyzing the writing styles of authors and articles is a key to supporting
various literary analyses such as author attribution and genre detection. Over
the years, rich sets of features that include stylometry, bag-of-words, n-grams
have been widely used to perform such analysis. However, the effectiveness of
these features largely depends on the linguistic aspects of a particular
language and datasets specific characteristics. Consequently, techniques based
on these feature sets cannot give desired results across domains. In this
paper, we propose a novel Word2vec graph based modeling of a document that can
rightly capture both context and style of the document. By using these Word2vec
graph based features, we perform classification to perform author attribution
and genre detection tasks. Our detailed experimental study with a comprehensive
set of literary writings shows the effectiveness of this method over
traditional feature based approaches. Our code and data are publicly available
at https://cutt.ly/svLjSgkComment: 12 pages, 6 figure
Author Identification from Literary Articles with Visual Features: A Case Study with Bangla Documents
Author identification is an important aspect of literary analysis, studied in natural language processing (NLP). It aids identify the most probable author of articles, news texts or social media comments and tweets, for example. It can be applied to other domains such as criminal and civil cases, cybersecurity, forensics, identification of plagiarizer, and many more. An automated system in this context can thus be very beneficial for society. In this paper, we propose a convolutional neural network (CNN)-based author identification system from literary articles. This system uses visual features along with a five-layer convolutional neural network for the identification of authors. The prime motivation behind this approach was the feasibility to identify distinct writing styles through a visualization of the writing patterns. Experiments were performed on 1200 articles from 50 authors achieving a maximum accuracy of 93.58%. Furthermore, to see how the system performed on different volumes of data, the experiments were performed on partitions of the dataset. The system outperformed standard handcrafted feature-based techniques as well as established works on publicly available datasets
Resolving the confusion of the authorship attribution of a Bengali book
The present paper aims to determine whether the Bengali book Londoner Naksa ebong France Bhraman (Wondrous Capers at London and Travelling in France) was written by the geologist Pramathanath Bose (P.N. Bose). To find it out, two well-established style markers often used in authorship attribution studies; namely, function words and punctuation marks, are used here. The result shows that possibly this book was penned by the geologist P.N. Bose. As a corollary, it may also be added that this approach may be used in future authorship attribution studies involving Bengali writings
Resolving the confusion of the authorship attribution of a Bengali book
406-410The present paper aims to determine whether the Bengali book Londoner Naksa ebong France Bhraman (Wondrous
Capers at London and Travelling in France) was written by the geologist Pramathanath Bose (P.N. Bose). To find it out,
two well-established style markers often used in authorship attribution studies; namely, function words and punctuation
marks, are used here. The result shows that possibly this book was penned by the geologist P.N. Bose. As a corollary, it may
also be added that this approach may be used in future authorship attribution studies involving Bengali writings
- …