884 research outputs found

    Authorship Identification in Bengali Literature: a Comparative Analysis

    Full text link
    Stylometry is the study of the unique linguistic styles and writing behaviors of individuals. It belongs to the core task of text categorization like authorship identification, plagiarism detection etc. Though reasonable number of studies have been conducted in English language, no major work has been done so far in Bengali. In this work, We will present a demonstration of authorship identification of the documents written in Bengali. We adopt a set of fine-grained stylistic features for the analysis of the text and use them to develop two different models: statistical similarity model consisting of three measures and their combination, and machine learning model with Decision Tree, Neural Network and SVM. Experimental results show that SVM outperforms other state-of-the-art methods after 10-fold cross validations. We also validate the relative importance of each stylistic feature to show that some of them remain consistently significant in every model used in this experiment.Comment: 9 pages, 5 tables, 4 picture

    Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

    Get PDF
    Authorship classification is a method of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification approach made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors’ classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and those selected are the 9 best performing models out of 90 for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets (BACC-18, BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively

    The Word2vec Graph Model for Author Attribution and Genre Detection in Literary Analysis

    Full text link
    Analyzing the writing styles of authors and articles is a key to supporting various literary analyses such as author attribution and genre detection. Over the years, rich sets of features that include stylometry, bag-of-words, n-grams have been widely used to perform such analysis. However, the effectiveness of these features largely depends on the linguistic aspects of a particular language and datasets specific characteristics. Consequently, techniques based on these feature sets cannot give desired results across domains. In this paper, we propose a novel Word2vec graph based modeling of a document that can rightly capture both context and style of the document. By using these Word2vec graph based features, we perform classification to perform author attribution and genre detection tasks. Our detailed experimental study with a comprehensive set of literary writings shows the effectiveness of this method over traditional feature based approaches. Our code and data are publicly available at https://cutt.ly/svLjSgkComment: 12 pages, 6 figure

    Author Identification from Literary Articles with Visual Features: A Case Study with Bangla Documents

    Get PDF
    Author identification is an important aspect of literary analysis, studied in natural language processing (NLP). It aids identify the most probable author of articles, news texts or social media comments and tweets, for example. It can be applied to other domains such as criminal and civil cases, cybersecurity, forensics, identification of plagiarizer, and many more. An automated system in this context can thus be very beneficial for society. In this paper, we propose a convolutional neural network (CNN)-based author identification system from literary articles. This system uses visual features along with a five-layer convolutional neural network for the identification of authors. The prime motivation behind this approach was the feasibility to identify distinct writing styles through a visualization of the writing patterns. Experiments were performed on 1200 articles from 50 authors achieving a maximum accuracy of 93.58%. Furthermore, to see how the system performed on different volumes of data, the experiments were performed on partitions of the dataset. The system outperformed standard handcrafted feature-based techniques as well as established works on publicly available datasets

    Resolving the confusion of the authorship attribution of a Bengali book

    Get PDF
    The present paper aims to determine whether the Bengali book Londoner Naksa ebong France Bhraman (Wondrous Capers at London and Travelling in France) was written by the geologist Pramathanath Bose (P.N. Bose). To find it out, two well-established style markers often used in authorship attribution studies; namely, function words and punctuation marks, are used here. The result shows that possibly this book was penned by the geologist P.N. Bose. As a corollary, it may also be added that this approach may be used in future authorship attribution studies involving Bengali writings

    Resolving the confusion of the authorship attribution of a Bengali book

    Get PDF
    406-410The present paper aims to determine whether the Bengali book Londoner Naksa ebong France Bhraman (Wondrous Capers at London and Travelling in France) was written by the geologist Pramathanath Bose (P.N. Bose). To find it out, two well-established style markers often used in authorship attribution studies; namely, function words and punctuation marks, are used here. The result shows that possibly this book was penned by the geologist P.N. Bose. As a corollary, it may also be added that this approach may be used in future authorship attribution studies involving Bengali writings
    corecore