216 research outputs found

    Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

    Get PDF
    Authorship classification is a method of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification approach made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors’ classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and those selected are the 9 best performing models out of 90 for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets (BACC-18, BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively

    The Word2vec Graph Model for Author Attribution and Genre Detection in Literary Analysis

    Full text link
    Analyzing the writing styles of authors and articles is a key to supporting various literary analyses such as author attribution and genre detection. Over the years, rich sets of features that include stylometry, bag-of-words, n-grams have been widely used to perform such analysis. However, the effectiveness of these features largely depends on the linguistic aspects of a particular language and datasets specific characteristics. Consequently, techniques based on these feature sets cannot give desired results across domains. In this paper, we propose a novel Word2vec graph based modeling of a document that can rightly capture both context and style of the document. By using these Word2vec graph based features, we perform classification to perform author attribution and genre detection tasks. Our detailed experimental study with a comprehensive set of literary writings shows the effectiveness of this method over traditional feature based approaches. Our code and data are publicly available at https://cutt.ly/svLjSgkComment: 12 pages, 6 figure

    Author Identification from Literary Articles with Visual Features: A Case Study with Bangla Documents

    Get PDF
    Author identification is an important aspect of literary analysis, studied in natural language processing (NLP). It aids identify the most probable author of articles, news texts or social media comments and tweets, for example. It can be applied to other domains such as criminal and civil cases, cybersecurity, forensics, identification of plagiarizer, and many more. An automated system in this context can thus be very beneficial for society. In this paper, we propose a convolutional neural network (CNN)-based author identification system from literary articles. This system uses visual features along with a five-layer convolutional neural network for the identification of authors. The prime motivation behind this approach was the feasibility to identify distinct writing styles through a visualization of the writing patterns. Experiments were performed on 1200 articles from 50 authors achieving a maximum accuracy of 93.58%. Furthermore, to see how the system performed on different volumes of data, the experiments were performed on partitions of the dataset. The system outperformed standard handcrafted feature-based techniques as well as established works on publicly available datasets

    Investigating features and techniques for Arabic authoriship attribution

    Get PDF
    Authorship attribution is the problem of identifying the true author of a disputed text. Throughout history, there have been many examples of this problem concerned with revealing genuine authors of works of literature that were published anonymously, and in some cases where more than one author claimed authorship of the disputed text. There has been considerable research effort into trying to solve this problem. Initially these efforts were based on statistical patterns, and more recently they have centred on a range of techniques from artificial intelligence. An important early breakthrough was achieved by Mosteller and Wallace in 1964 [15], who pioneered the use of ‘function words’ – typically pronouns, conjunctions and prepositions – as the features on which to base the discovery of patterns of usage relevant to specific authors. The authorship attribution problem has been tackled in many languages, but predominantly in the English language. In this thesis the problem is addressed for the first time in the Arabic Language. We therefore investigate whether the concept of functions words in English can also be used in the same way for authorship attribution in Arabic. We also describe and evaluate a hybrid of evolutionary algorithms and linear discriminant analysis as an approach to learn a model that classifies the author of a text, based on features derived from Arabic function words. The main target of the hybrid algorithm is to find a subset of features that can robustly and accurately classify disputed texts in unseen data. The hybrid algorithm also aims to do this with relatively small subsets of features. A specialised dataset was produced for this work, based on a collection of 14 Arabic books of different natures, representing a collection of six authors. This dataset was processed into training and test partitions in a way that provides a diverse collection of challenges for any authorship attribution approach. The combination of the successful list of Arabic function words and the hybrid algorithm for classification led to satisfying levels of accuracy in determining the author of portions of the texts in test data. The work described here is the first (to our knowledge) that investigates authorship attribution in the Arabic knowledge using computational methods. Among its contributions are: the first set of Arabic function words, the first specialised dataset aimed at testing Arabic authorship attribution methods, a new hybrid algorithm for classifying authors based on patterns derived from these function words, and, finally, a number of ideas and variants regarding how to use function words in association with character level features, leading in some cases to more accurate results

    An Archaeology of Screenwriting in Indian Cinema, 1930s-1950s

    Get PDF
    The Euro-American scholarship on screenwriting has produced elaborate histories of the development of the screenplay form with reference to extensive archival collections of film scenarios and scripts. On the other hand, the archival absence of early Indian film scripts has largely invisibilised practices of screenwriting and retroactively contributed to stereotypical descriptions of Indian film industries as unorganised. As a process of imagination and marker of industrialisation, screenwriting is often privileged as the most cerebral, analytical and rational process of film production. The stark absence of screenwriting histories in film cultures of the Global South makes it the absent technique of non-Western cinemas. In the thesis, I critically engage with the archival absence as a heuristic to rethink South Asian screenwriting practices beyond the prescriptive model of manuals as well as Euro-American notions of screenwriting. It is an archaeological as well as a media archaeological project. As an archaeological investigation in the Foucauldian mould, it studies the contradictions of screenwriting practice and discourse in order to understand how perceptions of archival lack and technical backwardness vis- -vis screenwriting in India gained the currency of truth. As a media archaeological project, it collates a disparate and discontinuous cross-section of screenwriting artefacts, discourses and practices in the archival absence of a formally evolving pre-cinematic text. Despite its reliance on archival and ethnographic sources, my research does not attempt to construct a comprehensive, chronological history of screenwriting practices in Bengali and Bombay cinema. Instead, this critical history epistemically delinks screenwriting from universalist discourses, introduces regional specificities and cultural subjectivities, and presents an alternative non-linear model of film historiography beyond questions of archival absence and technical backwardness
    corecore