216 research outputs found
Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks
Authorship classification is a method of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification approach made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors’ classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and those selected are the 9 best performing models out of 90 for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets (BACC-18, BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively
The Word2vec Graph Model for Author Attribution and Genre Detection in Literary Analysis
Analyzing the writing styles of authors and articles is a key to supporting
various literary analyses such as author attribution and genre detection. Over
the years, rich sets of features that include stylometry, bag-of-words, n-grams
have been widely used to perform such analysis. However, the effectiveness of
these features largely depends on the linguistic aspects of a particular
language and datasets specific characteristics. Consequently, techniques based
on these feature sets cannot give desired results across domains. In this
paper, we propose a novel Word2vec graph based modeling of a document that can
rightly capture both context and style of the document. By using these Word2vec
graph based features, we perform classification to perform author attribution
and genre detection tasks. Our detailed experimental study with a comprehensive
set of literary writings shows the effectiveness of this method over
traditional feature based approaches. Our code and data are publicly available
at https://cutt.ly/svLjSgkComment: 12 pages, 6 figure
Author Identification from Literary Articles with Visual Features: A Case Study with Bangla Documents
Author identification is an important aspect of literary analysis, studied in natural language processing (NLP). It aids identify the most probable author of articles, news texts or social media comments and tweets, for example. It can be applied to other domains such as criminal and civil cases, cybersecurity, forensics, identification of plagiarizer, and many more. An automated system in this context can thus be very beneficial for society. In this paper, we propose a convolutional neural network (CNN)-based author identification system from literary articles. This system uses visual features along with a five-layer convolutional neural network for the identification of authors. The prime motivation behind this approach was the feasibility to identify distinct writing styles through a visualization of the writing patterns. Experiments were performed on 1200 articles from 50 authors achieving a maximum accuracy of 93.58%. Furthermore, to see how the system performed on different volumes of data, the experiments were performed on partitions of the dataset. The system outperformed standard handcrafted feature-based techniques as well as established works on publicly available datasets
Investigating features and techniques for Arabic authoriship attribution
Authorship attribution is the problem of identifying the true author of a disputed text. Throughout history, there have been many examples of this problem concerned with revealing genuine authors of works of literature that were published anonymously, and in some cases where more than one author claimed authorship of the disputed text. There has been considerable research effort into trying to solve this problem. Initially these efforts were based on statistical patterns, and more recently they have centred on a range of techniques from artificial intelligence. An important early breakthrough was achieved by Mosteller and Wallace in 1964 [15], who pioneered the use of ‘function words’ – typically pronouns, conjunctions and prepositions – as the features on which to base the discovery of patterns of usage relevant to specific authors.
The authorship attribution problem has been tackled in many languages, but predominantly in the English language. In this thesis the problem is addressed for the first time in the Arabic Language. We therefore investigate whether the concept of functions words in English can also be used in the same way for authorship attribution in Arabic. We also describe and evaluate a hybrid of evolutionary algorithms and linear discriminant analysis as an approach to learn a model that classifies the author of a text, based on features derived from Arabic function words. The main target of the hybrid algorithm is to find a subset of features that can robustly and accurately classify disputed texts in unseen data. The hybrid algorithm also aims to do this with relatively small subsets of features. A specialised dataset was produced for this work, based on a collection of 14 Arabic books of different natures, representing a collection of six authors. This dataset was processed into training and test partitions in a way that provides a diverse collection of challenges for any authorship attribution approach.
The combination of the successful list of Arabic function words and the hybrid algorithm for classification led to satisfying levels of accuracy in determining the author of portions of the texts in test data. The work described here is the first (to our knowledge) that investigates authorship attribution in the Arabic knowledge using computational methods. Among its contributions are: the first set of Arabic function words, the first specialised dataset aimed at testing Arabic authorship attribution methods, a new hybrid algorithm for classifying authors based on patterns derived from these function words, and, finally, a number of ideas and variants regarding how to use function words in association with character level features, leading in some cases to more accurate results
An Archaeology of Screenwriting in Indian Cinema, 1930s-1950s
The Euro-American scholarship on screenwriting has produced elaborate histories of the development of the screenplay form with reference to extensive archival collections of film scenarios and scripts. On the other hand, the archival absence of early Indian film scripts has largely invisibilised practices of screenwriting and retroactively contributed to stereotypical descriptions of Indian film industries as unorganised. As a process of imagination and marker of industrialisation, screenwriting is often privileged as the most cerebral, analytical and rational process of film production. The stark absence of screenwriting histories in film cultures of the Global South makes it the absent technique of non-Western cinemas. In the thesis, I critically engage with the archival absence as a heuristic to rethink South Asian screenwriting practices beyond the prescriptive model of manuals as well as Euro-American notions of screenwriting. It is an archaeological as well as a media archaeological project. As an archaeological investigation in the Foucauldian mould, it studies the contradictions of screenwriting practice and discourse in order to understand how perceptions of archival lack and technical backwardness vis- -vis screenwriting in India gained the currency of truth. As a media archaeological project, it collates a disparate and discontinuous cross-section of screenwriting artefacts, discourses and practices in the archival absence of a formally evolving pre-cinematic text. Despite its reliance on archival and ethnographic sources, my research does not attempt to construct a comprehensive, chronological history of screenwriting practices in Bengali and Bombay cinema. Instead, this critical history epistemically delinks screenwriting from universalist discourses, introduces regional specificities and cultural subjectivities, and presents an alternative non-linear model of film historiography beyond questions of archival absence and technical backwardness
Recommended from our members
Term burstiness: evidence, model and applications
The present thesis looks at the phenomenon of term burstiness in text. Term burstiness is defined as the multiple re-occurrences in short succession of a particular term after it has occurred once in a certain text. Term burstiness is important as it aids in providing structure and meaning to a document. Various kinds of term burstiness in text are studied and their effect on a dataset explored in a series of homogeneity experiments. A novel model of term burstiness is proposed and evaluations based on the proposed model are performed on three different applications. The “bag-of-words” assumption is often used in statistical Natural Language Processing and Information Retrieval applications. Under this assumption all structure and positional information of terms is lost and only frequency counts of the document are retained. As a result of counting frequencies only, the “bag-of-words” representation of text assumes that the probability of a word occurring remains constant throughout the text. This assumption is often used because of its simplicity and the ease it provides for the application of mathematical and statistical techniques on text. Though this assumption is known to be untrue [CG95b, CG95a, ChuOO], but applications [SB97, Lew98, MN98, Seb02] based on this assumption appear not to be much hampered. A series of homogeneity based experiments are carried out to study the presence and extent of term burstiness against the term independence based homogeneity assumption on the dataset. A null hypothesis stating the homogeneity of a dataset is formulated and defeated in a series of experiments based on the y2 test, which tests the equality between two partitions of a certain dataset. Various schemes of partitioning a dataset are adopted to illustrate the effect of term burstiness and structure in text. This provided evidence of term burstiness in the dataset, and fine-grained information about the distribution of terms that might be used for characterizing or profiling a dataset. A model for term burstiness in a dataset is proposed based on the gaps between successive occurrences of a particular term. This model is not merely based on frequency counts like other existing models, but takes into account the structural and positional information about the term’s occurrence in the document. The proposed term burstiness model looks at gaps between successive occurrences of the term. These gaps are modeled using a mixture of exponential distributions. The first exponential distribution provides the overall rate of occurrence of a term in a dataset and the second exponential distribution determines the term’s rate of re-occurrence in a burst or when it has already occurred once previously. Since most terms occur in only a few documents, there are a large number of documents with no occurrences of a particular term. In the proposed model, non-occurrence of a term in a document is accounted for by the method of data censoring. It is not straightforward to obtain parameter estimates for such a complex model. So, Bayesian statistics is used for flexibility and ease of fitting this model, and for obtaining parameter estimates. The model can be used for all kinds of terms, be they rare content words, medium frequency terms or frequent function words. The term re-occurrence model is instantiated and verified against the background of different collections, in the context of three different applications. The applications include studying various terms within a dataset to identify behavioral differences between the terms, studying similar terms across different datasets to detect stylistic features based on the term’s distribution and studying the characteristics of very frequent terms across different datasets. The model aids in the identification of term characteristics in a dataset. It helps distinguish between highly bursty content terms and less bursty function words. The model can differentiate between a frequent function word and a scattered one. It can be used to identify stylistic features in a term’s distribution across text of varying genres. The model also aids in understanding the behaviour of very frequent (usually function) words in a dataset
- …