4,309 research outputs found
Recommended from our members
Ensemble methods for instance-based Arabic language authorship attribution
The Authorship Attribution (AA) is considered as a subfield of authorship analysis and it is an important problem as the range of anonymous information increased with fast growing of internet usage worldwide. In other languages such as English, Spanish and Chinese, such issue is quite well studied. However, in Arabic language, the AA problem has received less attention from the research community due to complexity and nature of Arabic sentences. The paper presented an intensive review on previous studies for Arabic language. Based on that, this study has employed the Technique for Order Preferences by Similarity to Ideal Solution (TOPSIS) method to choose the base classifier of the ensemble methods. In terms of attribution features, hundreds of stylometric features and distinct words using several tools have been extracted. Then, Adaboost and Bagging ensemble methods have been applied on Arabic enquires (Fatwa) dataset. The findings showed an improvement of the effectiveness of the authorship attribution task in the Arabic language
Letter counting: a stem cell for Cryptology, Quantitative Linguistics, and Statistics
Counting letters in written texts is a very ancient practice. It has
accompanied the development of Cryptology, Quantitative Linguistics, and
Statistics. In Cryptology, counting frequencies of the different characters in
an encrypted message is the basis of the so called frequency analysis method.
In Quantitative Linguistics, the proportion of vowels to consonants in
different languages was studied long before authorship attribution. In
Statistics, the alternation vowel-consonants was the only example that Markov
ever gave of his theory of chained events. A short history of letter counting
is presented. The three domains, Cryptology, Quantitative Linguistics, and
Statistics, are then examined, focusing on the interactions with the other two
fields through letter counting. As a conclusion, the eclectism of past
centuries scholars, their background in humanities, and their familiarity with
cryptograms, are identified as contributing factors to the mutual enrichment
process which is described here
The âPAThsâ Project: An Effort to Represent the Physical Dimension of Coptic Literary Production (ThirdâEleventh centuries)
PAThs â Tracking Papyrus and Parchment Paths: An Archaeological Atlas of Coptic Literature. Literary Texts in their Geographical Context. Production, Copying, Usage, Dissemination and Storage is an ambitious digital project based in Rome, working towards a new historical and archaeological geography of the Coptic literary tradition. This aim implies a number of auxiliary tasks and challenges, including classification of authors, works, titles, colophons, and codicological units, as well as the study and wherever possible exact mapping of the relevant geographical sites related to the production, circulation, and storage of manuscript
Conditional Complexity of Compression for Authorship Attribution
We introduce new stylometry tools based on the sliced conditional compression complexity of literary texts which are inspired by the nearly optimal application of the incomputable Kolmogorov conditional complexity (and presumably approximates it). Whereas other stylometry tools can occasionally be very close for different authors, our statistic is apparently strictly minimal for the true author, if the query and training texts are sufficiently large, compressor is sufficiently good and sampling bias is avoided (as in the poll samplings). We tune it and test its performance on attributing the Federalist papers (Madison vs. Hamilton). Our results confirm the previous attribution of Federalist papers by Mosteller and Wallace (1964) to Madison using the Naive Bayes classifier and the same attribution based on alternative classifiers such as SVM, and the second order Markov model of language. Then we apply our method for studying the attribution of the early poems from the Shakespeare Canon and the continuation of Marloweâs poem âHero and Leanderâ ascribed to G. Chapman.compression complexity, authorship attribution.
Investigating features and techniques for Arabic authoriship attribution
Authorship attribution is the problem of identifying the true author of a disputed text. Throughout history, there have been many examples of this problem concerned with revealing genuine authors of works of literature that were published anonymously, and in some cases where more than one author claimed authorship of the disputed text. There has been considerable research effort into trying to solve this problem. Initially these efforts were based on statistical patterns, and more recently they have centred on a range of techniques from artificial intelligence. An important early breakthrough was achieved by Mosteller and Wallace in 1964 [15], who pioneered the use of âfunction wordsâ â typically pronouns, conjunctions and prepositions â as the features on which to base the discovery of patterns of usage relevant to specific authors.
The authorship attribution problem has been tackled in many languages, but predominantly in the English language. In this thesis the problem is addressed for the first time in the Arabic Language. We therefore investigate whether the concept of functions words in English can also be used in the same way for authorship attribution in Arabic. We also describe and evaluate a hybrid of evolutionary algorithms and linear discriminant analysis as an approach to learn a model that classifies the author of a text, based on features derived from Arabic function words. The main target of the hybrid algorithm is to find a subset of features that can robustly and accurately classify disputed texts in unseen data. The hybrid algorithm also aims to do this with relatively small subsets of features. A specialised dataset was produced for this work, based on a collection of 14 Arabic books of different natures, representing a collection of six authors. This dataset was processed into training and test partitions in a way that provides a diverse collection of challenges for any authorship attribution approach.
The combination of the successful list of Arabic function words and the hybrid algorithm for classification led to satisfying levels of accuracy in determining the author of portions of the texts in test data. The work described here is the first (to our knowledge) that investigates authorship attribution in the Arabic knowledge using computational methods. Among its contributions are: the first set of Arabic function words, the first specialised dataset aimed at testing Arabic authorship attribution methods, a new hybrid algorithm for classifying authors based on patterns derived from these function words, and, finally, a number of ideas and variants regarding how to use function words in association with character level features, leading in some cases to more accurate results
- âŠ