4 research outputs found
Urdu AI: writeprints for Urdu authorship identification
This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing on 31/10/2021, available online at: https://doi.org/10.1145/3476467
The accepted version of the publication may differ from the final published version.The authorship identification task aims at identifying the original author of an anonymous text sample from
a set of candidate authors. It has several application domains such as digital text forensics and information
retrieval. These application domains are not limited to a specific language. However, most of the authorship
identification studies are focused on English and limited attention has been paid to Urdu. On the other
hand, existing Urdu authorship identification solutions drop accuracy as the number of training samples per
candidate author reduces, and when the number of candidate author increases. Consequently, these solutions
are inapplicable to real-world cases. To overcome these limitations, we formulate a stylometric feature space.
Based on this feature space we use an authorship identification solution that transforms each text sample
into point set, retrieves candidate text samples, and relies the nearest neighbour classifier to predict the
original author of the anonymous text sample. To evaluate our method, we create a significantly larger corpus
than existing studies and conduct several experimental studies which show that our solution can overcome
the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous
authorship identification works