2,625 research outputs found
Authorship attribution in portuguese using character N-grams
For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)
On the Feasibility of Malware Authorship Attribution
There are many occasions in which the security community is interested to
discover the authorship of malware binaries, either for digital forensics
analysis of malware corpora or for thwarting live threats of malware invasion.
Such a discovery of authorship might be possible due to stylistic features
inherent to software codes written by human programmers. Existing studies of
authorship attribution of general purpose software mainly focus on source code,
which is typically based on the style of programs and environment. However,
those features critically depend on the availability of the program source
code, which is usually not the case when dealing with malware binaries. Such
program binaries often do not retain many semantic or stylistic features due to
the compilation process. Therefore, authorship attribution in the domain of
malware binaries based on features and styles that will survive the compilation
process is challenging. This paper provides the state of the art in this
literature. Further, we analyze the features involved in those techniques. By
using a case study, we identify features that can survive the compilation
process. Finally, we analyze existing works on binary authorship attribution
and study their applicability to real malware binaries.Comment: FPS 201
Drawing Elena Ferrante's Profile. Workshop Proceedings, Padova, 7 September 2017
Elena Ferrante is an internationally acclaimed Italian novelist whose real identity has been kept secret by E/O publishing house for more than 25 years. Owing to her popularity, major Italian and foreign newspapers have long tried to discover her real identity. However, only a few attempts have been made to foster a scientific debate on her work.
In 2016, Arjuna Tuzzi and Michele Cortelazzo led an Italian research team that conducted a preliminary study and collected a well-founded, large corpus of Italian novels comprising 150 works published in the last 30 years by 40 different authors. Moreover, they shared their data with a select group of international experts on authorship attribution, profiling, and analysis of textual data: Maciej Eder and Jan Rybicki (Poland), Patrick Juola (United States), Vittorio Loreto and his research team, Margherita Lalli and Francesca Tria (Italy), George Mikros (Greece), Pierre Ratinaud (France), and Jacques Savoy (Switzerland).
The chapters of this volume report the results of this endeavour that were first presented during the international workshop Drawing Elena Ferrante's Profile in Padua on 7 September 2017 as part of the 3rd IQLA-GIAT Summer School in Quantitative Analysis of Textual Data. The fascinating research findings suggest that Elena Ferrante\u2019s work definitely deserves \u201cmany hands\u201d as well as an extensive effort to understand her distinct writing style and the reasons for her worldwide success
Recommended from our members
Identifying idiolect in forensic authorship attribution: an n-gram textbite approach
Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist âapproaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolectâ (Coulthard, 2004: 31). However, given the diXculty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ân-gram textbitesâ, small textual segments that characterise that authorâs writing, providing DNA-like chunks of identifying material
- âŠ