14 research outputs found

    A Decade of Shared Tasks in Digital Text Forensics at PAN

    Full text link
    [EN] Digital text forensics aims at examining the originality and credibility of information in electronic documents and, in this regard, to extract and analyze information about the authors of these documents. The research field has been substantially developed during the last decade. PAN is a series of shared tasks that started in 2009 and significantly contributed to attract the attention of the research community in well-defined digital text forensics tasks. Several benchmark datasets have been developed to assess the state-of-the-art performance in a wide range of tasks. In this paper, we present the evolution of both the examined tasks and the developed datasets during the last decade. We also briefly introduce the upcoming PAN 2019 shared tasks.We are indebted to many colleagues and friends who contributed greatly to PAN's tasks: Maik Anderka, Shlomo Argamon, Alberto BarrĂłn-Cedeño, Fabio Celli, Fabio Crestani, Walter Daelemans, Andreas Eiselt, Tim Gollub, Parth Gupta, Matthias Hagen, Teresa Holfeld, Patrick Juola, Giacomo Inches, Mike Kestemont, Moshe Koppel, Manuel Montes-y-GĂłmez, Aurelio Lopez-Lopez, Francisco Rangel, Miguel Angel SĂĄnchez-PĂ©rez, GĂŒnther Specht, Michael Tschuggnall, and Ben Verhoeven. Our special thanks go to PANÂżs sponsors throughout the years and not least to the hundreds of participants.Potthast, M.; Rosso, P.; Stamatatos, E.; Stein, B. (2019). A Decade of Shared Tasks in Digital Text Forensics at PAN. Lecture Notes in Computer Science. 11438:291-300. https://doi.org/10.1007/978-3-030-15719-7_39S2913001143

    Enron Authorship Verification Corpus

    No full text
    ===========================================Type of corpus:===========================================The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset" [1], which has been used across different research domains beyong Authorship Verification (AV). The intention behind this corpus is to provide other researchers in the field of AV the opportunity to compare their results to each other. ===========================================Language:===========================================All texts are written in English.===========================================Format of the corpus:===========================================The corpus was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" [2]. It consists of 80 AV cases, evenly distributed regarding true (Y) and false (N) authorships, as well as the ground truth (Y/N) regarding all AV cases. Each AV case comprise up to 5 documents (plain text files), where 2-4 documents stem from a known author, while the 5th document has an unknown authorship and, thus, is the subject of verification. Each document has been written by a single author X and is mostly aggregated from several mails of X, in order to provide a sufficient length that captures X's writing style. ===========================================Preprocessing steps:===========================================All texts in the corpus were preprocessed by hand, which resulted in an overall processing time of more than 30 hours. The preprocessing includes de-duplication, normalization of utf-8 symbols as well as the removal of URLs, e-mail headers, signatures and other metadata. Beyond these, the texts themselves have been undergone a variety of cleaning procedures including the removal of greetings/closing formulas, (telephone) numbers, named entities (names of people, companies, locations, etc.), quotes and repetitions of identical characters/symbols and words. As a last preprocessing step, multiple successive blanks, newlines and tabs were substituted with a single blank. ===========================================Basic statistics:===========================================The length of each preprocessed text ranges from 2,200-5,000 characters. More precisely, the average length of each known document is 3976 characters, while the average length of each unknown document is 3899 characters.===========================================Paper + Citation:===========================================https://link.springer.com/chapter/10.1007/978-3-319-98932-7_4===========================================References:===========================================[1] https://www.cs.cmu.edu/~enron[2] http://pan.webis.d

    Reddit Cross-Topic Authorship Verification Corpus

    No full text
    The "Reddit Cross-Topic Authorship Verification Corpus" consists of comments written between 2010 to 2016 from 1,000 reddit users. Each problem includes 1 unknown and 4 known documents (~ 7 KByte per document), where each document represents an aggregation of reviews coined from the same so-called subreddit. More precisely, all documents within a problem are disjunct regarding the subreddits in order to enable a cross-topic corpus. All subreddits cover exactly 1,388 different topics such as books, news, gaming, music, movies, etc. The corpus follows excatly the same format as the well-known PAN Authorship Identification corpora (http://pan.webis.de/)

    Author clustering using compression-based dissimilarity scores: Notebook for PAN at CLEF 2017

    No full text
    The PAN 2017 Author Clustering task examines the two application scenarios complete author clustering and authorship-link ranking. In the first scenario, one must identify the number (k) of different authors within a document collection and assign each document to exactly one of the k clusters, where each cluster corresponds to a different author. In the second scenario, one must establish authorship links between documents in a cluster and provide a list of document pairs, ranked according to a confidence score. We present a simple scheme to handle both scenarios. In order to group the documents by their authors, we use k-Medoids, where the optimal k is determined through the computation of silhouettes. To determine links between the documents in each cluster, we apply a predefined compressor as well as a dissimilarity measure. The resulting compression-based dissimilarity scores are then used to rank all document pairs. The proposed scheme does not require (text-)preprocessing, feature engineering or hyperparameter optimization, which are often necessary in author clustering and/or other related fields. However, the achieved results indicate that there is room for improvement

    Enron Authorship Verification Corpus

    No full text
    The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset", which was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" (http://pan.webis.de). The corpus consists of 80 authorship verification cases, evenly distributed regarding true/false authorships. Each authorship verification case comprise exactly 5 documents (plain text files). Here, 4 documents represent samples from the known (true) author, while the remaining 1 document represents the text of the unknown author (the subject of verification). The corpus is ballanced, not only in terms of the same number of known documents per case, but also regarding the lenth of the texts, which is near-equal (3-4 kilobyte per text). It can be assumed that each document is aggregated from (short) mails of the same author, in order to have a sufficient length that captures the authors writing style. All texts in the corpus have undergone the same preprocessing-procedure: De-duplication, removing of URL's, newlines/tabs, normalization of utf-8 symbols and substitution of multiple successive blanks with a single blank. All e-mail headers and other metadata (including signatures) have been removed from each document such that it contains only pure natural language text fron a single author. The intention behind this corpus is to provide other researchers in the field of authorship verification the opportunity to compare their results to each other

    Natural language watermarking for German texts

    No full text
    In this paper we present four informed natural language watermark embedding methods, which operate on the lexical and syntactic layer of German texts. Our scheme provides several benefits in comparison to state-of-the-art approaches, as for instance that it is not relying on complex NLP operations like full sentence parsing, word sense disambiguation, named entity recognition or semantic role parsing. Even rich lexical resources (e.g. WordNet or the Collins thesaurus), which play an essential role in many previous approches, are unnecessary for our system. Instead, our methods require only a Part-Of-Speech Tagger, simple wordlists that act as black- and whitelists and a trained classifier, which automatically predicts the ability of potential lexical or syntactic patterns to carry portions of the watermark message. Besides this, a part of the proposed methods can be easily adapted into other Indo-European languages, since the grammar rules the methods rely on are not re stricted only to the German language. Because the methods perform only lexical and minor syntactic transformations, the watermarked text is not affected by grammatical distortion and simultaneously the meaning of the text is preserved in 82.14% of the cases

    Fake News Detection with the New German Dataset "GermanFakeNC

    No full text
    The spread of misleading information and “alternative facts” on the internet gained in the last decade considerable importance worldwide. In recent years, several attempts have been made to counteract fake news based on automatic classification via machine learning models. These, however, require labeled data. The scarcity of available corpora for predictive modeling is a major stumbling block in this field, especially in other languages than English. Our contribution is twofold. First, we introduce a new publicly available German dataset “German Fake News Corpus” (GermanFakeNC) for the task of fake news detection which consists of 490 manually fact-checked articles. Every false statement in the text is verified claim-by-claim by authoritative sources. Our ground truth for trustworthy news consists of 4,500 news articles from well-known mainstream news publishers. With regard to the second contribution, we choose a Convolutional Neural Network (CNN) (Îș = 0.89) and the widely used SVM (Îș = 0.72) technique to detect fake news. Thus we hope that our approach will stimulate the progress in fake news detection and claim verification across languages
    corecore