1,408 research outputs found

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Composing Measures for Computing Text Similarity

    Get PDF
    We present a comprehensive study of computing similarity between texts. We start from the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. We thus define the notion of text similarity and distinguish it from related tasks such as textual entailment and near-duplicate detection. We then identify multiple text dimensions, i.e. characteristics inherent to texts that can be used to judge text similarity, for which we provide empirical evidence. We discuss state-of-the-art text similarity measures previously proposed in the literature, before continuing with a thorough discussion of common evaluation metrics and datasets. Based on the analysis, we devise an architecture which combines text similarity measures in a unified classification framework. We apply our system in two evaluation settings, for which it consistently outperforms prior work and competing systems: (a) an intrinsic evaluation in the context of the Semantic Textual Similarity Task as part of the Semantic Evaluation (SemEval) exercises, and (b) an extrinsic evaluation for the detection of text reuse. As a basis for future work, we introduce DKPro Similarity, an open source software package which streamlines the development of text similarity measures and complete experimental setups

    Overview of the 1st international competition on plagiarism detection

    Get PDF
    The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection. The competition was divided into the subtasks external plagiarism detection and intrinsic plagiarism detection, which were tackled by 13 participating groups. An important by-product of the competition is an evaluation framework for plagiarism detection, which consists of a large-scale plagiarism corpus and detection quality measures. The framework may serve as a unified test environment to compare future plagiarism detection research. In this paper we describe the corpus design and the quality measures, survey the detection approaches developed by the participants, and compile the achieved performance results of the competitors

    Paraphrase Plagiarism Identifcation with Character-level Features

    Full text link
    [EN] Several methods have been proposed for determining plagiarism between pairs of sentences, passages or even full documents. However, the majority of these methods fail to reliably detect paraphrase plagiarism due to the high complexity of the task, even for human beings. Paraphrase plagiarism identi cation consists in automatically recognizing document fragments that contain re-used text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes, and morphological or lexical substitutions. Our main hypothesis establishes that the original author's writing style ngerprint prevails in the plagiarized text even when paraphrases occur. Thus, in this paper we propose a novel text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features. As an additional contribution, we describe the methodology followed for the construction of an appropriate corpus for the task of paraphrase plagiarism identi cation, which represents a new valuable resource to the NLP community for future research work in this field.This work is the result of the collaboration in the framework of the CONACYT Thematic Networks program (RedTTL Language Technologies Network) and the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie action. The first author was supported by CONACYT (Scholarship 258345/224483). The second, third, and sixth authors were partially supported by CONACyT (Project Grants 258588 and 2410). The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the Grant ALMAMATER (PrometeoII/2014/030).Sánchez-Vega, F.; Villatoro-Tello, E.; Montes-Y-Gómez, M.; Rosso, P.; Stamatatos, E.; Villaseñor-Pineda, L. (2019). Paraphrase Plagiarism Identifcation with Character-level Features. Pattern Analysis and Applications. 22(2):669-681. https://doi.org/10.1007/s10044-017-0674-zS66968122

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    Two-Step Cluster Based Feature Discretization of Naive Bayes for Outlier Detection in Intrinsic Plagiarism Detection

    Full text link
    Intrinsic plagiarism detection is the task of analyzing a document with respect to undeclared changes in writing style which treated as outliers. Naive Bayes is often used to outlier detection. However, Naive Bayes has assumption that the values of continuous feature are normally distributed where this condition is strongly violated that caused low classification performance. Discretization of continuous feature can improve the performance of NaĂŻve Bayes. In this study, feature discretization based on Two-Step Cluster for NaĂŻve Bayes has been proposed. The proposed method using tf-idf and query language model as feature creator and False Positive/False Negative (FP/FN) threshold which aims to improve the accuracy and evaluated using PAN PC 2009 dataset. The result indicated that the proposed method with discrete feature outperform the result from continuous feature for all evaluation, such as recall, precision, f-measure and accuracy. The using of FP/FN threshold affects the result as well since it can decrease FP and FN; thus, increase all evaluation

    Author Profiling and Plagiarism Detection

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-25485-2_6In this chapter we introduce the topics that we will cover in the RuSSIR 2014 course on Author Profiling and Plagiarism Detection (APPD). Author profiling distinguishes between classes of authors studying how language is shared by classes of people. This task helps in identifying profiling aspects such as gender, age, native language, or even personality type. In case of the plagiarism detection task we are not interested in studying how language is shared. On the contrary, given a document we are interested in investigating if the writing style changes in order to unveil text inconsistencies, i.e., unexpected irregularities through the document such as changes in vocabulary, style and text complexity. In fact, when it is not possible to retrieve the source document(s) where plagiarism has been committed from, the intrinsic analysis of the suspicious document is the only way to find evidence of plagiarism. The difficulty in retrieving the source of plagiarism could be due to the fact that the documents are not available on the web or the plagiarised text fragments were obfuscated via paraphrasing or translation (in case the source document was in another language). In this overview, we also discuss the results of the shared tasks on author profiling (gender and age identification) and plagiarism detection that we help to organise at the PAN Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse.The PAN shared tasks on author profil-ing and on plagiarism detection have been organised in the framework of the WIQ-EIIRSES project (Grant No. 269180) within the EC FP 7 Marie Curie People. The research work described in the paper was carried out in the framework of the DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction inIntelligent Systems.Rosso, P. (2015). Author Profiling and Plagiarism Detection. En Information Retrieval. Springer. 229-250. https://doi.org/10.1007/978-3-319-25485-2_6S229250Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. TEXT 23, 321–346 (2003)Association of Teachers and Lecturers. School work plagued by plagiarism - ATL survey. Technical report, Association of Teachers and Lecturers, London, UK (2008). (Press release)Barrón-Cedeño, A.: On the mono- and cross-language detection of text re-use and plagiarism. Ph.D. thesis, Universitat Politènica de València (2012)Barrón-Cedeño, A., Rosso, P., Pinto, D., Juan, A.: On cross-lingual plagiarism analysis using a statistical model. In: Proceedings of the ECAI 2008 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, PAN 2008 (2008)Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl. Based Syst. 50, 11–17 (2013)Barrón-Cedeño, A., Vila, M., Martí, M., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)Bogdanova, D., Rosso, P., Solorio, T.: Exploring high-level features for detecting cyberpedophilia. Comput. Speech Lang. 28(1), 108–120 (2014)Braschler, M., Harman, D.: Notebook papers of CLEF 2010 LABs and workshops. Padua, Italy (2010)Cappellato, L., Ferro, N., Halvey, M., Kraaij, W.: CLEF 2014 labs and workshops, notebook papers. In: CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613–0073 (2014). http://ceur-ws.org/Vol-1180/Comas, R., Sureda, J., Nava, C., Serrano, L.: Academic cyberplagiarism: a descriptive and comparative analysis of the prevalence amongst the undergraduate students at Tecmilenio University (Mexico) and Balearic Islands University (Spain). In: Proceedings of the International Conference on Education and New Learning Technologies (EDULEARN 2010), Barcelona (2010)Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–233 (1948)Flores, E., Barrón-Cedeño, A., Rosso, P., Moreno, L.: Desocore: detecting source code re-use across programming languages. In: Proceedings of 12th International Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2012, pp. 1–4, Montreal, Canada (2012)Flores, E., Barrón-Cedeño, A., Moreno, L., Rosso, P.: Uncovering source code re-use in large-scale programming environments. In: Computer Applications in Engineering and Education, Accepted (2014). doi: 10.1002/cae.21608Forner, P., Navigli, R., Tufis, D.: CLEF 2013 evaluation labs and workshop - working notes papers, 23–26 September. Valencia, Spain (2013)Franco-Salvador, M., Gupta, P., Rosso, P.: Cross-Language plagiarism detection using a multilingual semantic network. In: Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E., Serdyukov, P. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 710–713. Springer, Heidelberg (2013)Franco-Salvador, M., Gupta, P., Rosso, P.: Knowledge graphs as context models: improving the detection of cross-language plagiarism with paraphrasing. In: Ferro, N. (ed.) PROMISE Winter School 2013. LNCS, vol. 8173, pp. 227–236. Springer, Heidelberg (2014)Gollub, T., Stein, B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M., (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM, August 2012. ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348501Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: Gurrin, C., Jones, G., Kelly, D., Kruschwitz, U., de Rijke, M., Sakai, T., Sheridan, P., (eds.) 36th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2013), pp. 981–984. ACM (2013)Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Adar, E., Hurst, M., Finin, T., Glance, N.S., Nicolov, N., Tseng, B.L., (eds.) ICWSM. The AAAI Press (2009)Gressel, G., Hrudya, P., Surendran, K., Thara, S., Aravind, A., Prabaharan, P.: Ensemble Learning Approach for Author Profiling-Notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Grozea, C., Popescu, M.: ENCOPLOT - performance in the Second International Plagiarism Detection Challenge lab report for PAN at CLEF 2010. In: Braschler and Harman [8]Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: pairwise sequence matching in linear time applied to plagiarism detection. In: Stein et al., (ed.) Overview of the 1st International Competition on Plagiarism Detection, pp. 10–18 (2009)Gunning, R.: The Technique of Clear Writing. McGraw-Hill Int. Book Co, New York (1952)Gupta, P., Barrón-Cedeño, A., Rosso, P.: Cross-language high similarity search using a conceptual thesaurus. In: Catarci, T., Peñas, A., Santucci, G., Forner, P., Hiemstra, D. (eds.) CLEF 2012. LNCS, vol. 7488, pp. 67–75. Springer, Heidelberg (2012)Honore, A.: Some simple measures of richness of vocabulary. Assoc. Lit. Linguist. Comput. Bull. 7(2), 172–177 (1979)IEEE. A Plagiarism FAQ. http://www.ieee.org/publications_standards/publications/rights/plagiarism_FAQ.html (2008). Published: 2008; Last Accessed 25 November 2012Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Lit. Linguist. Comput. 17(4), 401–412 (2002)Liau, Y., Vrizlynn, L.: Submission to the author profiling competition at pan-2014. In: Proceedings Recent Advances in Natural Language Processing III (2014). http://www.webis.de/research/events/pan-14Lopez-Monroy, A.P., Montes-Y-Gomez, M., Escalante, H.J., Villaseñor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN 2013: author profiling task–notebook for PAN at CLEF 2013. In: Forner, et al. [14]Pastor López-Monroy, A., Montes y Gómez, M., Escalante, H.J., Villaseñor-Pineda, L.: Using Intra-profile information for author profiling-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Maharjan, S., Shrestha, P., Solorio, T.: A simple approach to author profiling in MapReduce–notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Marquardt, J., Fanardi, G., Vasudevan, G., Moens, M.F., Davalos, S., Teredesai, A., De Cock, M.: Age and gender identification in social media-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Martin, B.: Plagiarism: policy against cheating or policy for learning? Nexus (Newsl. Aust. Sociol. Assoc.) 16(2), 15–16 (2004)Mcnamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1), 73–97 (2004)Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features-notebook for PAN at CLEF 2013. In: Forner, et al. [14]Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Tombros, A., Yavlinsky, A., Rüger, S.M., Tsikrika, T., Lalmas, M., MacFarlane, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)Montes y Gómez, M., Gelbukh, A.F., López-López, A., Baeza-Yates, R.A.: Flexible comparison of conceptual graphs. In: Proceedings DEXA, pp. 102–111 (2001)Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)Nawab, R.M.A., Stevenson, M., Clough, P.: University of sheffield lab report for pan at clef 2010. In: Braschler and Harman [8]Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “how old do you think i am?”; a study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013)Oberreuter, G., Eiselt, A.: Submission to the 6th international competition on plagiarism detection, From Innovand.io, Chile (2014). http://www.webis.de/research/events/pan-14Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)Palkovskii, Y., Belov, A.: Developing high-resolution universal multi-type N-Gram plagiarism detector-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: our words, our selves. Ann. Rev. Psychol. 54(1), 547–577 (2003)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: COLING 2010: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 997–1005 (2010)Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Plachouras, V., Macdonald, C., Ounis, I., White, R.W., Ruthven, I. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.:. Overview of the 1st international competition on plagiarism detection. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E., (eds.) Proceedings of the SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 1–9, 2009. CEUR-WS.org (September 2009). http://ceur-ws.org/Vol-502Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Braschler and Harman [8]Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Braschler, M., Harman, D., Pianta, E., (eds.) Working Notes Papers of the CLEF 2010 Evaluation Labs (September 2010) 2010. http://www.clef-initiative.eu/publication/working-notesPotthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Petras, V., Forner, P., Clough, P., (eds.) Working Notes Papers of the CLEF 2011 Evaluation Labs (September 2011) (2011). http://www.clef-initiative.eu/publication/working-notesPotthast, M., Gollub, T., Hagen, M., Grabegger, J., Kiesel, J., Michel, M., Oberlander, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacker, C., (eds.) Working Notes Papers of the CLEF 2012 Evaluation Labs (September 2012) (2012). http://www.clef-initiative.eu/publication/working-notesPotthast, M., Hagen, M., Stein, B., Grabegger, J., Michel, M., Tippmann, M., Welsch, C.: Chatnoir: a search engine for the clueweb09 corpus. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M., (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), p. 1004 (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: Forner, et al. [14]Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th International Competition on Plagiarism Detection. In: Cappellato, et al. [9]Pouliquen, B., Steinberger, R., Ignat, C.: Automatic linking of similar texts across languages. In: Proceedings of Recent Advances in Natural Language Processing III, RANLP 2003, pp. 307–316 (2003)Prakash, A., Saha, S.: Experiments on document chunking and query formation for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: Forner, et al. [14]Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkman, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014–notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: A winning approach to text alignment for text reuse detection at PAN 2014-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205. AAAI (2006)Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E., (eds.) Proceedings of the SEPLN09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 38–46, 2009. CEUR-WS.org, September 2009. http://ceur-ws.org/Vol-502Stein, B., Meyer zu Eissen, S., Potthast, M.: Strategies for retrieving plagiarized documents. In: Clarke, C., Fuhr, N., Kando, N., Kraaij, W., de Vries, A., (eds.) 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2007), pp. 825–826. ACM (2007)Stein, B., Potthast, M., Rosso, P., Barrón-Cedeño, A., Stamatatos, E., Koppel, M.: Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. ACM SIGIR Forum 45, 45–48 (2011)Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The jrc-acquis: a multilingual aligned parallel corpus with +20 languages. In: Proceedings of 5th International Conference on language resources and evaluation LREC 2006 (2006)Suchomel, S., Brandejs, M.: Heterogeneous queries for synoptic and phrasal search-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Villena-Román, J., González-Cristóbal, J.C.: DAEDALUS at PAN 2014: Guessing Tweet Author’s Gender and Age-Notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Vossen, P.: Eurowordnet: a multilingual database of autonomous and language-specific wordnets connected via an inter-lingual index. Int. J. Lexicography 17, 161–173 (2004)Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 783–792 (2010)Weren, E.R.D., Moreira, V.P., de Oliveira, J.P.M.:. Exploring information retrieval features for author profiling-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Williams, K., Chen, H.H., Giles, C.: Supervised ranking for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Yule, G.: The Statistical Study of Literary Vocabulary. Cambridge University press, Cambridge (1944)Zubarev, D., Sochenkov, I.: Using sentence similarity measure for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, L., et al. [9

    On the Feasibility of Malware Authorship Attribution

    Full text link
    There are many occasions in which the security community is interested to discover the authorship of malware binaries, either for digital forensics analysis of malware corpora or for thwarting live threats of malware invasion. Such a discovery of authorship might be possible due to stylistic features inherent to software codes written by human programmers. Existing studies of authorship attribution of general purpose software mainly focus on source code, which is typically based on the style of programs and environment. However, those features critically depend on the availability of the program source code, which is usually not the case when dealing with malware binaries. Such program binaries often do not retain many semantic or stylistic features due to the compilation process. Therefore, authorship attribution in the domain of malware binaries based on features and styles that will survive the compilation process is challenging. This paper provides the state of the art in this literature. Further, we analyze the features involved in those techniques. By using a case study, we identify features that can survive the compilation process. Finally, we analyze existing works on binary authorship attribution and study their applicability to real malware binaries.Comment: FPS 201

    Overview of PAN 2018. Author identification, author profiling, and author obfuscation

    Full text link
    [EN] PAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics.More specifically, this edition of PAN introduces a shared task in cross-domain authorship attribution, where texts of known and unknown authorship belong to distinct domains, and another task in style change detection that distinguishes between single author and multi-author texts. In addition, a shared task in multimodal author profiling examines, for the first time, a combination of information from both texts and images posted by social media users to estimate their gender. Finally, the author obfuscation task studies how a text by a certain author can be paraphrased so that existing author identification tools are confused and cannot recognize the similarity with other texts of the same author. New corpora have been built to support these shared tasks. A relatively large number of software submissions (41 in total) was received and evaluated. Best paradigms are highlighted while baselines indicate the pros and cons of submitted approaches.The work at the Universitat Polit`ecnica de Val`encia was funded by the MINECO research project SomEMBED (TIN2015-71147-C2-1-P)Stamatatos, E.; Rangel-Pardo, FM.; Tschuggnall, M.; Stein, B.; Kestemont, M.; Rosso, P.; Potthast, M. (2018). Overview of PAN 2018. Author identification, author profiling, and author obfuscation. Lecture Notes in Computer Science. 11018:267-285. https://doi.org/10.1007/978-3-319-98932-7_25S26728511018Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-2011. In: Petras, V., Forner, P., Clough, P. (eds.) Notebook Papers of CLEF 2011 Labs and Workshops, 19–22 September 2011, Amsterdam, Netherlands, September 2011. http://www.clef-initiative.eu/publication/working-notesBird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Sebastopol (2009)Bogdanova, D., Lazaridou, A.: Cross-language authorship attribution. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, pp. 2015–2020 (2014)Choi, F.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (NAACL), pp. 26–33. Association for Computational Linguistics, Seattle, April 2000Custódio, J.E., Paraboni, I.: EACH-USP ensemble cross-domain authorship attribution. In: Working Notes Papers of the CLEF 2018 Evaluation Labs, September 2018, to be announcedDaneshvar, S.: Gender identification in Twitter using n-grams and LSA. In: Working Notes Papers of the CLEF 2018 Evaluation Labs, September 2018, to be announcedDaniel Karaś, M.S., Sobecki, P.: OPI-JSA at CLEF 2017: author clustering and style breach detection. In: Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings. CLEF and CEUR-WS.org, September 2017Giannella, C.: An improved algorithm for unsupervised decomposition of a multi-author document. The MITRE Corporation. Technical Papers, February 2014Glover, A., Hirst, G.: Detecting stylistic inconsistencies in collaborative writing. In: Sharples, M., van der Geest, T. (eds.) The New Writing Environment, pp. 147–168. Springer, London (1996). https://doi.org/10.1007/978-1-4471-1482-6_12Hagen, M., Potthast, M., Stein, B.: Overview of the author obfuscation task at PAN 2017: safety evaluation revisited. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Hagen, M., Potthast, M., Stein, B.: Overview of the author obfuscation task at PAN 2018. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2018)Hellekson, K., Busse, K. (eds.): The Fan Fiction Studies Reader. University of Iowa Press, Iowa City (2014)Juola, P.: An overview of the traditional authorship attribution subtask. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.) CLEF 2012 Evaluation Labs and Workshop - Working Notes Papers, 17–20 September 2012, Rome, Italy, September 2012. http://www.clef-initiative.eu/publication/working-notesJuola, P.: The rowling case: a proposed standard analytic protocol for authorship questions. Digital Sch. Humanit. 30(suppl–1), i100–i113 (2015)Kestemont, M., Luyckx, K., Daelemans, W., Crombez, T.: Cross-genre authorship verification using unmasking. Engl. Stud. 93(3), 340–356 (2012)Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2018)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)Overdorf, R., Greenstadt, R.: Blogs, Twitter feeds, and reddit comments: cross-domain authorship attribution. Proc. Priv. Enhanc. Technol. 2016(3), 155–171 (2016)Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Notebook Papers of the 5th Evaluation Lab on Uncovering Plagiarism, Authorship and Social Software Misuse (PAN), Amsterdam, The Netherlands, September 2011Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2016. http://ceur-ws.org/Vol-1609/Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In: Fung, P., Poesio, M. (eds.) Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), pp. 1212–1221. Association for Computational Linguistics, August 2013. http://www.aclweb.org/anthology/P13-1119Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers, Toulouse, France, pp. 8–11. CEUR-WS.org, September 2015Rangel, F., et al.: Overview of the 2nd author profiling task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Evaluation Labs and Workshop - Working Notes Papers, Sheffield, UK, pp. 15–18. CEUR-WS.org, September 2014Rangel, F., Rosso, P., G’omez, M.M., Potthast, M., Stein, B.: Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org (2017)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop - Working Notes Papers, 23–26 September 2013, Valencia, Spain, September 2013Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Balog, K., Cappellato, L., Ferro, N., Macdonald, C. (eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, September 2016Safin, K., Kuznetsova, R.: Style breach detection with neural sentence embeddings. In: Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–102 (2015)Sapkota, U., Solorio, T., Montes, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of the 25th International Conference on Computational Linguistics. Technical Papers, pp. 1228–1237 (2014)Stamatatos, E.: Intrinsic plagiarism detection using character nnn-gram Profiles. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.) SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 38–46. Universidad Politécnica de Valencia and CEUR-WS.org, September 2009. http://ceur-ws.org/Vol-502Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21, 421–439 (2013)Stamatatos, E.: Authorship attribution using text distortion. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 1138–1149. Association for Computational Linguistics (2017)Stamatatos, E., et al.: Overview of the author identification task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers, 8–11 September 2015, Toulouse, France. CEUR-WS.org, September 2015Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2016. http://ceur-ws.org/Vol-1609/Takahashi, T., Tahara, T., Nagatani, K., Miura, Y., Taniguchi, T., Ohkuma, T.: Text and image synergy with feature cross technique for gender identification. In: Working Notes Papers of the CLEF 2018 Evaluation Labs, September 2018, to be announcedTellez, E.S., Miranda-Jiménez, S., Moctezuma, D., Graff, M., Salgado, V., Ortiz-Bejar, J.: Gender identification through multi-modal tweet analysis using microtc and bag of visual words. In: Working Notes Papers of the CLEF 2018 Evaluation Labs, September 2018, to be announcedTschuggnall, M., Specht, G.: Automatic decomposition of multi-author documents using grammar analysis. In: Proceedings of the 26th GI-Workshop on Grundlagen von Datenbanken. CEUR-WS, Bozen, October 2014Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, vol. 1866. CLEF and CEUR-WS.org, September 2017. http://ceur-ws.org/Vol-1866
    • …
    corecore