51 research outputs found

    Recent trends in digital text forensics and its evaluation

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-40802-1_28This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PAN 13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58 submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.This work was partially supported by the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie action.Gollub, T.; Potthast, M.; Beyer, A.; Busse, M.; Rangel Pardo, FM.; Rosso, P.; Stamatatos, E.... (2013). Recent trends in digital text forensics and its evaluation. En Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Springer Verlag (Germany). 282-302. https://doi.org/10.1007/978-3-642-40802-1_28S282302Aleman, Y., Loya, N., Vilarino Ayala, D., Pinto, D.: Two Methodologies Applied to the Author Profiling Task—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Argamon, S., Juola, P.: Overview of the International Authorship Identification Competition at PAN-2011. In: Proc. of CLEF 2011 (2011)Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, Genre, and Writing Style in Formal Written Texts. TEXT 23, 321–346 (2003)Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically Profiling the Author of an Anonymous Text. Commun. ACM 52(2), 119–123 (2009)Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: EvaluatIR: An Online Tool for Evaluating and Comparing IR Systems. In: Proc. of SIGIR 2009 (2009)Blockeel, H., Vanschoren, J.: Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 6–17. Springer, Heidelberg (2007)Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating Gender on Twitter. In: Proc. EMNLP 2011 (2011)Clough, P., Stevenson, M.: Developing a Corpus of Plagiarised Short Answers. Lang. Resour. Eval. 45, 5–24 (2011)Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: METER: MEasuring TExt Reuse. In: Proc. ACL 2002 (2002)De Roure, D., Goble, C., Stevens, R.: The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows. Future Gener. Comp. Sy. 25, 561–567 (2009)Caurcel Diaz, A.A., Gomez Hidalgo, J.M.: Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Downie, J.S.: The Music Information Retrieval Evaluation Exchange (2005–2007): A Window into Music Information Retrieval Research. Acoust. Sc. and Tech. 29(4), 247–255 (2008)Hernandez Farias, D.I., Guzman-Cabrera, R., Reyes, A., Rocha, M.A.: Semantic-based Features for Author Profiling Identification: First Insights—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Flekova, L., Gurevych, I.: Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media–Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers (2013)Gillam, L.: Readability for author profiling?—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Gollub, T., Burrows, S., Stein, B.: First Experiences with TIRA for Reproducible Evaluation in Information Retrieval. In: Proc. of OSIR at SIGIR 2012 (August 2012)Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower Research: Towards a Web Framework for Providing Experiments as a Service. In: Proc. of SIGIR 2012 (2012)Gollub, T., Stein, B., Burrows, S., Hoppe, D.: TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments. In: Proc. of TIR at DEXA 2012. IEEE (2012)Goswami, S., Sarkar, S., Rustagi, M.: Stylometric Analysis of Bloggers’ Age and Gender. In: Proc. of ICWSM 2009 (2009)Haggag, O., El-Beltagy, S.: Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Holmes, J., Meyerhoff, M.: The Handbook of Language and Gender. Blackwell Handbooks in Linguistics. Wiley (2003)Inches, G., Crestani, F.: Overview of the International Sexual Predator Identification Competition at PAN-2012. In: Proc. of CLEF 2012 (2012)Juola, P.: Authorship Attribution. Found. and Trends in IR 1, 234–334 (2008)Juola, P.: Ad-hoc Authorship Attribution Competition. In: Proc. of ALLC 2004 (2004)Juola, P.: An Overview of the Traditional Authorship Attribution Subtask. In: Proc. of CLEF 2012 (2012)Koppel, M., Winter, Y.: Determining if Two Documents are by the Same Author. Journal of the American Society for Information Science and Technology (to appear)Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4), 401–412 (2002)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)Koppel, M., Schler, J., Argamon, S.: Authorship Attribution in the Wild. Language Resources and Evaluation 45, 83–94 (2011)Kong, L., Qi, H., Du, C., Wang, M., Han, Z.: Approaches for Source Retrieval and Text Alignment of Plagiarism Detection—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Lim, W.Y., Goh, J., Thing, V.L.L.: Content-centric age and gender profiling—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Pastor Lopez-Monroy, A., Montes-Y-Gomez, M., Jair Escalante, H., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN’13: Author Profiling task—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based Classification for Author Profiling using Various Features—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How Old Do You Think I Am?”; A Study of Language and Age in Twitter. In: Proc. of ICWSM 2013 (2013)Nguyen, D., Smith, N.A., Rosé, C.P.: Author Age Prediction from Text Using Linear Regression. In: Proc. of LaTeCH at ACL-HLTGopal Patra, B., Banerjee, S., Das, D., Saikh, T., Bandyopadhyay, S.: Automatic Author Profiling Based on Linguistic and Stylistic Features—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting Age and Gender in Online Social Networks. In: Proc. of SMUC 2011 (2011)Pennebaker, J.W.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury, USA (2013)Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological Aspects of Natural Language Use: Our Words, Our Selves. Annual Review of Psychology 54(1), 547–577 (2003)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Proc. of PAN at SEPLN 2009 (2009)Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Proc. of CLEF 2010 (2010)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for Plagiarism Detection. In: Proc. of COLING 2010 (2010)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd International Competition on Plagiarism Detection. In: Proc. of CLEF 2011 (2011)Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th International Competition on Plagiarism Detection. In: Proc. of CLEF 2012 (2012)Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: A Search Engine for the ClueWeb09 Corpus. In: Proc. of SIGIR 2012 (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th International Competition on Plagiarism Detection. In: Proc. of CLEF 2013 (2013)Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing Interaction Logs to Understand Text Reuse from the Web. In: Proc. of ACL 2013. ACM (to appear, August 2013b)Rodíguez Torrejón, D.A., Martín Ramos, J.M.: Text Alignment Module in CoReMo 2.1 Plagiarism Detector—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Santosh, K., Bansal, R., Shekhar, M., Varma, V.: Author Profiling: Predicting Age and Gender from Blogs—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of Age and Gender on Blogging. In: Proc. of CAAW 2006 (2006)Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology 60, 538–556 (2009)Stamatatos, E.: Plagiarism Detection Using Stopword N-grams. Journal of the American Society for Information Science and Technology 62(12), 2512–2527 (2011)Stein, B., Meyer zu Eißen, S., Potthast, M.: Strategies for Retrieving Plagiarized Documents. In: Proc. of SIGIR 2007 (2007)Suchomel, Š., Kasprzak, J., Brandejs, M.: Diverse Queries and Feature Type Selection for Plagiarism Discovery—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Williams, K., Chen, H., Chowdhury, S.R., Giles, C.L.: Unsupervised Ranking for Plagiarism Source Retrieval—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Wojnarski, M., Stawicki, S., Wojnarowski, P.: TunedIT.org: System for Automated Evaluation of Algorithms in Repeatable Experiments. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 20–29. Springer, Heidelberg (2010)Zhang, C., Zhang, P.: Predicting Gender from Blog Posts. Technical report, University of Massachusetts Amherst, USA (2010

    Overview of the PAN'2016 - New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-44564-9_28This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of digital text forensic research. PAN 2016 comprises three shared tasks: (i) author identification, addressing author clustering and diarization (or intrinsic plagiarism detection); (ii) author profiling, addressing age and gender prediction from a cross-genre perspective; and (iii) author obfuscation, addressing author masking and obfuscation evaluation. In total, 35 teams participated in all three shared tasks of PAN 2016 and, following the practice of previous editions, software submissions were required and evaluated within the TIRA experimentation framework.The work of the first author was partially supported by the Som EMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMA MATER (Prometeo II/2014/030). The work of the second author was partially supported by Autoritas Consulting and by Ministerio de Economía y Competitividad de España under grant ECOPORTUNITY IPT-2012-1220-430000.Rosso, P.; Rangel-Pardo, FM.; Potthast, M.; Stamatatos, E.; Tschuggnall, M.; Stein, B. (2016). Overview of the PAN'2016 - New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation. En Experimental IR Meets Multilinguality, Multimodality, and Interaction. Springer Verlag (Germany). 332-350. https://doi.org/10.1007/978-3-319-44564-9_28S332350Almishari, M., Tsudik, G.: Exploring linkability of user reviews. In: Foresti, S., Yung, M., Martinelli, F. (eds.) ESORICS 2012. LNCS, vol. 7459, pp. 307–324. Springer, Heidelberg (2012)Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-Y-Gómez, M., Villaseñor-Pineda, L., Jair-Escalante, H.: INAOE’s Participation at PAN’15: author profiling task–notebook for PAN at CLEF 2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-2011. In: Working Notes Papers of the CLEF 2011 Evaluation Labs (2011)Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. TEXT 23, 321–346 (2003)Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., Chikhi, S.: Overview of the AraPlagDet PAN@ FIRE2015 shared task on arabic plagiarism detection. In: Notebook Papers of FIRE 2015. CEUR-WS.org, vol. 1587 (2015)Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of EMNLP 2011 (2011)Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM TIST 4(3), 43:1–43:21 (2013)Castillo, E., Cervantes, O., Vilariño, D., Pinto, D., León, S.: Unsupervised method for the authorship identification task. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180 (2014)Chaski, C.E.: Who’s at the keyboard: authorship attribution in digital evidence invesigations. Int. J. Digit. Evid. 4, 1–13 (2005)Clarke, C.L., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2009 web track. In: DTIC Document (2009)Flores, E., Rosso, P., Moreno, L., Villatoro, E.: On the detection of source code re-use. In: ACM FIRE 2014 Post Proceedings of the Forum for Information Retrieval Evaluation, pp. 21–30 (2015)Flores, E., Rosso, P., Villatoro, E., Moreno, L., Alcover, R., Chirivella, V.: PAN@FIRE: overview of CL-SOCO track on the detection of cross-language source code re-use. In: Notebook Papers of FIRE 2015. CEUR-WS.org, vol. 1587 (2015)Fréry, J., Largeron, C., Juganaru-Mathieu, M.: UJM at clef in author identification. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180 (2014)Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 282–302. Springer, Heidelberg (2013)Gollub, T., Stein, B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of SIGIR 12. ACM (2012)Hagen, M., Potthast, M., Stein, B.: Source retrieval for plagiarism detection from large web corpora: recent approaches. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of ACL 2004 (2004)Holmes, J., Meyerhoff, M.: The Handbook of Language and Gender. Blackwell Handbooks in Linguistics, Wiley (2003)Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining writeprints from anonymous e-mails for forensic investigation. Digit. Investig. 7(1–2), 56–64 (2010)Jankowska, M., Keselj, V., Milios, E.: CNG text classification for authorship profiling task-notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Juola, P.: An overview of the traditional authorship attribution subtask. In: Working Notes Papers of the CLEF 2012 Evaluation Labs (2012)Juola, P.: Authorship attribution. Found. Trends Inf. Retrieval 1, 234–334 (2008)Juola, P.: How a computer program helped reveal J.K. rowling as author of a Cuckoo’s calling. In: Scientific American (2013)Juola, P., Stamatatos, E.: Overview of the author identification task at PAN-2013. In:Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org vol. 1179 (2013)Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author masking through translation-notebook for PAN at CLEF 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)Koppel, M., Winter, Y.: Determining if two documents are written by the same author. J. Am. Soc. Inf. Sci. Technol. 65(1), 178–187 (2014)Layton, R., Watters, P., Dazeley, R.: Automated unsupervised authorship analysis using evidence accumulation clustering. Nat. Lang. Eng. 19(1), 95–120 (2013)López-Monroy, A.P., Montes-y Gómez, M., Jair-Escalante, H., Villasenor-Pineda, L.V.: Using intra-profile information for author profiling-notebook for PAN at CLEF 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)López-Monroy, A.P., Montes-y Gómez, M., Jair-Escalante, H., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN’13: author profiling task-notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of COLING (2008)Maharjan, S., Shrestha, P., Solorio, T., Hasan, R.: A straightforward author profiling approach in MapReduce. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS, vol. 8864, pp. 95–107. Springer, Heidelberg (2014)Mansoorizadeh, M.: Submission to the author obfuscation task at PAN 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)Mihaylova, T., Karadjov, G., Nakov, P., Kiprov, Y., Georgiev, G., Koychev, I.: SU@PAN’2016: author obfuscation-notebook for PAN at CLEF 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. Audio Speech Language Process. IEEE Trans. 20(2), 356–370 (2012)Moreau, E., Jayapal, A., Lynch, G., Vogel, C.: Author verification: basic stacked generalization applied to predictions from a set of heterogeneous learners. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: How old do you think I am? a study of language and age in twitter. In: Proceedings of ICWSM 13. AAAI (2013)Peñas, A., Rodrigo, A.: A Simple measure to assess non-response. In: Proceedings of HLT 2011 (2011)Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: our words, our selves. Ann. Rev. Psychol. 54(1), 547–577 (2003)Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2010 Evaluation Labs (2010)Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. (LREC) 45, 45–62 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2011 Evaluation Labs (2011)Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2012 Evaluation Labs (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Heidelberg (2014)Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Potthast, M., Göring, S., Rosso, P., Stein, B.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: a search engine for the ClueWeb09 corpus. In: Proceedings of SIGIR 12. ACM (2012)Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In: Proceedings of ACL 13. ACL (2013)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of COLING 10. ACL (2010)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: Proceedings of PAN at SEPLN 09. CEUR-WS.org 502 (2009)Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Inf. Process. Manage. Spec. Issue Emot. Sentiment Soc. Expressive Media 52(1), 73–92 (2016)Rangel, F., Rosso, P.: On the multilingual and genre robustness of emographs for author profiling in social media. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 274–280. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-24027-5_28Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Samdani, R., Chang, K., Roth, D.: A discriminative latent variable model for online clustering. In: Proceedings of The 31st International Conference on Machine Learning, pp. 1–9 (2014)Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character N-grams are created equal: a study in authorship attribution. In: Proceedings of NAACL 15. ACL (2015)Sapkota, U., Solorio, T., Montes-y-Gómez, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of COLING 14 (2014)Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI (2006)Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS One 8(9), 773–791 (2013)Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 538–556 (2009)Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21, 421–439 (2013)Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by authorship within and across documents. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN-2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. (LRE) 45, 63–82 (2011)Stein, B., Meyer zu Eißen, S.: Near Similarity Search and Plagiarism Analysis. In: Proceedings of GFKL 05. Springer, Heidelberg, pp. 430–437 (2006)Verhoeven, B., Daelemans, W.: Clips stylometry investigation (csi) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of LREC 2014 (2014)Verhoeven, B., Daelemans, W.: CLiPS stylometry investigation (CSI) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC (2014)Weren, E., Kauer, A., Mizusaki, L., Moreira, V., de Oliveira, P., Wives, L.: Examining multiple features for author profiling. J. Inf. Data Manage. 5(3), 266–280 (2014)Zhang, C., Zhang, P.: Predicting Gender from Blog Posts. Technical Report. University of Massachusetts Amherst, USA (2010

    Overview of the PAN/CLEF 2015 Evaluation Lab

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-24027-5_49This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of text mining research focusing on the identification of personal traits of authors left behind in texts unintentionally. PAN 2015 comprises three tasks: plagiarism detection, author identification and author profiling studying important variations of these problems. In plagiarism detection, community-driven corpus construction is introduced as a new way of developing evaluation resources with diversity. In author identification, cross-topic and cross-genre author verification (where the texts of known and unknown authorship do not match in topic and/or genre) is introduced. A new corpus was built for this challenging, yet realistic, task covering four languages. In author profiling, in addition to usual author demographics, such as gender and age, five personality traits are introduced (openness, conscientiousness, extraversion, agreeableness, and neuroticism) and a new corpus of Twitter messages covering four languages was developed. In total, 53 teams participated in all three tasks of PAN 2015 and, following the practice of previous editions, software submissions were required and evaluated within the TIRA experimentation framework.Stamatatos, E.; Potthast, M.; Rangel, F.; Rosso, P.; Stein, B. (2015). Overview of the PAN/CLEF 2015 Evaluation Lab. En Experimental IR Meets Multilinguality, Multimodality, and Interaction: 6th International Conference of the CLEF Association, CLEF'15, Toulouse, France, September 8-11, 2015, Proceedings. Springer International Publishing. 518-538. doi:10.1007/978-3-319-24027-5_49S518538Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-Y-Gómez, M., Villaseñor-Pineda, L., Jair-Escalante, H.: INAOE’s participation at PAN 2015: author profiling task–notebook for PAN at CLEF 2015. In: CLEF 2013 Working Notes. CEUR (2015)Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, Genre, and Writing Style in Formal Written Texts. TEXT 23, 321–346 (2003)Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: CLEF 2015 Working Notes. CEUR (2015)Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of EMNLP 2011. ACL (2011)Burrows, S., Potthast, M., Stein, B.: Paraphrase Acquisition via Crowdsourcing and Machine Learning. ACM TIST 4(3), 43:1–43:21 (2013)Castillo, E., Cervantes, O., Vilariño, D., Pinto, D., León, S.: Unsupervised method for the authorship identification task. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR (2014)Celli, F., Lepri, B., Biel, J.I., Gatica-Perez, D., Riccardi, G., Pianesi, F.: The workshop on computational personality recognition 2014. In: Proceedings of ACM MM 2014 (2014)Celli, F., Pianesi, F., Stillwell, D., Kosinski, M.: Workshop on computational personality recognition: shared task. In: Proceedings of WCPR at ICWSM 2013 (2013)Celli, F., Polonio, L.: Relationships between personality and interactions in facebook. In: Social Networking: Recent Trends, Emerging Issues and Future Outlook. Nova Science Publishers, Inc. (2013)Chaski, C.E.: Who’s at the Keyboard: Authorship Attribution in Digital Evidence Invesigations. International Journal of Digital Evidence 4 (2005)Chittaranjan, G., Blom, J., Gatica-Perez, D.: Mining Large-scale Smartphone Data for Personality Studies. Personal and Ubiquitous Computing 17(3), 433–450 (2013)Fréry, J., Largeron, C., Juganaru-Mathieu, M.: UJM at clef in author identification. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR (2014)Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 282–302. Springer, Heidelberg (2013)Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of SIGIR 2012. ACM (2012)Hagen, M., Potthast, M., Stein, B.: Source retrieval for plagiarism detection from large web corpora: recent approaches. In: CLEF 2015 Working Notes. CEUR (2015)van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of ACL 2004. ACL (2004)Holmes, J., Meyerhoff, M.: The Handbook of Language and Gender. Blackwell Handbooks in Linguistics. Wiley (2003)Jankowska, M., Keselj, V., Milios, E.: CNG text classification for authorship profiling task–notebook for PAN at CLEF 2013. In: CLEF 2013 Working Notes. CEUR (2013)Juola, P.: Authorship Attribution. Foundations and Trends in Information Retrieval 1, 234–334 (2008)Juola, P.: How a Computer Program Helped Reveal J.K. Rowling as Author of A Cuckoo’s Calling. Scientific American (2013)Juola, P., Stamatatos, E.: Overview of the author identification task at PAN-2013. In: CLEF 2013 Working Notes. CEUR (2013)Kalimeri, K., Lepri, B., Pianesi, F.: Going beyond traits: multimodal classification of personality states in the wild. In: Proceedings of ICMI 2013. ACM (2013)Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4) (2002)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)Koppel, M., Winter, Y.: Determining if Two Documents are Written by the same Author. Journal of the American Society for Information Science and Technology 65(1), 178–187 (2014)Kosinski, M., Bachrach, Y., Kohli, P., Stillwell, D., Graepel, T.: Manifestations of User Personality in Website Choice and Behaviour on Online Social Networks. Machine Learning (2013)López-Monroy, A.P., y Gómez, M.M., Jair-Escalante, H., Villaseñor-Pineda, L.: Using intra-profile information for author profiling–notebook for PAN at CLEF 2014. In: CLEF 2014 Working Notes. CEUR (2014)Lopez-Monroy, A.P., Montes-Y-Gomez, M., Escalante, H.J., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN 2013: author profiling task-notebook for PAN at CLEF 2013. In: CLEF 2013 Working Notes. CEUR (2013)Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of COLING 2008 (2008)Maharjan, S., Shrestha, P., Solorio, T., Hasan, R.: A straightforward author profiling approach in mapreduce. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS, vol. 8864, pp. 95–107. Springer, Heidelberg (2014)Mairesse, F., Walker, M.A., Mehl, M.R., Moore, R.K.: Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text. Journal of Artificial Intelligence Research 30(1), 457–500 (2007)Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)Mohammadi, G., Vinciarelli, A.: Automatic personality perception: Prediction of Trait Attribution Based on Prosodic Features. IEEE Transactions on Affective Computing 3(3), 273–284 (2012)Moreau, E., Jayapal, A., Lynch, G., Vogel, C.: Author verification: basic stacked generalization applied to predictions from a set of heterogeneous learners. In: CLEF 2015 Working Notes. CEUR (2015)Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How old do you think I am?”; a study of language and age in twitter. In: Proceedings of ICWSM 2013. AAAI (2013)Oberlander, J., Nowson, S.: Whose thumb is it anyway?: classifying author personality from weblog text. In: Proceedings of COLING 2006. ACL (2006)Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Proceedings of HLT 2011. ACL (2011)Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological Aspects of Natural Language Use: Our Words. Our Selves. Annual Review of Psychology 54(1), 547–577 (2003)Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: CLEF 2010 Working Notes. CEUR (2010)Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-Language Plagiarism Detection. Language Resources and Evaluation (LRE) 45, 45–62 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: CLEF 2011 Working Notes (2011)Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: CLEF 2012 Working Notes. CEUR (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: CLEF 2013 Working Notes. CEUR (2013)Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Heidelberg (2014)Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: CLEF 2014 Working Notes. CEUR (2014)Potthast, M., Göring, S., Rosso, P., Stein, B.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: CLEF 2015 Working Notes. CEUR (2015)Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: a search engine for the clueweb09 corpus. In: Proceedings of SIGIR 2012. ACM (2012)Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In: Proceedings of ACL 2013. ACL (2013)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of COLING 2010. ACL (2010)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: Proceedings of PAN at SEPLN 2009. CEUR (2009)Quercia, D., Lambiotte, R., Stillwell, D., Kosinski, M., Crowcroft, J.: The personality of popular facebook users. In: Proceedings of CSCW 2012. ACM (2012)Rammstedt, B., John, O.: Measuring Personality in One Minute or Less: A 10 Item Short Version of the Big Five Inventory in English and German. Journal of Research in Personality (2007)Rangel, F., Rosso, P.: On the impact of emotions on author profiling. In: Information Processing & Management, Special Issue on Emotion and Sentiment in Social and Expressive Media (2014) (in press)Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 Working Notes. CEUR (2015)Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: CLEF 2014 Working Notes. CEUR (2014)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: CLEF 2013 Working Notes. CEUR (2013)Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character N-grams are created equal: a study in authorship attribution. In: Proceedings of NAACL 2015. ACL (2015)Sapkota, U., Solorio, T., Montes-y-Gómez, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of COLING 2014 (2014)Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI (2006)Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. PloS one 8(9), 773–791 (2013)Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology 60, 538–556 (2009)Stamatatos, E.: On the Robustness of Authorship Attribution Based on Character N-gram Features. Journal of Law and Policy 21, 421–439 (2013)Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR (2015)Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN 2014. In: CLEF 2014 Working Notes. CEUR (2014)Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Comput. Linguist. 26(4), 471–495 (2000)Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic Plagiarism Analysis. Language Resources and Evaluation (LRE) 45, 63–82 (2011)Stein, B., Meyer zu Eißen, S.: Near similarity search and plagiarism analysis. In: Proceedings of GFKL 2005. Springer (2006)Sushant, S.A., Argamon, S., Dhawle, S., Pennebaker, J.W.: Lexical predictors of personality type. In: Proceedings of Joint Interface/CSNA 2005Verhoeven, B., Daelemans, W.: Clips stylometry investigation (CSI) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of LREC 2014. ACL (2014)Weren, E., Kauer, A., Mizusaki, L., Moreira, V., de Oliveira, P., Wives, L.: Examining Multiple Features for Author Profiling. Journal of Information and Data Management (2014)Zhang, C., Zhang, P.: Predicting gender from blog posts. Tech. rep., Technical Report. University of Massachusetts Amherst, USA (2010

    A Decade of Shared Tasks in Digital Text Forensics at PAN

    Full text link
    [EN] Digital text forensics aims at examining the originality and credibility of information in electronic documents and, in this regard, to extract and analyze information about the authors of these documents. The research field has been substantially developed during the last decade. PAN is a series of shared tasks that started in 2009 and significantly contributed to attract the attention of the research community in well-defined digital text forensics tasks. Several benchmark datasets have been developed to assess the state-of-the-art performance in a wide range of tasks. In this paper, we present the evolution of both the examined tasks and the developed datasets during the last decade. We also briefly introduce the upcoming PAN 2019 shared tasks.We are indebted to many colleagues and friends who contributed greatly to PAN's tasks: Maik Anderka, Shlomo Argamon, Alberto Barrón-Cedeño, Fabio Celli, Fabio Crestani, Walter Daelemans, Andreas Eiselt, Tim Gollub, Parth Gupta, Matthias Hagen, Teresa Holfeld, Patrick Juola, Giacomo Inches, Mike Kestemont, Moshe Koppel, Manuel Montes-y-Gómez, Aurelio Lopez-Lopez, Francisco Rangel, Miguel Angel Sánchez-Pérez, Günther Specht, Michael Tschuggnall, and Ben Verhoeven. Our special thanks go to PAN¿s sponsors throughout the years and not least to the hundreds of participants.Potthast, M.; Rosso, P.; Stamatatos, E.; Stein, B. (2019). A Decade of Shared Tasks in Digital Text Forensics at PAN. Lecture Notes in Computer Science. 11438:291-300. https://doi.org/10.1007/978-3-030-15719-7_39S2913001143

    Overview of the 6th International Competition on Plagiarism Detection

    Full text link
    [EN] This paper overviews 17 plagiarism detectors that have been evaluated within the sixth international competition on plagiarism detection at PAN 2014. We report on their performances for the two tasks source retrieval and text alignment of external plagiarism detection. For the third year in a row, we invite software submissions instead of run submissions for this task, which allows for cross-year evaluations. Moreover, we introduce new performance measures for text alignment to shed light on new aspects of detection performance.We thank the participating teams of this task for their devoted work. This paper was partially supported by the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie action.Potthast, M.; Hagen, M.; Beyer, A.; Busse, M.; Tippmann, M.; Rosso, P.; Stein, B. (2014). Overview of the 6th International Competition on Plagiarism Detection. CEUR Workshop Proceedings. 1180:845-876. http://hdl.handle.net/10251/61151S845876118

    Overview of the 5th International Competition on Plagiarism Detection

    Full text link
    Abstract This paper overviews 18 plagiarism detectors that have been evaluated within the fifth international competition on plagiarism detection at PAN 2013. We report on their performances for the two tasks source retrieval and text alignment of external plagiarism detection. Furthermore, we continue last year’s initiative to invite software submissions instead of run submissions, and, re-evaluate this year’s submissions on last year’s evaluation corpora and vice versa, thus demonstrating the benefits of software submissions in terms of reproducibility.Potthast, M.; Hagen, M.; Gollub, T.; Tippmann, M.; Kiesel, J.; Rosso, P.; Stamatatos, E.... (2013). Overview of the 5th International Competition on Plagiarism Detection. CLEF Conference on Multilingual and Multimodal Information Access Evaluation. 301-331. http://hdl.handle.net/10251/46635S30133

    A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection

    Full text link
    This is the author’s version of a work that was accepted for publication in Information Processing and Management. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information Processing and Management 52 (2016) 550–570. DOI 10.1016/j.ipm.2015.12.004Cross-language plagiarism detection aims to detect plagiarised fragments of text among documents in different languages. In this paper, we perform a systematic examination of Cross-language Knowledge Graph Analysis; an approach that represents text fragments using knowledge graphs as a language independent content model. We analyse the contributions to cross-language plagiarism detection of the different aspects covered by knowledge graphs: word sense disambiguation, vocabulary expansion, and representation by similarities with a collection of concepts. In addition, we study both the relevance of concepts and their relations when detecting plagiarism. Finally, as a key component of the knowledge graph construction, we present a new weighting scheme of relations between concepts based on distributed representations of concepts. Experimental results in Spanish–English and German–English plagiarism detection show state-of-the-art performance and provide interesting insights on the use of knowledge graphs. © 2015 Elsevier Ltd. All rights reserved.This research has been carried out in the framework of the European Commission WIQ-EI IRSES (No. 269180) and DIANA APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) projects. We would like to thank Tomas Mikolov, Martin Potthast, and Luis A. Leiva for their support and comments during this research.Franco-Salvador, M.; Rosso, P.; Montes Gomez, M. (2016). A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection. Information Processing and Management. 52(4):550-570. https://doi.org/10.1016/j.ipm.2015.12.004S55057052

    A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning

    Full text link
    Tesis por compendioNatural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human languages. One of its most challenging aspects involves enabling computers to derive meaning from human natural language. To do so, several meaning or context representations have been proposed with competitive performance. However, these representations still have room for improvement when working in a cross-domain or cross-language scenario. In this thesis we study the use of knowledge graphs as a cross-domain and cross-language representation of text and its meaning. A knowledge graph is a graph that expands and relates the original concepts belonging to a set of words. We obtain its characteristics using a wide-coverage multilingual semantic network as knowledge base. This allows to have a language coverage of hundreds of languages and millions human-general and -specific concepts. As starting point of our research we employ knowledge graph-based features - along with other traditional ones and meta-learning - for the NLP task of single- and cross-domain polarity classification. The analysis and conclusions of that work provide evidence that knowledge graphs capture meaning in a domain-independent way. The next part of our research takes advantage of the multilingual semantic network and focuses on cross-language Information Retrieval (IR) tasks. First, we propose a fully knowledge graph-based model of similarity analysis for cross-language plagiarism detection. Next, we improve that model to cover out-of-vocabulary words and verbal tenses and apply it to cross-language document retrieval, categorisation, and plagiarism detection. Finally, we study the use of knowledge graphs for the NLP tasks of community questions answering, native language identification, and language variety identification. The contributions of this thesis manifest the potential of knowledge graphs as a cross-domain and cross-language representation of text and its meaning for NLP and IR tasks. These contributions have been published in several international conferences and journals.El Procesamiento del Lenguaje Natural (PLN) es un campo de la informática, la inteligencia artificial y la lingüística computacional centrado en las interacciones entre las máquinas y el lenguaje de los humanos. Uno de sus mayores desafíos implica capacitar a las máquinas para inferir el significado del lenguaje natural humano. Con este propósito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavía tienen un margen de mejora en escenarios transdominios y translingües. En esta tesis estudiamos el uso de grafos de conocimiento como una representación transdominio y translingüe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semántica multilingüe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y específicos del ser humano. Como punto de partida de nuestra investigación empleamos características basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificación de la polaridad mono- y transdominio. El análisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigación aprovecha la capacidad de la red semántica multilingüe y se centra en tareas de Recuperación de Información (RI). Primero proponemos un modelo de análisis de similitud completamente basado en grafos de conocimiento para detección de plagio translingüe. A continuación, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingües de recuperación de documentos, clasificación, y detección de plagio. Por último, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificación del lenguaje nativo, y identificación de la variedad del lenguaje. Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representación transdominio y translingüe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales.El Processament del Llenguatge Natural (PLN) és un camp de la informàtica, la intel·ligència artificial i la lingüística computacional centrat en les interaccions entre les màquines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les màquines per inferir el significat del llenguatge natural humà. Amb aquest propòsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant això, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges. En aquesta tesi estudiem l'ús de grafs de coneixement com una representació trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement és un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen gràcies a l'ús com a base de coneixement d'una xarxa semàntica multilingüe d'àmplia cobertura. Això permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i específics de l'ésser humà. Com a punt de partida de la nostra investigació emprem característiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificació de la polaritat mono- i trans-domini. L'anàlisi i conclusions d'aquest treball mostra evidències que els grafs de coneixement capturen el significat d'una forma independent del domini. La següent part de la nostra investigació aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semàntica multilingüe i se centra en tasques de recuperació d'informació (RI). Primer proposem un model d'anàlisi de similitud completament basat en grafs de coneixement per a detecció de plagi trans-llenguatge. A continuació, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperació de documents, classificació, i detecció de plagi. Finalment, estudiem l'ús de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificació del llenguatge natiu, i identificació de la varietat del llenguatge. Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representació trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferències internacionals.Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/84285TESISCompendi
    corecore