16,344 research outputs found

    Overview of the 2nd Author Profiling Task at PAN 2014

    Full text link
    [EN] This overview presents the framework and the results for the Author Profiling task at PAN 2014. Objective of this year is the analysis of the adaptability of the detection approaches when given different genres. For this purpose a corpus with four different parts (subcorpora) has been compiled: social media, Twitter, blogs, and hotel reviews. The construction of the Twitter subcorpus happened in cooperation with RepLab in order to investigate also a reputational perspective. Altogether, the approaches of 10 participants are evaluated.The PAN task on author profiling has been organised in the framework of the WIQ-EI IRSES project (Grant No. 269180) within the FP 7 Marie Curie People Framework of the European Commission. We would like to thank Atribus by Corex for sponsoring the award for the winner team. We thank Julio Gonzalo, Jorge Carrillo and Damiano Spina from UNED for helping with the Twitter subcorpus. The work of the first author was partially funded by Autoritas Consulting SA and by Ministerio de Economía y Competitividad de España under grant ECOPORTUNITY IPT-2012-1220-430000 and CSO2013-43054-R. The work of the second author was in the framework the DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Rangel, F.; Rosso, P.; Chrugur, I.; Potthast, M.; Trenkmann, M.; Stein, B.; Verhoeven, B.... (2014). Overview of the 2nd Author Profiling Task at PAN 2014. CEUR Workshop Proceedings. 1180:898-927. http://hdl.handle.net/10251/61150S898927118

    Overview of the PAN'2016 - New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-44564-9_28This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of digital text forensic research. PAN 2016 comprises three shared tasks: (i) author identification, addressing author clustering and diarization (or intrinsic plagiarism detection); (ii) author profiling, addressing age and gender prediction from a cross-genre perspective; and (iii) author obfuscation, addressing author masking and obfuscation evaluation. In total, 35 teams participated in all three shared tasks of PAN 2016 and, following the practice of previous editions, software submissions were required and evaluated within the TIRA experimentation framework.The work of the first author was partially supported by the Som EMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMA MATER (Prometeo II/2014/030). The work of the second author was partially supported by Autoritas Consulting and by Ministerio de Economía y Competitividad de España under grant ECOPORTUNITY IPT-2012-1220-430000.Rosso, P.; Rangel-Pardo, FM.; Potthast, M.; Stamatatos, E.; Tschuggnall, M.; Stein, B. (2016). Overview of the PAN'2016 - New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation. En Experimental IR Meets Multilinguality, Multimodality, and Interaction. Springer Verlag (Germany). 332-350. https://doi.org/10.1007/978-3-319-44564-9_28S332350Almishari, M., Tsudik, G.: Exploring linkability of user reviews. In: Foresti, S., Yung, M., Martinelli, F. (eds.) ESORICS 2012. LNCS, vol. 7459, pp. 307–324. Springer, Heidelberg (2012)Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-Y-Gómez, M., Villaseñor-Pineda, L., Jair-Escalante, H.: INAOE’s Participation at PAN’15: author profiling task–notebook for PAN at CLEF 2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-2011. In: Working Notes Papers of the CLEF 2011 Evaluation Labs (2011)Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. TEXT 23, 321–346 (2003)Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., Chikhi, S.: Overview of the AraPlagDet PAN@ FIRE2015 shared task on arabic plagiarism detection. In: Notebook Papers of FIRE 2015. CEUR-WS.org, vol. 1587 (2015)Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of EMNLP 2011 (2011)Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM TIST 4(3), 43:1–43:21 (2013)Castillo, E., Cervantes, O., Vilariño, D., Pinto, D., León, S.: Unsupervised method for the authorship identification task. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180 (2014)Chaski, C.E.: Who’s at the keyboard: authorship attribution in digital evidence invesigations. Int. J. Digit. Evid. 4, 1–13 (2005)Clarke, C.L., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2009 web track. In: DTIC Document (2009)Flores, E., Rosso, P., Moreno, L., Villatoro, E.: On the detection of source code re-use. In: ACM FIRE 2014 Post Proceedings of the Forum for Information Retrieval Evaluation, pp. 21–30 (2015)Flores, E., Rosso, P., Villatoro, E., Moreno, L., Alcover, R., Chirivella, V.: PAN@FIRE: overview of CL-SOCO track on the detection of cross-language source code re-use. In: Notebook Papers of FIRE 2015. CEUR-WS.org, vol. 1587 (2015)Fréry, J., Largeron, C., Juganaru-Mathieu, M.: UJM at clef in author identification. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180 (2014)Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 282–302. Springer, Heidelberg (2013)Gollub, T., Stein, B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of SIGIR 12. ACM (2012)Hagen, M., Potthast, M., Stein, B.: Source retrieval for plagiarism detection from large web corpora: recent approaches. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of ACL 2004 (2004)Holmes, J., Meyerhoff, M.: The Handbook of Language and Gender. Blackwell Handbooks in Linguistics, Wiley (2003)Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining writeprints from anonymous e-mails for forensic investigation. Digit. Investig. 7(1–2), 56–64 (2010)Jankowska, M., Keselj, V., Milios, E.: CNG text classification for authorship profiling task-notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Juola, P.: An overview of the traditional authorship attribution subtask. In: Working Notes Papers of the CLEF 2012 Evaluation Labs (2012)Juola, P.: Authorship attribution. Found. Trends Inf. Retrieval 1, 234–334 (2008)Juola, P.: How a computer program helped reveal J.K. rowling as author of a Cuckoo’s calling. In: Scientific American (2013)Juola, P., Stamatatos, E.: Overview of the author identification task at PAN-2013. In:Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org vol. 1179 (2013)Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author masking through translation-notebook for PAN at CLEF 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)Koppel, M., Winter, Y.: Determining if two documents are written by the same author. J. Am. Soc. Inf. Sci. Technol. 65(1), 178–187 (2014)Layton, R., Watters, P., Dazeley, R.: Automated unsupervised authorship analysis using evidence accumulation clustering. Nat. Lang. Eng. 19(1), 95–120 (2013)López-Monroy, A.P., Montes-y Gómez, M., Jair-Escalante, H., Villasenor-Pineda, L.V.: Using intra-profile information for author profiling-notebook for PAN at CLEF 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)López-Monroy, A.P., Montes-y Gómez, M., Jair-Escalante, H., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN’13: author profiling task-notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of COLING (2008)Maharjan, S., Shrestha, P., Solorio, T., Hasan, R.: A straightforward author profiling approach in MapReduce. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS, vol. 8864, pp. 95–107. Springer, Heidelberg (2014)Mansoorizadeh, M.: Submission to the author obfuscation task at PAN 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)Mihaylova, T., Karadjov, G., Nakov, P., Kiprov, Y., Georgiev, G., Koychev, I.: SU@PAN’2016: author obfuscation-notebook for PAN at CLEF 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. Audio Speech Language Process. IEEE Trans. 20(2), 356–370 (2012)Moreau, E., Jayapal, A., Lynch, G., Vogel, C.: Author verification: basic stacked generalization applied to predictions from a set of heterogeneous learners. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: How old do you think I am? a study of language and age in twitter. In: Proceedings of ICWSM 13. AAAI (2013)Peñas, A., Rodrigo, A.: A Simple measure to assess non-response. In: Proceedings of HLT 2011 (2011)Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: our words, our selves. Ann. Rev. Psychol. 54(1), 547–577 (2003)Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2010 Evaluation Labs (2010)Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. (LREC) 45, 45–62 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2011 Evaluation Labs (2011)Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2012 Evaluation Labs (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Heidelberg (2014)Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Potthast, M., Göring, S., Rosso, P., Stein, B.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: a search engine for the ClueWeb09 corpus. In: Proceedings of SIGIR 12. ACM (2012)Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In: Proceedings of ACL 13. ACL (2013)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of COLING 10. ACL (2010)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: Proceedings of PAN at SEPLN 09. CEUR-WS.org 502 (2009)Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Inf. Process. Manage. Spec. Issue Emot. Sentiment Soc. Expressive Media 52(1), 73–92 (2016)Rangel, F., Rosso, P.: On the multilingual and genre robustness of emographs for author profiling in social media. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 274–280. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-24027-5_28Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Samdani, R., Chang, K., Roth, D.: A discriminative latent variable model for online clustering. In: Proceedings of The 31st International Conference on Machine Learning, pp. 1–9 (2014)Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character N-grams are created equal: a study in authorship attribution. In: Proceedings of NAACL 15. ACL (2015)Sapkota, U., Solorio, T., Montes-y-Gómez, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of COLING 14 (2014)Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI (2006)Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS One 8(9), 773–791 (2013)Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 538–556 (2009)Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21, 421–439 (2013)Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by authorship within and across documents. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN-2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. (LRE) 45, 63–82 (2011)Stein, B., Meyer zu Eißen, S.: Near Similarity Search and Plagiarism Analysis. In: Proceedings of GFKL 05. Springer, Heidelberg, pp. 430–437 (2006)Verhoeven, B., Daelemans, W.: Clips stylometry investigation (csi) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of LREC 2014 (2014)Verhoeven, B., Daelemans, W.: CLiPS stylometry investigation (CSI) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC (2014)Weren, E., Kauer, A., Mizusaki, L., Moreira, V., de Oliveira, P., Wives, L.: Examining multiple features for author profiling. J. Inf. Data Manage. 5(3), 266–280 (2014)Zhang, C., Zhang, P.: Predicting Gender from Blog Posts. Technical Report. University of Massachusetts Amherst, USA (2010

    Overview of the PAN/CLEF 2015 Evaluation Lab

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-24027-5_49This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of text mining research focusing on the identification of personal traits of authors left behind in texts unintentionally. PAN 2015 comprises three tasks: plagiarism detection, author identification and author profiling studying important variations of these problems. In plagiarism detection, community-driven corpus construction is introduced as a new way of developing evaluation resources with diversity. In author identification, cross-topic and cross-genre author verification (where the texts of known and unknown authorship do not match in topic and/or genre) is introduced. A new corpus was built for this challenging, yet realistic, task covering four languages. In author profiling, in addition to usual author demographics, such as gender and age, five personality traits are introduced (openness, conscientiousness, extraversion, agreeableness, and neuroticism) and a new corpus of Twitter messages covering four languages was developed. In total, 53 teams participated in all three tasks of PAN 2015 and, following the practice of previous editions, software submissions were required and evaluated within the TIRA experimentation framework.Stamatatos, E.; Potthast, M.; Rangel, F.; Rosso, P.; Stein, B. (2015). Overview of the PAN/CLEF 2015 Evaluation Lab. En Experimental IR Meets Multilinguality, Multimodality, and Interaction: 6th International Conference of the CLEF Association, CLEF'15, Toulouse, France, September 8-11, 2015, Proceedings. Springer International Publishing. 518-538. doi:10.1007/978-3-319-24027-5_49S518538Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-Y-Gómez, M., Villaseñor-Pineda, L., Jair-Escalante, H.: INAOE’s participation at PAN 2015: author profiling task–notebook for PAN at CLEF 2015. In: CLEF 2013 Working Notes. CEUR (2015)Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, Genre, and Writing Style in Formal Written Texts. TEXT 23, 321–346 (2003)Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: CLEF 2015 Working Notes. CEUR (2015)Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of EMNLP 2011. ACL (2011)Burrows, S., Potthast, M., Stein, B.: Paraphrase Acquisition via Crowdsourcing and Machine Learning. ACM TIST 4(3), 43:1–43:21 (2013)Castillo, E., Cervantes, O., Vilariño, D., Pinto, D., León, S.: Unsupervised method for the authorship identification task. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR (2014)Celli, F., Lepri, B., Biel, J.I., Gatica-Perez, D., Riccardi, G., Pianesi, F.: The workshop on computational personality recognition 2014. In: Proceedings of ACM MM 2014 (2014)Celli, F., Pianesi, F., Stillwell, D., Kosinski, M.: Workshop on computational personality recognition: shared task. In: Proceedings of WCPR at ICWSM 2013 (2013)Celli, F., Polonio, L.: Relationships between personality and interactions in facebook. In: Social Networking: Recent Trends, Emerging Issues and Future Outlook. Nova Science Publishers, Inc. (2013)Chaski, C.E.: Who’s at the Keyboard: Authorship Attribution in Digital Evidence Invesigations. International Journal of Digital Evidence 4 (2005)Chittaranjan, G., Blom, J., Gatica-Perez, D.: Mining Large-scale Smartphone Data for Personality Studies. Personal and Ubiquitous Computing 17(3), 433–450 (2013)Fréry, J., Largeron, C., Juganaru-Mathieu, M.: UJM at clef in author identification. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR (2014)Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 282–302. Springer, Heidelberg (2013)Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of SIGIR 2012. ACM (2012)Hagen, M., Potthast, M., Stein, B.: Source retrieval for plagiarism detection from large web corpora: recent approaches. In: CLEF 2015 Working Notes. CEUR (2015)van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of ACL 2004. ACL (2004)Holmes, J., Meyerhoff, M.: The Handbook of Language and Gender. Blackwell Handbooks in Linguistics. Wiley (2003)Jankowska, M., Keselj, V., Milios, E.: CNG text classification for authorship profiling task–notebook for PAN at CLEF 2013. In: CLEF 2013 Working Notes. CEUR (2013)Juola, P.: Authorship Attribution. Foundations and Trends in Information Retrieval 1, 234–334 (2008)Juola, P.: How a Computer Program Helped Reveal J.K. Rowling as Author of A Cuckoo’s Calling. Scientific American (2013)Juola, P., Stamatatos, E.: Overview of the author identification task at PAN-2013. In: CLEF 2013 Working Notes. CEUR (2013)Kalimeri, K., Lepri, B., Pianesi, F.: Going beyond traits: multimodal classification of personality states in the wild. In: Proceedings of ICMI 2013. ACM (2013)Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4) (2002)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)Koppel, M., Winter, Y.: Determining if Two Documents are Written by the same Author. Journal of the American Society for Information Science and Technology 65(1), 178–187 (2014)Kosinski, M., Bachrach, Y., Kohli, P., Stillwell, D., Graepel, T.: Manifestations of User Personality in Website Choice and Behaviour on Online Social Networks. Machine Learning (2013)López-Monroy, A.P., y Gómez, M.M., Jair-Escalante, H., Villaseñor-Pineda, L.: Using intra-profile information for author profiling–notebook for PAN at CLEF 2014. In: CLEF 2014 Working Notes. CEUR (2014)Lopez-Monroy, A.P., Montes-Y-Gomez, M., Escalante, H.J., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN 2013: author profiling task-notebook for PAN at CLEF 2013. In: CLEF 2013 Working Notes. CEUR (2013)Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of COLING 2008 (2008)Maharjan, S., Shrestha, P., Solorio, T., Hasan, R.: A straightforward author profiling approach in mapreduce. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS, vol. 8864, pp. 95–107. Springer, Heidelberg (2014)Mairesse, F., Walker, M.A., Mehl, M.R., Moore, R.K.: Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text. Journal of Artificial Intelligence Research 30(1), 457–500 (2007)Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)Mohammadi, G., Vinciarelli, A.: Automatic personality perception: Prediction of Trait Attribution Based on Prosodic Features. IEEE Transactions on Affective Computing 3(3), 273–284 (2012)Moreau, E., Jayapal, A., Lynch, G., Vogel, C.: Author verification: basic stacked generalization applied to predictions from a set of heterogeneous learners. In: CLEF 2015 Working Notes. CEUR (2015)Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How old do you think I am?”; a study of language and age in twitter. In: Proceedings of ICWSM 2013. AAAI (2013)Oberlander, J., Nowson, S.: Whose thumb is it anyway?: classifying author personality from weblog text. In: Proceedings of COLING 2006. ACL (2006)Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Proceedings of HLT 2011. ACL (2011)Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological Aspects of Natural Language Use: Our Words. Our Selves. Annual Review of Psychology 54(1), 547–577 (2003)Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: CLEF 2010 Working Notes. CEUR (2010)Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-Language Plagiarism Detection. Language Resources and Evaluation (LRE) 45, 45–62 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: CLEF 2011 Working Notes (2011)Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: CLEF 2012 Working Notes. CEUR (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: CLEF 2013 Working Notes. CEUR (2013)Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Heidelberg (2014)Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: CLEF 2014 Working Notes. CEUR (2014)Potthast, M., Göring, S., Rosso, P., Stein, B.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: CLEF 2015 Working Notes. CEUR (2015)Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: a search engine for the clueweb09 corpus. In: Proceedings of SIGIR 2012. ACM (2012)Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In: Proceedings of ACL 2013. ACL (2013)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of COLING 2010. ACL (2010)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: Proceedings of PAN at SEPLN 2009. CEUR (2009)Quercia, D., Lambiotte, R., Stillwell, D., Kosinski, M., Crowcroft, J.: The personality of popular facebook users. In: Proceedings of CSCW 2012. ACM (2012)Rammstedt, B., John, O.: Measuring Personality in One Minute or Less: A 10 Item Short Version of the Big Five Inventory in English and German. Journal of Research in Personality (2007)Rangel, F., Rosso, P.: On the impact of emotions on author profiling. In: Information Processing & Management, Special Issue on Emotion and Sentiment in Social and Expressive Media (2014) (in press)Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 Working Notes. CEUR (2015)Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: CLEF 2014 Working Notes. CEUR (2014)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: CLEF 2013 Working Notes. CEUR (2013)Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character N-grams are created equal: a study in authorship attribution. In: Proceedings of NAACL 2015. ACL (2015)Sapkota, U., Solorio, T., Montes-y-Gómez, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of COLING 2014 (2014)Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI (2006)Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. PloS one 8(9), 773–791 (2013)Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology 60, 538–556 (2009)Stamatatos, E.: On the Robustness of Authorship Attribution Based on Character N-gram Features. Journal of Law and Policy 21, 421–439 (2013)Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR (2015)Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN 2014. In: CLEF 2014 Working Notes. CEUR (2014)Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Comput. Linguist. 26(4), 471–495 (2000)Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic Plagiarism Analysis. Language Resources and Evaluation (LRE) 45, 63–82 (2011)Stein, B., Meyer zu Eißen, S.: Near similarity search and plagiarism analysis. In: Proceedings of GFKL 2005. Springer (2006)Sushant, S.A., Argamon, S., Dhawle, S., Pennebaker, J.W.: Lexical predictors of personality type. In: Proceedings of Joint Interface/CSNA 2005Verhoeven, B., Daelemans, W.: Clips stylometry investigation (CSI) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of LREC 2014. ACL (2014)Weren, E., Kauer, A., Mizusaki, L., Moreira, V., de Oliveira, P., Wives, L.: Examining Multiple Features for Author Profiling. Journal of Information and Data Management (2014)Zhang, C., Zhang, P.: Predicting gender from blog posts. Tech. rep., Technical Report. University of Massachusetts Amherst, USA (2010

    Overview of PAN 2018. Author identification, author profiling, and author obfuscation

    Full text link
    [EN] PAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics.More specifically, this edition of PAN introduces a shared task in cross-domain authorship attribution, where texts of known and unknown authorship belong to distinct domains, and another task in style change detection that distinguishes between single author and multi-author texts. In addition, a shared task in multimodal author profiling examines, for the first time, a combination of information from both texts and images posted by social media users to estimate their gender. Finally, the author obfuscation task studies how a text by a certain author can be paraphrased so that existing author identification tools are confused and cannot recognize the similarity with other texts of the same author. New corpora have been built to support these shared tasks. A relatively large number of software submissions (41 in total) was received and evaluated. Best paradigms are highlighted while baselines indicate the pros and cons of submitted approaches.The work at the Universitat Polit`ecnica de Val`encia was funded by the MINECO research project SomEMBED (TIN2015-71147-C2-1-P)Stamatatos, E.; Rangel-Pardo, FM.; Tschuggnall, M.; Stein, B.; Kestemont, M.; Rosso, P.; Potthast, M. (2018). Overview of PAN 2018. Author identification, author profiling, and author obfuscation. Lecture Notes in Computer Science. 11018:267-285. https://doi.org/10.1007/978-3-319-98932-7_25S26728511018Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-2011. In: Petras, V., Forner, P., Clough, P. (eds.) Notebook Papers of CLEF 2011 Labs and Workshops, 19–22 September 2011, Amsterdam, Netherlands, September 2011. http://www.clef-initiative.eu/publication/working-notesBird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Sebastopol (2009)Bogdanova, D., Lazaridou, A.: Cross-language authorship attribution. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, pp. 2015–2020 (2014)Choi, F.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (NAACL), pp. 26–33. Association for Computational Linguistics, Seattle, April 2000Custódio, J.E., Paraboni, I.: EACH-USP ensemble cross-domain authorship attribution. In: Working Notes Papers of the CLEF 2018 Evaluation Labs, September 2018, to be announcedDaneshvar, S.: Gender identification in Twitter using n-grams and LSA. In: Working Notes Papers of the CLEF 2018 Evaluation Labs, September 2018, to be announcedDaniel Karaś, M.S., Sobecki, P.: OPI-JSA at CLEF 2017: author clustering and style breach detection. In: Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings. CLEF and CEUR-WS.org, September 2017Giannella, C.: An improved algorithm for unsupervised decomposition of a multi-author document. The MITRE Corporation. Technical Papers, February 2014Glover, A., Hirst, G.: Detecting stylistic inconsistencies in collaborative writing. In: Sharples, M., van der Geest, T. (eds.) The New Writing Environment, pp. 147–168. Springer, London (1996). https://doi.org/10.1007/978-1-4471-1482-6_12Hagen, M., Potthast, M., Stein, B.: Overview of the author obfuscation task at PAN 2017: safety evaluation revisited. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Hagen, M., Potthast, M., Stein, B.: Overview of the author obfuscation task at PAN 2018. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2018)Hellekson, K., Busse, K. (eds.): The Fan Fiction Studies Reader. University of Iowa Press, Iowa City (2014)Juola, P.: An overview of the traditional authorship attribution subtask. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.) CLEF 2012 Evaluation Labs and Workshop - Working Notes Papers, 17–20 September 2012, Rome, Italy, September 2012. http://www.clef-initiative.eu/publication/working-notesJuola, P.: The rowling case: a proposed standard analytic protocol for authorship questions. Digital Sch. Humanit. 30(suppl–1), i100–i113 (2015)Kestemont, M., Luyckx, K., Daelemans, W., Crombez, T.: Cross-genre authorship verification using unmasking. Engl. Stud. 93(3), 340–356 (2012)Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2018)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)Overdorf, R., Greenstadt, R.: Blogs, Twitter feeds, and reddit comments: cross-domain authorship attribution. Proc. Priv. Enhanc. Technol. 2016(3), 155–171 (2016)Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Notebook Papers of the 5th Evaluation Lab on Uncovering Plagiarism, Authorship and Social Software Misuse (PAN), Amsterdam, The Netherlands, September 2011Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2016. http://ceur-ws.org/Vol-1609/Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In: Fung, P., Poesio, M. (eds.) Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), pp. 1212–1221. Association for Computational Linguistics, August 2013. http://www.aclweb.org/anthology/P13-1119Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers, Toulouse, France, pp. 8–11. CEUR-WS.org, September 2015Rangel, F., et al.: Overview of the 2nd author profiling task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Evaluation Labs and Workshop - Working Notes Papers, Sheffield, UK, pp. 15–18. CEUR-WS.org, September 2014Rangel, F., Rosso, P., G’omez, M.M., Potthast, M., Stein, B.: Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. In: CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org (2017)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop - Working Notes Papers, 23–26 September 2013, Valencia, Spain, September 2013Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Balog, K., Cappellato, L., Ferro, N., Macdonald, C. (eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, September 2016Safin, K., Kuznetsova, R.: Style breach detection with neural sentence embeddings. In: Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–102 (2015)Sapkota, U., Solorio, T., Montes, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of the 25th International Conference on Computational Linguistics. Technical Papers, pp. 1228–1237 (2014)Stamatatos, E.: Intrinsic plagiarism detection using character nnn-gram Profiles. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.) SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 38–46. Universidad Politécnica de Valencia and CEUR-WS.org, September 2009. http://ceur-ws.org/Vol-502Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21, 421–439 (2013)Stamatatos, E.: Authorship attribution using text distortion. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 1138–1149. Association for Computational Linguistics (2017)Stamatatos, E., et al.: Overview of the author identification task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers, 8–11 September 2015, Toulouse, France. CEUR-WS.org, September 2015Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2016. http://ceur-ws.org/Vol-1609/Takahashi, T., Tahara, T., Nagatani, K., Miura, Y., Taniguchi, T., Ohkuma, T.: Text and image synergy with feature cross technique for gender identification. In: Working Notes Papers of the CLEF 2018 Evaluation Labs, September 2018, to be announcedTellez, E.S., Miranda-Jiménez, S., Moctezuma, D., Graff, M., Salgado, V., Ortiz-Bejar, J.: Gender identification through multi-modal tweet analysis using microtc and bag of visual words. In: Working Notes Papers of the CLEF 2018 Evaluation Labs, September 2018, to be announcedTschuggnall, M., Specht, G.: Automatic decomposition of multi-author documents using grammar analysis. In: Proceedings of the 26th GI-Workshop on Grundlagen von Datenbanken. CEUR-WS, Bozen, October 2014Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, vol. 1866. CLEF and CEUR-WS.org, September 2017. http://ceur-ws.org/Vol-1866

    Overview of PAN'17: Author Identification, Author Profiling, and Author Obfuscation

    Full text link
    [EN] The PAN 2017 shared tasks on digital text forensics were held in conjunction with the annual CLEF conference. This paper gives a high-level overview of each of the three shared tasks organized this year, namely author identification, author profiling, and author obfuscation. For each task, we give a brief summary of the evaluation data, performance measures, and results obtained. Altogether, 29 participants submitted a total of 33 pieces of software for evaluation, whereas 4 participants submitted to more than one task. All submitted software has been deployed to the TIRA evaluation platform, where it remains hosted for reproducibility purposes.The work at the Universitat Politècnica de València was funded by the MINECO research project SomEMBED (TIN2015-71147-C2-1-P).Potthast, M.; Rangel-Pardo, FM.; Tschuggnall, M.; Stamatatos, E.; Rosso, P.; Stein, B. (2017). Overview of PAN'17: Author Identification, Author Profiling, and Author Obfuscation. Lecture Notes in Computer Science. 10456:275-290. https://doi.org/10.1007/978-3-319-65813-1_25S27529010456Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)Bagnall, D.: Authorship clustering using multi-headed recurrent neural networks—notebook for PAN at CLEF 2016. In: Balog et al. [3] (2016). http://ceur-ws.org/Vol-1609/Balog, K., Cappellato, L., Ferro, N., Macdonald, C. (eds.): CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, 5–8 September, Évora, Portugal. CEUR Workshop Proceedings. CEUR-WS.org (2016). http://www.clef-initiative.eu/publication/working-notesClarke, C.L., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2009 web track. Technical report, DTIC Document (2009)García, Y., Castro, D., Lavielle, V., Noz, R.M.: Discovering author groups using a β\beta β -compact graph-based clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Working Notes. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Glavaš, G., Nanni, F., Ponzetto, S.P.: Unsupervised text segmentation using semantic relatedness graphs. In: Association for Computational Linguistics (2016)Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M. (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM, August 2012Gómez-Adorno, H., Aleman, Y., no, D.V., Sanchez-Perez, M.A., Pinto, D., Sidorov, G.: Author clustering using hierarchical clustering analysis. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Working Notes. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Hagen, M., Potthast, M., Stein, B.: Overview of the author obfuscation task at PAN 2017: safety evaluation revisited. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Halvani, O., Graner, L.: Author clustering based on compression-based dissimilarity scores. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Working Notes. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Hearst, M.A.: TextTiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Advances in Neural Information Processing Systems (NIPS), pp. 3294–3302 (2015)Kocher, M., Savoy, J.: UniNE at CLEF 2017: author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Working Notes. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Koppel, M., Akiva, N., Dershowitz, I., Dershowitz, N.: Unsupervised decomposition of a document into authorial components. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1356–1364 (2011)Misra, H., Yvon, F., Jose, J.M., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: Proceedings of CIKM 2009, pp. 1553–1556. ACM (2009)Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguis. 28(1), 19–36 (2002)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Notebook Papers of the 5th Evaluation Lab on Uncovering Plagiarism, Authorship and Social Software Misuse (PAN), Amsterdam, The Netherlands, September 2011Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Cham (2014). doi: 10.1007/978-3-319-11382-1_22Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2016. http://ceur-ws.org/Vol-1609/Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In: Fung, P., Poesio, M. (eds.) Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 13), pp. 1212–1221. Association for Computational Linguistics (2013). http://www.aclweb.org/anthology/p13-1119Rangel, F., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop – Working Notes Papers, 8–11 September, Toulouse, France. CEUR Workshop Proceedings, CEUR-WS.org, September 2015Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, 15–18 September, Sheffield, UK. CEUR Workshop Proceedings, CEUR-WS.org, September 2014Rangel, F., Rosso, P., Franco-Salvador, M.: A low dimensionality representation for language variety identification. In: 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing. LNCS. Springer (2016). arXiv:1705.10754Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, 23–26 September, Valencia, Spain (2013)Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2017Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Balog et al. [3]Riedl, M., Biemann, C.: TopicTiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42. Association for Computational Linguistics (2012)Scaiano, M., Inkpen, D.: Getting more from segmentation evaluation. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 362–366. Association for Computational Linguistics (2012)Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by authorship within and across documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org. http://ceur-ws.org/Vol-1609/Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by authorship within and across documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2016Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 201

    Author Profiling and Plagiarism Detection

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-25485-2_6In this chapter we introduce the topics that we will cover in the RuSSIR 2014 course on Author Profiling and Plagiarism Detection (APPD). Author profiling distinguishes between classes of authors studying how language is shared by classes of people. This task helps in identifying profiling aspects such as gender, age, native language, or even personality type. In case of the plagiarism detection task we are not interested in studying how language is shared. On the contrary, given a document we are interested in investigating if the writing style changes in order to unveil text inconsistencies, i.e., unexpected irregularities through the document such as changes in vocabulary, style and text complexity. In fact, when it is not possible to retrieve the source document(s) where plagiarism has been committed from, the intrinsic analysis of the suspicious document is the only way to find evidence of plagiarism. The difficulty in retrieving the source of plagiarism could be due to the fact that the documents are not available on the web or the plagiarised text fragments were obfuscated via paraphrasing or translation (in case the source document was in another language). In this overview, we also discuss the results of the shared tasks on author profiling (gender and age identification) and plagiarism detection that we help to organise at the PAN Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse.The PAN shared tasks on author profil-ing and on plagiarism detection have been organised in the framework of the WIQ-EIIRSES project (Grant No. 269180) within the EC FP 7 Marie Curie People. The research work described in the paper was carried out in the framework of the DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction inIntelligent Systems.Rosso, P. (2015). Author Profiling and Plagiarism Detection. En Information Retrieval. Springer. 229-250. https://doi.org/10.1007/978-3-319-25485-2_6S229250Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. TEXT 23, 321–346 (2003)Association of Teachers and Lecturers. School work plagued by plagiarism - ATL survey. Technical report, Association of Teachers and Lecturers, London, UK (2008). (Press release)Barrón-Cedeño, A.: On the mono- and cross-language detection of text re-use and plagiarism. Ph.D. thesis, Universitat Politènica de València (2012)Barrón-Cedeño, A., Rosso, P., Pinto, D., Juan, A.: On cross-lingual plagiarism analysis using a statistical model. In: Proceedings of the ECAI 2008 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, PAN 2008 (2008)Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl. Based Syst. 50, 11–17 (2013)Barrón-Cedeño, A., Vila, M., Martí, M., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)Bogdanova, D., Rosso, P., Solorio, T.: Exploring high-level features for detecting cyberpedophilia. Comput. Speech Lang. 28(1), 108–120 (2014)Braschler, M., Harman, D.: Notebook papers of CLEF 2010 LABs and workshops. Padua, Italy (2010)Cappellato, L., Ferro, N., Halvey, M., Kraaij, W.: CLEF 2014 labs and workshops, notebook papers. In: CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613–0073 (2014). http://ceur-ws.org/Vol-1180/Comas, R., Sureda, J., Nava, C., Serrano, L.: Academic cyberplagiarism: a descriptive and comparative analysis of the prevalence amongst the undergraduate students at Tecmilenio University (Mexico) and Balearic Islands University (Spain). In: Proceedings of the International Conference on Education and New Learning Technologies (EDULEARN 2010), Barcelona (2010)Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–233 (1948)Flores, E., Barrón-Cedeño, A., Rosso, P., Moreno, L.: Desocore: detecting source code re-use across programming languages. In: Proceedings of 12th International Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2012, pp. 1–4, Montreal, Canada (2012)Flores, E., Barrón-Cedeño, A., Moreno, L., Rosso, P.: Uncovering source code re-use in large-scale programming environments. In: Computer Applications in Engineering and Education, Accepted (2014). doi: 10.1002/cae.21608Forner, P., Navigli, R., Tufis, D.: CLEF 2013 evaluation labs and workshop - working notes papers, 23–26 September. Valencia, Spain (2013)Franco-Salvador, M., Gupta, P., Rosso, P.: Cross-Language plagiarism detection using a multilingual semantic network. In: Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E., Serdyukov, P. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 710–713. Springer, Heidelberg (2013)Franco-Salvador, M., Gupta, P., Rosso, P.: Knowledge graphs as context models: improving the detection of cross-language plagiarism with paraphrasing. In: Ferro, N. (ed.) PROMISE Winter School 2013. LNCS, vol. 8173, pp. 227–236. Springer, Heidelberg (2014)Gollub, T., Stein, B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M., (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM, August 2012. ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348501Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: Gurrin, C., Jones, G., Kelly, D., Kruschwitz, U., de Rijke, M., Sakai, T., Sheridan, P., (eds.) 36th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2013), pp. 981–984. ACM (2013)Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Adar, E., Hurst, M., Finin, T., Glance, N.S., Nicolov, N., Tseng, B.L., (eds.) ICWSM. The AAAI Press (2009)Gressel, G., Hrudya, P., Surendran, K., Thara, S., Aravind, A., Prabaharan, P.: Ensemble Learning Approach for Author Profiling-Notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Grozea, C., Popescu, M.: ENCOPLOT - performance in the Second International Plagiarism Detection Challenge lab report for PAN at CLEF 2010. In: Braschler and Harman [8]Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: pairwise sequence matching in linear time applied to plagiarism detection. In: Stein et al., (ed.) Overview of the 1st International Competition on Plagiarism Detection, pp. 10–18 (2009)Gunning, R.: The Technique of Clear Writing. McGraw-Hill Int. Book Co, New York (1952)Gupta, P., Barrón-Cedeño, A., Rosso, P.: Cross-language high similarity search using a conceptual thesaurus. In: Catarci, T., Peñas, A., Santucci, G., Forner, P., Hiemstra, D. (eds.) CLEF 2012. LNCS, vol. 7488, pp. 67–75. Springer, Heidelberg (2012)Honore, A.: Some simple measures of richness of vocabulary. Assoc. Lit. Linguist. Comput. Bull. 7(2), 172–177 (1979)IEEE. A Plagiarism FAQ. http://www.ieee.org/publications_standards/publications/rights/plagiarism_FAQ.html (2008). Published: 2008; Last Accessed 25 November 2012Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Lit. Linguist. Comput. 17(4), 401–412 (2002)Liau, Y., Vrizlynn, L.: Submission to the author profiling competition at pan-2014. In: Proceedings Recent Advances in Natural Language Processing III (2014). http://www.webis.de/research/events/pan-14Lopez-Monroy, A.P., Montes-Y-Gomez, M., Escalante, H.J., Villaseñor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN 2013: author profiling task–notebook for PAN at CLEF 2013. In: Forner, et al. [14]Pastor López-Monroy, A., Montes y Gómez, M., Escalante, H.J., Villaseñor-Pineda, L.: Using Intra-profile information for author profiling-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Maharjan, S., Shrestha, P., Solorio, T.: A simple approach to author profiling in MapReduce–notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Marquardt, J., Fanardi, G., Vasudevan, G., Moens, M.F., Davalos, S., Teredesai, A., De Cock, M.: Age and gender identification in social media-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Martin, B.: Plagiarism: policy against cheating or policy for learning? Nexus (Newsl. Aust. Sociol. Assoc.) 16(2), 15–16 (2004)Mcnamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1), 73–97 (2004)Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features-notebook for PAN at CLEF 2013. In: Forner, et al. [14]Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Tombros, A., Yavlinsky, A., Rüger, S.M., Tsikrika, T., Lalmas, M., MacFarlane, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)Montes y Gómez, M., Gelbukh, A.F., López-López, A., Baeza-Yates, R.A.: Flexible comparison of conceptual graphs. In: Proceedings DEXA, pp. 102–111 (2001)Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)Nawab, R.M.A., Stevenson, M., Clough, P.: University of sheffield lab report for pan at clef 2010. In: Braschler and Harman [8]Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “how old do you think i am?”; a study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013)Oberreuter, G., Eiselt, A.: Submission to the 6th international competition on plagiarism detection, From Innovand.io, Chile (2014). http://www.webis.de/research/events/pan-14Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)Palkovskii, Y., Belov, A.: Developing high-resolution universal multi-type N-Gram plagiarism detector-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: our words, our selves. Ann. Rev. Psychol. 54(1), 547–577 (2003)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: COLING 2010: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 997–1005 (2010)Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Plachouras, V., Macdonald, C., Ounis, I., White, R.W., Ruthven, I. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.:. Overview of the 1st international competition on plagiarism detection. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E., (eds.) Proceedings of the SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 1–9, 2009. CEUR-WS.org (September 2009). http://ceur-ws.org/Vol-502Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Braschler and Harman [8]Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Braschler, M., Harman, D., Pianta, E., (eds.) Working Notes Papers of the CLEF 2010 Evaluation Labs (September 2010) 2010. http://www.clef-initiative.eu/publication/working-notesPotthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Petras, V., Forner, P., Clough, P., (eds.) Working Notes Papers of the CLEF 2011 Evaluation Labs (September 2011) (2011). http://www.clef-initiative.eu/publication/working-notesPotthast, M., Gollub, T., Hagen, M., Grabegger, J., Kiesel, J., Michel, M., Oberlander, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacker, C., (eds.) Working Notes Papers of the CLEF 2012 Evaluation Labs (September 2012) (2012). http://www.clef-initiative.eu/publication/working-notesPotthast, M., Hagen, M., Stein, B., Grabegger, J., Michel, M., Tippmann, M., Welsch, C.: Chatnoir: a search engine for the clueweb09 corpus. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M., (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), p. 1004 (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: Forner, et al. [14]Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th International Competition on Plagiarism Detection. In: Cappellato, et al. [9]Pouliquen, B., Steinberger, R., Ignat, C.: Automatic linking of similar texts across languages. In: Proceedings of Recent Advances in Natural Language Processing III, RANLP 2003, pp. 307–316 (2003)Prakash, A., Saha, S.: Experiments on document chunking and query formation for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: Forner, et al. [14]Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkman, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014–notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: A winning approach to text alignment for text reuse detection at PAN 2014-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205. AAAI (2006)Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E., (eds.) Proceedings of the SEPLN09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 38–46, 2009. CEUR-WS.org, September 2009. http://ceur-ws.org/Vol-502Stein, B., Meyer zu Eissen, S., Potthast, M.: Strategies for retrieving plagiarized documents. In: Clarke, C., Fuhr, N., Kando, N., Kraaij, W., de Vries, A., (eds.) 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2007), pp. 825–826. ACM (2007)Stein, B., Potthast, M., Rosso, P., Barrón-Cedeño, A., Stamatatos, E., Koppel, M.: Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. ACM SIGIR Forum 45, 45–48 (2011)Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The jrc-acquis: a multilingual aligned parallel corpus with +20 languages. In: Proceedings of 5th International Conference on language resources and evaluation LREC 2006 (2006)Suchomel, S., Brandejs, M.: Heterogeneous queries for synoptic and phrasal search-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Villena-Román, J., González-Cristóbal, J.C.: DAEDALUS at PAN 2014: Guessing Tweet Author’s Gender and Age-Notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Vossen, P.: Eurowordnet: a multilingual database of autonomous and language-specific wordnets connected via an inter-lingual index. Int. J. Lexicography 17, 161–173 (2004)Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 783–792 (2010)Weren, E.R.D., Moreira, V.P., de Oliveira, J.P.M.:. Exploring information retrieval features for author profiling-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Williams, K., Chen, H.H., Giles, C.: Supervised ranking for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Yule, G.: The Statistical Study of Literary Vocabulary. Cambridge University press, Cambridge (1944)Zubarev, D., Sochenkov, I.: Using sentence similarity measure for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, L., et al. [9

    On the multilingual and genre robustness of EmoGraphs for author profiling in social media

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-24027-5_28Author profiling aims at identifying different traits such as age and gender of an author on the basis of her writings. We propose the novel EmoGraph graph-based approach where morphosyntactic categories are enriched with semantic and affective information. In this work we focus on testing the robustness of EmoGraphs when applied to age and gender identification. Results with PAN-AP-14 corpus show the competitiveness of the representation over genres and languages. Finally, some interesting insights are shown, for example with topic and emotion bounded genres such as hotel reviews.The research has been carried out in the framework of the European Commission WIQ-EI IRSES (no. 269180) and DIANA - Finding Hidden Knowledge in Texts (TIN2012-38603-C02) projects. The work of the first author was partially funded by Autoritas Consulting SA and by Spanish Ministry of Economics under grant ECOPORTUNITY IPT-2012-1220-430000.Rangel, F.; Rosso, P. (2015). On the multilingual and genre robustness of EmoGraphs for author profiling in social media. En Experimental IR Meets Multilinguality, Multimodality, and Interaction: 6th International Conference of the CLEF Association, CLEF'15, Toulouse, France, September 8-11, 2015, Proceedings. Springer International Publishing. 274-280. https://doi.org/10.1007/978-3-319-24027-5_28S274280Argamon, S., Koppel, M., Fine, J., Shimoni, A.: Gender, genre, and writing style informal written texts. TEXT 23, 321–346 (2003)Levin, B.: English Verb Classes and Alternations. University of Chicago Press, Chicago (1993)Mohammad, S.M., Yang, T.: Tracking sentiment in mail: how gender differ on emotional axes. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (2011)Pennebaker, J.W.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury Press (2011)Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Information Processing & Management, Special Issue on Emotion and Sentiment in Social and Expressive Media (in press, 2015)Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at pan 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij, W. (eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180 (2014)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at pan 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179 (2013)Sidorov, G., Miranda-Jimnez, S., Viveros-Jimnez, F., Gelbukh, F., Castro-Snchez, N., Velsquez, F., Daz-Rangel, I., Surez-Guerra, S., Trevio, A., Gordon-Miranda, J.: Empirical study of opinion mining in spanish tweets. In: 11th Mexican International Conference on Artificial Intelligence, MICAI, pp. 1–4 (2012)Strapparava, C., Valitutti, A.: Wordnet-affect: an affective extension of wordnet. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon (2004

    Fine-Grained Analysis of Language Varieties and Demographics

    Full text link
    [EN] The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the anonymous users and this could be useful in several domains beyond security and forensics such as marketing, for example. In this paper, we focus on a fine-grained analysis of language varieties while considering also the authors¿ demographics. We present a Low-Dimensionality Statistical Embedding method to represent text documents. We compared the performance of this method with the best performing teams in the Author Profiling task at PAN 2017. We obtained an average accuracy of 92.08% versus 91.84% for the best performing team at PAN 2017. We also analyse the relationship of the language variety identification with the authors¿ gender. Furthermore, we applied our proposed method to a more fine-grained annotated corpus of Arabic varieties covering 22 Arab countries and obtained an overall accuracy of 88.89%. We have also investigated the effect of the authors¿ age and gender on the identification of the different Arabic varieties, as well as the effect of the corpus size on the performance of our method.This publication was made possible by NPRP grant 9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.Rangel, F.; Rosso, P.; Zaghouani, W.; Charfi, A. (2020). Fine-Grained Analysis of Language Varieties and Demographics. Natural Language Engineering. 26(6):641-661. https://doi.org/10.1017/S1351324920000108S641661266Kestemont, M. , Tschuggnall, M. , Stamatatos, E. , Daelemans, W. , Specht, G. , Stein, B. and Potthast, M. (2018). Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153-157. doi:10.1007/bf02295996Lui, M. and Cook, P. (2013). Classifying english documents by national dialect. In Proceedings of the Australasian Language Technology Association Workshop, Citeseer pp. 5–15.Basile, A. , Dwyer, G. , Medvedeva, M. , Rawee, J. , Haagsma, H. and Nissim, M. (2017). Is there life beyond n-grams? A simple SVM-based author profiling system. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In Association for Computational Linguistics (ACL), pp. 456–461.Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523. doi:10.1016/0306-4573(88)90021-0Zaghouani, W. and Charfi, A. (2018a). ArapTweet: A large MultiDialect Twitter corpus for gender, age and language variety identification. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Zampieri, M. , Tan, L. , Ljubešić, N. , Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 1–9.Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In PACLIC, pp. 404–410.Zaidan, O. F., & Callison-Burch, C. (2014). Arabic Dialect Identification. Computational Linguistics, 40(1), 171-202. doi:10.1162/coli_a_00169Grouin, C. , Forest, D. , Paroubek, P. and Zweigenbaum, P. (2011). Présentation et résultats du défi fouille de texte DEFT2011 Quand un article de presse a t-il été écrit? À quel article scientifique correspond ce résumé? Actes du septième Défi Fouille de Textes, p. 3.Martinc, M. , Skrjanec, I. , Zupan, K. and Pollak, S. Pan (2017). Author profiling – gender and language variety prediction. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Rangel, F. , Rosso, P. and Franco-Salvador, M. (2016b). A low dimensionality representation for language variety identification. In 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, LNCS. Springer-Verlag, arxiv:1705.10754.Hagen, M. , Potthast, M. and Stein, B. (2018). Overview of the Author Obfuscation Task at PAN 2018. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Zampieri, M. and Gebre, B.G. (2012). Automatic identification of language varieties: The case of portuguese. In The 11th Conference on Natural Language Processing (KONVENS), pp. 233–237 (2012)Rangel, F. , Rosso, P. , Montes-y-Gómez, M. , Potthast, M. and Stein, B. (2018). Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Heitele, D. (1975). An epistemological view on fundamental stochastic ideas. Educational Studies in Mathematics, 6(2), 187-205. doi:10.1007/bf00302543Inches, G. and Crestani, F. (2012). Overview of the International Sexual Predator Identification Competition at PAN-2012. CLEF Online working notes/labs/workshop, vol. 30.Rosso, P. , Rangel Pardo, F.M. , Ghanem, B. and Charfi, A. (2018b). ARAP: Arabic Author Profiling Project for Cyber-Security. Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN).Agić, Ž. , Tiedemann, J. , Dobrovoljc, K. , Krek, S. , Merkler, D. , Može, S. , Nakov, P. , Osenova, P. and Vertan, C. (2014). Proceedings of the EMNLP 2014 Workshop on Language Technology for Closely Related Languages and Language Variants. Association for Computational Linguistics.Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and Dialects in Social Media. Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP). doi:10.3115/v1/w14-5904Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Antònia Martít, M. (2015). Language Variety Identification Using Distributed Representations of Words and Documents. Experimental IR Meets Multilinguality, Multimodality, and Interaction, 28-40. doi:10.1007/978-3-319-24027-5_3Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W., & Charfi, A. (2018). A survey on author profiling, deception, and irony detection for the Arabic language. Language and Linguistics Compass, 12(4), e12275. doi:10.1111/lnc3.12275Malmasi, S. , Zampieri, M. , Ljubešić, N. , Nakov, P. , Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14.Rangel, F. , Rosso, P. , Potthast, M. and Stein, B. (2017). Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Cappellato L., Ferro N., Goeuriot, L. and Mandl T. (eds), Working Notes Papers of the CLEF 2017 Evaluation Labs, p. 1613–0073, CLEF and CEUR-WS.org.Zampieri, M. , Malmasi, S. , Ljubešić, N. , Nakov, P. , Ali, A. , Tiedemann, J. , Scherrer, Y. , Aepli, N. (2017). Findings of the vardial evaluation campaign 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–15.Bogdanova, D., Rosso, P., & Solorio, T. (2014). Exploring high-level features for detecting cyberpedophilia. Computer Speech & Language, 28(1), 108-120. doi:10.1016/j.csl.2013.04.007Maier, W. and Gómez-Rodríguez, C. (2014). Language Variety Identification in Spanish Tweets. LT4CloseLang.Castro, D. , Souza, E. , de Oliveira, A.L.I. (2016). Discriminating between Brazilian and European Portuguese national varieties on Twitter texts. In 5th Brazilian Conference on Intelligent Systems (BRACIS), pp. 265–270.Zaghouani, W. and Charfi, A. (2018b). Guidelines and annotation framework for Arabic author profiling. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Hernández Fusilier, D., Montes-y-Gómez, M., Rosso, P., & Guzmán Cabrera, R. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443. doi:10.1016/j.ipm.2014.11.001Tellez, E.S. , Miranda-Jiménez, S. , Graff, M. and Moctezuma, D. (2017). Gender and language variety identification with microtc. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds). CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Kandias, M., Stavrou, V., Bozovic, N., & Gritzalis, D. (2013). Proactive insider threat detection through social media. Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. doi:10.1145/2517840.251786

    Author Profiling in Social Media: The Impact of Emotions on Discourse Analysis

    Full text link
    [EN] In this paper we summarise the content of the keynote that will be given at the 5th International Conference on Statistical Language and Speech Processing (SLSP) in Le Mans, France in October 23¿25, 2017. In the keynote we will address the importance of inferring demographic information for marketing and security reasons. The aim is to model how language is shared in gender and age groups taking into account its statistical usage. We will see how a shallow discourse analysis can be done on the basis of a graph-based representation in order to extract information such as how complicated the discourse is (i.e., how connected the graph is), how much interconnected grammatical categories are, how far a grammatical category is from others, how different grammatical categories are related to each other, how the discourse is modelled in different structural or stylistic units, what are the grammatical categories with the most central use in the discourse of a demographic group, what are the most common connectors in the linguistic structures used, etc. Moreover, we will see also the importance to consider emotions in the shallow discourse analysis and the impact that this has. We carried out some experiments for identifying gender and age, both in Spanish and in English, using PAN-AP-13 and PAN-PC-14 corpora, obtaining comparable results to the best performing systems of the PAN Lab at CLEF.The research work described in this paper was partially carried out in the framework of the SomEMBED project (TIN2015-71147-C2-1-P), funded by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO).Rosso, P.; Rangel-Pardo, FM. (2017). Author Profiling in Social Media: The Impact of Emotions on Discourse Analysis. Lecture Notes in Computer Science. 10583:3-18. https://doi.org/10.1007/978-3-319-68456-7_1S31810583Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008(10), 10008 (2008)Bonacich, P.: Factoring and weighting approaches to clique identification. J. Math. Soc. 2(1), 113–120 (1972)Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Soc. 25(2), 163–177 (2001)Carreras, X., Chao, I., Padró, L., Padró, M.: FreeLing : an open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004) (2004)Díaz Rangel, I., Sidorov, G., Suárez-Guerra, S.: Creación y evaluación de un diccionario marcado con emociones y ponderado para el español. Onomazein 29, 23 (2014). (in Spanish)Ekman, P.: Universals and cultural differences in facial expressions of emotion. In: Symposium on Motivation, Nebraska, pp. 207–283 (1972)Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop, Working Notes Papers, September 2013, Valencia, Spain, vol. 1179, pp. 23–26. CEUR-WS.org (2013)Koppel, M., Argamon, S., Shimoni, A.: Automatically categorizing written texts by author gender. Literay Linguist. Comput. 17(4), 401–412 (2003)Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. (TCS) 407(1–3), 458–473 (2008)Levin, B.: English Verb Classes and Alternations. University of Chicago Press, Chicago (1993)Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-Interdiscip. J. Study Discourse 8(3), 243–281 (1988)Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features notebook for PAN at CLEF 2013. In: Forner et al. [7]Padró, L., Stanilovsky, E.: FreeLing 3.0: towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012) (2012)Lopez-Monroy, A.P., Montes-Gomez, M., Jair Escalante, H., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOEs participation at PAN13: author profiling task. Notebook for PAN at CLEF 2013. In: Forner et al. [7]Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.: Psychological aspects of natural language use: our words, our selves. Annu. Rev. Psychol. 54, 547–577 (2003)Pennebaker, J.W.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury Press, London (2011)Rangel, F., Hernández, I., Rosso, P., Reyes, A.: Emotions and irony per gender in Facebook. In: Proceedings of the Workshop on Emotion, Social Signals, Sentiment & Linked Open Data (ES3LOD), LREC-2014, Reykjavik, Iceland, 26–31 May 2014, pp. 68–73 (2014)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: Forner et al. [7]Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) Notebook Papers of CLEF 2014 LABs and Workshops, vol. 1180, pp. 951–957. CEUR-WS.org (2014)Rangel, F., Rosso, P.: On the multilingual and genre robustness of EmoGraphs for author profiling in social media. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 274–280. Springer, Cham (2015). doi: 10.1007/978-3-319-24027-5_28Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Inf. Process. Manag. 52(1), 73–92 (2016)Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, AAAI, pp. 199–205 (2006)Soler-Company, J. Wanner, L.: Use of discourse and syntactic features for gender identification. In: The Eighth Starting Artificial Intelligence Research Symposium. Collocated with the 22nd European Conference on Artificial Intelligence, pp. 215–220 (2016)Soler-Company, J., Wanner, L.: On the relevance of syntactic and discourse features for author profiling and identification. In: 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Valencia, Spain, pp. 681–687 (2017)Strapparava, C., Valitutti, A.: WordNet affect: an affective extension of WordNet. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisboa, pp. 1083–1086 (2004)Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 409–410 (1998)Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML, pp. 412–420 (1997
    corecore