14 research outputs found

    Overview of the PAN'2016 - New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-44564-9_28This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of digital text forensic research. PAN 2016 comprises three shared tasks: (i) author identification, addressing author clustering and diarization (or intrinsic plagiarism detection); (ii) author profiling, addressing age and gender prediction from a cross-genre perspective; and (iii) author obfuscation, addressing author masking and obfuscation evaluation. In total, 35 teams participated in all three shared tasks of PAN 2016 and, following the practice of previous editions, software submissions were required and evaluated within the TIRA experimentation framework.The work of the first author was partially supported by the Som EMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMA MATER (Prometeo II/2014/030). The work of the second author was partially supported by Autoritas Consulting and by Ministerio de EconomĂ­a y Competitividad de España under grant ECOPORTUNITY IPT-2012-1220-430000.Rosso, P.; Rangel-Pardo, FM.; Potthast, M.; Stamatatos, E.; Tschuggnall, M.; Stein, B. (2016). Overview of the PAN'2016 - New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation. En Experimental IR Meets Multilinguality, Multimodality, and Interaction. Springer Verlag (Germany). 332-350. https://doi.org/10.1007/978-3-319-44564-9_28S332350Almishari, M., Tsudik, G.: Exploring linkability of user reviews. In: Foresti, S., Yung, M., Martinelli, F. (eds.) ESORICS 2012. LNCS, vol. 7459, pp. 307–324. Springer, Heidelberg (2012)Álvarez-Carmona, M.A., LĂłpez-Monroy, A.P., Montes-Y-GĂłmez, M., Villaseñor-Pineda, L., Jair-Escalante, H.: INAOE’s Participation at PAN’15: author profiling task–notebook for PAN at CLEF 2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)AmigĂł, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-2011. In: Working Notes Papers of the CLEF 2011 Evaluation Labs (2011)Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. TEXT 23, 321–346 (2003)Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., Chikhi, S.: Overview of the AraPlagDet PAN@ FIRE2015 shared task on arabic plagiarism detection. In: Notebook Papers of FIRE 2015. CEUR-WS.org, vol. 1587 (2015)Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of EMNLP 2011 (2011)Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM TIST 4(3), 43:1–43:21 (2013)Castillo, E., Cervantes, O., Vilariño, D., Pinto, D., LeĂłn, S.: Unsupervised method for the authorship identification task. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180 (2014)Chaski, C.E.: Who’s at the keyboard: authorship attribution in digital evidence invesigations. Int. J. Digit. Evid. 4, 1–13 (2005)Clarke, C.L., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2009 web track. In: DTIC Document (2009)Flores, E., Rosso, P., Moreno, L., Villatoro, E.: On the detection of source code re-use. In: ACM FIRE 2014 Post Proceedings of the Forum for Information Retrieval Evaluation, pp. 21–30 (2015)Flores, E., Rosso, P., Villatoro, E., Moreno, L., Alcover, R., Chirivella, V.: PAN@FIRE: overview of CL-SOCO track on the detection of cross-language source code re-use. In: Notebook Papers of FIRE 2015. CEUR-WS.org, vol. 1587 (2015)FrĂ©ry, J., Largeron, C., Juganaru-Mathieu, M.: UJM at clef in author identification. In: CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180 (2014)Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., MĂŒller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 282–302. Springer, Heidelberg (2013)Gollub, T., Stein, B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of SIGIR 12. ACM (2012)Hagen, M., Potthast, M., Stein, B.: Source retrieval for plagiarism detection from large web corpora: recent approaches. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of ACL 2004 (2004)Holmes, J., Meyerhoff, M.: The Handbook of Language and Gender. Blackwell Handbooks in Linguistics, Wiley (2003)Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining writeprints from anonymous e-mails for forensic investigation. Digit. Investig. 7(1–2), 56–64 (2010)Jankowska, M., Keselj, V., Milios, E.: CNG text classification for authorship profiling task-notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Juola, P.: An overview of the traditional authorship attribution subtask. In: Working Notes Papers of the CLEF 2012 Evaluation Labs (2012)Juola, P.: Authorship attribution. Found. Trends Inf. Retrieval 1, 234–334 (2008)Juola, P.: How a computer program helped reveal J.K. rowling as author of a Cuckoo’s calling. In: Scientific American (2013)Juola, P., Stamatatos, E.: Overview of the author identification task at PAN-2013. In:Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org vol. 1179 (2013)Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author masking through translation-notebook for PAN at CLEF 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)Koppel, M., Winter, Y.: Determining if two documents are written by the same author. J. Am. Soc. Inf. Sci. Technol. 65(1), 178–187 (2014)Layton, R., Watters, P., Dazeley, R.: Automated unsupervised authorship analysis using evidence accumulation clustering. Nat. Lang. Eng. 19(1), 95–120 (2013)LĂłpez-Monroy, A.P., Montes-y GĂłmez, M., Jair-Escalante, H., Villasenor-Pineda, L.V.: Using intra-profile information for author profiling-notebook for PAN at CLEF 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)LĂłpez-Monroy, A.P., Montes-y GĂłmez, M., Jair-Escalante, H., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN’13: author profiling task-notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of COLING (2008)Maharjan, S., Shrestha, P., Solorio, T., Hasan, R.: A straightforward author profiling approach in MapReduce. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS, vol. 8864, pp. 95–107. Springer, Heidelberg (2014)Mansoorizadeh, M.: Submission to the author obfuscation task at PAN 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., RĂŒger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)Mihaylova, T., Karadjov, G., Nakov, P., Kiprov, Y., Georgiev, G., Koychev, I.: SU@PAN’2016: author obfuscation-notebook for PAN at CLEF 2016. In: Conference and Labs of the Evaluation Forum, CLEF (2016)Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. Audio Speech Language Process. IEEE Trans. 20(2), 356–370 (2012)Moreau, E., Jayapal, A., Lynch, G., Vogel, C.: Author verification: basic stacked generalization applied to predictions from a set of heterogeneous learners. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: How old do you think I am? a study of language and age in twitter. In: Proceedings of ICWSM 13. AAAI (2013)Peñas, A., Rodrigo, A.: A Simple measure to assess non-response. In: Proceedings of HLT 2011 (2011)Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: our words, our selves. Ann. Rev. Psychol. 54(1), 547–577 (2003)Potthast, M., BarrĂłn-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2010 Evaluation Labs (2010)Potthast, M., BarrĂłn-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. (LREC) 45, 45–62 (2011)Potthast, M., Eiselt, A., BarrĂłn-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2011 Evaluation Labs (2011)Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., OberlĂ€nder, A., Tippmann, M., BarrĂłn-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2012 Evaluation Labs (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Heidelberg (2014)Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Potthast, M., Hagen, M., Stein, B.: Author obfuscation: attacking the state of the art in authorship verification. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Potthast, M., Göring, S., Rosso, P., Stein, B.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: a search engine for the ClueWeb09 corpus. In: Proceedings of SIGIR 12. ACM (2012)Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In: Proceedings of ACL 13. ACL (2013)Potthast, M., Stein, B., BarrĂłn-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of COLING 10. ACL (2010)Potthast, M., Stein, B., Eiselt, A., BarrĂłn-Cedeño, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: Proceedings of PAN at SEPLN 09. CEUR-WS.org 502 (2009)Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Inf. Process. Manage. Spec. Issue Emot. Sentiment Soc. Expressive Media 52(1), 73–92 (2016)Rangel, F., Rosso, P.: On the multilingual and genre robustness of emographs for author profiling in social media. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 274–280. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-24027-5_28Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: Working Notes Papers of the CLEF 2013 Evaluation Labs. CEUR-WS.org, vol. 1179 (2013)Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Samdani, R., Chang, K., Roth, D.: A discriminative latent variable model for online clustering. In: Proceedings of The 31st International Conference on Machine Learning, pp. 1–9 (2014)Sapkota, U., Bethard, S., Montes-y-GĂłmez, M., Solorio, T.: Not all character N-grams are created equal: a study in authorship attribution. In: Proceedings of NAACL 15. ACL (2015)Sapkota, U., Solorio, T., Montes-y-GĂłmez, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of COLING 14 (2014)Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI (2006)Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS One 8(9), 773–791 (2013)Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 538–556 (2009)Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21, 421–439 (2013)Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by authorship within and across documents. In: CLEF 2016 Working Notes. CEUR-WS.org (2016)Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., LĂłpez-LĂłpez, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN-2015. In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR-WS.org, vol. 1391 (2015)Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., SĂĄnchez-PĂ©rez, M.A., BarrĂłn-Cedeño, A.: Overview of the author identification task at PAN 2014. In: Working Notes Papers of the CLEF 2014 Evaluation Labs. CEUR-WS.org, vol. 1180 (2014)Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. (LRE) 45, 63–82 (2011)Stein, B., Meyer zu Eißen, S.: Near Similarity Search and Plagiarism Analysis. In: Proceedings of GFKL 05. Springer, Heidelberg, pp. 430–437 (2006)Verhoeven, B., Daelemans, W.: Clips stylometry investigation (csi) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of LREC 2014 (2014)Verhoeven, B., Daelemans, W.: CLiPS stylometry investigation (CSI) corpus: a dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC (2014)Weren, E., Kauer, A., Mizusaki, L., Moreira, V., de Oliveira, P., Wives, L.: Examining multiple features for author profiling. J. Inf. Data Manage. 5(3), 266–280 (2014)Zhang, C., Zhang, P.: Predicting Gender from Blog Posts. Technical Report. University of Massachusetts Amherst, USA (2010

    Intrinsic plagiarism detection and author analysis by utilizing grammar

    Get PDF
    Die Anzahl an frei verfĂŒgbaren Textdokumenten ist in den letzten Jahren aufgrund des enormen Aufschwungs des Internets erheblich gestiegen. Eine der Konsequenzen ist, dass Quellen fĂŒr mögliche Plagiate leicht gefunden werden können, wĂ€hrend es auf der anderen Seite fĂŒr automatische Erkennungstools aufgrund der großen Datenmengen immer schwieriger wird, Plagiate zu erkennen. Zudem sind Quellen oft nicht in digitaler Form vorhanden, was fĂŒr Tools, die auf Vergleiche mit bekannten Dokumenten basieren, ein unlösbares Problem darstellt. Andererseits können geĂŒbte menschliche Leser verdĂ€chtige Passagen oft ĂŒber eine intuitive Stilanalyse ausfindig machen. In dieser Arbeit werden verschiedene Algorithmen zur intrinsischen Plagiatserkennung entwickelt, welche ausschließlich das zu prĂŒfende Dokument untersuchen und so das Problem umgehen, externe Daten heranziehen zu mĂŒssen. Dabei besteht die Grundidee darin, den Schreibstil von Autoren auf Basis der von ihnen verwendeten Grammatik zur Formulierung von SĂ€tzen zu untersuchen, und diese Information zu nutzen, um syntaktisch auffĂ€llige Textfragmente zu identifizieren. Unter Verwendung einer Ă€hnlichen Analyse wird diese Idee auch auf das Problem, Textdokumente automatisch Autoren zuzuordnen, angewendet. DarĂŒber hinaus wird gezeigt, dass die verwendete Grammatik auch ein unterscheidbares Kriterium darstellt, um Informationen wie das Geschlecht und das Alter des Verfassers abzuschĂ€tzen. Schlussendlich werden die vorherigen Analysen und Resultate verwendet und so adaptiert, dass Anteile von verschiedene Autoren in einem gemeinschaftlich verfassten Text automatisch erkannt werden können.With the advent of the world wide web the number of freely available text documents has increased considerably in the last years. As one of the immediate results, it has become easier to find sources that serve as the basis for plagiarism. On the other side, it has become harder for detection tools to automatically expose plagiarism due to the huge amount of possible origins. Moreover, sources may even not be digitally available, resulting in an unsolvable problem for such tools, whereas experienced human readers might find suspicious passages based on an intuitive style analysis. In this thesis, intrinsic plagiarism detection algorithms are proposed which operate on the suspicious document only and circumvent the problem of incorporating external data. The main idea is thereby to analyze the style of authors in terms of the grammar that is used to formulate sentences, and to expose significantly outstanding text fragments according to the syntax, which is represented by grammar trees. By using a similar style analysis, the idea has also been applied to the problem of automatically assigning authors to unseen text documents. Moreover, it is shown that grammar also serves as a distinguishing feature to profile an author, namely to predict his/her gender and age. Reusing all previous analyses and results, the idea has finally been adapted in order to be used to automatically detect different authorships in a collaboratively written document.Michael TschuggnallZsfassung in dt. SpracheInnsbruck, Univ., Diss., 2014OeBB(VLID)19876

    Intrinsic plagiarism detection and author analysis by utilizing grammar

    No full text
    Die Anzahl an frei verfĂŒgbaren Textdokumenten ist in den letzten Jahren aufgrund des enormen Aufschwungs des Internets erheblich gestiegen. Eine der Konsequenzen ist, dass Quellen fĂŒr mögliche Plagiate leicht gefunden werden können, wĂ€hrend es auf der anderen Seite fĂŒr automatische Erkennungstools aufgrund der großen Datenmengen immer schwieriger wird, Plagiate zu erkennen. Zudem sind Quellen oft nicht in digitaler Form vorhanden, was fĂŒr Tools, die auf Vergleiche mit bekannten Dokumenten basieren, ein unlösbares Problem darstellt. Andererseits können geĂŒbte menschliche Leser verdĂ€chtige Passagen oft ĂŒber eine intuitive Stilanalyse ausfindig machen. In dieser Arbeit werden verschiedene Algorithmen zur intrinsischen Plagiatserkennung entwickelt, welche ausschließlich das zu prĂŒfende Dokument untersuchen und so das Problem umgehen, externe Daten heranziehen zu mĂŒssen. Dabei besteht die Grundidee darin, den Schreibstil von Autoren auf Basis der von ihnen verwendeten Grammatik zur Formulierung von SĂ€tzen zu untersuchen, und diese Information zu nutzen, um syntaktisch auffĂ€llige Textfragmente zu identifizieren. Unter Verwendung einer Ă€hnlichen Analyse wird diese Idee auch auf das Problem, Textdokumente automatisch Autoren zuzuordnen, angewendet. DarĂŒber hinaus wird gezeigt, dass die verwendete Grammatik auch ein unterscheidbares Kriterium darstellt, um Informationen wie das Geschlecht und das Alter des Verfassers abzuschĂ€tzen. Schlussendlich werden die vorherigen Analysen und Resultate verwendet und so adaptiert, dass Anteile von verschiedene Autoren in einem gemeinschaftlich verfassten Text automatisch erkannt werden können.With the advent of the world wide web the number of freely available text documents has increased considerably in the last years. As one of the immediate results, it has become easier to find sources that serve as the basis for plagiarism. On the other side, it has become harder for detection tools to automatically expose plagiarism due to the huge amount of possible origins. Moreover, sources may even not be digitally available, resulting in an unsolvable problem for such tools, whereas experienced human readers might find suspicious passages based on an intuitive style analysis. In this thesis, intrinsic plagiarism detection algorithms are proposed which operate on the suspicious document only and circumvent the problem of incorporating external data. The main idea is thereby to analyze the style of authors in terms of the grammar that is used to formulate sentences, and to expose significantly outstanding text fragments according to the syntax, which is represented by grammar trees. By using a similar style analysis, the idea has also been applied to the problem of automatically assigning authors to unseen text documents. Moreover, it is shown that grammar also serves as a distinguishing feature to profile an author, namely to predict his/her gender and age. Reusing all previous analyses and results, the idea has finally been adapted in order to be used to automatically detect different authorships in a collaboratively written document.Michael TschuggnallZsfassung in dt. SpracheInnsbruck, Univ., Diss., 2014OeBB(VLID)19876

    Machine learning approaches to predict rehabilitation success based on clinical and patient-reported outcome measures

    No full text
    A common way to treat hip, knee or foot injuries is by conducting a corresponding physician-guided rehab over several weeks or even months. While health professionals are often able to estimate the treatment success beforehand to a certain extent based on their experience, it is scientifically still not clear to what extent relevant factors and circumstances explain or predict rehab outcomes. To this end, we apply modern machine learning techniques to a real-life dataset consisting of data from more than a thousand rehab patients (N = 1,047) and build models that are able to predict the rehab success for a patient upon treatment start. By utilizing clinical and patient-reported outcome measures (PROMs) from questionnaires, we compute patient-related clinical measurements (CROMs) for different targets like the range of motion of a knee, and subsequently use those indicators to learn prediction models. While we at first apply regression algorithms to estimate the rehab success in terms of percental admission and discharge value differences, we finally also utilize classification models to make predictions based on a three-classed grading scheme. Extensive evaluations for different treatment groups and targets show promising results with F-scores exceeding 65% that are able to substantially outperform baselines (by up to 40%) and thus show that machine learning can indeed be applied for better medical controlling and optimized treatment paths in rehab praxis. Future developments should include further relevant critical success criteria in the rehabilitation routine to further optimize the prognosis models for clinical practice

    PAN22 Authorship Analysis: Style Change Detection

    No full text
    <p>This is the dataset for the <a href="https://pan.webis.de/clef22/pan22-web/style-change-detection.html">Style Change Detection</a> task of PAN 2022.</p> <p><strong>Task</strong></p> <p>The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches. Hence, a fundamental question is the following: If multiple authors have written a text together, can we find evidence for this fact; i.e., do we have a means to detect variations in the writing style? Answering this question belongs to the most difficult and most interesting challenges in author identification: Style change detection is the only means to detect plagiarism in a document if no comparison texts are given; likewise, style change detection can help to uncover gift authorships, to verify a claimed authorship, or to develop new technology for writing support.</p> <p>Previous editions of the Style Change Detection task aim at e.g., detecting whether a document is single- or multi-authored (<a href="https://pan.webis.de/clef18/pan18-web/style-change-detection.html">2018</a>), the actual number of authors within a document (<a href="https://pan.webis.de/clef19/pan19-web/style-change-detection.html">2019</a>), whether there was a style change between two consecutive paragraphs (<a href="https://pan.webis.de/clef20/pan20-web/style-change-detection.html">2020</a>, <a href="https://pan.webis.de/clef21/pan21-web/style-change-detection.html">2021</a>) and where the actual style changes were located (<a href="https://pan.webis.de/clef21/pan21-web/style-change-detection.html">2021</a>). Based on the progress made towards this goal in previous years, we again extend the set of challenges to likewise entice novices and experts:</p> <p>Given a document, we ask participants to solve the following three tasks:</p> <ul> <li><strong>[Task1] Style Change Basic:</strong> for a text written by two authors that contains a single style change only, find the position of this change (i.e., cut the text into the two authors’ texts on the paragraph-level),</li> <li><strong>[Task2] Style Change Advanced:</strong> for a text written by two or more authors, find all positions of writing style change (i.e., assign all paragraphs of the text uniquely to some author out of the number of authors assumed for the multi-author document)</li> <li><strong>[Task3] Style Change Real-World:</strong> for a text written by two or more authors, find all positions of writing style change, where style changes now not only occur between paragraphs, but at the sentence level.</li> </ul> <p>All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors.</p> <p><strong>Data</strong></p> <p>To develop and then test your algorithms, three datasets including ground truth information are provided (<em>dataset1</em> for task 1, <em>dataset2</em> for task 2, and <em>dataset3</em> for task 3).</p> <p>Each dataset is split into three parts:</p> <ol> <li><em>training set:</em> Contains 70% of the whole dataset and includes ground truth data. Use this set to develop and train your models.</li> <li><em>validation set:</em> Contains 15% of the whole dataset and includes ground truth data. Use this set to evaluate and optimize your models.</li> <li><em>test set:</em> Contains 15% of the whole dataset, no ground truth data is given. This set is used for evaluation (see later).</li> </ol> <p>You are free to use additional external data for training your models. However, we ask you to make the additional data utilized freely available under a suitable license.</p> <p><strong>Input Format</strong></p> <p>The datasets are based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data for each dataset, respectively.</p> <p>For each problem instance <code>X</code> (i.e., each input document), two files are provided:</p> <ol> <li><code>problem-X.txt</code> contains the actual text, where paragraphs are denoted by <code>\n</code> for tasks 1 and 2. For task 3, we provide one sentence per paragraph (again, split by <code>\n</code>).</li> <li><code>truth-problem-X.json</code> contains the ground truth, i.e., the correct solution in JSON format. An example file is listed in the following (note that we list keys for the three tasks here): <pre><code>{ "authors": NUMBER_OF_AUTHORS, "site": SOURCE_SITE, "changes": RESULT_ARRAY_TASK1 or RESULT_ARRAY_TASK3, "paragraph-authors": RESULT_ARRAY_TASK2 }</code></pre> <p>The result for task 1 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). For task 2 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., <code>[1, 2, 1]</code> for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic). The result for task 3 (key "changes") is similarly structured as the results array for task 1. However, for task 3, the <code>changes</code> array holds a binary for each pair of consecutive <em>sentences</em> and they may be multiple style changes in the document.</p> <p>An example of a multi-author document with a style change between the third and fourth paragraph (or sentence for task 3) could be described as follows (we only list the relevant key/value pairs here):</p> <pre><code>{ "changes": [0,0,1,...], "paragraph-authors": [1,1,1,2,...] }</code></pre> <p> </p> </li> </ol> <p><strong>Output Format</strong></p> <p>To evaluate the solutions for the tasks, the results have to be stored in a single file for each of the input documents and each of the datasets. Please note that we require a solution file to be generated for each input problem for each dataset. The data structure during the evaluation phase will be similar to that in the training phase, with the exception that the ground truth files are missing.</p> <p>For each given problem <code>problem-X.txt</code>, your software should output the missing solution file <code>solution-problem-X.json</code>, containing a JSON object holding the solution to the respective task. The solution for tasks 1 and 3 is an array containing a binary value for each pair of consecutive paragraphs (task 1) or sentences (task 3). For task 2, the solution is an array containing the order of authors contained in the document (as in the truth files).</p> <p>An example solution file for tasks 1 and 3 is featured in the following (note again that for task 1, changes are captured on the paragraph level, whereas for task 3, changes are captured on the sentence level):</p> <pre><code>{ "changes": [0,0,1,0,0,...] }</code></pre> <p>For task 2, the solution file looks as follows:</p> <pre><code>{ "paragraph-authors": [1,1,2,2,3,2,...] }</code></pre> <p> </p&gt

    PAN18 Multi-Author Analysis: Style-Change-Detection

    No full text
    Dataset for binary style change detection. More information about the task: Lin

    PAN17 Author Identification: Clustering

    No full text
    We provide a collection of (up to 50) short documents (paragraphs extracted from larger documents), identify authorship links and groups of documents by the same author. All documents are single-authored, in the same language, and belong to the same genre. However, the topic or text-length of documents may vary. The number of distinct authors whose documents are included in the collection is not given. More information: Lin

    PAN18 Author Identification: Attribution

    No full text
    We provide a corpus which comprises a set of cross-domain authorship attribution problems in each of the following 5 languages: English, French, Italian, Polish, and Spanish. Note that we specifically avoid to use the term 'training corpus' because the sets of candidate authors of the development and the evaluation corpora are not overlapping. Therefore, your approach should not be designed to particularly handle the candidate authors of the development corpus. Each problem consists of a set of known fanfics by each candidate author and a set of unknown fanfics located in separate folders. The file problem-info.json that can be found in the main folder of each problem, shows the name of folder of unknown documents and the list of names of candidate author folders. The true author of each unknown document can be seen in the file ground-truth.json, also found in the main folder of each problem. In addition, to handle a collection of such problems, the file collection-info.jsonincludes all relevant information. In more detail, for each problem it lists its main folder, the language (either "en", "fr", "it", "pl", or "sp") and encoding (always UTF-8) of its documents. More information: Linknew version: removed passwords inside package
    corecore