304 research outputs found

    Recent trends in digital text forensics and its evaluation

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-40802-1_28This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PAN 13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58 submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.This work was partially supported by the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie action.Gollub, T.; Potthast, M.; Beyer, A.; Busse, M.; Rangel Pardo, FM.; Rosso, P.; Stamatatos, E.... (2013). Recent trends in digital text forensics and its evaluation. En Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Springer Verlag (Germany). 282-302. https://doi.org/10.1007/978-3-642-40802-1_28S282302Aleman, Y., Loya, N., Vilarino Ayala, D., Pinto, D.: Two Methodologies Applied to the Author Profiling Task—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Argamon, S., Juola, P.: Overview of the International Authorship Identification Competition at PAN-2011. In: Proc. of CLEF 2011 (2011)Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, Genre, and Writing Style in Formal Written Texts. TEXT 23, 321–346 (2003)Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically Profiling the Author of an Anonymous Text. Commun. ACM 52(2), 119–123 (2009)Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: EvaluatIR: An Online Tool for Evaluating and Comparing IR Systems. In: Proc. of SIGIR 2009 (2009)Blockeel, H., Vanschoren, J.: Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 6–17. Springer, Heidelberg (2007)Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating Gender on Twitter. In: Proc. EMNLP 2011 (2011)Clough, P., Stevenson, M.: Developing a Corpus of Plagiarised Short Answers. Lang. Resour. Eval. 45, 5–24 (2011)Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: METER: MEasuring TExt Reuse. In: Proc. ACL 2002 (2002)De Roure, D., Goble, C., Stevens, R.: The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows. Future Gener. Comp. Sy. 25, 561–567 (2009)Caurcel Diaz, A.A., Gomez Hidalgo, J.M.: Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Downie, J.S.: The Music Information Retrieval Evaluation Exchange (2005–2007): A Window into Music Information Retrieval Research. Acoust. Sc. and Tech. 29(4), 247–255 (2008)Hernandez Farias, D.I., Guzman-Cabrera, R., Reyes, A., Rocha, M.A.: Semantic-based Features for Author Profiling Identification: First Insights—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Flekova, L., Gurevych, I.: Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media–Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers (2013)Gillam, L.: Readability for author profiling?—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Gollub, T., Burrows, S., Stein, B.: First Experiences with TIRA for Reproducible Evaluation in Information Retrieval. In: Proc. of OSIR at SIGIR 2012 (August 2012)Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower Research: Towards a Web Framework for Providing Experiments as a Service. In: Proc. of SIGIR 2012 (2012)Gollub, T., Stein, B., Burrows, S., Hoppe, D.: TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments. In: Proc. of TIR at DEXA 2012. IEEE (2012)Goswami, S., Sarkar, S., Rustagi, M.: Stylometric Analysis of Bloggers’ Age and Gender. In: Proc. of ICWSM 2009 (2009)Haggag, O., El-Beltagy, S.: Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Holmes, J., Meyerhoff, M.: The Handbook of Language and Gender. Blackwell Handbooks in Linguistics. Wiley (2003)Inches, G., Crestani, F.: Overview of the International Sexual Predator Identification Competition at PAN-2012. In: Proc. of CLEF 2012 (2012)Juola, P.: Authorship Attribution. Found. and Trends in IR 1, 234–334 (2008)Juola, P.: Ad-hoc Authorship Attribution Competition. In: Proc. of ALLC 2004 (2004)Juola, P.: An Overview of the Traditional Authorship Attribution Subtask. In: Proc. of CLEF 2012 (2012)Koppel, M., Winter, Y.: Determining if Two Documents are by the Same Author. Journal of the American Society for Information Science and Technology (to appear)Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4), 401–412 (2002)Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)Koppel, M., Schler, J., Argamon, S.: Authorship Attribution in the Wild. Language Resources and Evaluation 45, 83–94 (2011)Kong, L., Qi, H., Du, C., Wang, M., Han, Z.: Approaches for Source Retrieval and Text Alignment of Plagiarism Detection—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Lim, W.Y., Goh, J., Thing, V.L.L.: Content-centric age and gender profiling—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Pastor Lopez-Monroy, A., Montes-Y-Gomez, M., Jair Escalante, H., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN’13: Author Profiling task—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based Classification for Author Profiling using Various Features—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How Old Do You Think I Am?”; A Study of Language and Age in Twitter. In: Proc. of ICWSM 2013 (2013)Nguyen, D., Smith, N.A., Rosé, C.P.: Author Age Prediction from Text Using Linear Regression. In: Proc. of LaTeCH at ACL-HLTGopal Patra, B., Banerjee, S., Das, D., Saikh, T., Bandyopadhyay, S.: Automatic Author Profiling Based on Linguistic and Stylistic Features—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting Age and Gender in Online Social Networks. In: Proc. of SMUC 2011 (2011)Pennebaker, J.W.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury, USA (2013)Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological Aspects of Natural Language Use: Our Words, Our Selves. Annual Review of Psychology 54(1), 547–577 (2003)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Proc. of PAN at SEPLN 2009 (2009)Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Proc. of CLEF 2010 (2010)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for Plagiarism Detection. In: Proc. of COLING 2010 (2010)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd International Competition on Plagiarism Detection. In: Proc. of CLEF 2011 (2011)Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th International Competition on Plagiarism Detection. In: Proc. of CLEF 2012 (2012)Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: A Search Engine for the ClueWeb09 Corpus. In: Proc. of SIGIR 2012 (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th International Competition on Plagiarism Detection. In: Proc. of CLEF 2013 (2013)Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing Interaction Logs to Understand Text Reuse from the Web. In: Proc. of ACL 2013. ACM (to appear, August 2013b)Rodíguez Torrejón, D.A., Martín Ramos, J.M.: Text Alignment Module in CoReMo 2.1 Plagiarism Detector—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Santosh, K., Bansal, R., Shekhar, M., Varma, V.: Author Profiling: Predicting Age and Gender from Blogs—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of Age and Gender on Blogging. In: Proc. of CAAW 2006 (2006)Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology 60, 538–556 (2009)Stamatatos, E.: Plagiarism Detection Using Stopword N-grams. Journal of the American Society for Information Science and Technology 62(12), 2512–2527 (2011)Stein, B., Meyer zu Eißen, S., Potthast, M.: Strategies for Retrieving Plagiarized Documents. In: Proc. of SIGIR 2007 (2007)Suchomel, Š., Kasprzak, J., Brandejs, M.: Diverse Queries and Feature Type Selection for Plagiarism Discovery—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Williams, K., Chen, H., Chowdhury, S.R., Giles, C.L.: Unsupervised Ranking for Plagiarism Source Retrieval—Notebook for PAN at CLEF 2013. In: Forner, et al. (eds.) [15]Wojnarski, M., Stawicki, S., Wojnarowski, P.: TunedIT.org: System for Automated Evaluation of Algorithms in Repeatable Experiments. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 20–29. Springer, Heidelberg (2010)Zhang, C., Zhang, P.: Predicting Gender from Blog Posts. Technical report, University of Massachusetts Amherst, USA (2010

    Neural and Non-Neural Approaches to Authorship Attribution

    Get PDF

    Identifying Authorship Style in Malicious Binaries: Techniques, Challenges & Datasets

    Get PDF
    Attributing a piece of malware to its creator typically requires threat intelligence. Binary attribution increases the level of difficulty as it mostly relies upon the ability to disassemble binaries to identify authorship style. Our survey explores malicious author style and the adversarial techniques used by them to remain anonymous. We examine the adversarial impact on the state-of-the-art methods. We identify key findings and explore the open research challenges. To mitigate the lack of ground truth datasets in this domain, we publish alongside this survey the largest and most diverse meta-information dataset of 15,660 malware labeled to 164 threat actor groups
    • …
    corecore