4 research outputs found

    Heterogeneous Queries for Synoptic and Phrasal Search

    Get PDF
    This paper describes our approaches for the Plagiarism Detection – Source Retrieval task of PAN 2014. We combined and improved methodology used at PAN 2012 and PAN 2013. Our system combines three types of queries: The keywords-based queries; the paragraph-based queries; and the headers-based queries. The queries are distinguished also by other properties such as the phrase query or the positional query. The queries are submitted to two search engines – Chatnoir and Indri – according to their properties. The query’s position serves for the search control, minimization of the total number of executed queries is the system’s priority. Downloaded documents are textually compared with the suspicious document and if a similarity is found, the downloaded document is reported

    Improving Synoptic Querying for Source Retrieval

    Get PDF
    Source retrieval is a part of plagiarism discovery process, where only a selected set of candidate documents are retrieved from a large corpus of potential source documents and passed for detailed document comparison in order to highlight potential plagiarism. This paper describes used methodology and the architecture of source retrieval system developed for PAN 2015 lab on uncovering plagiarism, authorship, and social software misuse. The system is based on our previous systems used at PAN since 2012. The majority of features were adopted with some improvements described in this paper. The paper analyzes used methodology and discuss the queries performance. The paper provides explanation for many implementation settings in the source retrieval process. The source retrieval subsystem forms an integral part of a modern system for plagiarism discovery.Source retrieval is a part of plagiarism discovery process, where only a selected set of candidate documents are retrieved from a large corpus of potential source documents and passed for detailed document comparison in order to highlight potential plagiarism. This paper describes used methodology and the architecture of source retrieval system developed for PAN 2015 lab on uncovering plagiarism, authorship, and social software misuse. The system is based on our previous systems used at PAN since 2012. The majority of features were adopted with some improvements described in this paper. The paper analyzes used methodology and discuss the queries performance. The paper provides explanation for many implementation settings in the source retrieval process. The source retrieval subsystem forms an integral part of a modern system for plagiarism discovery

    Source Retrieval for Plagiarism Detection

    Get PDF
    Plagiarism has become a serious problem mainly because of the electronically available documents. An online document retrieval is a weighty part of a modern anti-plagiarism tool. This paper describes an architecture and concepts of a real-world document retrieval system, which is a part of a general anti-plagiarism software. Up to date systems for plagiarism detection are discussed from the source retrieval perspective. The key approaches of source retrieval are compared. The system recommendations stem from design, implementation, and several years of operation experience of a nationwide plagiarism solution at Masaryk University in the Czech Republic. The design can be adapted to many situations. Proper usage of such systems contributes to the gradual improvement of the quality of student theses.Plagiarism has become a serious problem mainly because of the electronically available documents. An online document retrieval is a weighty part of a modern anti-plagiarism tool. This paper describes an architecture and concepts of a real-world document retrieval system, which is a part of a general anti-plagiarism software. Up to date systems for plagiarism detection are discussed from the source retrieval perspective. The key approaches of source retrieval are compared. The system recommendations stem from design, implementation, and several years of operation experience of a nationwide plagiarism solution at Masaryk University in the Czech Republic. The design can be adapted to many situations. Proper usage of such systems contributes to the gradual improvement of the quality of student theses

    Author Profiling and Plagiarism Detection

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-25485-2_6In this chapter we introduce the topics that we will cover in the RuSSIR 2014 course on Author Profiling and Plagiarism Detection (APPD). Author profiling distinguishes between classes of authors studying how language is shared by classes of people. This task helps in identifying profiling aspects such as gender, age, native language, or even personality type. In case of the plagiarism detection task we are not interested in studying how language is shared. On the contrary, given a document we are interested in investigating if the writing style changes in order to unveil text inconsistencies, i.e., unexpected irregularities through the document such as changes in vocabulary, style and text complexity. In fact, when it is not possible to retrieve the source document(s) where plagiarism has been committed from, the intrinsic analysis of the suspicious document is the only way to find evidence of plagiarism. The difficulty in retrieving the source of plagiarism could be due to the fact that the documents are not available on the web or the plagiarised text fragments were obfuscated via paraphrasing or translation (in case the source document was in another language). In this overview, we also discuss the results of the shared tasks on author profiling (gender and age identification) and plagiarism detection that we help to organise at the PAN Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse.The PAN shared tasks on author profil-ing and on plagiarism detection have been organised in the framework of the WIQ-EIIRSES project (Grant No. 269180) within the EC FP 7 Marie Curie People. The research work described in the paper was carried out in the framework of the DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction inIntelligent Systems.Rosso, P. (2015). Author Profiling and Plagiarism Detection. En Information Retrieval. Springer. 229-250. https://doi.org/10.1007/978-3-319-25485-2_6S229250Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. TEXT 23, 321–346 (2003)Association of Teachers and Lecturers. School work plagued by plagiarism - ATL survey. Technical report, Association of Teachers and Lecturers, London, UK (2008). (Press release)Barrón-Cedeño, A.: On the mono- and cross-language detection of text re-use and plagiarism. Ph.D. thesis, Universitat Politènica de València (2012)Barrón-Cedeño, A., Rosso, P., Pinto, D., Juan, A.: On cross-lingual plagiarism analysis using a statistical model. In: Proceedings of the ECAI 2008 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, PAN 2008 (2008)Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl. Based Syst. 50, 11–17 (2013)Barrón-Cedeño, A., Vila, M., Martí, M., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)Bogdanova, D., Rosso, P., Solorio, T.: Exploring high-level features for detecting cyberpedophilia. Comput. Speech Lang. 28(1), 108–120 (2014)Braschler, M., Harman, D.: Notebook papers of CLEF 2010 LABs and workshops. Padua, Italy (2010)Cappellato, L., Ferro, N., Halvey, M., Kraaij, W.: CLEF 2014 labs and workshops, notebook papers. In: CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613–0073 (2014). http://ceur-ws.org/Vol-1180/Comas, R., Sureda, J., Nava, C., Serrano, L.: Academic cyberplagiarism: a descriptive and comparative analysis of the prevalence amongst the undergraduate students at Tecmilenio University (Mexico) and Balearic Islands University (Spain). In: Proceedings of the International Conference on Education and New Learning Technologies (EDULEARN 2010), Barcelona (2010)Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–233 (1948)Flores, E., Barrón-Cedeño, A., Rosso, P., Moreno, L.: Desocore: detecting source code re-use across programming languages. In: Proceedings of 12th International Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2012, pp. 1–4, Montreal, Canada (2012)Flores, E., Barrón-Cedeño, A., Moreno, L., Rosso, P.: Uncovering source code re-use in large-scale programming environments. In: Computer Applications in Engineering and Education, Accepted (2014). doi: 10.1002/cae.21608Forner, P., Navigli, R., Tufis, D.: CLEF 2013 evaluation labs and workshop - working notes papers, 23–26 September. Valencia, Spain (2013)Franco-Salvador, M., Gupta, P., Rosso, P.: Cross-Language plagiarism detection using a multilingual semantic network. In: Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E., Serdyukov, P. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 710–713. Springer, Heidelberg (2013)Franco-Salvador, M., Gupta, P., Rosso, P.: Knowledge graphs as context models: improving the detection of cross-language plagiarism with paraphrasing. In: Ferro, N. (ed.) PROMISE Winter School 2013. LNCS, vol. 8173, pp. 227–236. Springer, Heidelberg (2014)Gollub, T., Stein, B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M., (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM, August 2012. ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348501Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: Gurrin, C., Jones, G., Kelly, D., Kruschwitz, U., de Rijke, M., Sakai, T., Sheridan, P., (eds.) 36th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2013), pp. 981–984. ACM (2013)Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Adar, E., Hurst, M., Finin, T., Glance, N.S., Nicolov, N., Tseng, B.L., (eds.) ICWSM. The AAAI Press (2009)Gressel, G., Hrudya, P., Surendran, K., Thara, S., Aravind, A., Prabaharan, P.: Ensemble Learning Approach for Author Profiling-Notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Grozea, C., Popescu, M.: ENCOPLOT - performance in the Second International Plagiarism Detection Challenge lab report for PAN at CLEF 2010. In: Braschler and Harman [8]Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: pairwise sequence matching in linear time applied to plagiarism detection. In: Stein et al., (ed.) Overview of the 1st International Competition on Plagiarism Detection, pp. 10–18 (2009)Gunning, R.: The Technique of Clear Writing. McGraw-Hill Int. Book Co, New York (1952)Gupta, P., Barrón-Cedeño, A., Rosso, P.: Cross-language high similarity search using a conceptual thesaurus. In: Catarci, T., Peñas, A., Santucci, G., Forner, P., Hiemstra, D. (eds.) CLEF 2012. LNCS, vol. 7488, pp. 67–75. Springer, Heidelberg (2012)Honore, A.: Some simple measures of richness of vocabulary. Assoc. Lit. Linguist. Comput. Bull. 7(2), 172–177 (1979)IEEE. A Plagiarism FAQ. http://www.ieee.org/publications_standards/publications/rights/plagiarism_FAQ.html (2008). Published: 2008; Last Accessed 25 November 2012Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Lit. Linguist. Comput. 17(4), 401–412 (2002)Liau, Y., Vrizlynn, L.: Submission to the author profiling competition at pan-2014. In: Proceedings Recent Advances in Natural Language Processing III (2014). http://www.webis.de/research/events/pan-14Lopez-Monroy, A.P., Montes-Y-Gomez, M., Escalante, H.J., Villaseñor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN 2013: author profiling task–notebook for PAN at CLEF 2013. In: Forner, et al. [14]Pastor López-Monroy, A., Montes y Gómez, M., Escalante, H.J., Villaseñor-Pineda, L.: Using Intra-profile information for author profiling-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Maharjan, S., Shrestha, P., Solorio, T.: A simple approach to author profiling in MapReduce–notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Marquardt, J., Fanardi, G., Vasudevan, G., Moens, M.F., Davalos, S., Teredesai, A., De Cock, M.: Age and gender identification in social media-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Martin, B.: Plagiarism: policy against cheating or policy for learning? Nexus (Newsl. Aust. Sociol. Assoc.) 16(2), 15–16 (2004)Mcnamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1), 73–97 (2004)Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features-notebook for PAN at CLEF 2013. In: Forner, et al. [14]Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Tombros, A., Yavlinsky, A., Rüger, S.M., Tsikrika, T., Lalmas, M., MacFarlane, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)Montes y Gómez, M., Gelbukh, A.F., López-López, A., Baeza-Yates, R.A.: Flexible comparison of conceptual graphs. In: Proceedings DEXA, pp. 102–111 (2001)Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)Nawab, R.M.A., Stevenson, M., Clough, P.: University of sheffield lab report for pan at clef 2010. In: Braschler and Harman [8]Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “how old do you think i am?”; a study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013)Oberreuter, G., Eiselt, A.: Submission to the 6th international competition on plagiarism detection, From Innovand.io, Chile (2014). http://www.webis.de/research/events/pan-14Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)Palkovskii, Y., Belov, A.: Developing high-resolution universal multi-type N-Gram plagiarism detector-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: our words, our selves. Ann. Rev. Psychol. 54(1), 547–577 (2003)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: COLING 2010: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 997–1005 (2010)Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Plachouras, V., Macdonald, C., Ounis, I., White, R.W., Ruthven, I. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.:. Overview of the 1st international competition on plagiarism detection. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E., (eds.) Proceedings of the SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 1–9, 2009. CEUR-WS.org (September 2009). http://ceur-ws.org/Vol-502Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Braschler and Harman [8]Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Braschler, M., Harman, D., Pianta, E., (eds.) Working Notes Papers of the CLEF 2010 Evaluation Labs (September 2010) 2010. http://www.clef-initiative.eu/publication/working-notesPotthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Petras, V., Forner, P., Clough, P., (eds.) Working Notes Papers of the CLEF 2011 Evaluation Labs (September 2011) (2011). http://www.clef-initiative.eu/publication/working-notesPotthast, M., Gollub, T., Hagen, M., Grabegger, J., Kiesel, J., Michel, M., Oberlander, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacker, C., (eds.) Working Notes Papers of the CLEF 2012 Evaluation Labs (September 2012) (2012). http://www.clef-initiative.eu/publication/working-notesPotthast, M., Hagen, M., Stein, B., Grabegger, J., Michel, M., Tippmann, M., Welsch, C.: Chatnoir: a search engine for the clueweb09 corpus. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M., (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), p. 1004 (2012)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: Forner, et al. [14]Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th International Competition on Plagiarism Detection. In: Cappellato, et al. [9]Pouliquen, B., Steinberger, R., Ignat, C.: Automatic linking of similar texts across languages. In: Proceedings of Recent Advances in Natural Language Processing III, RANLP 2003, pp. 307–316 (2003)Prakash, A., Saha, S.: Experiments on document chunking and query formation for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: Forner, et al. [14]Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkman, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014–notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: A winning approach to text alignment for text reuse detection at PAN 2014-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205. AAAI (2006)Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E., (eds.) Proceedings of the SEPLN09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 38–46, 2009. CEUR-WS.org, September 2009. http://ceur-ws.org/Vol-502Stein, B., Meyer zu Eissen, S., Potthast, M.: Strategies for retrieving plagiarized documents. In: Clarke, C., Fuhr, N., Kando, N., Kraaij, W., de Vries, A., (eds.) 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2007), pp. 825–826. ACM (2007)Stein, B., Potthast, M., Rosso, P., Barrón-Cedeño, A., Stamatatos, E., Koppel, M.: Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. ACM SIGIR Forum 45, 45–48 (2011)Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The jrc-acquis: a multilingual aligned parallel corpus with +20 languages. In: Proceedings of 5th International Conference on language resources and evaluation LREC 2006 (2006)Suchomel, S., Brandejs, M.: Heterogeneous queries for synoptic and phrasal search-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Villena-Román, J., González-Cristóbal, J.C.: DAEDALUS at PAN 2014: Guessing Tweet Author’s Gender and Age-Notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Vossen, P.: Eurowordnet: a multilingual database of autonomous and language-specific wordnets connected via an inter-lingual index. Int. J. Lexicography 17, 161–173 (2004)Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 783–792 (2010)Weren, E.R.D., Moreira, V.P., de Oliveira, J.P.M.:. Exploring information retrieval features for author profiling-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Williams, K., Chen, H.H., Giles, C.: Supervised ranking for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Yule, G.: The Statistical Study of Literary Vocabulary. Cambridge University press, Cambridge (1944)Zubarev, D., Sochenkov, I.: Using sentence similarity measure for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, L., et al. [9
    corecore