5 research outputs found

    Sistem Deteksi Kemiripan Antar Dokumen Teks Menggunakan Model Bayesian pada Term Latent Semantic Analysis (LSA)

    Get PDF
    Metode Latent Semantic Analysis(LSA) adalah suatu metode yang mampu merepresentasikan hubungan antar dokumen teks melalui term serta dapat menilai kemiripan antar dokumen teks tersebut. Namun, metode LSA hanya menilai kemiripan antar dokumen teks melalui frekuensi term yang ada pada masing-masing dokumen teks sehingga mempunyai kelemahan yaitu tidak memperhatikan urutan atau tata letak term tersebut yang secara tidak langsung berpengaruh pada makna yang terkandung pada masing-masing dokumen. Oleh karena itu, digunakan model Bayesian pada term yang dihasilkan oleh LSA tersebut untuk menjaga dan memperhatikan urutan termdalam mendeteksi kemiripan antar dokumen teks sehingga struktur kalimat tetap terjaga dan mendapat hasil penilaian kemiripan antar dokumen teks yang lebih baik.Jika terdapat dua dokumen yang saling salin (copy) namun struktur kalimatnya diubah dan dibandingkan pada LSA dengan menggunakan cosine similarity maka akan didapat hasil yang sama seperti kedua dokumen ini dibandingkan tanpa Perubahan struktur kalimat, sedangkan jika dibandingkan dengan menggunakan model Bayesian pada term, dokumen-dokumen yang mempunyai perbedaan struktur kalimat akan diperlakukan berbeda

    On the detection of SOurce COde re-use

    Full text link
    © {Owner/Author | ACM} {2014}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation, http://dx.doi.org/10.1145/2824864.2824878"This paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that automatically detect the source code re-use phenomenon. The detection of source code re-use is an important research field for both software industry and academia fields. Accordingly, PAN@FIRE track, named SOurce COde Re-use (SOCO) focused on the detection of re-used source codes in C/C++ and Java programming languages. Participant systems were asked to annotate several source codes whether or not they represent cases of source code re-use. In total five teams submitted 17 runs. The training set consisted of annotations made by several experts, a feature which turns the SOCO 2014 collection in a useful data set for future evaluations and, at the same time, it establishes a standard evaluation framework for future research works on the posed shared task.PAN@FIRE (SOCO) has been organised in the framework of WIQ-EI (EC IRSES grantn. 269180) and DIANA-APPLICATIONS (TIN2012-38603-C02- 01) research projects. The work of the last author was supported by CONACyT Mexico Project Grant CB-2010/153315, and SEP-PROMEP UAM-PTC-380/48510349.Flores Sáez, E.; Rosso, P.; Moreno Boronat, LA.; Villatoro-Tello, E. (2014). On the detection of SOurce COde re-use. En FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation. ACM. 21-30. https://doi.org/10.1145/2824864.2824878S2130C. Arwin and S. Tahaghoghi. Plagiarism detection across programming languages. Proceedings of the 29th Australian Computer Science Conference, Australian Computer Society, 48:277--286, 2006.N. Baer and R. Zeidman. Measuring whitespace pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4):249--254, 2012.M. Chilowicz, E. Duris, and G. Roussel. Syntax tree fingerprinting for source code similarity detection. In Program Comprehension, 2009. ICPC '09. IEEE 17th International Conference on, pages 243--247, 2009.D. Chuda, P. Navrat, B. Kovacova, and P. Humay. The issue of (software) plagiarism: A student view. Education, IEEE Transactions on, 55(1):22--28, 2012.G. Cosma and M. Joy. Evaluating the performance of lsa for source-code plagiarism detection. Informatica, 36(4):409--424, 2013.B. Cui, J. Li, T. Guo, J. Wang, and D. Ma. Code comparison system based on abstract syntax tree. In Broadband Network and Multimedia Technology (IC-BNMT), 3rd IEEE International Conference on, pages 668--673, Oct 2010.J. A. W. Faidhi and S. K. Robinson. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Comput. Educ., 11(1):11--19, Jan. 1987.Fire, editor. FIRE 2014 Working Notes. Sixth International Workshop of the Forum for Information Retrieval Evaluation, Bangalore, India, 5--7 December, 2014.J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education, pages n/a--n/a, 2014.E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. DeSoCoRe: Detecting source code re-use across programming languages. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstration Session, NAACL-HLT, pages 1--4. Association for Computational Linguistics, 2012.E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. Towards the Detection of Cross-Language Source Code Reuse. Proceedings of 16th International Conference on Applications of Natural Language to Information Systems, NLDB-2011, Springer-Verlag, LNCS(6716), pages 250--253, 2011.E. Flores, M. Ibarra-Romero, L. Moreno, G. Sidorov, and P. Rosso. Modelos de recuperación de información basados en n-gramas aplicados a la reutilización de código fuente. In Proc. 3rd Spanish Conf. on Information Retrieval, pages 185--188, 2014.D. Ganguly and G. J. Jones. Dcu@ fire-2014: an information retrieval approach for source code plagiarism detection. In Fire [8].R. García-Hernández and Y. Lendeneva. Identification of similar source codes based on longest common substrings. In Fire [8].M. Joy and M. Luck. Plagiarism in programming assignments. Education, IEEE Transactions on, 42(2):129--133, May 1999.A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In Reverse Engineering, 2004. Proceedings. 11th Working Conference on, pages 214--223, Nov 2004.S. Narayanan and S. Simi. Source code plagiarism detection and performance analysis using fingerprint based distance measure method. In Proc. of 7th International Conference on Computer Science Education, ICCSE '12, pages 1065--1068, July 2012.M. Potthast, M. Hagen, A. Beyer, M. Busse, M. Tippmann, P. Rosso, and B. Stein. Overview of the 6th international competition on plagiarism detection. In L. Cappellato, N. Ferro, M. Halvey, and W. Kraaij, editors, Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014., volume 1180 of CEUR Workshop Proceedings, pages 845--876. CEUR-WS.org, 2014.L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, 8(11):1016--1038, 2002.I. Rahal and C. Wielga. Source code plagiarism detection using biological string similarity algorithms. Journal of Information & Knowledge Management, 13(3), 2014.A. Ramírez-de-la Cruz, G. Ramírez-de-la Rosa, C. Sánchez-Sánchez, W. A. Luna-Ramírez, H. Jiménez-Salazar, and C. Rodríguez-Lucatero. Uam@soco 2014: Detection of source code reuse by means of combining different types of representations. In Fire [8].F. Rosales, A. García, S. Rodríguez, J. L. Pedraza, R. Méndez, and M. M. Nieto. Detection of plagiarism in programming assignments. IEEE Transactions on Education, 51(2):174--183, 2008.K. Sparck and C. van Rijsbergen. Report on the need for and provision of an "ideal" information retrieval test collection. British Library Research and Development Report, 5266, University of Cambridge, 1975.G. Whale. Software metrics and plagiarism detection. Journal of Systems and Software, 13(2):131--138, 1990

    Detecting Source Code Plagiarism on .NET Programming Languages using Low-level Representation and Adaptive Local Alignment

    Get PDF
    Even though there are various source code plagiarism detection approaches, only a few works which are focused on low-level representation for deducting similarity. Most of them are only focused on lexical token sequence extracted from source code. In our point of view, low-level representation is more beneficial than lexical token since its form is more compact than the source code itself. It only considers semantic-preserving instructions and ignores many source code delimiter tokens. This paper proposes a source code plagiarism detection which rely on low-level representation. For a case study, we focus our work on .NET programming languages with Common Intermediate Language as its low-level representation. In addition, we also incorporate Adaptive Local Alignment for detecting similarity. According to Lim et al, this algorithm outperforms code similarity state-of-the-art algorithm (i.e. Greedy String Tiling) in term of effectiveness. According to our evaluation which involves various plagiarism attacks, our approach is more effective and efficient when compared with standard lexical-token approach
    corecore