    Sistem Deteksi Kemiripan Antar Dokumen Teks Menggunakan Model Bayesian pada Term Latent Semantic Analysis (LSA)

    Metode Latent Semantic Analysis(LSA) adalah suatu metode yang mampu merepresentasikan hubungan antar dokumen teks melalui term serta dapat menilai kemiripan antar dokumen teks tersebut. Namun, metode LSA hanya menilai kemiripan antar dokumen teks melalui frekuensi term yang ada pada masing-masing dokumen teks sehingga mempunyai kelemahan yaitu tidak memperhatikan urutan atau tata letak term tersebut yang secara tidak langsung berpengaruh pada makna yang terkandung pada masing-masing dokumen. Oleh karena itu, digunakan model Bayesian pada term yang dihasilkan oleh LSA tersebut untuk menjaga dan memperhatikan urutan termdalam mendeteksi kemiripan antar dokumen teks sehingga struktur kalimat tetap terjaga dan mendapat hasil penilaian kemiripan antar dokumen teks yang lebih baik.Jika terdapat dua dokumen yang saling salin (copy) namun struktur kalimatnya diubah dan dibandingkan pada LSA dengan menggunakan cosine similarity maka akan didapat hasil yang sama seperti kedua dokumen ini dibandingkan tanpa Perubahan struktur kalimat, sedangkan jika dibandingkan dengan menggunakan model Bayesian pada term, dokumen-dokumen yang mempunyai perbedaan struktur kalimat akan diperlakukan berbeda

    On the detection of SOurce COde re-use

    This paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that automatically detect the source code re-use phenomenon. The detection of source code re-use is an important research field for both software industry and academia fields. Accordingly, PAN@FIRE track, named SOurce COde Re-use (SOCO) focused on the detection of re-used source codes in C/C++ and Java programming languages. Participant systems were asked to annotate several source codes whether or not they represent cases of source code re-use. In total five teams submitted 17 runs. The training set consisted of annotations made by several experts, a feature which turns the SOCO 2014 collection in a useful data set for future evaluations and, at the same time, it establishes a standard evaluation framework for future research works on the posed shared task.     Detecting Source Code Plagiarism on .NET Programming Languages using Low-level Representation and Adaptive Local Alignment

    Even though there are various source code plagiarism detection approaches, only a few works which are focused on low-level representation for deducting similarity. Most of them are only focused on lexical token sequence extracted from source code. In our point of view, low-level representation is more beneficial than lexical token since its form is more compact than the source code itself. It only considers semantic-preserving instructions and ignores many source code delimiter tokens. This paper proposes a source code plagiarism detection which rely on low-level representation. For a case study, we focus our work on .NET programming languages with Common Intermediate Language as its low-level representation. In addition, we also incorporate Adaptive Local Alignment for detecting similarity. According to Lim et al, this algorithm outperforms code similarity state-of-the-art algorithm (i.e. Greedy String Tiling) in term of effectiveness. According to our evaluation which involves various plagiarism attacks, our approach is more effective and efficient when compared with standard lexical-token approach