1,393 research outputs found
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Prototype of Automatic Essay Assessment and Plagiarism Detection on Mobile Learning âMolearnâ Application Using GLSA Method
In evaluating the studentâs learning outcomes, essay exams were commonly used by teachers to measure the level of studentâs understanding of the learning material. However assessing essay answers was more difficult in reality because it contained teacherâs subjectivity and required a longer correction time. In addition, detecting similarity in essay answers between students also required more teacherâs efforts. In previous studies, a prototype of essay answer assessment and plagiarism detection had been successfully created. However, the prototype display still needed an improvement based on the evaluation results given by biology teachers in East Java Province as the application users. The previous prototype also still carried the Latent Semantic Analysis (LSA) method which had several weaknesses. Therefore, this study aimed to produce prototypes that had better display and text similarity methods. The Generalized Latent Semantic Analysis (GLSA) method was chosen because it was able to cover the weaknesses of the LSA method. GLSA was able to detect sentences that had syntactic errors or missing common words. Based on the evaluation results, this study succeeded in producing a prototype with a better display value. The level of user satisfaction increased by 6.12%. In addition, the study succeeded in using the GLSA method as a substitute for LSA for creating better prototype essay assessment and automatic plagiarism detection
Hunting for Pirated Software Using Metamorphic Analysis
In this paper, we consider the problem of detecting software that has been pirated and modified. We analyze a variety of detection techniques that have been previously studied in the context of malware detection. For each technique, we empirically determine the detection rate as a function of the degree of modification of the original code. We show that the code must be greatly modified before we fail to reliably distinguish it, and we show that our results offer a significant improvement over previous related work. Our approach can be applied retroactively to any existing software and hence, it is both practical and effective
A Machine Learning Approach for Plagiarism Detection
Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic.
This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases.
Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author.
The Intrinsic method aims to generate a model of author âstyleâ by revealing a set of certain features of authorship. The modelâs generation procedure focuses on just one author as an attempt to summarise aspects of an authorâs style in a definitive and clear-cut manner.
The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists
Fragile watermarking for image authentication using dyadic walsh ordering
A digital image is subjected to the most manipulation. This is driven by the easy manipulating process through image editing software which is growing rapidly. These problems can be solved through the watermarking model as an active authentication system for the image. One of the most popular methods is Singular Value Decomposition (SVD) which has good imperceptibility and detection capabilities. Nevertheless, SVD has high complexity and can only utilize one singular matrix S, and ignore two orthogonal matrices. This paper proposes the use of the Walsh matrix with dyadic ordering to generate a new S matrix without the orthogonal matrices. The experimental results showed that the proposed method was able to reduce computational time by 22% and 13% compared to the SVD-based method and similar methods based on the Hadamard matrix respectively. This research can be used as a reference to speed up the computing time of the watermarking methods without compromising the level of imperceptibility and authentication
CroLSSim: Crossâlanguage software similarity detector using hybrid approach of LSAâbased ASTâMDrep features and CNNâLSTM model
Software similarity in different programming codes is a rapidly evolving field because of its numerous applications in software development, software cloning, software plagiarism, and software forensics. Currently, software researchers and developers search cross-language open-source repositories for similar applications for a variety of reasons, such as reusing programming code, analyzing different implementations, and looking for a better application. However, it is a challenging task because each programming language has a unique syntax and semantic structure. In this paper, a novel tool called Cross-Language Software Similarity (CroLSSim) is designed to detect similar software applications written in different programming codes. First, the Abstract Syntax Tree (AST) features are collected from different programming codes. These are high-quality features that can show the abstract view of each program. Then, Methods Description (MDrep) in combination with AST is used to examine the relationship among different method calls. Second, the Term Frequency Inverse Document Frequency approach is used to retrieve the local and global weights from AST-MDrep features. Third, the Latent Semantic Analysis-based features extraction and selection method is proposed to extract the semantic anchors in reduced dimensional space. Fourth, the Convolution Neural Network (CNN)-based features extraction method is proposed to mine the deep features. Finally, a hybrid deep learning model of CNN-Long-Short-Term Memory is designed to detect semantically similar software applications from these latent variables. The data set contains approximately 9.5K Java, 8.8K C#, and 7.4K C++ software applications obtained from GitHub. The proposed approach outperforms as compared with the state-of-the-art methods
- âŠ