6 research outputs found

    AntiPlag: Plagiarism Detection on Electronic Submissions of Text Based Assignments

    Full text link
    Plagiarism is one of the growing issues in academia and is always a concern in Universities and other academic institutions. The situation is becoming even worse with the availability of ample resources on the web. This paper focuses on creating an effective and fast tool for plagiarism detection for text based electronic assignments. Our plagiarism detection tool named AntiPlag is developed using the tri-gram sequence matching technique. Three sets of text based assignments were tested by AntiPlag and the results were compared against an existing commercial plagiarism detection tool. AntiPlag showed better results in terms of false positives compared to the commercial tool due to the pre-processing steps performed in AntiPlag. In addition, to improve the detection latency, AntiPlag applies a data clustering technique making it four times faster than the commercial tool considered. AntiPlag could be used to isolate plagiarized text based assignments from non-plagiarised assignments easily. Therefore, we present AntiPlag, a fast and effective tool for plagiarism detection on text based electronic assignments

    Performance analysis of supervised learning classifiers for the prediction of child birth weight

    Get PDF
    Even though the technology advancement has helped improving medical and health sectors, high infant mortality rate is still considered as a serious problem in developing countries. Low birth weight (LBW) plays a major role in infant mortality. There are several reasons for LBW of a child. There is a wide variation of LBW within Sri Lanka in different geographical areas. Particularly the districts belong to plantation workers, Monaragala, Ampara and Poionnaruwa districts show the highest percentages of newborns belonging to the LBW category. From the statistics Ampara district shows 17% of LBW. The objective of this work is to find a suitable way to predict the child weight using the existing pattern of low birth weight in Ampara region. There is a need to find an algorithm with good performance among several existing supervised learning classifiers to construct a decision model. This work was carried out among 2700 pregnant mothers throughout the MOH offices in Ampara district. Initially the existing data were manually classified into three classes such as Normal Birth Weight (NB), Low Birth Weight (LB) and High Birth Weight (HB). Several important parameters were captured from the data set. C4.5, CART and ID3 supervised learning classifiers in Weka data mining and machine learning tool were used in this experiment. The data were handled through three major processes namely as pre-processing, attribute selection and construction of decision trees using classifiers. The missing values in the huge data set were handled in pre-processing. The useful and most significant parameters were selected and ranked using feature selection process. Three decision tree algorithms were used to construct decision tree. The accuracy and time complexity for the tree construction were measured using the experimental tool Weka by applying 10-fold cross validation from the experimental results on accuracy and time complexity of the decision tree classifiers, C4.5 produces higher accuracy as 86.15% and the time complexity is less than 1 minute. Considering the time complexity and accuracy C4.5 works effectively compared to the others. Therefore C4.5 was selected as a best classifier to construct decision tree model for the prediction of child weight in Ampara district

    Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement

    Full text link
    We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of data measurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections.Comment: 17 pages, 45 citations, 2 figure

    Classification of Microarrays with kNN : Comparison of Dimensionality Reduction Methods

    No full text
    Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (kNN) for high-dimensional data sets, such as microarrays. The effect of the choice of dimensionality reduction method on the predictive performance of kNN for classifying microarray data is an open issue, and four common dimensionality reduction methods, Principal Component Analysis (PCA), Random Projection (RP), Partial Least Squares (PLS) and Information Gain(IG), are compared on eight microarray data sets. It is observed that all dimensionality reduction methods result in more accurate classifiers than what is obtained from using the raw attributes. Furthermore, it is observed that both PCA and PLS reach their best accuracies with fewer components than the other two methods, and that RP needs far more components than the others to outperform kNN on the non-reduced dataset. None of the dimensionality reduction methods can be concluded to generally outperform the others, although PLS is shown to be superior on all four binary classification tasks, but the main conclusion from the study is that the choice of dimensionality reduction method can be of major importance when classifying microarrays using kNN
    corecore