84 research outputs found
Source-code plagiarism : an academic perspective
In computing courses, students are often required to complete tutorial and laboratory exercises asking them to produce source-code. Academics may require students to submit source-code produced as part of such exercises in order to monitor their students’ understanding of the material taught on that module, and submitted source-code may be checked for similarities in order to identify instances of plagiarism. In exercises that require students to work individually, source-code plagiarism can occur between students or students may plagiarise by copying material from a book or from other sources. We have conducted a survey of UK academics who teach programming on computing courses, in order to establish what is understood to constitute source-code plagiarism in an undergraduate context. In this report, we analyse the responses received from 59 academics. This report presents a detailed description of what can constitute source-code plagiarism from the perspective of academics who teach programming on computing courses, based on the responses to the survey
Source-code plagiarism : a UK academic perspective
In computing courses, students are often required to complete tutorial and laboratory exercises asking them to produce source-code. Academics may require students to submit source-code produced as part of such exercises in order to monitor their students' understanding of the material taught on that module, and submitted source-code may be checked for similarities in order to identify instances of plagiarism. In exercises that require students to work individually, source-code plagiarism can occur between students or students may plagiarise by copying material from a book or from other sources. We have conducted a survey of UK academics who teach programming on computing courses, in order to establish what is understood to constitute source-code plagiarism in an undergraduate context. In this report, we analyse the responses received from 59 academics. This report presents a detailed description of what can constitute source-code plagiarism from the perspective of academics who teach programming on computing courses, based on the responses to the survey
An approach to source-code plagiarism detection investigation using latent semantic analysis
This thesis looks at three aspects of source-code plagiarism. The first aspect of the
thesis is concerned with creating a definition of source-code plagiarism; the second aspect
is concerned with describing the findings gathered from investigating the Latent Semantic
Analysis information retrieval algorithm for source-code similarity detection; and the final
aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that
combines Latent Semantic Analysis with plagiarism detection tools.
A recent review of the literature revealed that there is no commonly agreed definition of
what constitutes source-code plagiarism in the context of student assignments. This thesis
first analyses the findings from a survey carried out to gather an insight into the perspectives
of UK Higher Education academics who teach programming on computing courses. Based
on the survey findings, a detailed definition of source-code plagiarism is proposed.
Secondly, the thesis investigates the application of an information retrieval technique,
Latent Semantic Analysis, to derive semantic information from source-code files. Various
parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent
Semantic Analysis using various parameter settings and its effectiveness in retrieving
similar source-code files when optimising those parameters are evaluated.
Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection
tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is
a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism
detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility
for investigating the importance of source-code fragments with regards to their contribution
towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of
suspicious files and source-code fragments
VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval
Relation-focused cross-modal information retrieval focuses on retrieving
information based on relations expressed in user queries, and it is
particularly important in information retrieval applications and
next-generation search engines. While pre-trained networks like Contrastive
Language-Image Pre-training (CLIP) have achieved state-of-the-art performance
in cross-modal learning tasks, the Vision Transformer (ViT) used in these
networks is limited in its ability to focus on image region relations.
Specifically, ViT is trained to match images with relevant descriptions at the
global level, without considering the alignment between image regions and
descriptions. This paper introduces VITR, a novel network that enhances ViT by
extracting and reasoning about image region relations based on a Local encoder.
VITR comprises two main components: (1) extending the capabilities of ViT-based
cross-modal networks to extract and reason with region relations in images; and
(2) aggregating the reasoned results with the global knowledge to predict the
similarity scores between images and descriptions. Experiments were carried out
by applying the proposed network to relation-focused cross-modal information
retrieval tasks on the Flickr30K, RefCOCOg, and CLEVR datasets. The results
revealed that the proposed VITR network outperformed various other
state-of-the-art networks including CLIP, VSE, and VSRN++ on both
image-to-text and text-to-image cross-modal information retrieval tasks
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval
Visual Semantic Embedding (VSE) aims to extract the semantics of images and
their descriptions, and embed them into the same latent space for cross-modal
information retrieval. Most existing VSE networks are trained by adopting a
hard negatives loss function which learns an objective margin between the
similarity of relevant and irrelevant image-description embedding pairs.
However, the objective margin in the hard negatives loss function is set as a
fixed hyperparameter that ignores the semantic differences of the irrelevant
image-description pairs. To address the challenge of measuring the optimal
similarities between image-description pairs before obtaining the trained VSE
networks, this paper presents a novel approach that comprises two main parts:
(1) finds the underlying semantics of image descriptions; and (2) proposes a
novel semantically enhanced hard negatives loss function, where the learning
objective is dynamically determined based on the optimal similarity scores
between irrelevant image-description pairs. Extensive experiments were carried
out by integrating the proposed methods into five state-of-the-art VSE networks
that were applied to three benchmark datasets for cross-modal information
retrieval tasks. The results revealed that the proposed methods achieved the
best performance and can also be adopted by existing and future VSE networks
An approach to source-code plagiarism detection investigation using latent semantic analysis
This thesis looks at three aspects of source-code plagiarism. The first aspect of the thesis is concerned with creating a definition of source-code plagiarism; the second aspect is concerned with describing the findings gathered from investigating the Latent Semantic Analysis information retrieval algorithm for source-code similarity detection; and the final aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that combines Latent Semantic Analysis with plagiarism detection tools. A recent review of the literature revealed that there is no commonly agreed definition of what constitutes source-code plagiarism in the context of student assignments. This thesis first analyses the findings from a survey carried out to gather an insight into the perspectives of UK Higher Education academics who teach programming on computing courses. Based on the survey findings, a detailed definition of source-code plagiarism is proposed. Secondly, the thesis investigates the application of an information retrieval technique, Latent Semantic Analysis, to derive semantic information from source-code files. Various parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent Semantic Analysis using various parameter settings and its effectiveness in retrieving similar source-code files when optimising those parameters are evaluated. Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility for investigating the importance of source-code fragments with regards to their contribution towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of suspicious files and source-code fragments.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluation
Continual learning refers to the capability of a machine learning model to
learn and adapt to new information, without compromising its performance on
previously learned tasks. Although several studies have investigated continual
learning methods for information retrieval tasks, a well-defined task
formulation is still lacking, and it is unclear how typical learning strategies
perform in this context. To address this challenge, a systematic task
formulation of continual neural information retrieval is presented, along with
a multiple-topic dataset that simulates continuous information retrieval. A
comprehensive continual neural information retrieval framework consisting of
typical retrieval models and continual learning strategies is then proposed.
Empirical evaluations illustrate that the proposed framework can successfully
prevent catastrophic forgetting in neural information retrieval and enhance
performance on previously learned tasks. The results indicate that
embedding-based retrieval models experience a decline in their continual
learning performance as the topic shift distance and dataset volume of new
tasks increase. In contrast, pretraining-based models do not show any such
correlation. Adopting suitable learning strategies can mitigate the effects of
topic shift and data augmentation.Comment: Submitted to Information Science
Classifying Imbalanced Multi-modal Sensor Data for Human Activity Recognition in a Smart Home using Deep Learning
In smart homes, data generated from real-time
sensors for human activity recognition is complex, noisy and
imbalanced. It is a significant challenge to create machine
learning models that can classify activities which are not as
commonly occurring as other activities. Machine learning
models designed to classify imbalanced data are biased
towards learning the more commonly occurring classes. Such
learning bias occurs naturally, since the models better learn
classes which contain more records. This paper examines
whether fusing real-world imbalanced multi-modal sensor data
improves classification results as opposed to using unimodal
data; and compares deep learning approaches to dealing with
imbalanced multi-modal sensor data when using various
resampling methods and deep learning models. Experiments
were carried out using a large multi-modal sensor dataset
generated from the Sensor Platform for HEalthcare in a
Residential Environment (SPHERE). The data comprises
16104 samples, where each sample comprises 5608 features and
belongs to one of 20 activities (classes). Experimental results
using SPHERE demonstrate the challenges of dealing with
imbalanced multi-modal data and highlight the importance of
having a suitable number of samples within each class for
sufficiently training and testing deep learning models.
Furthermore, the results revealed that when fusing the data
and using the Synthetic Minority Oversampling Technique
(SMOTE) to correct class imbalance, CNN-LSTM achieved the
highest classification accuracy of 93.67% followed by CNN,
93.55%, and LSTM, i.e. 92.98%
ForestMonkey: Toolkit for Reasoning with AI-based Defect Detection and Classification Models
Artificial intelligence (AI) reasoning and explainable AI (XAI) tasks have
gained popularity recently, enabling users to explain the predictions or
decision processes of AI models. This paper introduces Forest Monkey (FM), a
toolkit designed to reason the outputs of any AI-based defect detection and/or
classification model with data explainability. Implemented as a Python package,
FM takes input in the form of dataset folder paths (including original images,
ground truth labels, and predicted labels) and provides a set of charts and a
text file to illustrate the reasoning results and suggest possible
improvements. The FM toolkit consists of processes such as feature extraction
from predictions to reasoning targets, feature extraction from images to defect
characteristics, and a decision tree-based AI-Reasoner. Additionally, this
paper investigates the time performance of the FM toolkit when applied to four
AI models with different datasets. Lastly, a tutorial is provided to guide
users in performing reasoning tasks using the FM toolkit.Comment: 6 pages, 5 figures, accepted in 2023 IEEE symposium series on
computational intelligence (SSCI
- …