190 research outputs found

    Source code authorship attribution

    Get PDF
    To attribute authorship means to identify the true author among many candidates for samples of work of unknown or contentious authorship. Authorship attribution is a prolific research area for natural language, but much less so for source code, with eight other research groups having published empirical results concerning the accuracy of their approaches to date. Authorship attribution of source code is the focus of this thesis. We first review, reimplement, and benchmark all existing published methods to establish a consistent set of accuracy scores. This is done using four newly constructed and significant source code collections comprising samples from academic sources, freelance sources, and multiple programming languages. The collections developed are the most comprehensive to date in the field. We then propose a novel information retrieval method for source code authorship attribution. In this method, source code features from the collection samples are tokenised, converted into n-grams, and indexed for stylistic comparison to query samples using the Okapi BM25 similarity measure. Authorship of the top ranked sample is used to classify authorship of each query, and the proportion of times that this is correct determines overall accuracy. The results show that this approach is more accurate than the best approach from the previous work for three of the four collections. The accuracy of the new method is then explored in the context of author style evolving over time, by experimenting with a collection of student programming assignments that spans three semesters with established relative timestamps. We find that it takes one full semester for individual coding styles to stabilise, which is essential knowledge for ongoing authorship attribution studies and quality control in general. We conclude the research by extending both the new information retrieval method and previous methods to provide a complete set of benchmarks for advancing the field. In the final evaluation, we show that the n-gram approaches are leading the field, with accuracy scores for some collections around 90% for a one-in-ten classification problem

    Neural and Non-Neural Approaches to Authorship Attribution

    Get PDF

    A study on plagiarism detection and plagiarism direction identification using natural language processing techniques

    Get PDF
    Ever since we entered the digital communication era, the ease of information sharing through the internet has encouraged online literature searching. With this comes the potential risk of a rise in academic misconduct and intellectual property theft. As concerns over plagiarism grow, more attention has been directed towards automatic plagiarism detection. This is a computational approach which assists humans in judging whether pieces of texts are plagiarised. However, most existing plagiarism detection approaches are limited to super cial, brute-force stringmatching techniques. If the text has undergone substantial semantic and syntactic changes, string-matching approaches do not perform well. In order to identify such changes, linguistic techniques which are able to perform a deeper analysis of the text are needed. To date, very limited research has been conducted on the topic of utilising linguistic techniques in plagiarism detection. This thesis provides novel perspectives on plagiarism detection and plagiarism direction identi cation tasks. The hypothesis is that original texts and rewritten texts exhibit signi cant but measurable di erences, and that these di erences can be captured through statistical and linguistic indicators. To investigate this hypothesis, four main research objectives are de ned. First, a novel framework for plagiarism detection is proposed. It involves the use of Natural Language Processing techniques, rather than only relying on the vii traditional string-matching approaches. The objective is to investigate and evaluate the in uence of text pre-processing, and statistical, shallow and deep linguistic techniques using a corpus-based approach. This is achieved by evaluating the techniques in two main experimental settings. Second, the role of machine learning in this novel framework is investigated. The objective is to determine whether the application of machine learning in the plagiarism detection task is helpful. This is achieved by comparing a thresholdsetting approach against a supervised machine learning classi er. Third, the prospect of applying the proposed framework in a large-scale scenario is explored. The objective is to investigate the scalability of the proposed framework and algorithms. This is achieved by experimenting with a large-scale corpus in three stages. The rst two stages are based on longer text lengths and the nal stage is based on segments of texts. Finally, the plagiarism direction identi cation problem is explored as supervised machine learning classi cation and ranking tasks. Statistical and linguistic features are investigated individually or in various combinations. The objective is to introduce a new perspective on the traditional brute-force pair-wise comparison of texts. Instead of comparing original texts against rewritten texts, features are drawn based on traits of texts to build a pattern for original and rewritten texts. Thus, the classi cation or ranking task is to t a piece of text into a pattern. The framework is tested by empirical experiments, and the results from initial experiments show that deep linguistic analysis contributes to solving the problems we address in this thesis. Further experiments show that combining shallow and viii deep techniques helps improve the classi cation of plagiarised texts by reducing the number of false negatives. In addition, the experiment on plagiarism direction detection shows that rewritten texts can be identi ed by statistical and linguistic traits. The conclusions of this study o er ideas for further research directions and potential applications to tackle the challenges that lie ahead in detecting text reuse.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Code similarity and clone search in large-scale source code data

    Get PDF
    Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse

    The Future of Information Sciences : INFuture2011 : Information Sciences and e-Society

    Get PDF

    Conversion of hydrocarbons to biosurfactants : an insight into the bioprocess optimisation of biosurfactant production using alkanes as inducers

    Get PDF
    Surfactants are chemical compounds that are able to alter interfacial properties, particularly surface tension. When they are biologically produced, the term biosurfactant is used. One of the most important groups of biosurfactants is a family of chemical compounds known as glycolipids, whose structure consists of a sugar group and a lipid tail. Glycolipids are subdivided into three main groups: rhamnolipids, sophorolipids and trehalolipids, named following their sugar moieties, respectively rhamnose, trehalose and sophorose. Biosurfactants exhibit attractive advantages over chemical surfactants. Examples of these are biodegradability, low toxicity, and effectiveness at extreme temperature, pH and salinity. The objective of the present research project was, first, to investigate the potential of liquid aliphatic hydrocarbons to induce biosurfactant production by the bacterium Ps. aeruginosa 2Bf isolated based on its ability to metabolise alkanes. The second objective was to optimise biosurfactant production using alkanes as sole carbon and energy source, through optimising the mixing & aeration conditions, media conditions as well as provision of alkane, in a stirred tank batch reactor system. The final objective was to describe the biosurfactant formed. Experiments were organised in three major series: the exploratory shake flask based experiments, the bioreactor-based experiments to optimise biosurfactant production and characterise biokinetics and performance, and the biosurfactant characterisation experiments. Following review of a number of methods, microbial cell counts were selected as the most reproducible measure of biomass formation in the presence of alkanes. The presence of biosurfactant was quantified functionally in terms of the emulsification index and alteration of surface tension. Using a shake flask-based study, nitrogen source was investigated in terms of biomass and biosurfactant synthesis. Four pre-selected nitrogen sources were tested in order to select the best for bioreactor based study. These nitrogen sources consisted of specific combinations of three nitrogen compounds, NH4NO3, NaNO3 and (NH4)2SO4. During the study, long chain liquid n-alkanes were used as sole carbon source and the C/N ratio maintained at the value of 18.6 in mass terms. Results confirmed that both a combination of NO3 ' and NH4+ ions or a nitrogen source composed solely of NH4+ ions were suitable for biomass growth and biosurfactant production. (NH4)SO4 was used as the N-source of choice in the remainder of the study. While the C14-C17 alkanes cut was the carbon source of interest in the study, two pure alkanes, n-C12 and n-C16 were tested and compared to the C14-C17 blend. The C14-C17 fraction, sourced as an industrial byproduct, compared favourably as a carbon source with respect to hexadecane and dodecane. ii Biosurfactant production was not observed in Ps. aeruginosa 2Bf cultures where glucose was the sole carbon source and the bacteria were not previously exposed to linear alkanes. Using a mixed carbon source of glucose and alkane, or on pre-exposure of the bacteria to alkane, biosurfactant production was induced. Induction was optimised where alkane was the sole carbon source over a period of four sub-culture steps. In the quantitative optimisation of biosurfactant production through the bioreactor based study, mixing and aeration were optimised; agitation and aeration proved to be equally important, the first at intermediate rates, the second at lower rates. Their interaction, when maximum biomass was used as the variable for response, was found to be important for agitation rates up to 500 rpm. Beyond this range of agitation speed, the interaction between aeration and agitation became negligible. In the case of Eindex as the variable for response, similar results were obtained with regard to the impact of the interaction between aeration and agitation on the process. It was significant from lower to intermediate agitation rates, and negligible from intermediate to higher rates of agitation. Lower aeration rate was found to enhance the oxygen utilisation rate, while mass transfer was relatively favoured by high aeration rate. Regarding the emulsification power of the product, quantitative tests were carried out on culture suspension, supernatant prepared by centrifugation and supernatant prepared by centrifugation and filtration at 0.22μm pore size filters. Results showed that some emulsification effect was lost through centrifugation and filtration. This loss of emulsification effect was more pronounced in the filtration case, thus showing that some biosurfactant was removed along some other material or substance through sticking on filter paper. Foam control was required, and two mechanical foam breakers were compared to anti-foam reagent. It was experimentally established that mechanical foam breakers are preferable to chemical anti-foam reagents. On comparing the two different mechanical foam breakers, the modified two blade paddle with three slits, FB-2, performed better than the simple two blade paddle foam breaker, FB-1. Further investigations showed that the interaction between type of foam control and agitation rate was negligible throughout the process. The Biosurfactant was characterised at the structural level and the antibiotic potential of Ps. aeruginosa 2Bf's biosurfactant was analysed. In addition to the thin layer chromatography, three different spectroscopic methods (mass, infrared & nuclear magnetic resonance) were used to study the chemical structure of the biosurfactant produced. Up to six rhamnolipid structures were tentatively identified with spectrometric analysis whereas only four to five structures could be detected with thin layer chromatography. Possession of an anti-microbial activity by the rhamnolipids produced was confirmed with the B. subtilis inhibition test

    Tuberculosis : how different synthetic analogues of pathogen associated mycolates affect lipid homeostasis of murine host macrophages

    Get PDF

    Anomaly detection & object classification using multi-spectral LiDAR and sonar

    Get PDF
    In this thesis, we present the theory of high-dimensional signal approximation of multifrequency signals. We also present both linear and non-linear compressive sensing (CS) algorithms that generate encoded representations of time-correlated single photon counting (TCSPC) light detection and ranging (LiDAR) data, side-scan sonar (SSS) and synthetic aperture sonar (SAS). The main contributions of this thesis are summarised as follows: 1. Research is carried out studying full-waveform (FW) LiDARs, in particular, the TCSPC data, capture, storage and processing. 2. FW-LiDARs are capable of capturing large quantities of photon-counting data in real-time. However, the real-time processing of the raw LiDAR waveforms hasn’t been widely exploited. This thesis answers some of the fundamental questions: • can semantic information be extracted and encoded from raw multi-spectral FW-LiDAR signals? • can these encoded representations then be used for object segmentation and classification? 3. Research is carried out into signal approximation and compressive sensing techniques, its limitations and the application domains. 4. Research is also carried out in 3D point cloud processing, combining geometric features with material spectra (spectral-depth representation), for object segmentation and classification. 5. Extensive experiments have been carried out with publicly available datasets, e.g. the Washington RGB Image and Depth (RGB-D) dataset [108], YaleB face dataset1 [110], real-world multi-frequency aerial laser scans (ALS)2 and an underwater multifrequency (16 wavelengths) TCSPC dataset collected using custom-build targets especially for this thesis. 6. The multi-spectral measurements were made underwater on targets with different shapes and materials. A novel spectral-depth representation is presented with strong discrimination characteristics on target signatures. Several custom-made and realistically scaled exemplars with known and unknown targets have been investigated using a multi-spectral single photon counting LiDAR system. 7. In this work, we also present a new approach to peak modelling and classification for waveform enabled LiDAR systems. Not all existing approaches perform peak modelling and classification simultaneously in real-time. This was tested on both simulated waveform enabled LiDAR data and real ALS data2 . This PhD also led to an industrial secondment at Carbomap, Edinburgh, where some of the waveform modelling algorithms were implemented in C++ and CUDA for Nvidia TX1 boards for real-time performance. 1http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ 2This dataset was captured in collaboration with Carbomap Ltd. Edinburgh, UK. The data was collected during one of the trials in Austria using commercial-off-the-shelf (COTS) sensors

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov
    corecore