167 research outputs found

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    A Machine Learning Approach for Plagiarism Detection

    Get PDF
    Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists

    Modeling Non-Standard Text Classification Tasks

    Get PDF
    Text classification deals with discovering knowledge in texts and is used for extracting, filtering, or retrieving information in streams and collections. The discovery of knowledge is operationalized by modeling text classification tasks, which is mainly a human-driven engineering process. The outcome of this process, a text classification model, is used to inductively learn a text classification solution from a priori classified examples. The building blocks of modeling text classification tasks cover four aspects: (1) the way examples are represented, (2) the way examples are selected, (3) the way classifiers learn from examples, and (4) the way models are selected. This thesis proposes methods that improve the prediction quality of text classification solutions for unseen examples, especially for non-standard tasks where standard models do not fit. The original contributions are related to the aforementioned building blocks: (1) Several topic-orthogonal text representations are studied in the context of non-standard tasks and a new representation, namely co-stems, is introduced. (2) A new active learning strategy that goes beyond standard sampling is examined. (3) A new one-class ensemble for improving the effectiveness of one-class classification is proposed. (4) A new model selection framework to cope with subclass distribution shifts that occur in dynamic environments is introduced

    Improved techniques for phishing email detection based on random forest and firefly-based support vector machine learning algorithms.

    Get PDF
    Master of Science in Computer Science. University of KwaZulu-Natal, Durban, 2014.Electronic fraud is one of the major challenges faced by the vast majority of online internet users today. Curbing this menace is not an easy task, primarily because of the rapid rate at which fraudsters change their mode of attack. Many techniques have been proposed in the academic literature to handle e-fraud. Some of them include: blacklist, whitelist, and machine learning (ML) based techniques. Among all these techniques, ML-based techniques have proven to be the most efficient, because of their ability to detect new fraudulent attacks as they appear.There are three commonly perpetrated electronic frauds, namely: email spam, phishing and network intrusion. Among these three, more financial loss has been incurred owing to phishing attacks. This research investigates and reports the use of MLand Nature Inspired technique in the domain of phishing detection, with the foremost objective of developing a dynamic and robust phishing email classifier with improved classification accuracy and reduced processing time.Two approaches to phishing email detection are proposed, and two email classifiers are developed based on the proposed approaches. In the first approach, a random forest algorithm is used to construct decision trees,which are,in turn,used for email classification. The second approach introduced a novel MLmethod that hybridizes firefly algorithm (FFA) and support vector machine (SVM). The hybridized method consists of three major stages: feature extraction phase, hyper-parameter selection phase and email classification phase. In the feature extraction phase, the feature vectors of all the features described in Section 3.6 are extracted and saved in a file for easy access.In the second stage, a novel hyper-parameter search algorithm, developed in this research, is used to generate exponentially growing sequence of paired C and Gamma (Îł) values. FFA is then used to optimize the generated SVM hyper-parameters and to also find the best hyper-parameter pair. Finally, in the third phase, SVM is used to carry out the classification. This new approach addresses the problem of hyper-parameter optimization in SVM, and in turn, improves the classification speed and accuracy of SVM. Using two publicly available email datasets, some experiments are performed to evaluate the performance of the two proposed phishing email detection techniques. During the evaluation of each approach, a set of features (well suited for phishing detection) are extracted from the training dataset and used to constructthe classifiers. Thereafter, the trained classifiers are evaluated on the test dataset. The evaluations produced very good results. The RF-based classifier yielded a classification accuracy of 99.70%, a FP rate of 0.06% and a FN rate of 2.50%. Also, the hybridized classifier (known as FFA_SVM) produced a classification accuracy of 99.99%, a FP rate of 0.01% and a FN rate of 0.00%

    Automatic text summarisation using linguistic knowledge-based semantics

    Get PDF
    Text summarisation is reducing a text document to a short substitute summary. Since the commencement of the field, almost all summarisation research works implemented to this date involve identification and extraction of the most important document/cluster segments, called extraction. This typically involves scoring each document sentence according to a composite scoring function consisting of surface level and semantic features. Enabling machines to analyse text features and understand their meaning potentially requires both text semantic analysis and equipping computers with an external semantic knowledge. This thesis addresses extractive text summarisation by proposing a number of semantic and knowledge-based approaches. The work combines the high-quality semantic information in WordNet, the crowdsourced encyclopaedic knowledge in Wikipedia, and the manually crafted categorial variation in CatVar, to improve the summary quality. Such improvements are accomplished through sentence level morphological analysis and the incorporation of Wikipedia-based named-entity semantic relatedness while using heuristic algorithms. The study also investigates how sentence-level semantic analysis based on semantic role labelling (SRL), leveraged with a background world knowledge, influences sentence textual similarity and text summarisation. The proposed sentence similarity and summarisation methods were evaluated on standard publicly available datasets such as the Microsoft Research Paraphrase Corpus (MSRPC), TREC-9 Question Variants, and the Document Understanding Conference 2002, 2005, 2006 (DUC 2002, DUC 2005, DUC 2006) Corpora. The project also uses Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for the quantitative assessment of the proposed summarisers’ performances. Results of our systems showed their effectiveness as compared to related state-of-the-art summarisation methods and baselines. Of the proposed summarisers, the SRL Wikipedia-based system demonstrated the best performance

    Software similarity and classification

    Full text link
    This thesis analyses software programs in the context of their similarity to other software programs. Applications proposed and implemented include detecting malicious software and discovering security vulnerabilities

    Source code authorship attribution

    Get PDF
    To attribute authorship means to identify the true author among many candidates for samples of work of unknown or contentious authorship. Authorship attribution is a prolific research area for natural language, but much less so for source code, with eight other research groups having published empirical results concerning the accuracy of their approaches to date. Authorship attribution of source code is the focus of this thesis. We first review, reimplement, and benchmark all existing published methods to establish a consistent set of accuracy scores. This is done using four newly constructed and significant source code collections comprising samples from academic sources, freelance sources, and multiple programming languages. The collections developed are the most comprehensive to date in the field. We then propose a novel information retrieval method for source code authorship attribution. In this method, source code features from the collection samples are tokenised, converted into n-grams, and indexed for stylistic comparison to query samples using the Okapi BM25 similarity measure. Authorship of the top ranked sample is used to classify authorship of each query, and the proportion of times that this is correct determines overall accuracy. The results show that this approach is more accurate than the best approach from the previous work for three of the four collections. The accuracy of the new method is then explored in the context of author style evolving over time, by experimenting with a collection of student programming assignments that spans three semesters with established relative timestamps. We find that it takes one full semester for individual coding styles to stabilise, which is essential knowledge for ongoing authorship attribution studies and quality control in general. We conclude the research by extending both the new information retrieval method and previous methods to provide a complete set of benchmarks for advancing the field. In the final evaluation, we show that the n-gram approaches are leading the field, with accuracy scores for some collections around 90% for a one-in-ten classification problem

    Intelligent instance selection techniques for support vector machine speed optimization with application to e-fraud detection.

    Get PDF
    Doctor of Philosophy in Computer Science. University of KwaZulu-Natal, Durban 2017.Decision-making is a very important aspect of many businesses. There are grievous penalties involved in wrong decisions, including financial loss, damage of company reputation and reduction in company productivity. Hence, it is of dire importance that managers make the right decisions. Machine Learning (ML) simplifies the process of decision making: it helps to discover useful patterns from historical data, which can be used for meaningful decision-making. The ability to make strategic and meaningful decisions is dependent on the reliability of data. Currently, many organizations are overwhelmed with vast amounts of data, and unfortunately, ML algorithms cannot effectively handle large datasets. This thesis therefore proposes seven filter-based and five wrapper-based intelligent instance selection techniques for optimizing the speed and predictive accuracy of ML algorithms, with a particular focus on Support Vector Machine (SVM). Also, this thesis proposes a novel fitness function for instance selection. The primary difference between the filter-based and wrapper-based technique is in their method of selection. The filter-based techniques utilizes the proposed fitness function for selection, while the wrapper-based technique utilizes SVM algorithm for selection. The proposed techniques are obtained by fusing SVM algorithm with the following Nature Inspired algorithms: flower pollination algorithm, social spider algorithm, firefly algorithm, cuckoo search algorithm and bat algorithm. Also, two of the filter-based techniques are boundary detection algorithms, inspired by edge detection in image processing and edge selection in ant colony optimization. Two different sets of experiments were performed in order to evaluate the performance of the proposed techniques (wrapper-based and filter-based). All experiments were performed on four datasets containing three popular e-fraud types: credit card fraud, email spam and phishing email. In addition, experiments were performed on 20 datasets provided by the well-known UCI data repository. The results show that the proposed filter-based techniques excellently improved SVM training speed in 100% (24 out of 24) of the datasets used for evaluation, without significantly affecting SVM classification quality. Moreover, experimental results also show that the wrapper-based techniques consistently improved SVM predictive accuracy in 78% (18 out of 23) of the datasets used for evaluation and simultaneously improved SVM training speed in all cases. Furthermore, two different statistical tests were conducted to further validate the credibility of the results: Freidman’s test and Holm’s post-hoc test. The statistical test results reveal that the proposed filter-based and wrapper-based techniques are significantly faster, compared to standard SVM and some existing instance selection techniques, in all cases. Moreover, statistical test results also reveal that Cuckoo Search Instance Selection Algorithm outperform all the proposed techniques, in terms of speed. Overall, the proposed techniques have proven to be fast and accurate ML-based e-fraud detection techniques, with improved training speed, predictive accuracy and storage reduction. In real life application, such as video surveillance and intrusion detection systems, that require a classifier to be trained very quickly for speedy classification of new target concepts, the filter-based techniques provide the best solutions; while the wrapper-based techniques are better suited for applications, such as email filters, that are very sensitive to slight changes in predictive accuracy

    Machine Learning and Security of Non-Executable Files

    Get PDF
    Computer malware is a well-known threat in security which, despite the enormous time and effort invested in fighting it, is today more prevalent than ever. Recent years have brought a surge in one particular type: malware embedded in non-executable file formats, e.g., PDF, SWF and various office file formats. The result has been a massive number of infections, owed primarily to the trust that ordinary computer users have in these file formats. In addition, their feature-richness and implementation complexity have created enormous attack surfaces in widely deployed client software, resulting in regular discoveries of new vulnerabilities. The traditional approach to malware detection – signature matching, heuristics and behavioral profiling – has from its inception been a labor-intensive manual task, always lagging one step behind the attacker. With the exponential growth of computers and networks, malware has become more diverse, wide-spread and adaptive than ever, scaling much faster than the available talent pool of human malware analysts. An automated and scalable approach is needed to fill the gap between automated malware adaptation and manual malware detection, and machine learning is emerging as a viable solution. Its branch called adversarial machine learning studies the security of machine learning algorithms and the special conditions that arise when machine learning is applied for security. This thesis is a study of adversarial machine learning in the context of static detection of malware in non-executable file formats. It evaluates the effectiveness, efficiency and security of machine learning applications in this context. To this end, it introduces 3 data-driven detection methods developed using very large, high quality datasets. PJScan detects malicious PDF files based on lexical properties of embedded JavaScript code and is the fastest method published to date. SL2013 extends its coverage to all PDF files, regardless of JavaScript presence, by analyzing the hierarchical structure of PDF logical building blocks and demonstrates excellent performance in a novel long-term realistic experiment. Finally, Hidost generalizes the hierarchical-structure-based feature set to become the first machine-learning-based malware detector operating on multiple file formats. In a comprehensive experimental evaluation on PDF and SWF, it outperforms other academic methods and commercial antivirus systems in detection effectiveness. Furthermore, the thesis presents a framework for security evaluation of machine learning classifiers in a case study performed on an independent PDF malware detector. The results show that the ability to manipulate a part of the classifier’s feature set allows a malicious adversary to disguise malware so that it appears benign to the classifier with a high success rate. The presented methods are released as open-source software.Schadsoftware ist eine gut bekannte Sicherheitsbedrohung. Trotz der enormen Zeit und des Aufwands die investiert werden, um sie zu beseitigen, ist sie heute weiter verbreitet als je zuvor. In den letzten Jahren kam es zu einem starken Anstieg von Schadsoftware, welche in nicht-ausfĂŒhrbaren Dateiformaten, wie PDF, SWF und diversen Office-Formaten, eingebettet ist. Die Folge war eine massive Anzahl von Infektionen, ermöglicht durch das Vertrauen, das normale Rechnerbenutzer in diese Dateiformate haben. Außerdem hat die KomplexitĂ€t und Vielseitigkeit dieser Dateiformate große AngriffsflĂ€chen in weitverbreiteter Klient-Software verursacht, und neue SicherheitslĂŒcken werden regelmĂ€ĂŸig entdeckt. Der traditionelle Ansatz zur Erkennung von Schadsoftware – Mustererkennung, Heuristiken und Verhaltensanalyse – war vom Anfang an eine Ă€ußerst mĂŒhevolle Handarbeit, immer einen Schritt hinter den Angreifern zurĂŒck. Mit dem exponentiellen Wachstum von Rechenleistung und Netzwerkgeschwindigkeit ist Schadsoftware diverser, zahlreicher und schneller-anpassend geworden als je zuvor, doch die VerfĂŒgbarkeit von menschlichen Schadsoftware-Analysten kann nicht so schnell skalieren. Ein automatischer und skalierbarer Ansatz ist gefragt, und maschinelles Lernen tritt als eine brauchbare Lösung hervor. Ein Bereich davon, Adversarial Machine Learning, untersucht die Sicherheit von maschinellen Lernverfahren und die besonderen VerhĂ€ltnisse, die bei der Anwendung von machinellem Lernen fĂŒr Sicherheit entstehen. Diese Arbeit ist eine Studie von Adversarial Machine Learning im Kontext statischer Schadsoftware-Erkennung in nicht-ausfĂŒhrbaren Dateiformaten. Sie evaluiert die Wirksamkeit, LeistungsfĂ€higkeit und Sicherheit von maschinellem Lernen in diesem Kontext. Zu diesem Zweck stellt sie 3 datengesteuerte Erkennungsmethoden vor, die alle auf sehr großen und diversen DatensĂ€tzen entwickelt wurden. PJScan erkennt bösartige PDF-Dateien anhand lexikalischer Eigenschaften von eingebettetem JavaScript-Code und ist die schnellste bisher veröffentliche Methode. SL2013 erweitert die Erkennung auf alle PDF-Dateien, unabhĂ€ngig davon, ob sie JavaScript enthalten, indem es die hierarchische Struktur von logischen PDF-Bausteinen analysiert. Es zeigt hervorragende Leistung in einem neuen, langfristigen und realistischen Experiment. Schließlich generalisiert Hidost den auf hierarchischen Strukturen basierten Merkmalsraum und wurde zum ersten auf maschinellem Lernen basierten Schadsoftware-Erkennungssystem, das auf mehreren Dateiformaten anwendbar ist. In einer umfassenden experimentellen Evaulierung auf PDF- und SWF-Formaten schlĂ€gt es andere akademische Methoden und kommerzielle Antiviren-Lösungen bezĂŒglich Erkennungswirksamkeit. Überdies stellt diese Doktorarbeit ein Framework fĂŒr Sicherheits-Evaluierung von auf machinellem Lernen basierten Klassifikatoren vor und wendet es in einer Fallstudie auf eine unabhĂ€ngige akademische Schadsoftware-Erkennungsmethode an. Die Ergebnisse zeigen, dass die FĂ€higkeit, nur einen Teil von Features, die ein Klasifikator verwendet, zu manipulieren, einem Angreifer ermöglicht, Schadsoftware in Dateien so einzubetten, dass sie von der Erkennungsmethode mit hoher Erfolgsrate als gutartig fehlklassifiziert wird. Die vorgestellten Methoden wurden als Open-Source-Software veröffentlicht

    A Corpus Driven Computational Intelligence Framework for Deception Detection in Financial Text

    Get PDF
    Financial fraud rampages onwards seemingly uncontained. The annual cost of fraud in the UK is estimated to be as high as £193bn a year [1] . From a data science perspective and hitherto less explored this thesis demonstrates how the use of linguistic features to drive data mining algorithms can aid in unravelling fraud. To this end, the spotlight is turned on Financial Statement Fraud (FSF), known to be the costliest type of fraud [2]. A new corpus of 6.3 million words is composed of102 annual reports/10-K (narrative sections) from firms formally indicted for FSF juxtaposed with 306 non-fraud firms of similar size and industrial grouping. Differently from other similar studies, this thesis uniquely takes a wide angled view and extracts a range of features of different categories from the corpus. These linguistic correlates of deception are uncovered using a variety of techniques and tools. Corpus linguistics methodology is applied to extract keywords and to examine linguistic structure. N-grams are extracted to draw out collocations. Readability measurement in financial text is advanced through the extraction of new indices that probe the text at a deeper level. Cognitive and perceptual processes are also picked out. Tone, intention and liquidity are gauged using customised word lists. Linguistic ratios are derived from grammatical constructs and word categories. An attempt is also made to determine ‘what’ was said as opposed to ‘how’. Further a new module is developed to condense synonyms into concepts. Lastly frequency counts from keywords unearthed from a previous content analysis study on financial narrative are also used. These features are then used to drive machine learning based classification and clustering algorithms to determine if they aid in discriminating a fraud from a non-fraud firm. The results derived from the battery of models built typically exceed classification accuracy of 70%. The above process is amalgamated into a framework. The process outlined, driven by empirical data demonstrates in a practical way how linguistic analysis could aid in fraud detection and also constitutes a unique contribution made to deception detection studies
    • 

    corecore