1,361 research outputs found

    funcGNN: A Graph Neural Network Approach to Program Similarity

    Full text link
    Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. We introduce funcGNN, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between high-level language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (0.00194), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with the state of the art methods. funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.Comment: 11 pages, 8 figures, 3 table

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Metaheuristic-Based Neural Network Training And Feature Selector For Intrusion Detection

    Get PDF
    Intrusion Detection (ID) in the context of computer networks is an essential technique in modern defense-in-depth security strategies. As such, Intrusion Detection Systems (IDSs) have received tremendous attention from security researchers and professionals. An important concept in ID is anomaly detection, which amounts to the isolation of normal behavior of network traffic from abnormal (anomaly) events. This isolation is essentially a classification task, which led researchers to attempt the application of well-known classifiers from the area of machine learning to intrusion detection. Neural Networks (NNs) are one of the most popular techniques to perform non-linear classification, and have been extensively used in the literature to perform intrusion detection. However, the training datasets usually compose feature sets of irrelevant or redundant information, which impacts the performance of classification, and traditional learning algorithms such as backpropagation suffer from known issues, including slow convergence and the trap of local minimum. Those problems lend themselves to the realm of optimization. Considering the wide success of swarm intelligence methods in optimization problems, the main objective of this thesis is to contribute to the improvement of intrusion detection technology through the application of swarm-based optimization techniques to the basic problems of selecting optimal packet features, and optimal training of neural networks on classifying those features into normal and attack instances. To realize these objectives, the research in this thesis follows three basic stages, succeeded by extensive evaluations

    Intelligent Match Merging to Prevent Obfuscation Attacks on Software Plagiarism Detectors

    Get PDF
    Aufgrund der steigenden Anzahl der Informatikstudierenden verlassen sich Dozenten auf aktuelle Werkzeuge zur Erkennung von Quelltextplagiaten, um zu verhindern, dass Studierende plagiierte Programmieraufgaben einreichen. Während diese auf Token basierenden Plagiatsdetektoren inhärent resilient gegen einfache Verschleierungen sind, ermöglichen kürzlich veröffentlichte Verschleierungswerkzeuge den Studierenden, ihre Abgaben mühelos zu ändern, um die Erkennung zu umgehen. Der Vormarsch von ChatGPT hat zusätzliche Bedenken hinsichtlich seiner Verschleierungsfähigkeiten und der Notwendigkeit wirksamer Gegenstrategien aufgeworfen. Bestehende Verteidigungsmechanismen gegen Verschleierung sind oft durch ihre Spezifität für bestimmte Angriffe oder ihre Abhängigkeit von Programmiersprachen begrenzt, was eine mühsame und fehleranfällige Neuimplementierung erfordert. Als Antwort auf diese Herausforderung führt diese Arbeit einen neuartigen Verteidigungsmechanismus gegen automatische Verschleierungsangriffe namens Match-Zusammenführung ein. Er macht sich die Tatsache zunutze, dass Verschleierungsangriffe die Token-Sequenz ändern, um Übereinstimmungen zwischen zwei Abgaben aufzuspalten, sodass die gebrochenen Übereinstimmungen vom Plagiatsdetektor verworfen werden. Match-Zusammenführung macht die Auswirkungen dieser Angriffe rückgängig, indem benachbarte Übereinstimmungen auf der Grundlage einer Heuristik intelligent zusammengeführt werden, um falsch positive Ergebnisse zu minimieren. Die Widerstandsfähigkeit unserer Methode gegen klassische Verschleierungsangriffe wird durch Evaluationen anhand verschiedener realer Datensätze, einschließlich Studienarbeiten und Programmierwettbewerbe, in sechs verschiedenen Angriffsszenarien demonstriert. Darüber hinaus verbessert sie die Erkennungsleistung gegen KI-basierte Verschleierung signifikant. Was diesen Mechanismus auszeichnet, ist seine Unabhängigkeit von Sprache und Angriff, während sein minimaler Laufzeit-Aufwand ihn nahtlos mit anderen Verteidigungsmechanismen kompatibel macht

    A lightweight, graph-theoretic model of class-based similarity to support object-oriented code reuse.

    Get PDF
    The work presented in this thesis is principally concerned with the development of a method and set of tools designed to support the identification of class-based similarity in collections of object-oriented code. Attention is focused on enhancing the potential for software reuse in situations where a reuse process is either absent or informal, and the characteristics of the organisation are unsuitable, or resources unavailable, to promote and sustain a systematic approach to reuse. The approach builds on the definition of a formal, attributed, relational model that captures the inherent structure of class-based, object-oriented code. Based on code-level analysis, it relies solely on the structural characteristics of the code and the peculiarly object-oriented features of the class as an organising principle: classes, those entities comprising a class, and the intra and inter-class relationships existing between them, are significant factors in defining a two-phase similarity measure as a basis for the comparison process. Established graph-theoretic techniques are adapted and applied via this model to the problem of determining similarity between classes. This thesis illustrates a successful transfer of techniques from the domains of molecular chemistry and computer vision. Both domains provide an existing template for the analysis and comparison of structures as graphs. The inspiration for representing classes as attributed relational graphs, and the application of graph-theoretic techniques and algorithms to their comparison, arose out of a well-founded intuition that a common basis in graph-theory was sufficient to enable a reasonable transfer of these techniques to the problem of determining similarity in object-oriented code. The practical application of this work relates to the identification and indexing of instances of recurring, class-based, common structure present in established and evolving collections of object-oriented code. A classification so generated additionally provides a framework for class-based matching over an existing code-base, both from the perspective of newly introduced classes, and search "templates" provided by those incomplete, iteratively constructed and refined classes associated with current and on-going development. The tools and techniques developed here provide support for enabling and improving shared awareness of reuse opportunity, based on analysing structural similarity in past and ongoing development, tools and techniques that can in turn be seen as part of a process of domain analysis, capable of stimulating the evolution of a systematic reuse ethic

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

    Full text link
    [EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-yS1059310607321

    Code clone detection in obfuscated Android apps

    Get PDF
    The Android operating system has long become one of the main global smartphone operating systems. Both developers and malware authors often reuse code to expedite the process of creating new apps and malware samples. Code cloning is the most common way of reusing code in the process of developing Android apps. Finding code clones through the analysis of Android binary code is a challenging task that becomes more sophisticated when instances of code reuse are non-contiguous, reordered, or intertwined with other code. We introduce an approach for detecting cloned methods as well as small and non-contiguous code clones in obfuscated Android applications by simulating the execution of Android apps and then analyzing the subsequent execution traces. We first validate our approach’s ability on finding different types of code clones on 20 injected clones. Next we validate the resistance of our approach against obfuscation by comparing its results on a set of 1085 apps before and after code obfuscation. We obtain 78-87% similarity between the finding from non-obfuscated applications and four sets of obfuscated applications. We also investigated the presence of code clones among 1603 Android applications. We were able to find 44,776 code clones where 34% of code clones were seen from different applications and the rest are among different versions of an application. We also performed a comparative analysis between the clones found by our approach and the clones detected by Nicad on the source code of applications. Finally, we show a practical application of our approach for detecting variants of Android banking malware. Among 60,057 code clone clusters that are found among a dataset of banking malware, 92.9% of them were unique to one malware family or benign applications
    corecore