904 research outputs found

    Document Fingerprinting Using Graph Grammar Induction

    Get PDF
    The purpose of this study was to detect the similarity between documents when the relationships between textures are considered. In our study, we focus on C-language documents as our domain. Our algorithm starts from converting document into graph format. Next, graph grammar is extracted from the graph by SubdueGL, a graph grammar induction algorithm. Finally, the evaluation of the similarity between documents is accomplished by comparing the graph grammars. We also study graph characteristics, graph grammar and the graph isomorphism. In the converting module, documents are translated into graph format, which can be defined differently in various domains. In C-language documents, we found that a conceptual graph which is the most expressive via considering in relationship between textures has the best performance in detecting similarity. Thus, our algorithm generates this conceptual graph. After evaluating our algorithm, the results show that our algorithm can detect the similarity between documents well. However, it can not indicate that the found similarity is texture similarity or structure similarity because our process combines those two similarities in its final result. Nevertheless, compared to other algorithms, our approach works well when relationships between textures are considered.Computer Science Departmen

    Comparative Study of Existing Plagiarism Detection Systems and Design of a New Web Based System

    Get PDF
    The purpose of this study was to compare and contrast existing plagiarism detection systems. In addition to this study, a new web based plagiarism detection system was developed. The new system uses a graph-based data-mining algorithm, SubdueGL, and a graph grammar matching algorithm, SubMatch. The attempt in the new system was to see if the graph grammar data mining technique can produce a more reliable system that can overcome some of the weaknesses discovered in the existing systems. The existing systems were found to have some limitations due to the algorithms used and system design. The new system was found to give reliable results when compared with them. The results were superior in some cases because it can detect similar patterns between documents even if there is some structural dissimilarity. The interface provided is web-based and easy to use.Computer Science Departmen

    Software Plagiarism Detection Using Abstract Syntax Tree and Graph-based Data Mining

    Get PDF
    This study is using a graph-based data mining technique to discover cases of software plagiarism. We hypothesize that repetitive patterns found in the abstract syntax tree (AST) representation of source code will only match such patterns of other source code if the author of both are the same. A graph-based data mining technique was used for analyzing the AST and extracting the patterns. The results from the data miner were compared using a graph matching algorithm, which provided the measure of similarity. We used artificial test sets and actual student assignments for evaluation. The experiments identified plagiarism behaviors in both artificial and real-world data. These findings proved the system to be feasible. This system can be applied to every kind of programming language that use abstract syntax trees for compilation, and these ASTs can easily be extracted using the compiler. An advantage of this system over other plagiarism detectors is that it can deal with partial source code plagiarism behavior, which others do not currently do. Disadvantages of our approach include slow speed because of the graph-based data mining system used, and dependence on compilers to provide the AST. Also, if a source code cannot be compiled, the compiler will not provide a full AST, and the results will be inaccurate.Computer Science Departmen

    Topics in combinatorial pattern matching

    Get PDF

    From Frequency to Meaning: Vector Space Models of Semantics

    Full text link
    Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

    Network Analysis with Stochastic Grammars

    Get PDF
    Digital forensics requires significant manual effort to identify items of evidentiary interest from the ever-increasing volume of data in modern computing systems. One of the tasks digital forensic examiners conduct is mentally extracting and constructing insights from unstructured sequences of events. This research assists examiners with the association and individualization analysis processes that make up this task with the development of a Stochastic Context -Free Grammars (SCFG) knowledge representation for digital forensics analysis of computer network traffic. SCFG is leveraged to provide context to the low-level data collected as evidence and to build behavior profiles. Upon discovering patterns, the analyst can begin the association or individualization process to answer criminal investigative questions. Three contributions resulted from this research. First , domain characteristics suitable for SCFG representation were identified and a step -by- step approach to adapt SCFG to novel domains was developed. Second, a novel iterative graph-based method of identifying similarities in context-free grammars was developed to compare behavior patterns represented as grammars. Finally, the SCFG capabilities were demonstrated in performing association and individualization in reducing the suspect pool and reducing the volume of evidence to examine in a computer network traffic analysis use case

    Cyber Security

    Get PDF
    This open access book constitutes the refereed proceedings of the 17th International Annual Conference on Cyber Security, CNCERT 2021, held in Beijing, China, in AJuly 2021. The 14 papers presented were carefully reviewed and selected from 51 submissions. The papers are organized according to the following topical sections: ​data security; privacy protection; anomaly detection; traffic analysis; social network security; vulnerability detection; text classification

    Advanced Threat Intelligence: Interpretation of Anomalous Behavior in Ubiquitous Kernel Processes

    Get PDF
    Targeted attacks on digital infrastructures are a rising threat against the confidentiality, integrity, and availability of both IT systems and sensitive data. With the emergence of advanced persistent threats (APTs), identifying and understanding such attacks has become an increasingly difficult task. Current signature-based systems are heavily reliant on fixed patterns that struggle with unknown or evasive applications, while behavior-based solutions usually leave most of the interpretative work to a human analyst. This thesis presents a multi-stage system able to detect and classify anomalous behavior within a user session by observing and analyzing ubiquitous kernel processes. Application candidates suitable for monitoring are initially selected through an adapted sentiment mining process using a score based on the log likelihood ratio (LLR). For transparent anomaly detection within a corpus of associated events, the author utilizes star structures, a bipartite representation designed to approximate the edit distance between graphs. Templates describing nominal behavior are generated automatically and are used for the computation of both an anomaly score and a report containing all deviating events. The extracted anomalies are classified using the Random Forest (RF) and Support Vector Machine (SVM) algorithms. Ultimately, the newly labeled patterns are mapped to a dedicated APT attacker–defender model that considers objectives, actions, actors, as well as assets, thereby bridging the gap between attack indicators and detailed threat semantics. This enables both risk assessment and decision support for mitigating targeted attacks. Results show that the prototype system is capable of identifying 99.8% of all star structure anomalies as benign or malicious. In multi-class scenarios that seek to associate each anomaly with a distinct attack pattern belonging to a particular APT stage we achieve a solid accuracy of 95.7%. Furthermore, we demonstrate that 88.3% of observed attacks could be identified by analyzing and classifying a single ubiquitous Windows process for a mere 10 seconds, thereby eliminating the necessity to monitor each and every (unknown) application running on a system. With its semantic take on threat detection and classification, the proposed system offers a formal as well as technical solution to an information security challenge of great significance.The financial support by the Christian Doppler Research Association, the Austrian Federal Ministry for Digital and Economic Affairs, and the National Foundation for Research, Technology and Development is gratefully acknowledged

    Acta Cybernetica : Volume 22. Number 3.

    Get PDF
    • …
    corecore