1,068 research outputs found

    Malware Classification based on Call Graph Clustering

    Full text link
    Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DBSCAN. Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/0

    THE SCALABLE AND ACCOUNTABLE BINARY CODE SEARCH AND ITS APPLICATIONS

    Get PDF
    The past decade has been witnessing an explosion of various applications and devices. This big-data era challenges the existing security technologies: new analysis techniques should be scalable to handle “big data” scale codebase; They should be become smart and proactive by using the data to understand what the vulnerable points are and where they locate; effective protection will be provided for dissemination and analysis of the data involving sensitive information on an unprecedented scale. In this dissertation, I argue that the code search techniques can boost existing security analysis techniques (vulnerability identification and memory analysis) in terms of scalability and accuracy. In order to demonstrate its benefits, I address two issues of code search by using the code analysis: scalability and accountability. I further demonstrate the benefit of code search by applying it for the scalable vulnerability identification [57] and the cross-version memory analysis problems [55, 56]. Firstly, I address the scalability problem of code search by learning “higher-level” semantic features from code [57]. Instead of conducting fine-grained testing on a single device or program, it becomes much more crucial to achieve the quick vulnerability scanning in devices or programs at a “big data” scale. However, discovering vulnerabilities in “big code” is like finding a needle in the haystack, even when dealing with known vulnerabilities. This new challenge demands a scalable code search approach. To this end, I leverage successful techniques from the image search in computer vision community and propose a novel code encoding method for scalable vulnerability search in binary code. The evaluation results show that this approach can achieve comparable or even better accuracy and efficiency than the baseline techniques. Secondly, I tackle the accountability issues left in the vulnerability searching problem by designing vulnerability-oriented raw features [58]. The similar code does not always represent the similar vulnerability, so it requires that the feature engineering for the code search should focus on semantic level features rather than syntactic ones. I propose to extract conditional formulas as higher-level semantic features from the raw binary code to conduct the code search. A conditional formula explicitly captures two cardinal factors of a vulnerability: 1) erroneous data dependencies and 2) missing or invalid condition checks. As a result, the binary code search on conditional formulas produces significantly higher accuracy and provides meaningful evidence for human analysts to further examine the search results. The evaluation results show that this approach can further improve the search accuracy of existing bug search techniques with very reasonable performance overhead. Finally, I demonstrate the potential of the code search technique in the memory analysis field, and apply it to address their across-version issue in the memory forensic problem [55, 56]. The memory analysis techniques for COTS software usually rely on the so-called “data structure profiles” for their binaries. Construction of such profiles requires the expert knowledge about the internal working of a specified software version. However, it is still a cumbersome manual effort most of time. I propose to leverage the code search technique to enable a notion named “cross-version memory analysis”, which can update a profile for new versions of a software by transferring the knowledge from the model that has already been trained on its old version. The evaluation results show that the code search based approach advances the existing memory analysis methods by reducing the manual efforts while maintaining the reasonable accuracy. With the help of collaborators, I further developed two plugins to the Volatility memory forensic framework [2], and show that each of the two plugins can construct a localized profile to perform specified memory forensic tasks on the same memory dump, without the need of manual effort in creating the corresponding profile

    Fuzzing Deep-Learning Libraries via Automated Relational API Inference

    Full text link
    A growing body of research has been dedicated to DL model testing. However, there is still limited work on testing DL libraries, which serve as the foundations for building, training, and running DL models. Prior work on fuzzing DL libraries can only generate tests for APIs which have been invoked by documentation examples, developer tests, or DL models, leaving a large number of APIs untested. In this paper, we propose DeepREL, the first approach to automatically inferring relational APIs for more effective DL library fuzzing. Our basic hypothesis is that for a DL library under test, there may exist a number of APIs sharing similar input parameters and outputs; in this way, we can easily "borrow" test inputs from invoked APIs to test other relational APIs. Furthermore, we formalize the notion of value equivalence and status equivalence for relational APIs to serve as the oracle for effective bug finding. We have implemented DeepREL as a fully automated end-to-end relational API inference and fuzzing technique for DL libraries, which 1) automatically infers potential API relations based on API syntactic or semantic information, 2) synthesizes concrete test programs for invoking relational APIs, 3) validates the inferred relational APIs via representative test inputs, and finally 4) performs fuzzing on the verified relational APIs to find potential inconsistencies. Our evaluation on two of the most popular DL libraries, PyTorch and TensorFlow, demonstrates that DeepREL can cover 157% more APIs than state-of-the-art FreeFuzz. To date, DeepREL has detected 162 bugs in total, with 106 already confirmed by the developers as previously unknown bugs. Surprisingly, DeepREL has detected 13.5% of the high-priority bugs for the entire PyTorch issue-tracking system in a three-month period. Also, besides the 162 code bugs, we have also detected 14 documentation bugs (all confirmed).Comment: Accepted at ESEC/FSE 202

    Safe data structure visualisation

    Get PDF

    Regional feature learning using attribute structural analysis in bipartite attention framework for vehicle re-identification

    Get PDF
    Vehicle re-identification identifies target vehicles using images obtained by numerous non-overlapping real-time surveillance cameras. The effectiveness of re-identification is further challenging because of illumination changes, pose differences of captured images, and resolution. Fine-grained appearance changes in vehicles are recognized in addition to the coarse-grained characteristics like color of the vehicle along with model, and other custom features like logo stickers, annual service signs, and hangings to overcome these challenges. To prove the efficiency of our proposed bipartite attention framework, a novel dataset called Attributes27 which has 27 labelled attributes for each class are created. Our framework contains three major sections: The first section where the overall and semantic characteristics of every individual vehicle image are extracted by a double branch convolutional neural network (CNN) layer. Secondly, to identify the region of interests (ROIs) each branch has a self-attention block linked to it. Lastly to extract the regional features from the obtained ROIs, a partition-alignment block is deployed. The results of our proposed system’s evaluation on the Attributes27 and VeRi-776 datasets has highlighted significant regional attributes of each vehicle and improved the accuracy. Attributes27 and VeRi-776 datasets exhibits 98.5% and 84.3% accuracy respectively which are comparatively higher than the existing methods with 78.6% accuracy

    A graph based process model measurement framework using scheduling theory

    Get PDF
    Software development processes, as a means of ensuring software quality and productivity, have been widely accepted within the software development community; software process modeling, on the other hand, continues to be a subject of interest in the research community. Even with organizations that have achieved higher SEI maturity levels, processes are by and large described in documents and reinforced as guidelines or laws governing software development activities. The lack of industry-wide adaptation of software process modeling as part of development activities can be attributed to two major reasons: lack of forecast power in the (software) process modeling and lack of integration mechanism for the described process to seamlessly interact with daily development activities. This dissertation describes a research through which a framework has been established where processes can be manipulated, measured, and dynamically modified by interacting with project management techniques and activities in an integrated process modeling environment, thus closing the gap between process modeling and software development. In this research, processes are described using directed graphs, similar to the techniques with CPM. This way, the graphs can be manipulated visually while the properties of the graphs-can be used to check their validity. The partial ordering and the precedence relationship of the tasks in the graphs are similar to the one studied in other researches [Delcambre94] [Mills96]. Measurements of the effectiveness of the processes are added in this research. These measurements provide bases for the judgment when manipulating the graphs to produce or modify a process. Software development can be considered as activities related to three sets: a set of tasks (τ), a set of resources (ρ), and a set of constraints (y). The process, P, is then a function of all the sets interacting with each other: P = {τ, ρ, y). The interactions of these sets can be described in terms of different machine models using scheduling theory. While trying to produce an optimal solution satisfying a set of prescribed conditions using the analytical method would lead to a practically non-feasible formulation, many heuristic algorithms in scheduling theory combined with manual manipulation of the tasks can help to produce a reasonable good process, the effectiveness of which is reflected through a set of measurement criteria, in particular, the make-span, the float, and the bottlenecks. Through an integrated process modeling environment, these measurements can be obtained in real time, thus providing a feedback loop during the process execution. This feedback loop is essential for risk management and control

    Verifying Monadic Second-Order Properties of Graph Programs

    Get PDF
    The core challenge in a Hoare- or Dijkstra-style proof system for graph programs is in defining a weakest liberal precondition construction with respect to a rule and a postcondition. Previous work addressing this has focused on assertion languages for first-order properties, which are unable to express important global properties of graphs such as acyclicity, connectedness, or existence of paths. In this paper, we extend the nested graph conditions of Habel, Pennemann, and Rensink to make them equivalently expressive to monadic second-order logic on graphs. We present a weakest liberal precondition construction for these assertions, and demonstrate its use in verifying non-local correctness specifications of graph programs in the sense of Habel et al.Comment: Extended version of a paper to appear at ICGT 201

    Exploiting Heterogeneity in Networks of Aerial and Ground Robotic Agents

    Get PDF
    By taking advantage of complementary communication technologies, distinct sensing functionalities and varied motion dynamics present in a heterogeneous multi-robotic network, it is possible to accomplish a main mission objective by assigning specialized sub-tasks to specific members of a robotic team. An adequate selection of the team members and an effective coordination are some of the challenges to fully exploit the unique capabilities that these types of systems can offer. Motivated by real world applications, we focus on a multi-robotic network consisting off aerial and ground agents which has the potential to provide critical support to humans in complex settings. For instance, aerial robotic relays are capable of transporting small ground mobile sensors to expand the communication range and the situational awareness of first responders in hazardous environments. In the first part of this dissertation, we extend work on manipulation of cable-suspended loads using aerial robots by solving the problem of lifting the cable-suspended load from the ground before proceeding to transport it. Since the suspended load-quadrotor system experiences switching conditions during this critical maneuver, we define a hybrid system and show that it is differentially-flat. This property facilitates the design of a nonlinear controller which tracks a waypoint-based trajectory associated with the discrete states of the hybrid system. In addition, we address the case of unknown payload mass by combining a least-squares estimation method with the designed controller. Second, we focus on the coordination of a heterogeneous team formed by a group of ground mobile sensors and a flying communication router which is deployed to sense areas of interest in a cluttered environment. Using potential field methods, we propose a controller for the coordinated mobility of the team to guarantee inter-robot and obstacle collision avoidance as well as connectivity maintenance among the ground agents while the main goal of sensing is carried out. For the case of the aerial communications relays, we combine antenna diversity with reinforcement learning to dynamically re-locate these relays so that the received signal strength is maintained above a desired threshold. Motivated by the recent interest of combining radio frequency and optical wireless communications, we envision the implementation of an optical link between micro-scale aerial and ground robots. This type of link requires maintaining a sufficient relative transmitter-receiver position for reliable communications. In the third part of this thesis, we tackle this problem. Based on the link model, we define a connectivity cone where a minimum transmission rate is guaranteed. For example, the aerial robot has to track the ground vehicle to stay inside this cone. The control must be robust to noisy measurements. Thus, we use particle filters to obtain a better estimation of the receiver position and we design a control algorithm for the flying robot to enhance the transmission rate. Also, we consider the problem of pairing a ground sensor with an aerial vehicle, both equipped with a hybrid radio-frequency/optical wireless communication system. A challenge is positioning the flying robot within optical range when the sensor location is unknown. Thus, we take advantage of the hybrid communication scheme by developing a control strategy that uses the radio signal to guide the aerial platform to the ground sensor. Once the optical-based signal strength has achieved a certain threshold, the robot hovers within optical range. Finally, we investigate the problem of building an alliance of agents with different skills in order to satisfy the requirements imposed by a given task. We find this alliance, known also as a coalition, by using a bipartite graph in which edges represent the relation between agent capabilities and required resources for task execution. Using this graph, we build a coalition whose total capability resources can satisfy the task resource requirements. Also, we study the heterogeneity of the formed coalition to analyze how it is affected for instance by the amount of capability resources present in the agents
    corecore