143 research outputs found
Classifying malicious windows executables using anomaly based detection
A malicious executable is broadly defined as any program or piece of code designed to cause damage to a system or the information it contains, or to prevent the system from being used in a normal manner. A generic term used to describe any kind of malicious software is Maiware, which includes Viruses, Worms, Trojans, Backdoors, Root-kits, Spyware and Exploits. Anomaly detection is technique which builds a statistical profile of the normal and malicious data and classifies unseen data based on these two profiles.
A detection system is presented here which is anomaly based and focuses on the Windows® platform. Several file infection techniques were studied to understand what particular features in the executable binary are more susceptible to being used for the malicious code propagation. A framework is presented for collecting data for both static (non-execution based) as well as dynamic (execution based) analysis of the malicious executables. Two specific features are extracted using static analysis, Windows API (from the Import Address Table of the Portable Executable Header) and the hex byte frequency count (collected using Hexdump utility) which have been explained in detail. Dynamic analysis features which were extracted are briefly mentioned and the major challenges faced using this data is explained. Classification results using Support Vector Machines for anomaly detection is shown for the two static analysis features. Experimental results have provided classification results with up to 94% accuracy for new, previously unseen executables
Recommended from our members
A complex situation in data recovery
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The research considers an unusual situation in data recovery. Data recovery is the process of recovering data from recording media that is not accessible by normal means. Providing that the data has not been overwritten or the recording medium physically damaged, this is usually a
relatively simple process of either repairing the file system so that the file(s) may be accessed as usual or finding the data on the medium and copying it directly from the medium into normal file(s). The data in this recovery situation is recorded by specialist call centre recording equipment and is stored on the recording medium in a proprietary format whereby simultaneous conversations are multiplexed together and can only be accessed by using associated metadata records. The value of the recorded data may be very high especially in the financial sector where
it may be considered a legal audit of business transactions. When a failure occurs and data needs to be recovered, both the data and metadata information must be recreated before a single call can be replayed. A key component to accessing this information is the location metadata that
identifies the location of the required components on the medium. If the metadata is corrupted, incomplete or wrong then a repair cannot proceed until it is corrected. This research focuses on the problem of verifying this location metadata. Initially it was believed that only a small set of errors would exist and work centred on detecting these errors by presenting the information to engineers in an at-a-glance image. When the extent of the possible errors was realised, an attempt was made to deduce location metadata by exploring the content
of the recorded medium. Although successful in one instance, the process was not able to
distinguish between current and previous uses. Eventually insights gained from exploration of the recording application's source code, permitted an intelligent trial and error process which deduced the underlying medium apportioning formula. It was then possible to incorporate this
formula into the heuristics, generating the at-a-glance image, to create an artefact that could verify the location metadata for any given repair. After discovering the formula, the research returned to the media exploration and the produced disk fingerprinting technique. The disk fingerprinting technique gave valuable insights into error states in call centre recording and provided a new way of seeing the contents of a hard drive. This research provided the following contributions:
1. It has provided a means by which the recording systems' location metadata can be
verified and repaired.
2. As a result of this verification, greater automation of the recovery process is now possible before the need for human verification is required.
3. The disk fingerprinting process. This has already given insights into the recording
system's problems and is able to provide a new way of seeing the contents of recording
media
Automatic Generation of Input Grammars Using Symbolic Execution
Invalid input often leads to unexpected behavior in a program and is behind a plethora of known and unknown vulnerabilities. To prevent improper input from being processed, the input needs to be validated before the rest of the program executes. Formal language theory facilitates the definition and recognition of proper inputs. We focus on the problem of defining valid input after the program has already been written. We construct a parser that infers the structure of inputs which avoid vulnerabilities while existing work focuses on inferring the structure of input the program anticipates. We present a tool that constructs an input language, given the program as input, using symbolic execution on symbolic arguments. This differs from existing work which tracks the execution of concrete inputs to infer a grammar. We test our tool on programs with known vulnerabilities, including programs in the GNU Coreutils library, and we demonstrate how the parser catches known invalid inputs. We conclude that the synthesis of the complete parser cannot be entirely automated due to limitations of symbolic execution tools and issues of computability. A more comprehensive parser must additionally be informed by examples and counterexamples of the input language
PolyFS Visualizer
One of the most important operating system topics, file systems, control how we store and access data and form a key point in a computer scientists understanding of the underlying mechanisms of a computer. However, file systems, with their abstract concepts and lack of concrete learning aids, is a confusing subjects for students. Historically at Cal Poly, the CPE 453 Introduction to Operating Systems has been on of the most failed classes in the computing majors, leading to the need for better teaching and learning tools. Tools allowing students to gain concrete examples of abstract concepts could be used to better prepare students for industry.
The PolyFS Visualizer is a block level file system visualization service built for the PolyFS and TinyFS file systems design specifications currently used by some of professors teaching CPE 453. The service allows students to easily view the blocks of their file system and see metadata, the blocks binary content and the interlinked structure. Students can either compile their file system code with a provided block emulation library to build their disk on a remote server and make use of a visualization website or place the file mounted as their file system directly into the visualization service to view it locally. This allows students to easily view, debug and explore their implementation of a file system to understand how different design decisions affect its operation.
The implementation includes three main components: a disk emulation library in C for compilation with students code, a node JS back-end to handle students file systems and block operations and a read only visualization service. We have conducted two surveys of students in order to determine the usefulness of the PolyFS Visualizer. Students responded that the use of the PolyFS visualizer helps with the PolyFS file system design project and has several ideas for future features and expansions
Malware Target Recognition via Static Heuristics
Organizations increasingly rely on the confidentiality, integrity and availability of their information and communications technologies to conduct effective business operations while maintaining their competitive edge. Exploitation of these networks via the introduction of undetected malware ultimately degrades their competitive edge, while taking advantage of limited network visibility and the high cost of analyzing massive numbers of programs. This article introduces the novel Malware Target Recognition (MaTR) system which combines the decision tree machine learning algorithm with static heuristic features for malware detection. By focusing on contextually important static heuristic features, this research demonstrates superior detection results. Experimental results on large sample datasets demonstrate near ideal malware detection performance (99.9+% accuracy) with low false positive (8.73e-4) and false negative rates (8.03e-4) at the same point on the performance curve. Test results against a set of publicly unknown malware, including potential advanced competitor tools, show MaTR’s superior detection rate (99%) versus the union of detections from three commercial antivirus products (60%). The resulting model is a fine granularity sensor with potential to dramatically augment cyberspace situation awareness
Logging and Analysis of Internet of Things (IoT) Device Network Traffic and Power Consumption
An increasing number of devices, from coffee makers to electric kettles, are becoming connected to the Internet. These are all a part of the Internet of Things, or IoT. Each device generates unique network traffic and power consumption patterns. Until now, there has not been a comprehensive set of data that captures these traffic and power patterns. This thesis documents how we collected 10 to 15 weeks of network traffic and power consumption data from 15 different IoT devices and provides an analysis of a subset of 6 devices. Devices including an Amazon Echo Dot, Google Home Mini, and Google Chromecast were used on a regular basis and all of their network traffic and power consumption was logged to a MySQL database. The database currently contains 64 million packets and 71 gigabytes of data and is still growing in size as more data is collected 24/7 from each device. We show that it is possible to see when users are asking their smart speaker a question or whether the lights in their home are on or off based on power consumption and network traffic from the devices. These trends can be seen even if the data being sent is encrypted
- …