865 research outputs found
Identifying Authorship Style in Malicious Binaries: Techniques, Challenges & Datasets
Attributing a piece of malware to its creator typically requires threat intelligence. Binary attribution increases the level of difficulty as it mostly relies upon the ability to disassemble binaries to identify authorship style. Our survey explores malicious author style and the adversarial techniques used by them to remain anonymous. We examine the adversarial impact on the state-of-the-art methods. We identify key findings and explore the open research challenges. To mitigate the lack of ground truth datasets in this domain, we publish alongside this survey the largest and most diverse meta-information dataset of 15,660 malware labeled to 164 threat actor groups
A multi-input deep learning model for C/C++ source code attribution
Code stylometry is applying analysis techniques to a collection of source code or binaries to determine variations in style. The variations extracted are often used to identify the author of the text or to differentiate one piece from another.
In this research, we were able to create a multi-input deep learning model that could accurately categorize and group code from multiple projects. The deep learning model took as input word-based tokenization for code comments, character-based tokenization for the source code text, and the metadata features described by A. Caliskan-Islam et al. Using these three inputs, we were able to achieve 90% validation accuracy with a loss value of 0.1203 using 12 projects consisting of 5,877 files. Finally, we analyzed the Bitcoin source code using our data model showing a high probability match to the OpenSSL project
Dos and Don'ts of Machine Learning in Computer Security
With the growing processing power of computing systems and the increasing
availability of massive datasets, machine learning algorithms have led to major
breakthroughs in many different areas. This development has influenced computer
security, spawning a series of work on learning-based security systems, such as
for malware detection, vulnerability discovery, and binary code analysis.
Despite great potential, machine learning in security is prone to subtle
pitfalls that undermine its performance and render learning-based systems
potentially unsuitable for security tasks and practical deployment. In this
paper, we look at this problem with critical eyes. First, we identify common
pitfalls in the design, implementation, and evaluation of learning-based
security systems. We conduct a study of 30 papers from top-tier security
conferences within the past 10 years, confirming that these pitfalls are
widespread in the current security literature. In an empirical analysis, we
further demonstrate how individual pitfalls can lead to unrealistic performance
and interpretations, obstructing the understanding of the security problem at
hand. As a remedy, we propose actionable recommendations to support researchers
in avoiding or mitigating the pitfalls where possible. Furthermore, we identify
open problems when applying machine learning in security and provide directions
for further research.Comment: to appear at USENIX Security Symposium 202
Cracking the Code: Unraveling Gender Disparities in Open-Source Contributions
Within the world of open source software (OSS) development, previous research has shown that the success rate of pull requests (PRs) may exhibit gender-related imbalances. In this work, we seek to examine which factors may contribute to this imbalance; we do so by performing a comprehensive study on a corpus of over 50,000 accepted PRs taken from a set of well-known Python projects. We perform both stylometric and quality-based analyses on the submitted PRs by both female and male developers, and we find that the results vary across gender. For example, we found that code written by male developers is more prone to both bugs and blocker issues. Based on our experiences, we propose a set of actionable recommendations, aimed at fostering diversity and equal opportunities within the OSS ecosystem
Data quality measures for identity resolution
The explosion in popularity of online social networks has led to increased interest in identity resolution from security practitioners. Being able to connect together the multiple online accounts of a user can be of use in verifying identity attributes and in tracking the activity of malicious users. At the same time, privacy researchers are exploring the same phenomenon with interest in identifying privacy risks caused by re-identification attacks. Existing literature has explored how particular components of an online identity may be used to connect profiles, but few if any studies have attempted to assess the comparative value of information attributes. In addition, few of the methods being reported are easily comparable, due to difficulties with obtaining and sharing ground- truth data. Attempts to gain a comprehensive understanding of the identifiability of profile attributes are hindered by these issues. With a focus on overcoming these hurdles to effective research, this thesis first develops a methodology for sampling ground-truth data from online social networks. Building on this with reference to both existing literature and samples of real profile data, this thesis describes and grounds a comprehensive matching schema of profile attributes. The work then defines data quality measures which are important for identity resolution, and measures the availability, consistency and uniqueness of the schema’s contents. The developed measurements are then applied in a feature selection scheme to reduce the impact of missing data issues common in identity resolution. Finally, this thesis addresses the purposes to which identity resolution may be applied, defining the further application-oriented data quality measurements of novelty, veracity and relevance, and demonstrating their calculation and application for a particular use case: evaluating the social engineering vulnerability of an organisation
Advanced Machine Learning Techniques and Meta-Heuristic Optimization for the Detection of Masquerading Attacks in Social Networks
According to the report published by the online protection firm Iovation in 2012,
cyber fraud ranged from 1 percent of the Internet transactions in North America
Africa to a 7 percent in Africa, most of them involving credit card fraud, identity
theft, and account takeover or hÂĽacking attempts. This kind of crime is still growing
due to the advantages offered by a non face-to-face channel where a increasing
number of unsuspecting victims divulges sensitive information. Interpol classifies
these illegal activities into 3 types:
• Attacks against computer hardware and software.
• Financial crimes and corruption.
• Abuse, in the form of grooming or “sexploitation”.
Most research efforts have been focused on the target of the crime developing different
strategies depending on the casuistic. Thus, for the well-known phising, stored
blacklist or crime signals through the text are employed eventually designing adhoc
detectors hardly conveyed to other scenarios even if the background is widely
shared. Identity theft or masquerading can be described as a criminal activity oriented
towards the misuse of those stolen credentials to obtain goods or services by
deception. On March 4, 2005, a million of personal and sensitive information such
as credit card and social security numbers was collected by White Hat hackers at
Seattle University who just surfed the Web for less than 60 minutes by means of
the Google search engine. As a consequence they proved the vulnerability and lack
of protection with a mere group of sophisticated search terms typed in the engine
whose large data warehouse still allowed showing company or government websites
data temporarily cached.
As aforementioned, platforms to connect distant people in which the interaction is
undirected pose a forcible entry for unauthorized thirds who impersonate the licit
user in a attempt to go unnoticed with some malicious, not necessarily economic,
interests. In fact, the last point in the list above regarding abuses has become a
major and a terrible risk along with the bullying being both by means of threats,
harassment or even self-incrimination likely to drive someone to suicide, depression
or helplessness. California Penal Code Section 528.5 states:
“Notwithstanding any other provision of law, any person who knowingly
and without consent credibly impersonates another actual person through
or on an Internet Web site or by other electronic means for purposes of
harming, intimidating, threatening, or defrauding another person is guilty
of a public offense punishable pursuant to subdivision [...]”.
IV
Therefore, impersonation consists of any criminal activity in which someone assumes
a false identity and acts as his or her assumed character with intent to get
a pecuniary benefit or cause some harm. User profiling, in turn, is the process of
harvesting user information in order to construct a rich template with all the advantageous
attributes in the field at hand and with specific purposes. User profiling is
often employed as a mechanism for recommendation of items or useful information
which has not yet considered by the client. Nevertheless, deriving user tendency or
preferences can be also exploited to define the inherent behavior and address the
problem of impersonation by detecting outliers or strange deviations prone to entail
a potential attack.
This dissertation is meant to elaborate on impersonation attacks from a profiling
perspective, eventually developing a 2-stage environment which consequently embraces
2 levels of privacy intrusion, thus providing the following contributions:
• The inference of behavioral patterns from the connection time traces aiming at
avoiding the usurpation of more confidential information. When compared to
previous approaches, this procedure abstains from impinging on the user privacy
by taking over the messages content, since it only relies on time statistics
of the user sessions rather than on their content.
• The application and subsequent discussion of two selected algorithms for the
previous point resolution:
– A commonly employed supervised algorithm executed as a binary classifier
which thereafter has forced us to figure out a method to deal with the
absence of labeled instances representing an identity theft.
– And a meta-heuristic algorithm in the search for the most convenient parameters
to array the instances within a high dimensional space into properly
delimited clusters so as to finally apply an unsupervised clustering
algorithm.
• The analysis of message content encroaching on more private information but
easing the user identification by mining discriminative features by Natural
Language Processing (NLP) techniques. As a consequence, the development of
a new feature extraction algorithm based on linguistic theories motivated by
the massive quantity of features often gathered when it comes to texts.
In summary, this dissertation means to go beyond typical, ad-hoc approaches
adopted by previous identity theft and authorship attribution research. Specifically
it proposes tailored solutions to this particular and extensively studied paradigm
with the aim at introducing a generic approach from a profiling view, not tightly
bound to a unique application field. In addition technical contributions have been
made in the course of the solution formulation intending to optimize familiar methods
for a better versatility towards the problem at hand. In summary: this Thesis
establishes an encouraging research basis towards unveiling subtle impersonation
attacks in Social Networks by means of intelligent learning techniques
StyleCounsel: Seeing the (Random) Forest for the Trees in Adversarial Code Stylometry
Authorship attribution has piqued the interest of scholars for centuries, but had historically remained a matter of subjective opinion, based upon examination of handwriting and the physical document. Midway through the 20th Century, a technique known as stylometry was developed, in which the content of a document is analyzed to extract the author's grammar use, preferred vocabulary, and other elements of compositional style. In parallel to this, programmers, and particularly those involved in education, were writing and testing systems designed to automate the analysis of good coding style and best practice, in order to assist with grading assignments. In the aftermath of the Morris Worm incident in 1988, researchers began to consider whether this automated analysis of program style could be combined with stylometry techniques and applied to source code, to identify the author of a program.
The results of recent experiments have suggested this code stylometry can successfully identify the author of short programs from among hundreds of candidates with up to 98\% precision. This potential ability to discern the programmer of a sample of code from a large group of possible authors could have concerning consequences for the open-source community at large, particularly those contributors that may wish to remain anonymous. Recent international events have suggested the developers of certain anti-censorship and anti-surveillance tools are being targeted by their governments and forced to delete their repositories or face prosecution.
In light of this threat to the freedom and privacy of individual programmers around the world, and due to a dearth of published research into practical code stylometry at scale and its feasibility, we carried out a number of investigations looking into the difficulties of applying this technique in the real world, and how one might effect a robust defence against it. To this end, we devised a system to aid programmers in obfuscating their inherent style and imitating another, overt, author's style in order to protect their anonymity from this forensic technique. Our system utilizes the implicit rules encoded in the decision points of a random forest ensemble in order to derive a set of recommendations to present to the user detailing how to achieve this obfuscation and mimicry attack. In order to best test this system, and simultaneously assess the difficulties of performing practical stylometry at scale, we also gathered a large corpus of real open-source software and devised our own feature set including both novel attributes and those inspired or borrowed from other sources.
Our results indicate that attempting a mass analysis of publicly available source code is fraught with difficulties in ensuring the integrity of the data. Furthermore, we found ours and most other published feature sets do not sufficiently capture an author's style independently of the content to be very effective at scale, although its accuracy is significantly greater than a random guess. Evaluations of our tool indicate it can successfully extract a set of changes that would result in a misclassification as another user if implemented. More importantly, this extraction was independent of the specifics of the feature set, and therefore would still work even with a more accurate model of style. We ran a limited user study to assess the usability of the tool, and found overall it was beneficial to our participants, and could be even more beneficial if the valuable feedback we received were implemented in future work
Leveraging Longitudinal Data for Personalized Prediction and Word Representations
This thesis focuses on personalization, word representations, and longitudinal dialog. We first look at users expressions of individual preferences. In this targeted sentiment task, we find that we can improve entity extraction and sentiment classification using domain lexicons and linear term weighting. This task is important to personalization and dialog systems, as targets need to be identified in conversation and personal preferences affect how the system should react. Then we examine individuals with large amounts of personal conversational data in order to better predict what people will say. We consider extra-linguistic features that can be used to predict behavior and to predict the relationship between interlocutors. We show that these features improve over just using message content and that training on personal data leads to much better performance than training on a sample from all other users. We look not just at using personal data for these end-tasks, but also constructing personalized word representations. When we have a lot of data for an individual, we create personalized word embeddings that improve performance on language modeling and authorship attribution. When we have limited data, but we have user demographics, we can instead construct demographic word embeddings. We show that these representations improve language modeling and word association performance. When we do not have demographic information, we show that using a small amount of data from an individual, we can calculate similarity to existing users and interpolate or leverage data from these users to improve language modeling performance. Using these types of personalized word representations, we are able to provide insight into what words vary more across users and demographics. The kind of personalized representations that we introduce in this work allow for applications such as predictive typing, style transfer, and dialog systems. Importantly, they also have the potential to enable more equitable language models, with improved performance for those demographic groups that have little representation in the data.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167971/1/cfwelch_1.pd
Music 2025 : The Music Data Dilemma: issues facing the music industry in improving data management
© Crown Copyright 2019Music 2025ʼ investigates the infrastructure issues around the management of digital data in an increasingly stream driven industry. The findings are the culmination of over 50 interviews with high profile music industry representatives across the sector and reflects key issues as well as areas of consensus and contrasting views. The findings reveal whilst there are great examples of data initiatives across the value chain, there are opportunities to improve efficiency and interoperability
- …