Search CORE

8 research outputs found

Идентификация автора исходного кода методами машинного обучения

Author: Alexander Sergeevich Romanov
Anna Vladimirovna Kurtukova
Publication venue: Russian Academy of Sciences, St. Petersburg Federal Research Center
Publication date: 01/06/2019
Field of study

Статья посвящена анализу проблемы определения автора исходного кода, которая представляет интерес для исследователей в области информационной безопасности, компьютерной криминалистики, оценки качества образовательного процесса, защиты интеллектуальной собственности. Представлен подробный анализ современных решений проблемы. Предлагаются две новые методики идентификации на основе алгоритмов машинного обучения: машины опорных векторов, фильтра быстрой корреляции и информативных признаков; гибридной сверточно-рекуррентной нейронной сети. Эксперименты проводились на базе исходных кодов, написанных на наиболее популярных языках программирования. В экспериментальную базу вошли экземпляры исходных кодов, написанных на Java, C++, Python, PHP, JavaScript, C, C# и Ruby. Данные были получены с веб-сервиса для хостинга IT-проектов Github. Общее количество исходных кодов превышает 150 тысяч образцов, средняя длина каждого из которых составляет 850 символов. Размер корпуса — 542 автора. С помощью перекрестной проверки по 10 блокам оценена точность разработанных методик для различного количества авторов. Для наиболее популярного языка программирования Java проведен дополнительный ряд экспериментов с количеством авторов от 2 до 50 и приведены графики зависимости точности идентификации от размера корпуса. Анализ результатов показал, что методика на основе гибридной нейронной сети способна достигать точности 97%, что является наилучшим результатом на сегодняшний день. Методика на основе машины опорных векторов позволила добиться точности 96%. Гибридная нейронная сеть оказалась точнее машины опорных векторов в среднем на 5%

Directory of Open Access Journals

Идентификация автора исходного кода методами машинного обучения

Author: Куртукова Анна Владимировна
Романов Александр Сергеевич
Publication venue: СПб ФИЦ РАН
Publication date: 07/06/2019
Field of study

The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property. The paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network. The experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors. The experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.Статья посвящена анализу проблемы определения автора исходного кода, которая представляет интерес для исследователей в области информационной безопасности, компьютерной криминалистики, оценки качества образовательного процесса, защиты интеллектуальной собственности. Представлен подробный анализ современных решений проблемы. Предлагаются две новые методики идентификации на основе алгоритмов машинного обучения: машины опорных векторов, фильтра быстрой корреляции и информативных признаков; гибридной сверточно-рекуррентной нейронной сети. Эксперименты проводились на базе исходных кодов, написанных на наиболее популярных языках программирования. В экспериментальную базу вошли экземпляры исходных кодов, написанных на Java, C++, Python, PHP, JavaScript, C, C# и Ruby. Данные были получены с веб-сервиса для хостинга IT-проектов Github. Общее количество исходных кодов превышает 150 тысяч образцов, средняя длина каждого из которых составляет 850 символов. Размер корпуса — 542 автора. С помощью перекрестной проверки по 10 блокам оценена точность разработанных методик для различного количества авторов. Для наиболее популярного языка программирования Java проведен дополнительный ряд экспериментов с количеством авторов от 2 до 50 и приведены графики зависимости точности идентификации от размера корпуса. Анализ результатов показал, что методика на основе гибридной нейронной сети способна достигать точности 97%, что является наилучшим результатом на сегодняшний день. Методика на основе машины опорных векторов позволила добиться точности 96%. Гибридная нейронная сеть оказалась точнее машины опорных векторов в среднем на 5%

Информатика и автоматизация

Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments

Author: Caliskan Aylin
Dauber Edwin
Greenstadt Rachel
Harang Richard
Nelson Frederica
Shearer Gregory
Weisman Michael
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/07/2019
Field of study

Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that individually authored complete files can be attributed, these efforts have focused on such ideal data sets as contest submissions and student assignments. We explore the problem of authorship attribution “in the wild,” examining source code obtained from open-source version control systems, and investigate how contributions can be attributed to their authors, either on an individual or a per-account basis. In this work, we present a study of attribution of code collected from collaborative environments and identify factors which make attribution of code fragments more or less successful. For individual contributions, we show that previous methods (adapted to be applied to short code fragments) yield an accuracy of approximately 50% or 60%, depending on whether we average by sample or by author, at identifying the correct author out of a set of 104 programmers. By ensembling the classification probabilities of a sufficiently large set of samples belonging to the same author we achieve much higher accuracy for assigning the set of samples to the correct author from a known suspect set. Additionally, we propose the use of calibration curves to identify which samples are by unknown and previously unencountered authors

Directory of Open Access Journals

Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

Crossref

Identifying Authorship Style in Malicious Binaries: Techniques, Challenges & Datasets

Author: Cavallaro L
Gray J
Sgandurra D
Publication venue: 'Center for Open Science'
Publication date: 18/01/2021
Field of study

Attributing a piece of malware to its creator typically requires threat intelligence. Binary attribution increases the level of difficulty as it mostly relies upon the ability to disassemble binaries to identify authorship style. Our survey explores malicious author style and the adversarial techniques used by them to remain anonymous. We examine the adversarial impact on the state-of-the-art methods. We identify key findings and explore the open research challenges. To mitigate the lack of ground truth datasets in this domain, we publish alongside this survey the largest and most diverse meta-information dataset of 15,660 malware labeled to 164 threat actor groups

arXiv.org e-Print Archive

UCL Discovery

Source Code Stylometry and Authorship Attribution for Open Source

Author: Watson Daniel
Publication venue: 'University of Waterloo'
Publication date: 18/09/2019
Field of study

Public software repositories such as GitHub make transparent the development history of an open source software system. Source code commits, discussions about new features and bugs, and code reviews are stored and carefully attributed to the appropriate developers. However, sometimes governments may seek to analyze these repositories, to identify citizens who contribute to projects they disapprove of, such as those involving cryptography or social media. While developers who seek anonymity may contribute under assumed identities, their body of public work may be characteristic enough to betray who they really are. The ability to contribute anonymously to public bodies of knowledge is extremely important to the future of technological and intellectual freedoms. Just as in security hacking, the only way to protect vulnerable individuals is by demonstrating the means and strength of available attacks so that those concerned may know of the need and develop the means to protect themselves. In this work, we present a method to de-anonymize source code contributors based on the authors' intrinsic programming style. First, we present a partial replication study wherein we attempt to de-anonymize a large number of entries into the Google Code Jam competition. We base our approach on Caliskan-Islam et al. 2015, but with modifications to the feature set and modelling strategy for scalability and feature-selection robustness. We did not achieve 0.98 F1 achieved in this prior work, but managed a still reasonable 0.71 F1 under identical experimental conditions, and a 0.88 F1 given more data from the same set. Second, we present an exploratory study focused on de-anonymizing programmers who have contributed to a repository, using other commits from the same repository as training data. We train random-forest classifiers using programmer data collected from 37 medium to large open-source repositories. Given a choice between active developers in a project, we were able to correctly determine authorship of a given function about 75% of the time, without the use of identifying meta-data or comments. We were also able to correctly validate a contributor as the author of a questioned function with 80\% recall and 65\% precision. This exploratory study provides empirical support for our approach. Finally, we present the results of a similar, but more difficult study wherein we attempt de-anonymize a repository in the same manner, but without using the target repository as training data. To do this, we gather as much training data as possible from the repository's contributors through the Github API. We evaluate our technique over 3 repositories: Bitcoin, Ethereum (crypto-currencies) and TrinityCore (a game engine). Our results in this experiment starkly contrast our results in the intra-repository study showing accuracies of 35% for Bitcoin, 22% for Ethereum, and 21% for TrinityCore which had candidate set sizes of 6, 5, and 7 respectively. Our results indicate that we can do somewhat better than random guessing, even under difficult experimental conditions, but they also indicate some fundamental issues with the state of the art of Code Stylometry. In this work we present our methodology, results, and some comments on past empirical studies, the difficulties we faced, and likely hurdles for future work in the area

University of Waterloo's Institutional Repository

The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

Author: Hendrikse Steven
Publication venue: NSUWorks
Publication date: 01/01/2017
Field of study

In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

NSU Works