8 research outputs found
ΠΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ Π°Π²ΡΠΎΡΠ° ΠΈΡΡ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΊΠΎΠ΄Π° ΠΌΠ΅ΡΠΎΠ΄Π°ΠΌΠΈ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
Π‘ΡΠ°ΡΡΡ ΠΏΠΎΡΠ²ΡΡΠ΅Π½Π° Π°Π½Π°Π»ΠΈΠ·Ρ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ Π°Π²ΡΠΎΡΠ° ΠΈΡΡ
ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΊΠΎΠ΄Π°, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ ΠΈΠ½ΡΠ΅ΡΠ΅Ρ Π΄Π»Ρ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ Π² ΠΎΠ±Π»Π°ΡΡΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎΠΉ Π±Π΅Π·ΠΎΠΏΠ°ΡΠ½ΠΎΡΡΠΈ, ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ½ΠΎΠΉ ΠΊΡΠΈΠΌΠΈΠ½Π°Π»ΠΈΡΡΠΈΠΊΠΈ, ΠΎΡΠ΅Π½ΠΊΠΈ ΠΊΠ°ΡΠ΅ΡΡΠ²Π° ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΏΡΠΎΡΠ΅ΡΡΠ°, Π·Π°ΡΠΈΡΡ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΠΎΠΉ ΡΠΎΠ±ΡΡΠ²Π΅Π½Π½ΠΎΡΡΠΈ.
ΠΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ ΠΏΠΎΠ΄ΡΠΎΠ±Π½ΡΠΉ Π°Π½Π°Π»ΠΈΠ· ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ
ΡΠ΅ΡΠ΅Π½ΠΈΠΉ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ. ΠΡΠ΅Π΄Π»Π°Π³Π°ΡΡΡΡ Π΄Π²Π΅ Π½ΠΎΠ²ΡΠ΅ ΠΌΠ΅ΡΠΎΠ΄ΠΈΠΊΠΈ ΠΈΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ: ΠΌΠ°ΡΠΈΠ½Ρ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ², ΡΠΈΠ»ΡΡΡΠ° Π±ΡΡΡΡΠΎΠΉ ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ ΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ²; Π³ΠΈΠ±ΡΠΈΠ΄Π½ΠΎΠΉ ΡΠ²Π΅ΡΡΠΎΡΠ½ΠΎ-ΡΠ΅ΠΊΡΡΡΠ΅Π½ΡΠ½ΠΎΠΉ Π½Π΅ΠΉΡΠΎΠ½Π½ΠΎΠΉ ΡΠ΅ΡΠΈ.
ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΡ ΠΏΡΠΎΠ²ΠΎΠ΄ΠΈΠ»ΠΈΡΡ Π½Π° Π±Π°Π·Π΅ ΠΈΡΡ
ΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ΄ΠΎΠ², Π½Π°ΠΏΠΈΡΠ°Π½Π½ΡΡ
Π½Π° Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΡ
ΡΠ·ΡΠΊΠ°Ρ
ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ. Π ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠ°Π»ΡΠ½ΡΡ Π±Π°Π·Ρ Π²ΠΎΡΠ»ΠΈ ΡΠΊΠ·Π΅ΠΌΠΏΠ»ΡΡΡ ΠΈΡΡ
ΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ΄ΠΎΠ², Π½Π°ΠΏΠΈΡΠ°Π½Π½ΡΡ
Π½Π° Java, C++, Python, PHP, JavaScript, C, C# ΠΈ Ruby. ΠΠ°Π½Π½ΡΠ΅ Π±ΡΠ»ΠΈ ΠΏΠΎΠ»ΡΡΠ΅Π½Ρ Ρ Π²Π΅Π±-ΡΠ΅ΡΠ²ΠΈΡΠ° Π΄Π»Ρ Ρ
ΠΎΡΡΠΈΠ½Π³Π° IT-ΠΏΡΠΎΠ΅ΠΊΡΠΎΠ² Github. ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΈΡΡ
ΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ΄ΠΎΠ² ΠΏΡΠ΅Π²ΡΡΠ°Π΅Ρ 150 ΡΡΡΡΡ ΠΎΠ±ΡΠ°Π·ΡΠΎΠ², ΡΡΠ΅Π΄Π½ΡΡ Π΄Π»ΠΈΠ½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΠΈΠ· ΠΊΠΎΡΠΎΡΡΡ
ΡΠΎΡΡΠ°Π²Π»ΡΠ΅Ρ 850 ΡΠΈΠΌΠ²ΠΎΠ»ΠΎΠ². Π Π°Π·ΠΌΠ΅Ρ ΠΊΠΎΡΠΏΡΡΠ° β 542 Π°Π²ΡΠΎΡΠ°.
Π‘ ΠΏΠΎΠΌΠΎΡΡΡ ΠΏΠ΅ΡΠ΅ΠΊΡΠ΅ΡΡΠ½ΠΎΠΉ ΠΏΡΠΎΠ²Π΅ΡΠΊΠΈ ΠΏΠΎ 10 Π±Π»ΠΎΠΊΠ°ΠΌ ΠΎΡΠ΅Π½Π΅Π½Π° ΡΠΎΡΠ½ΠΎΡΡΡ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠ°Π½Π½ΡΡ
ΠΌΠ΅ΡΠΎΠ΄ΠΈΠΊ Π΄Π»Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΠΎΠ³ΠΎ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° Π°Π²ΡΠΎΡΠΎΠ². ΠΠ»Ρ Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ Java ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½ Π΄ΠΎΠΏΠΎΠ»Π½ΠΈΡΠ΅Π»ΡΠ½ΡΠΉ ΡΡΠ΄ ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠΎΠ² Ρ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎΠΌ Π°Π²ΡΠΎΡΠΎΠ² ΠΎΡ 2 Π΄ΠΎ 50 ΠΈ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½Ρ Π³ΡΠ°ΡΠΈΠΊΠΈ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΡΠΎΡΠ½ΠΎΡΡΠΈ ΠΈΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΎΡ ΡΠ°Π·ΠΌΠ΅ΡΠ° ΠΊΠΎΡΠΏΡΡΠ°.
ΠΠ½Π°Π»ΠΈΠ· ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² ΠΏΠΎΠΊΠ°Π·Π°Π», ΡΡΠΎ ΠΌΠ΅ΡΠΎΠ΄ΠΈΠΊΠ° Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ Π³ΠΈΠ±ΡΠΈΠ΄Π½ΠΎΠΉ Π½Π΅ΠΉΡΠΎΠ½Π½ΠΎΠΉ ΡΠ΅ΡΠΈ ΡΠΏΠΎΡΠΎΠ±Π½Π° Π΄ΠΎΡΡΠΈΠ³Π°ΡΡ ΡΠΎΡΠ½ΠΎΡΡΠΈ 97%, ΡΡΠΎ ΡΠ²Π»ΡΠ΅ΡΡΡ Π½Π°ΠΈΠ»ΡΡΡΠΈΠΌ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠΌ Π½Π° ΡΠ΅Π³ΠΎΠ΄Π½ΡΡΠ½ΠΈΠΉ Π΄Π΅Π½Ρ. ΠΠ΅ΡΠΎΠ΄ΠΈΠΊΠ° Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΌΠ°ΡΠΈΠ½Ρ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ»Π° Π΄ΠΎΠ±ΠΈΡΡΡΡ ΡΠΎΡΠ½ΠΎΡΡΠΈ 96%. ΠΠΈΠ±ΡΠΈΠ΄Π½Π°Ρ Π½Π΅ΠΉΡΠΎΠ½Π½Π°Ρ ΡΠ΅ΡΡ ΠΎΠΊΠ°Π·Π°Π»Π°ΡΡ ΡΠΎΡΠ½Π΅Π΅ ΠΌΠ°ΡΠΈΠ½Ρ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² Π² ΡΡΠ΅Π΄Π½Π΅ΠΌ Π½Π° 5%
ΠΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ Π°Π²ΡΠΎΡΠ° ΠΈΡΡ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΊΠΎΠ΄Π° ΠΌΠ΅ΡΠΎΠ΄Π°ΠΌΠΈ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property.
The paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network.
The experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects β Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors.
The experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and itβs at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.Π‘ΡΠ°ΡΡΡ ΠΏΠΎΡΠ²ΡΡΠ΅Π½Π° Π°Π½Π°Π»ΠΈΠ·Ρ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ Π°Π²ΡΠΎΡΠ° ΠΈΡΡ
ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΊΠΎΠ΄Π°, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ ΠΈΠ½ΡΠ΅ΡΠ΅Ρ Π΄Π»Ρ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ Π² ΠΎΠ±Π»Π°ΡΡΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎΠΉ Π±Π΅Π·ΠΎΠΏΠ°ΡΠ½ΠΎΡΡΠΈ, ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ½ΠΎΠΉ ΠΊΡΠΈΠΌΠΈΠ½Π°Π»ΠΈΡΡΠΈΠΊΠΈ, ΠΎΡΠ΅Π½ΠΊΠΈ ΠΊΠ°ΡΠ΅ΡΡΠ²Π° ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΏΡΠΎΡΠ΅ΡΡΠ°, Π·Π°ΡΠΈΡΡ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΠΎΠΉ ΡΠΎΠ±ΡΡΠ²Π΅Π½Π½ΠΎΡΡΠΈ.
ΠΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ ΠΏΠΎΠ΄ΡΠΎΠ±Π½ΡΠΉ Π°Π½Π°Π»ΠΈΠ· ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ
ΡΠ΅ΡΠ΅Π½ΠΈΠΉ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ. ΠΡΠ΅Π΄Π»Π°Π³Π°ΡΡΡΡ Π΄Π²Π΅ Π½ΠΎΠ²ΡΠ΅ ΠΌΠ΅ΡΠΎΠ΄ΠΈΠΊΠΈ ΠΈΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ: ΠΌΠ°ΡΠΈΠ½Ρ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ², ΡΠΈΠ»ΡΡΡΠ° Π±ΡΡΡΡΠΎΠΉ ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ ΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ²; Π³ΠΈΠ±ΡΠΈΠ΄Π½ΠΎΠΉ ΡΠ²Π΅ΡΡΠΎΡΠ½ΠΎ-ΡΠ΅ΠΊΡΡΡΠ΅Π½ΡΠ½ΠΎΠΉ Π½Π΅ΠΉΡΠΎΠ½Π½ΠΎΠΉ ΡΠ΅ΡΠΈ.
ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΡ ΠΏΡΠΎΠ²ΠΎΠ΄ΠΈΠ»ΠΈΡΡ Π½Π° Π±Π°Π·Π΅ ΠΈΡΡ
ΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ΄ΠΎΠ², Π½Π°ΠΏΠΈΡΠ°Π½Π½ΡΡ
Π½Π° Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΡ
ΡΠ·ΡΠΊΠ°Ρ
ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ. Π ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠ°Π»ΡΠ½ΡΡ Π±Π°Π·Ρ Π²ΠΎΡΠ»ΠΈ ΡΠΊΠ·Π΅ΠΌΠΏΠ»ΡΡΡ ΠΈΡΡ
ΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ΄ΠΎΠ², Π½Π°ΠΏΠΈΡΠ°Π½Π½ΡΡ
Π½Π° Java, C++, Python, PHP, JavaScript, C, C# ΠΈ Ruby. ΠΠ°Π½Π½ΡΠ΅ Π±ΡΠ»ΠΈ ΠΏΠΎΠ»ΡΡΠ΅Π½Ρ Ρ Π²Π΅Π±-ΡΠ΅ΡΠ²ΠΈΡΠ° Π΄Π»Ρ Ρ
ΠΎΡΡΠΈΠ½Π³Π° IT-ΠΏΡΠΎΠ΅ΠΊΡΠΎΠ² Github. ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΈΡΡ
ΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ΄ΠΎΠ² ΠΏΡΠ΅Π²ΡΡΠ°Π΅Ρ 150 ΡΡΡΡΡ ΠΎΠ±ΡΠ°Π·ΡΠΎΠ², ΡΡΠ΅Π΄Π½ΡΡ Π΄Π»ΠΈΠ½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΠΈΠ· ΠΊΠΎΡΠΎΡΡΡ
ΡΠΎΡΡΠ°Π²Π»ΡΠ΅Ρ 850 ΡΠΈΠΌΠ²ΠΎΠ»ΠΎΠ². Π Π°Π·ΠΌΠ΅Ρ ΠΊΠΎΡΠΏΡΡΠ° β 542 Π°Π²ΡΠΎΡΠ°.
Π‘ ΠΏΠΎΠΌΠΎΡΡΡ ΠΏΠ΅ΡΠ΅ΠΊΡΠ΅ΡΡΠ½ΠΎΠΉ ΠΏΡΠΎΠ²Π΅ΡΠΊΠΈ ΠΏΠΎ 10 Π±Π»ΠΎΠΊΠ°ΠΌ ΠΎΡΠ΅Π½Π΅Π½Π° ΡΠΎΡΠ½ΠΎΡΡΡ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠ°Π½Π½ΡΡ
ΠΌΠ΅ΡΠΎΠ΄ΠΈΠΊ Π΄Π»Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΠΎΠ³ΠΎ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° Π°Π²ΡΠΎΡΠΎΠ². ΠΠ»Ρ Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ Java ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½ Π΄ΠΎΠΏΠΎΠ»Π½ΠΈΡΠ΅Π»ΡΠ½ΡΠΉ ΡΡΠ΄ ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠΎΠ² Ρ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎΠΌ Π°Π²ΡΠΎΡΠΎΠ² ΠΎΡ 2 Π΄ΠΎ 50 ΠΈ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½Ρ Π³ΡΠ°ΡΠΈΠΊΠΈ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΡΠΎΡΠ½ΠΎΡΡΠΈ ΠΈΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΎΡ ΡΠ°Π·ΠΌΠ΅ΡΠ° ΠΊΠΎΡΠΏΡΡΠ°.
ΠΠ½Π°Π»ΠΈΠ· ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² ΠΏΠΎΠΊΠ°Π·Π°Π», ΡΡΠΎ ΠΌΠ΅ΡΠΎΠ΄ΠΈΠΊΠ° Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ Π³ΠΈΠ±ΡΠΈΠ΄Π½ΠΎΠΉ Π½Π΅ΠΉΡΠΎΠ½Π½ΠΎΠΉ ΡΠ΅ΡΠΈ ΡΠΏΠΎΡΠΎΠ±Π½Π° Π΄ΠΎΡΡΠΈΠ³Π°ΡΡ ΡΠΎΡΠ½ΠΎΡΡΠΈ 97%, ΡΡΠΎ ΡΠ²Π»ΡΠ΅ΡΡΡ Π½Π°ΠΈΠ»ΡΡΡΠΈΠΌ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠΌ Π½Π° ΡΠ΅Π³ΠΎΠ΄Π½ΡΡΠ½ΠΈΠΉ Π΄Π΅Π½Ρ. ΠΠ΅ΡΠΎΠ΄ΠΈΠΊΠ° Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΌΠ°ΡΠΈΠ½Ρ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ»Π° Π΄ΠΎΠ±ΠΈΡΡΡΡ ΡΠΎΡΠ½ΠΎΡΡΠΈ 96%. ΠΠΈΠ±ΡΠΈΠ΄Π½Π°Ρ Π½Π΅ΠΉΡΠΎΠ½Π½Π°Ρ ΡΠ΅ΡΡ ΠΎΠΊΠ°Π·Π°Π»Π°ΡΡ ΡΠΎΡΠ½Π΅Π΅ ΠΌΠ°ΡΠΈΠ½Ρ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² Π² ΡΡΠ΅Π΄Π½Π΅ΠΌ Π½Π° 5%
Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments
Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that individually authored complete files can be attributed, these efforts have focused on such ideal data sets as contest submissions and student assignments. We explore the problem of authorship attribution βin the wild,β examining source code obtained from open-source version control systems, and investigate how contributions can be attributed to their authors, either on an individual or a per-account basis. In this work, we present a study of attribution of code collected from collaborative environments and identify factors which make attribution of code fragments more or less successful. For individual contributions, we show that previous methods (adapted to be applied to short code fragments) yield an accuracy of approximately 50% or 60%, depending on whether we average by sample or by author, at identifying the correct author out of a set of 104 programmers. By ensembling the classification probabilities of a sufficiently large set of samples belonging to the same author we achieve much higher accuracy for assigning the set of samples to the correct author from a known suspect set. Additionally, we propose the use of calibration curves to identify which samples are by unknown and previously unencountered authors
Identifying Authorship Style in Malicious Binaries: Techniques, Challenges & Datasets
Attributing a piece of malware to its creator typically requires threat intelligence. Binary attribution increases the level of difficulty as it mostly relies upon the ability to disassemble binaries to identify authorship style. Our survey explores malicious author style and the adversarial techniques used by them to remain anonymous. We examine the adversarial impact on the state-of-the-art methods. We identify key findings and explore the open research challenges. To mitigate the lack of ground truth datasets in this domain, we publish alongside this survey the largest and most diverse meta-information dataset of 15,660 malware labeled to 164 threat actor groups
Source Code Stylometry and Authorship Attribution for Open Source
Public software repositories such as GitHub make transparent the development history of an open source software system. Source code commits, discussions about new features and bugs, and code reviews are stored and carefully attributed to the appropriate developers. However, sometimes governments may seek to analyze these repositories, to identify citizens who contribute to projects they disapprove of, such as those involving cryptography or social media. While developers who seek anonymity may contribute under assumed identities, their body of public work may be characteristic enough to betray who they really are. The ability to contribute anonymously to public bodies of knowledge is extremely important to the future of technological and intellectual freedoms. Just as in security hacking, the only way to protect vulnerable individuals is by demonstrating the means and strength of available attacks so that those concerned may know of the need and develop the means to protect themselves.
In this work, we present a method to de-anonymize source code contributors based on the authors' intrinsic programming style. First, we present a partial replication study wherein we attempt to de-anonymize a large number of entries into the Google Code Jam competition. We base our approach on Caliskan-Islam et al. 2015, but with modifications to the feature set and modelling strategy for scalability and feature-selection robustness. We did not achieve 0.98 F1 achieved in this prior work, but managed a still reasonable 0.71 F1 under identical experimental conditions, and a 0.88 F1 given more data from the same set.
Second, we present an exploratory study focused on de-anonymizing programmers who have contributed to a repository, using other commits from the same repository as training data. We train random-forest classifiers using programmer data collected from 37 medium to large open-source repositories. Given a choice between active developers in a project, we were able to correctly determine authorship of a given function about 75% of the time, without the use of identifying meta-data or comments. We were also able to correctly validate a contributor as the author of a questioned function with 80\% recall and 65\% precision. This exploratory study provides empirical support for our approach.
Finally, we present the results of a similar, but more difficult study wherein we attempt de-anonymize a repository in the same manner, but without using the target repository as training data. To do this, we gather as much training data as possible from the repository's contributors through the Github API. We evaluate our technique over 3 repositories: Bitcoin, Ethereum (crypto-currencies) and TrinityCore (a game engine). Our results in this experiment starkly contrast our results in the intra-repository study showing accuracies of 35% for Bitcoin, 22% for Ethereum, and 21% for TrinityCore which had candidate set sizes of 6, 5, and 7 respectively.
Our results indicate that we can do somewhat better than random guessing, even under difficult experimental conditions, but they also indicate some fundamental issues with the state of the art of Code Stylometry. In this work we present our methodology, results, and some comments on past empirical studies, the difficulties we faced, and likely hurdles for future work in the area
The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files
In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated