11 research outputs found

    Unsupervised authorship analysis of phishing webpages

    Get PDF
    Authorship analysis on phishing websites enables the investigation of phishing attacks, beyond basic analysis. In authorship analysis, salient features from documents are used to determine properties about the author, such as which of a set of candidate authors wrote a given document. In unsupervised authorship analysis, the aim is to group documents such that all documents by one author are grouped together. Applying this to cyber-attacks shows the size and scope of attacks from specific groups. This in turn allows investigators to focus their attention on specific attacking groups rather than trying to profile multiple independent attackers. In this paper, we analyse phishing websites using the current state of the art unsupervised authorship analysis method, called NUANCE. The results indicate that the application produces clusters which correlate strongly to authorship, evaluated using expert knowledge and external information as well as showing an improvement over a previous approach with known flaws. © 2012 IEEE

    Metodologias para tomada de decisão a partir de informações qualitativas

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-Graduação em Ciência da Computação

    A fuzzy logic approach to computer software source code authorship analysis

    No full text
    Software source code authorship analysis has become an important area in recent years with promising applications in both the legal sector (such as proof of ownership and software forensics) and the education sector (such as plagiarism detection and assessing style). Authorship analysis encompasses the sub-areas of author discrimination, author characterization, and similarity detection (also referred to as plagiarism detection). While a large number of metrics have been proposed for this task, many borrowed or adapted from the area of computational linguistics, there is a difficulty with capturing certain types of information in terms of quantitative measurement. Here it is proposed that existing numerical metrics should be supplemented with fuzzy-logic linguistic variables to capture more subjective elements of authorship, such as the degree to which comments match the actual source code’s behavior. These variables avoid the need for complex and subjective rules, replacing these with an expert’s judgement. Fuzzy-logic models may also help to overcome problems with small data sets for calibrating such models. Using authorship discrimination as a test case, the utility of objective and fuzzy measures, singularly and in combination, is assessed as well as the consistency of the measures between counters

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    Source code authorship attribution

    Get PDF
    To attribute authorship means to identify the true author among many candidates for samples of work of unknown or contentious authorship. Authorship attribution is a prolific research area for natural language, but much less so for source code, with eight other research groups having published empirical results concerning the accuracy of their approaches to date. Authorship attribution of source code is the focus of this thesis. We first review, reimplement, and benchmark all existing published methods to establish a consistent set of accuracy scores. This is done using four newly constructed and significant source code collections comprising samples from academic sources, freelance sources, and multiple programming languages. The collections developed are the most comprehensive to date in the field. We then propose a novel information retrieval method for source code authorship attribution. In this method, source code features from the collection samples are tokenised, converted into n-grams, and indexed for stylistic comparison to query samples using the Okapi BM25 similarity measure. Authorship of the top ranked sample is used to classify authorship of each query, and the proportion of times that this is correct determines overall accuracy. The results show that this approach is more accurate than the best approach from the previous work for three of the four collections. The accuracy of the new method is then explored in the context of author style evolving over time, by experimenting with a collection of student programming assignments that spans three semesters with established relative timestamps. We find that it takes one full semester for individual coding styles to stabilise, which is essential knowledge for ongoing authorship attribution studies and quality control in general. We conclude the research by extending both the new information retrieval method and previous methods to provide a complete set of benchmarks for advancing the field. In the final evaluation, we show that the n-gram approaches are leading the field, with accuracy scores for some collections around 90% for a one-in-ten classification problem

    Software forensics: extending authorship analysis techniques to computer programs

    No full text
    Please note that this is a searchable PDF derived via optical character recognition (OCR) from the original source document. As the OCR process is never 100% perfect, there may be some discrepancies between the document image and the underlying text.The number of occurrences and severity of computer-based attacks such as viruses and worms, logic bombs, trojan horses, computer fraud, and plagiarism of code have become of increasing concern. In an attempt to better deal with these problems it is proposed that methods for examining the authorship of computer programs are necessary. This field is referred to here as software forensics. This involves the areas of author discrimination, identification, and characterisation, as well as intent analysis. Borrowing extensively from the existing fields of linguistics and software metrics, this can be seen as a new and exciting area for forensics to extend into.UnpublishedGray, A.R., Sallis, P.J., and MacDonell. S.G. (1998). IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination): A dictionary-based system for extracting source code metrics for software forensics. Submitted to SE:E&P’98 Software Engineering: Education & Practice. Dunedin. New Zealand. Kilgour, R.I., Gray, A.R., Sallis, P.J., and MacDonell, S.G. (1997). A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis. Accepted for The Fourth International Conference on Neural Information Processing -- The Annual Confrence of the Asian Pacific Neural Network Assembly (ICONIP’97). Dunedin. New Zealand. Longstaff, T.A., and Schultz, E.E. (1993). Beyond Preliminary Analysis of the WANK and OILZ Worms: A Case Study of Malicious Code. Computers & Security. 12:61-77. Sallis, P.J. (1994). Contemporary Computing Methods for the Authorship Characterisation Problem in Computational Linguistics. New Zealand Journal of Computing. 5(1):85-95. Sallis P., Aakjaer, A., and MacDonell, S. (1996). Software Forensics: Old Methods for a New Science. SE:E&P’96 (Software Engineering: Education & Practice). Dunedin. New Zealand. IEEE Computer Society Press. 367-371. Spafford, E.H. (1989). The Internet Worm Program: An Analysis. Computer Communications Review. 19(1):17- 49. Spafford, E.H., and Weeber, S.A. (1993). Software Forensics: Can we track Code to its Authors? Computers & Security. 12:585-595. Whale, G. (1990). Software Metrics and Plagiarism Detection. Journal of Systems and Software. 13:131-138

    Software forensics: extending authorship analysis techniques to computer programs

    Get PDF
    Please note that this is a searchable PDF derived via optical character recognition (OCR) from the original source document. As the OCR process is never 100% perfect, there may be some discrepancies between the document image and the underlying text.The number of occurrences and severity of computer-based attacks such as viruses and worms, logic bombs, trojan horses, computer fraud, and plagiarism of code have become of increasing concern. In an attempt to better deal with these problems it is proposed that methods for examining the authorship of computer programs are necessary. This field is referred to here as software forensics. This involves the areas of author discrimination, identification, and characterisation, as well as intent analysis. Borrowing extensively from the existing fields of linguistics and software metrics, this can be seen as a new and exciting area for forensics to extend into.UnpublishedGray, A.R., Sallis, P.J., and MacDonell. S.G. (1998). IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination): A dictionary-based system for extracting source code metrics for software forensics. Submitted to SE:E&P’98 Software Engineering: Education & Practice. Dunedin. New Zealand. Kilgour, R.I., Gray, A.R., Sallis, P.J., and MacDonell, S.G. (1997). A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis. Accepted for The Fourth International Conference on Neural Information Processing -- The Annual Confrence of the Asian Pacific Neural Network Assembly (ICONIP’97). Dunedin. New Zealand. Longstaff, T.A., and Schultz, E.E. (1993). Beyond Preliminary Analysis of the WANK and OILZ Worms: A Case Study of Malicious Code. Computers & Security. 12:61-77. Sallis, P.J. (1994). Contemporary Computing Methods for the Authorship Characterisation Problem in Computational Linguistics. New Zealand Journal of Computing. 5(1):85-95. Sallis P., Aakjaer, A., and MacDonell, S. (1996). Software Forensics: Old Methods for a New Science. SE:E&P’96 (Software Engineering: Education & Practice). Dunedin. New Zealand. IEEE Computer Society Press. 367-371. Spafford, E.H. (1989). The Internet Worm Program: An Analysis. Computer Communications Review. 19(1):17- 49. Spafford, E.H., and Weeber, S.A. (1993). Software Forensics: Can we track Code to its Authors? Computers & Security. 12:585-595. Whale, G. (1990). Software Metrics and Plagiarism Detection. Journal of Systems and Software. 13:131-138

    Computer-mediated communication: experiments with e-mail readability

    Get PDF
    Please note that this is a searchable PDF derived via optical character recognition (OCR) from the original source document. As the OCR process is never 100% perfect, there may be some discrepancies between the document image and the underlying text.[No abstract.]Unpublished[1] Berthold, M.R., F. Sudweeks, S. Newton, R. Coyne (1997) It makes sense: Using an autoassociative neural network to explore typicality in computer mediated discussions, in Network and Netplay: Virtual Groups on the Internet, (eds. F. Sudweeks et al), AAAI/MIT Press Menlo Park, Ca. [2] Ferris S.P., What is CMC? An Overview of Scholarly Definitions in Computer-Mediated Communication Magazine ISSN 1O76-027X / Volume 4, Number 1 / January 1, 1997, http://www.december.com/cmc/mag/1997/jan/ [3] Gray, A.R., P.J. Sallis, and S.G. MacDonell (1997) Software forensics: extending authorship analysis to computer programs. In Proceedings of the Third Biannnual Conference of the International Association of Forensic Linguists, Durham NC. [4] Gray, A.R., P.J. Sallis, and S.G. MacDonell, IDENTIFIED (Integrated Dictionary-based Extraction of Non-language~dependent Token Information for Forensic Identification, Examination, and Discrimination): a dictionary-based system for extracting source code metrics for software forensics. Submitted to SE:E&P ’98 (Software Education Conference), Dunedin, New Zealand. Forthcoming. [5] Harman D. (1995) Overview of the Third Text REtrieval Conference (TREC-3) in The Third Text REtrieval Conference, April 1995 (ed. D.K. Harman), NIST, Gaithersburg, MD. [6] Harrison C. (1980) Readability in the Class Room, Cambridge University Press, Cambridge. [7] Kassabova, D. and P.J. Sallis, (1997) Connectionist Methods for Stylometric Analysis: A Hybrid Approach, in Neuro-Fuzzy Tools and Techniques for Information Processing, Springer Verlag, Singapore, in press. [8] Kilgour, R.I., A.R Gray,., P.J. Sallis, and S.G. MacDonell, A fuzzy logic approach to computer software source code authorship analysis. Accepted by ICONIP/ANZIIS/ANNES’97 Conference, Dunedin New Zealand. Forthcoming. [9] Kohonen T. (1997) Exploration of Very Large Databases by Self Organising Maps, in Proceedings of the 1997 International Conference on Neural Networks (ICNN’97) Houston, June 1997, vol. 1, IEEE. [10] Rafaeli, S. and F. Sudweeks (1997) Interactivity on the Nets, in NetWork and NetPlay: Virtual Groups on the Internet, (eds Sudweeks et al), AAAI/MIT Press, Menlo Park, Ca. [11] Robertson S.E. et al. (1995) OKAPI at TREC-3, in The Third Text REtrieval Conference, April 1995, (ed. D.K. Harman), NIST, Gaithersburg, MD, USA. [12] Rudy I.A. (1996) A Critical Review of Research on Electronic Mail, in European Journal of Information Systems, 4, 198-213. [13] Sallis, P.J., S.G. MacDonell,. and A. Aakjaer (1996) Software Forensics: old methods for a new science. In Proceedings, Software Education Conference (SE:E&P ’96). Dunedin, New Zealand, January 1996. [14] The Comprehensivehttp://tile.net/ Internet Reference to Discussion Lists, Newsgroups, FTP Sites, Computer Products Vendors and Internet Service & Web Design Companies at http://tile.net/ [15] Wilkins H. (1991) Computer Talk: Long-Distance Conversations by Computer in Written Communication, 8(1), Sage Publications, Inc., 56-78. [16] Wright, T., Privacy Protection Principles for Electronic mail systems, Report of the Information and Privacy Commissioner for Ontario, Canada, The Computer Law and Security Report, March-April, 199

    Computer-mediated communication: experiments with e-mail readability

    No full text
    Please note that this is a searchable PDF derived via optical character recognition (OCR) from the original source document. As the OCR process is never 100% perfect, there may be some discrepancies between the document image and the underlying text.[No abstract.]Unpublished[1] Berthold, M.R., F. Sudweeks, S. Newton, R. Coyne (1997) It makes sense: Using an autoassociative neural network to explore typicality in computer mediated discussions, in Network and Netplay: Virtual Groups on the Internet, (eds. F. Sudweeks et al), AAAI/MIT Press Menlo Park, Ca. [2] Ferris S.P., What is CMC? An Overview of Scholarly Definitions in Computer-Mediated Communication Magazine ISSN 1O76-027X / Volume 4, Number 1 / January 1, 1997, http://www.december.com/cmc/mag/1997/jan/ [3] Gray, A.R., P.J. Sallis, and S.G. MacDonell (1997) Software forensics: extending authorship analysis to computer programs. In Proceedings of the Third Biannnual Conference of the International Association of Forensic Linguists, Durham NC. [4] Gray, A.R., P.J. Sallis, and S.G. MacDonell, IDENTIFIED (Integrated Dictionary-based Extraction of Non-language~dependent Token Information for Forensic Identification, Examination, and Discrimination): a dictionary-based system for extracting source code metrics for software forensics. Submitted to SE:E&P ’98 (Software Education Conference), Dunedin, New Zealand. Forthcoming. [5] Harman D. (1995) Overview of the Third Text REtrieval Conference (TREC-3) in The Third Text REtrieval Conference, April 1995 (ed. D.K. Harman), NIST, Gaithersburg, MD. [6] Harrison C. (1980) Readability in the Class Room, Cambridge University Press, Cambridge. [7] Kassabova, D. and P.J. Sallis, (1997) Connectionist Methods for Stylometric Analysis: A Hybrid Approach, in Neuro-Fuzzy Tools and Techniques for Information Processing, Springer Verlag, Singapore, in press. [8] Kilgour, R.I., A.R Gray,., P.J. Sallis, and S.G. MacDonell, A fuzzy logic approach to computer software source code authorship analysis. Accepted by ICONIP/ANZIIS/ANNES’97 Conference, Dunedin New Zealand. Forthcoming. [9] Kohonen T. (1997) Exploration of Very Large Databases by Self Organising Maps, in Proceedings of the 1997 International Conference on Neural Networks (ICNN’97) Houston, June 1997, vol. 1, IEEE. [10] Rafaeli, S. and F. Sudweeks (1997) Interactivity on the Nets, in NetWork and NetPlay: Virtual Groups on the Internet, (eds Sudweeks et al), AAAI/MIT Press, Menlo Park, Ca. [11] Robertson S.E. et al. (1995) OKAPI at TREC-3, in The Third Text REtrieval Conference, April 1995, (ed. D.K. Harman), NIST, Gaithersburg, MD, USA. [12] Rudy I.A. (1996) A Critical Review of Research on Electronic Mail, in European Journal of Information Systems, 4, 198-213. [13] Sallis, P.J., S.G. MacDonell,. and A. Aakjaer (1996) Software Forensics: old methods for a new science. In Proceedings, Software Education Conference (SE:E&P ’96). Dunedin, New Zealand, January 1996. [14] The Comprehensivehttp://tile.net/ Internet Reference to Discussion Lists, Newsgroups, FTP Sites, Computer Products Vendors and Internet Service & Web Design Companies at http://tile.net/ [15] Wilkins H. (1991) Computer Talk: Long-Distance Conversations by Computer in Written Communication, 8(1), Sage Publications, Inc., 56-78. [16] Wright, T., Privacy Protection Principles for Electronic mail systems, Report of the Information and Privacy Commissioner for Ontario, Canada, The Computer Law and Security Report, March-April, 199
    corecore