2 research outputs found

    A MACHINE LEARNING APPROACH FOR CLASSIFYING JAVASCRIPT USING STATIC CODE ANALYSIS

    Get PDF
    This thesis develops a machine learning approach to classify normal and anomalous JavaScript based on a static analysis of select features derived from the top 30 000 webpages on the internet. A dataset of 136 features was extracted from 100 000 raw JavaScript files. Nine test groups were created and tested using 10 subsets of features. K-means clustering was used to group the data and manually translate into binary classification. The results from the K-means clustering show moderate performance with distortions less than 1.0 from elbow plot analysis and average silhouette scores between 0.3 and 0.8 using silhouette analysis of the clustering. The classification of each JavaScript file was then examined using naïve Bayes algorithm to re-create and examine the performance of the highest performing classifiers using a less processing intensive method. Naïve Bayes was not a good model to re-create the K-means classifier. The best performing classifiers had a Matthews correlation coefficient of 0.75 when examining small JavaScript, and less that 0.38 when examining the medium or large JavaScript. The results show that most JavaScript files were small in file size, and file size was the only defining feature. No features tested effectively categorize the vast majority of JavaScript other than file size. Further research is needed to find features that more accurately encompass the majority of JavaScript to define normal JavaScript.National Security Agency, Ft. Meade, MD, 20755Lieutenant, United States NavyApproved for public release. Distribution is unlimited

    Challenging machine learning algorithms in predicting vulnerable javascript functions

    No full text
    The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures. © 2019 IEEE
    corecore