Study on log event noise reduction by using Naive Bayes supervised machine learning

Abstract

This research addresses which Naive Bayes model would be best to predict Windows log events that could be considered noise or in other words not containing information about malicious activities. With the exploding amount of log data being generated by servers, large corporations or organizations are having an increasingly difficult time analyzing these logs to find evidence of malicious activity in their environment. Fortune 200 and larger corporations today are producing Terabytes of log events daily and this is expanding at a rate that soon it will be in the Petabytes. It is estimated that 80 to 90 percent of these log events could be classified as noise or just informational. They are not needed for finding evidence of malicious activity. By showing a process that can be used to predict whether these log events are noise or non-noise, with a reasonable degree of accuracy, tools could then be used to analyze log events to find malicious activity to filter out noise events and reduce the amount of data needed to be processed. This research will compare the Naive Bayes Bag of Words Multinomial, Multinomial TF-IDF and Multi-Variate Bernoulli models using different size feature word sets in predicting Windows noise log events

    Similar works