AbstractData cleaning is an important step performed in the preprocessing stage of web usage mining, and is widely used in many data mining systems. Despite many efforts on data cleaning for web server logs, it is still an open question for enterprise proxy logs. With unlimited accesses to websites, enterprise proxy logs trace web requests from multiple clients to multiple web servers,which make them quite different from web sever logs on both location and content. Therefore, many irrelevant items such as software updating requests cannot be filtered out by traditional data cleaning methods. In this paper, we propose the first method named EPLogCleaner that can filter out plenty of irrelevant items based on the common prefix of their URLs. We make an evaluation of EPLogCleaner with a real network traffic trace captured from one enterprise proxy. Experimental results show that EPLogCleaner can improve data quality of enterprise proxy logs by further filtering out more than 30% URL requests comparing with traditional data cleaning methods
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.