Abstract-As more and more software moves to Data Analytics as a Service (DAaaS), the web application has become more ubiquitous and log file analysis is becoming a necessary task for analyzing the client’s behaviour. Log files are getting generated very fast i.e., at the rate of 1-10 Mb/s per server. A single data centre can generate tens of terabytes of log data in a day which is very huge. In order to analyze such large datasets we need parallel processing system and reliable data storage mechanism. Virtual database system is an effective solution for integrating the data, but it becomes inefficient for large datasets. As log files are continuous stream data from distributed servers, an efficient way to handle such data is needed to store and send data into Storage servers. Our system uses Apache Flume to gather streams of log data from various servers and store them into HDFS, a Hadoop Distributed File System. MapReduce a parallel processing strategy breaks up input data and sends fractions of the original data to several machines in Hadoop cluster. This mechanism helps to process huge amounts of log data in parallel, using all the machines in the Hadoop cluster and computes result efficiently. This approach reduces the computation and response time as well as the load on to the end system. This paper proposes a server log analysis using Apache Flume for continuous streaming, Apache Hadoop through MapReduce and Pig for analyzing such logs, thereby providing accurate results for clients
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.