Log data is pivotal in activities like anomaly detection and failure
diagnosis in the automated maintenance of software systems. Due to their
unstructured format, log parsing is often required to transform them into a
structured format for automated analysis. A variety of log parsers exist,
making it vital to benchmark these tools to comprehend their features and
performance. However, existing datasets for log parsing are limited in terms of
scale and representativeness, posing challenges for studies that aim to
evaluate or develop log parsers. This problem becomes more pronounced when
these parsers are evaluated for production use. To address these issues, we
introduce a new collection of large-scale annotated log datasets, named LogPub,
which more accurately mirrors log data observed in real-world software systems.
LogPub comprises 14 datasets, each averaging 3.6 million log lines. Utilizing
LogPub, we re-evaluate 15 log parsers in a more rigorous and practical setting.
We also propose a new evaluation metric to lessen the sensitivity of current
metrics to imbalanced data distribution. Furthermore, we are the first to
scrutinize the detailed performance of log parsers on logs that represent rare
system events and offer comprehensive information for system troubleshooting.
Parsing such logs accurately is vital yet challenging. We believe that our work
could shed light on the design and evaluation of log parsers in more realistic
settings, thereby facilitating their implementation in production systems