1 research outputs found
Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums
Automatic extraction of forum posts and metadata is a crucial but challenging
task since forums do not expose their content in a standardized structure.
Content extraction methods, therefore, often need customizations such as
adaptations to page templates and improvements of their extraction code before
they can be deployed to new forums. Most of the current solutions are also
built for the more general case of content extraction from web pages and lack
key features important for understanding forum content such as the
identification of author metadata and information on the thread structure.
This paper, therefore, presents a method that determines the XPath of forum
posts, eliminating incorrect mergers and splits of the extracted posts that
were common in systems from the previous generation. Based on the individual
posts further metadata such as authors, forum URL and structure are extracted.
We also introduce Harvest, a new open source toolkit that implements the
presented methods and create a gold standard extracted from 52 different Web
forums for evaluating our approach. A comprehensive evaluation reveals that
Harvest clearly outperforms competing systems.Comment: IEEE/WIC/ACM International Joint Conference on Web Intelligence and
Intelligent Agent Technology (WI-IAT 2020), Accepted 27 October 202