With the widespread use of the internet, it has become increasingly crucial
to extract specific information from vast amounts of academic articles
efficiently. Data mining techniques are generally employed to solve this issue.
However, data mining for academic articles is challenging since it requires
automatically extracting specific patterns in complex and unstructured layout
documents. Current data mining methods for academic articles employ
rule-based(RB) or machine learning(ML) approaches. However, using rule-based
methods incurs a high coding cost for complex typesetting articles. On the
other hand, simply using machine learning methods requires annotation work for
complex content types within the paper, which can be costly. Furthermore, only
using machine learning can lead to cases where patterns easily recognized by
rule-based methods are mistakenly extracted. To overcome these issues, from the
perspective of analyzing the standard layout and typesetting used in the
specified publication, we emphasize implementing specific methods for specific
characteristics in academic articles. We have developed a novel Text Block
Refinement Framework (TBRF), a machine learning and rule-based scheme hybrid.
We used the well-known ACL proceeding articles as experimental data for the
validation experiment. The experiment shows that our approach achieved over 95%
classification accuracy and 90% detection accuracy for tables and figures.Comment: This paper has been accepted at 'The International Symposium on
Innovations in Intelligent Systems and Applications 2023 (INISTA 2023)