1 research outputs found
Automatic Document Layout Analysis through Relational Machine Learning
The current spread of digital documents raised the need of effective
content-based retrieval techniques. Since manual indexing is infeasible and subjective,
automatic techniques are the obvious solution. In particular, the ability of
properly identifying and understanding a document’s structure is crucial, in order
to focus on the most significant components only. At a geometrical level, this task
is known as Layout Analysis, and thoroughly studied in the literature. On suitable
descriptions of the document layout, Machine Learning techniques can be applied
to automatically infer models of classes of documents and of their components. Indeed,
organizing the documents on the grounds of the knowledge they contain is
fundamental for being able to correctly access them according to the user’s needs.
Thus, the quality of the layout analysis outcome biases the next understanding
steps. Unfortunately, due to the variety of document styles and formats, the automatically
found structure often needs to be manually adjusted. We propose the application
of supervised Machine Learning techniques to infer correction rules to be
applied to forthcoming documents. A first-order logic representation is suggested,
because corrections often depend on the relationships of the wrong components with
the surrounding ones. Moreover, as a consequence of the continuous flow of documents,
the learned models often need to be updated and refined, which calls for
incremental abilities. The proposed technique, embedded in a prototypical version
of the document processing system DOMINUS, using the incremental first-order
logic learner INTHELEX, revealed good performance in real-world experiments