30 research outputs found
A novel Markov logic rule induction strategy for characterizing sports video footage
The grounding of high-level semantic concepts is a key requirement of video annotation systems. Rule induction can thus constitute an invaluable intermediate step in characterizing protocol-governed domains, such as broadcast sports footage. We here set out a novel “clause grammar template” approach to the problem of rule-induction in video footage of court games that employs a second-order meta grammar for Markov Logic Network construction. The aim is to build an adaptive system for sports video annotation capable, in principle, both of learning ab initio and also adaptively transferring learning between distinct rule domains. The method is tested with respect to both a simulated game predicate generator and also real data derived from tennis footage via computer-vision based approaches including HOG3D based player-action classification, Hough-transform based court detection, and graph-theoretic ball-tracking. Experiments demonstrate that the method exhibits both error resilience and learning transfer in the court domain context. Moreover the clause template approach naturally generalizes to any suitably-constrained, protocol-governed video domain characterized by feature noise or detector error
Sequence-aware multimodal page classification of Brazilian legal documents
The Brazilian Supreme Court receives tens of thousands of cases each
semester. Court employees spend thousands of hours to execute the initial
analysis and classification of those cases -- which takes effort away from
posterior, more complex stages of the case management workflow. In this paper,
we explore multimodal classification of documents from Brazil's Supreme Court.
We train and evaluate our methods on a novel multimodal dataset of 6,510
lawsuits (339,478 pages) with manual annotation assigning each page to one of
six classes. Each lawsuit is an ordered sequence of pages, which are stored
both as an image and as a corresponding text extracted through optical
character recognition. We first train two unimodal classifiers: a ResNet
pre-trained on ImageNet is fine-tuned on the images, and a convolutional
network with filters of multiple kernel sizes is trained from scratch on
document texts. We use them as extractors of visual and textual features, which
are then combined through our proposed Fusion Module. Our Fusion Module can
handle missing textual or visual input by using learned embeddings for missing
data. Moreover, we experiment with bi-directional Long Short-Term Memory
(biLSTM) networks and linear-chain conditional random fields to model the
sequential nature of the pages. The multimodal approaches outperform both
textual and visual classifiers, especially when leveraging the sequential
nature of the pages.Comment: 11 pages, 6 figures. This preprint, which was originally written on 8
April 2021, has not undergone peer review or any post-submission improvements
or corrections. The Version of Record of this article is published in the
International Journal on Document Analysis and Recognition, and is available
online at https://doi.org/10.1007/s10032-022-00406-7 and
https://rdcu.be/cRvv