3 research outputs found
Towards Visual Syntactical Understanding
Syntax is usually studied in the realm of linguistics and refers to the
arrangement of words in a sentence. Similarly, an image can be considered as a
visual 'sentence', with the semantic parts of the image acting as 'words'.
While visual syntactic understanding occurs naturally to humans, it is
interesting to explore whether deep neural networks (DNNs) are equipped with
such reasoning. To that end, we alter the syntax of natural images (e.g.
swapping the eye and nose of a face), referred to as 'incorrect' images, to
investigate the sensitivity of DNNs to such syntactic anomaly. Through our
experiments, we discover an intriguing property of DNNs where we observe that
state-of-the-art convolutional neural networks, as well as vision transformers,
fail to discriminate between syntactically correct and incorrect images when
trained on only correct ones. To counter this issue and enable visual syntactic
understanding with DNNs, we propose a three-stage framework- (i) the 'words'
(or the sub-features) in the image are detected, (ii) the detected words are
sequentially masked and reconstructed using an autoencoder, (iii) the original
and reconstructed parts are compared at each location to determine syntactic
correctness. The reconstruction module is trained with BERT-like masked
autoencoding for images, with the motivation to leverage language model
inspired training to better capture the syntax. Note, our proposed approach is
unsupervised in the sense that the incorrect images are only used during
testing and the correct versus incorrect labels are never used for training. We
perform experiments on CelebA, and AFHQ datasets and obtain classification
accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes
well to ImageNet samples which share common classes with CelebA and AFHQ
without explicitly training on them
Application of stochastic grammars to understanding action
Thesis (M.S.)--Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1998.Includes bibliographical references (leaves 69-72).by Yuri A. Ivanov.M.S