As a step towards finding a middle ground between manual annotation of images and fully automatic approaches that rely only on visual properties of the image, researchers have proposed systems that leverage human speech for annotating images. The efficacy of such systems depends on the accuracy of the speech recognizer. Alternatively, to reduce the effect of recognition errors, some systems store the speech recognizer’s alternatives (N-best lists) for every utterance as part of the annotation for the image. These lists are later used for query expansion during retrieval. However, such annotation systems do not use these N-best lists to improve interpretation of the recognizer’s output. In this work, we show how semantic knowledge acquired through an understanding of relationships between image tags can be used in improving interpretation of a speech recognizer’s output in the context of annotation of images. We postulate the problem of using speech for annotation as that of disambiguating between alternatives across several N-best lists associated with tags spoken for an image. This approach eliminates the need to store all the items across such lists as part of the annotation. Towards this end, we propose different models that leverage semantics derived from a heterogenous corpus of image annotations. We capture semantics in the form of co-relations between annotations as well as through probabilistic generative models based on the assumption that tags for an image are generated from semantic themes and that words in the same theme tend to have a semantic affinity for each other. We evaluate our models in the context of image tags that are popularly used common/proper nouns as made publicly available through an online photo sharin
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.