HIERARCHICAL LEARNING OF DISCRIMINATIVE FEATURES AND CLASSIFIERS FOR LARGE-SCALE VISUAL RECOGNITION

Abstract

Enabling computers to recognize objects present in images has been a long standing but tremendously challenging problem in the field of computer vision for decades. Beyond the difficulties resulting from huge appearance variations, large-scale visual recognition poses unprecedented challenges when the number of visual categories being considered becomes thousands, and the amount of images increases to millions. This dissertation contributes to addressing a number of the challenging issues in large-scale visual recognition. First, we develop an automatic image-text alignment method to collect massive amounts of labeled images from the Web for training visual concept classifiers. Specif- ically, we first crawl a large number of cross-media Web pages containing Web images and their auxiliary texts, and then segment them into a collection of image-text pairs. We then show that near-duplicate image clustering according to visual similarity can significantly reduce the uncertainty on the relatedness of Web images’ semantics to their auxiliary text terms or phrases. Finally, we empirically demonstrate that ran- dom walk over a newly proposed phrase correlation network can help to achieve more precise image-text alignment by refining the relevance scores between Web images and their auxiliary text terms. Second, we propose a visual tree model to reduce the computational complexity of a large-scale visual recognition system by hierarchically organizing and learning the classifiers for a large number of visual categories in a tree structure. Compared to previous tree models, such as the label tree, our visual tree model does not require training a huge amount of classifiers in advance which is computationally expensive. However, we experimentally show that the proposed visual tree achieves results that are comparable or even better to other tree models in terms of recognition accuracy and efficiency. Third, we present a joint dictionary learning (JDL) algorithm which exploits the inter-category visual correlations to learn more discriminative dictionaries for image content representation. Given a group of visually correlated categories, JDL simul- taneously learns one common dictionary and multiple category-specific dictionaries to explicitly separate the shared visual atoms from the category-specific ones. We accordingly develop three classification schemes to make full use of the dictionaries learned by JDL for visual content representation in the task of image categoriza- tion. Experiments on two image data sets which respectively contain 17 and 1,000 categories demonstrate the effectiveness of the proposed algorithm. In the last part of the dissertation, we develop a novel data-driven algorithm to quantitatively characterize the semantic gaps of different visual concepts for learning complexity estimation and inference model selection. The semantic gaps are estimated directly in the visual feature space since the visual feature space is the common space for concept classifier training and automatic concept detection. We show that the quantitative characterization of the semantic gaps helps to automatically select more effective inference models for classifier training, which further improves the recognition accuracy rates

    Similar works