30 research outputs found
String representations and distances in deep Convolutional Neural Networks for image classification
International audienceRecent advances in image classification mostly rely on the use of powerful local features combined with an adapted image representation. Although Convolutional Neural Network (CNN) features learned from ImageNet were shown to be generic and very efficient, they still lack of flexibility to take into account variations in the spatial layout of visual elements. In this paper, we investigate the use of structural representations on top of pre-trained CNN features to improve image classification. Images are represented as strings of CNN features. Similarities between such representations are computed using two new edit distance variants adapted to the image classification domain. Our algorithms have been implemented and tested on several challenging datasets, 15Scenes, Caltech101, Pas-cal VOC 2007 and MIT indoor. The results show that our idea of using structural string representations and distances clearly improves the classification performance over standard approaches based on CNN and SVM with linear kernel, as well as other recognized methods of the literature
Fusion of tf.idf Weighted Bag of Visual Features for Image Classification
International audienceImage representation using bag of visual words approach is commonly used in image classification. Features are extracted from images and clustered into a visual vocabulary. Images can then be represented as a normalized histogram of visual words similarly to textual documents represented as a weighted vector of terms. As a result, text categorization techniques are applicable to image classification. In this paper, our contribution is twofold. First, we propose a suitable Term-Frequency and Inverse Document Frequency weighting scheme to characterize the importance of visual words. Second, we present a method to fuse different bag-of-words obtained with different vocabularies. We show that using our tf.idf normalization and the fusion leads to better classification rates than other normalization methods, other fusion schemes or other approaches evaluated on the SIMPLIcity collection
Approximate Image Matching using Strings of Bag-of-Visual Words Representation
International audienceThe Spatial Pyramid Matching approach has become very popular to model images as sets of local bag-of words. The image comparison is then done region-by-region with an intersection kernel. Despite its success, this model presents some limitations: the grid partitioning is predefined and identical for all images and the matching is sensitive to intra- and inter-class variations. In this paper, we propose a novel approach based on approximate string matching to overcome these limitations and improve the results. First, we introduce a new image representation as strings of ordered bag-of-words. Second, we present a new edit distance specifically adapted to strings of histograms in the context of image comparison. This distance identifies local alignments between subregions and allows to remove sequences of similar subregions to better match two images. Experiments on 15 Scenes and Caltech 101 show that the proposed approach outperforms the classical spatial pyramid representation and most existing concurrent methods for classification presented in recent years
Spatial orientations of visual word pairs to improve Bag-of-Visual-Words model
International audienceThis paper presents a novel approach to incorporate spatial information in the bag-of-visual-words model for category level and scene classiïŹcation. In the traditional bag-of-visual-words model, feature vectors are histograms of visual words. This representation is appearance based and does not contain any information regarding the arrangement of the visual words in the 2D image space. In this framework, we present a simple and efïŹ- cient way to infuse spatial information. Particularly, we are interested in explicit global relationships among the spatial positions of visual words. Therefore, we take advantage of the orientation of the segments formed by Pairs of Identical visual Words (PIW). An evenly distributed normalized histogram of angles of PIW is computed. Histograms pro- duced by each word type constitute a powerful description of intra type visual words relationships. Experiments on challenging datasets demonstrate that our method is com- petitive with the concurrent ones. We also show that, our method provides important complementary information to the spatial pyramid matching and can improve the overall performance
Combinaison d'information visuelle et textuelle pour la recherche d'information multimédia
International audienceNous prĂ©sentons dans cet article un modĂšle de reprĂ©sentation de documents multimĂ©dia combinant des informations textuelles et des descripteurs visuels. Le texte et l'image composant un document sont chacun dĂ©crits par un vecteur de poids en suivant une approche "sac-de-mots". Le modĂšle utilisĂ© permet d'effectuer des requĂȘtes multimĂ©dia pour la recherche d'information. Notre mĂ©thode est Ă©valuĂ©e sur la base imageCLEF'08 pour laquelle nous possĂ©dons la vĂ©ritĂ© de terrain. Plusieurs expĂ©rimentations ont Ă©t\Ă© menĂ©es avec diffĂ©rents descripteurs et plusieurs combinaisons de modalitĂ©s. L'analyse des rĂ©sultats montre qu'un modĂšle de document multimĂ©dia permet d'augmenter les performances d'un systĂšme de recherche basĂ© uniquement sur une seule modalitĂ©, qu'elle soit textuelle ou visuelle
Scheimpflug Self-Calibration Based on Tangency Points
International audienceSPIV self-calibration strongly depends on the accuracy of the detection of the projection of the control points. A new family of control points and an algorithm of image detection are proposed to overcome the bias associated to the use of dot centers as control points in SPIV self-calibration
Fisher Linear Discriminant Analysis for Text-Image Combination in Multimedia Information Retrieval
International audienceWith multimedia information retrieval, combining different modalities - text, image, audio or video provides additional information and generally improves the overall system performance. For this purpose, the linear combination method is presented as simple, flexible and effective. However, it requires to choose the weight assigned to each modality. This issue is still an open problem and is addressed in this paper. Our approach, based on Fisher Linear Discriminant Analysis, aims to learn these weights for multimedia documents composed of text and images. Text and images are both represented with the classical bag-of-words model. Our method was tested over the ImageCLEF datasets 2008 and 2009. Results demonstrate that our combination approach not only outperforms the use of the single textual modality but provides a nearly optimal learning of the weights with an efficient computation. Moreover, it is pointed out that the method allows to combine more than two modalities without increasing the complexity and thus the computing tim
Global Bilateral Symmetry Detection Using Multiscale Mirror Histograms
In recent years, there has been renewed interest in bilateral symmetry detection in images. It consists in detecting the main bilateral symmetry axis inside artificial or natural images. State-of-the-art methods combine feature point detection, pairwise comparison and voting in Hough-like space. In spite of their good performance, they fail to give reliable results over challenging real-world and artistic images. In this paper, we propose a novel symmetry detection method using multi-scale edge features combined with local orientation histograms. An experimental evaluation is conducted on public datasets plus a new aesthetic-oriented dataset. The results show that our approach outperforms all other concurrent methods
Mise en correspondance de formes Ă niveaux de gris par palpage morphologique
Dans cette communication, nous prĂ©sentons deux nouvelles transformĂ©es de mise en correspondance de formes (pattern matching) dans les images Ă niveaux de gris. Elles se basent sur le principe du palpage mĂ©canique et sont dĂ©finies dans le contexte de la morphologie mathĂ©matique. La premiĂšre transformĂ©e permet de localiser dans une image toutes les instances d'un mĂȘme motif et porte le nom de transformĂ©e SOMP (Single Object Matching using Probing). Elle possĂšde toutes les propriĂ©tĂ©s d'une mĂ©trique et, par consĂ©quent, elle retourne une mesure de similaritĂ© entre l'image et le modĂšle recherchĂ©. D'autres propriĂ©tĂ©s relatives au bruit et au temps de calculs sont abordĂ©es. La seconde transformĂ©e, appelĂ©e transformĂ©e MOMP (Multiple Objects Matching using Probing), offre la possibilitĂ© de localiser toutes les occurrences de plusieurs motifs de formes diffĂ©rentes. Elle est particuliĂšrement adaptĂ©e Ă la dĂ©tection d'objets de diffĂ©rentes tailles ou perturbĂ©s par le bruit. Des rĂ©sultats sont prĂ©sentĂ©s pour les deux transformĂ©es
Combining text/image in WikipediaMM task 2009
6 pagesThis paper reports our multimedia information retrieval experiments carried out for the ImageCLEF track 2009. In 2008, we proposed a multimedia document model defined as a vector of textual and visual terms weighted using a tf.idf approch [5]. For our second participation, our goal was to improve this previous model in the following ways: 1) use of additional information for the textual part (legend and image bounding text extracted from the original documents, 2) use of different image detectors and descriptors, 3) new text / image combination approach. Results allow to evaluate the benefits of these different improvements