37,668 research outputs found

    Easy Transfer Learning By Exploiting Intra-domain Structures

    Full text link
    Transfer learning aims at transferring knowledge from a well-labeled domain to a similar but different domain with limited or no labels. Unfortunately, existing learning-based methods often involve intensive model selection and hyperparameter tuning to obtain good results. Moreover, cross-validation is not possible for tuning hyperparameters since there are often no labels in the target domain. This would restrict wide applicability of transfer learning especially in computationally-constraint devices such as wearables. In this paper, we propose a practically Easy Transfer Learning (EasyTL) approach which requires no model selection and hyperparameter tuning, while achieving competitive performance. By exploiting intra-domain structures, EasyTL is able to learn both non-parametric transfer features and classifiers. Extensive experiments demonstrate that, compared to state-of-the-art traditional and deep methods, EasyTL satisfies the Occam's Razor principle: it is extremely easy to implement and use while achieving comparable or better performance in classification accuracy and much better computational efficiency. Additionally, it is shown that EasyTL can increase the performance of existing transfer feature learning methods.Comment: Camera-ready version of IEEE International Conference on Multimedia and Expo (ICME) 2019; code available at http://transferlearning.xyz/code/traditional/EasyT

    Snap and Find: Deep Discrete Cross-domain Garment Image Retrieval

    Full text link
    With the increasing number of online stores, there is a pressing need for intelligent search systems to understand the item photos snapped by customers and search against large-scale product databases to find their desired items. However, it is challenging for conventional retrieval systems to match up the item photos captured by customers and the ones officially released by stores, especially for garment images. To bridge the customer- and store- provided garment photos, existing studies have been widely exploiting the clothing attributes (\textit{e.g.,} black) and landmarks (\textit{e.g.,} collar) to learn a common embedding space for garment representations. Unfortunately they omit the sequential correlation of attributes and consume large quantity of human labors to label the landmarks. In this paper, we propose a deep multi-task cross-domain hashing termed \textit{DMCH}, in which cross-domain embedding and sequential attribute learning are modeled simultaneously. Sequential attribute learning not only provides the semantic guidance for embedding, but also generates rich attention on discriminative local details (\textit{e.g.,} black buttons) of clothing items without requiring extra landmark labels. This leads to promising performance and 306×\times boost on efficiency when compared with the state-of-the-art models, which is demonstrated through rigorous experiments on two public fashion datasets

    Fine-grained Apparel Classification and Retrieval without rich annotations

    Full text link
    The ability to correctly classify and retrieve apparel images has a variety of applications important to e-commerce, online advertising and internet search. In this work, we propose a robust framework for fine-grained apparel classification, in-shop and cross-domain retrieval which eliminates the requirement of rich annotations like bounding boxes and human-joints or clothing landmarks, and training of bounding box/ key-landmark detector for the same. Factors such as subtle appearance differences, variations in human poses, different shooting angles, apparel deformations, and self-occlusion add to the challenges in classification and retrieval of apparel items. Cross-domain retrieval is even harder due to the presence of large variation between online shopping images, usually taken in ideal lighting, pose, positive angle and clean background as compared with street photos captured by users in complicated conditions with poor lighting and cluttered scenes. Our framework uses compact bilinear CNN with tensor sketch algorithm to generate embeddings that capture local pairwise feature interactions in a translationally invariant manner. For apparel classification, we pass the feature embeddings through a softmax classifier, while, the in-shop and cross-domain retrieval pipelines use a triplet-loss based optimization approach, such that squared Euclidean distance between embeddings measures the dissimilarity between the images. Unlike previous works that relied on bounding box, key clothing landmarks or human joint detectors to assist the final deep classifier, proposed framework can be trained directly on the provided category labels or generated triplets for triplet loss optimization. Lastly, Experimental results on the DeepFashion fine-grained categorization, and in-shop and consumer-to-shop retrieval datasets provide a comparative analysis with previous work performed in the domain.Comment: 14 pages, 6 figures, 3 tables, Submitted to Springer Journal of Applied Intelligenc

    Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval

    Full text link
    Cross-modal information retrieval aims to find heterogeneous data of various modalities from a given query of one modality. The main challenge is to map different modalities into a common semantic space, in which distance between concepts in different modalities can be well modeled. For cross-modal information retrieval between images and texts, existing work mostly uses off-the-shelf Convolutional Neural Network (CNN) for image feature extraction. For texts, word-level features such as bag-of-words or word2vec are employed to build deep learning models to represent texts. Besides word-level semantics, the semantic relations between words are also informative but less explored. In this paper, we model texts by graphs using similarity measure based on word2vec. A dual-path neural network model is proposed for couple feature learning in cross-modal information retrieval. One path utilizes Graph Convolutional Network (GCN) for text modeling based on graph representations. The other path uses a neural network with layers of nonlinearities for image modeling based on off-the-shelf features. The model is trained by a pairwise similarity loss function to maximize the similarity of relevant text-image pairs and minimize the similarity of irrelevant pairs. Experimental results show that the proposed model outperforms the state-of-the-art methods significantly, with 17% improvement on accuracy for the best case.Comment: 7 pages, 11 figure

    Cross Domain Knowledge Learning with Dual-branch Adversarial Network for Vehicle Re-identification

    Full text link
    The widespread popularization of vehicles has facilitated all people's life during the last decades. However, the emergence of a large number of vehicles poses the critical but challenging problem of vehicle re-identification (reID). Till now, for most vehicle reID algorithms, both the training and testing processes are conducted on the same annotated datasets under supervision. However, even a well-trained model will still cause fateful performance drop due to the severe domain bias between the trained dataset and the real-world scenes. To address this problem, this paper proposes a domain adaptation framework for vehicle reID (DAVR), which narrows the cross-domain bias by fully exploiting the labeled data from the source domain to adapt the target domain. DAVR develops an image-to-image translation network named Dual-branch Adversarial Network (DAN), which could promote the images from the source domain (well-labeled) to learn the style of target domain (unlabeled) without any annotation and preserve identity information from source domain. Then the generated images are employed to train the vehicle reID model by a proposed attention-based feature learning model with more reasonable styles. Through the proposed framework, the well-trained reID model has better domain adaptation ability for various scenes in real-world situations. Comprehensive experimental results have demonstrated that our proposed DAVR can achieve excellent performances on both VehicleID dataset and VeRi-776 dataset.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0786

    Cross-media Similarity Metric Learning with Unified Deep Networks

    Full text link
    As a highlighting research topic in the multimedia area, cross-media retrieval aims to capture the complex correlations among multiple media types. Learning better shared representation and distance metric for multimedia data is important to boost the cross-media retrieval. Motivated by the strong ability of deep neural network in feature representation and comparison functions learning, we propose the Unified Network for Cross-media Similarity Metric (UNCSM) to associate cross-media shared representation learning with distance metric in a unified framework. First, we design a two-pathway deep network pretrained with contrastive loss, and employ double triplet similarity loss for fine-tuning to learn the shared representation for each media type by modeling the relative semantic similarity. Second, the metric network is designed for effectively calculating the cross-media similarity of the shared representation, by modeling the pairwise similar and dissimilar constraints. Compared to the existing methods which mostly ignore the dissimilar constraints and only use sample distance metric as Euclidean distance separately, our UNCSM approach unifies the representation learning and distance metric to preserve the relative similarity as well as embrace more complex similarity functions for further improving the cross-media retrieval accuracy. The experimental results show that our UNCSM approach outperforms 8 state-of-the-art methods on 4 widely-used cross-media datasets.Comment: 19 pages, submitted to Multimedia Tools and Application

    PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-modalities

    Full text link
    Data of different modalities generally convey complimentary but heterogeneous information, and a more discriminative representation is often preferred by combining multiple data modalities like the RGB and infrared features. However in reality, obtaining both data channels is challenging due to many limitations. For example, the RGB surveillance cameras are often restricted from private spaces, which is in conflict with the need of abnormal activity detection for personal security. As a result, using partial data channels to build a full representation of multi-modalities is clearly desired. In this paper, we propose a novel Partial-modal Generative Adversarial Networks (PM-GANs) that learns a full-modal representation using data from only partial modalities. The full representation is achieved by a generated representation in place of the missing data channel. Extensive experiments are conducted to verify the performance of our proposed method on action recognition, compared with four state-of-the-art methods. Meanwhile, a new Infrared-Visible Dataset for action recognition is introduced, and will be the first publicly available action dataset that contains paired infrared and visible spectrum

    Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

    Full text link
    For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apart from cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing to a sequence trained teacher.Comment: To Appear in ICASSP 201

    Current Challenges and Visions in Music Recommender Systems Research

    Full text link
    Music recommender systems (MRS) have experienced a boom in recent years, thanks to the emergence and success of online streaming services, which nowadays make available almost all music in the world at the user's fingertip. While today's MRS considerably help users to find interesting music in these huge catalogs, MRS research is still facing substantial challenges. In particular when it comes to build, incorporate, and evaluate recommendation strategies that integrate information beyond simple user--item interactions or content-based descriptors, but dig deep into the very essence of listener needs, preferences, and intentions, MRS research becomes a big endeavor and related publications quite sparse. The purpose of this trends and survey article is twofold. We first identify and shed light on what we believe are the most pressing challenges MRS research is facing, from both academic and industry perspectives. We review the state of the art towards solving these challenges and discuss its limitations. Second, we detail possible future directions and visions we contemplate for the further evolution of the field. The article should therefore serve two purposes: giving the interested reader an overview of current challenges in MRS research and providing guidance for young researchers by identifying interesting, yet under-researched, directions in the field

    Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation

    Full text link
    Supervised deep learning with pixel-wise training labels has great successes on multi-person part segmentation. However, data labeling at pixel-level is very expensive. To solve the problem, people have been exploring to use synthetic data to avoid the data labeling. Although it is easy to generate labels for synthetic data, the results are much worse compared to those using real data and manual labeling. The degradation of the performance is mainly due to the domain gap, i.e., the discrepancy of the pixel value statistics between real and synthetic data. In this paper, we observe that real and synthetic humans both have a skeleton (pose) representation. We found that the skeletons can effectively bridge the synthetic and real domains during the training. Our proposed approach takes advantage of the rich and realistic variations of the real data and the easily obtainable labels of the synthetic data to learn multi-person part segmentation on real images without any human-annotated labels. Through experiments, we show that without any human labeling, our method performs comparably to several state-of-the-art approaches which require human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other hand, if part labels are also available in the real-images during training, our method outperforms the supervised state-of-the-art methods by a large margin. We further demonstrate the generalizability of our method on predicting novel keypoints in real images where no real data labels are available for the novel keypoints detection. Code and pre-trained models are available at https://github.com/kevinlin311tw/CDCL-human-part-segmentationComment: To appear in IEEE Transactions on Circuits and Systems for Video Technology; Presented at ICCV 2019 Demonstratio
    corecore