566 research outputs found

    Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

    Get PDF
    In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .Comment: Add a simple strategy to boost the performance of image captioning task significantly. More details are shown in Section 8 of the paper. The code and related data are available at https://github.com/mjhucla/mRNN-CR ;. arXiv admin note: substantial text overlap with arXiv:1410.109

    Direction-of-Arrival Estimation Based on Joint Sparsity

    Get PDF
    We present a DOA estimation algorithm, called Joint-Sparse DOA to address the problem of Direction-of-Arrival (DOA) estimation using sensor arrays. Firstly, DOA estimation is cast as the joint-sparse recovery problem. Then, norm is approximated by an arctan function to represent joint sparsity and DOA estimation can be obtained by minimizing the approximate norm. Finally, the minimization problem is solved by a quasi-Newton method to estimate DOA. Simulation results show that our algorithm has some advantages over most existing methods: it needs a small number of snapshots to estimate DOA, while the number of sources need not be known a priori. Besides, it improves the resolution, and it can also handle the coherent sources well

    Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

    Get PDF
    In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216

    SVDinsTN: An Integrated Method for Tensor Network Representation with Efficient Structure Search

    Full text link
    Tensor network (TN) representation is a powerful technique for data analysis and machine learning. It practically involves a challenging TN structure search (TN-SS) problem, which aims to search for the optimal structure to achieve a compact representation. Existing TN-SS methods mainly adopt a bi-level optimization method that leads to excessive computational costs due to repeated structure evaluations. To address this issue, we propose an efficient integrated (single-level) method named SVD-inspired TN decomposition (SVDinsTN), eliminating the need for repeated tedious structure evaluation. By inserting a diagonal factor for each edge of the fully-connected TN, we calculate TN cores and diagonal factors simultaneously, with factor sparsity revealing the most compact TN structure. Experimental results on real-world data demonstrate that SVDinsTN achieves approximately 102∼10310^2\sim{}10^3 times acceleration in runtime compared to the existing TN-SS methods while maintaining a comparable level of representation ability
    • …
    corecore