7,069 research outputs found

    Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition

    Full text link
    The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and downsampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.Comment: Accepted to ICASSP 201

    AON: Towards Arbitrarily-Oriented Text Recognition

    Full text link
    Recognizing text from natural images is a hot research topic in computer vision due to its various applications. Despite the enduring research of several decades on optical character recognition (OCR), recognizing texts from natural images is still a challenging task. This is because scene texts are often in irregular (e.g. curved, arbitrarily-oriented or seriously distorted) arrangements, which have not yet been well addressed in the literature. Existing methods on text recognition mainly work with regular (horizontal and frontal) texts and cannot be trivially generalized to handle irregular texts. In this paper, we develop the arbitrary orientation network (AON) to directly capture the deep features of irregular texts, which are combined into an attention-based decoder to generate character sequence. The whole network can be trained end-to-end by using only images and word-level annotations. Extensive experiments on various benchmarks, including the CUTE80, SVT-Perspective, IIIT5k, SVT and ICDAR datasets, show that the proposed AON-based method achieves the-state-of-the-art performance in irregular datasets, and is comparable to major existing methods in regular datasets.Comment: Accepted by CVPR201

    Building Machines That Learn and Think Like People

    Get PDF
    Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.Comment: In press at Behavioral and Brain Sciences. Open call for commentary proposals (until Nov. 22, 2016). https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/information/calls-for-commentary/open-calls-for-commentar

    Transcribing Content from Structural Images with Spotlight Mechanism

    Full text link
    Transcribing content from structural images, e.g., writing notes from music scores, is a challenging task as not only the content objects should be recognized, but the internal structure should also be preserved. Existing image recognition methods mainly work on images with simple content (e.g., text lines with characters), but are not capable to identify ones with more complex content (e.g., structured symbols), which often follow a fine-grained grammar. To this end, in this paper, we propose a hierarchical Spotlight Transcribing Network (STN) framework followed by a two-stage "where-to-what" solution. Specifically, we first decide "where-to-look" through a novel spotlight mechanism to focus on different areas of the original image following its structure. Then, we decide "what-to-write" by developing a GRU based network with the spotlight areas for transcribing the content accordingly. Moreover, we propose two implementations on the basis of STN, i.e., STNM and STNR, where the spotlight movement follows the Markov property and Recurrent modeling, respectively. We also design a reinforcement method to refine the framework by self-improving the spotlight mechanism. We conduct extensive experiments on many structural image datasets, where the results clearly demonstrate the effectiveness of STN framework.Comment: Accepted by KDD2018 Research Track. In proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'18
    • …
    corecore