35 research outputs found

    DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool

    Full text link
    We present a lightweight annotation tool, the Data AnnotatoR Tool (DART), for the general task of labeling structured data with textual descriptions. The tool is implemented as an interactive application that reduces human efforts in annotating large quantities of structured data, e.g. in the format of a table or tree structure. By using a backend sequence-to-sequence model, our system iteratively analyzes the annotated labels in order to better sample unlabeled data. In a simulation experiment performed on annotating large quantities of structured data, DART has been shown to reduce the total number of annotations needed with active learning and automatically suggesting relevant labels.Comment: Accepted to COLING 2020 (selected as outstanding paper

    FoleyGen: Visually-Guided Audio Generation

    Full text link
    Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations

    Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition

    Full text link
    Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. Experiments on on-device Transformer-based streaming speech recognition models show that folding attention reduces model size (and corresponding memory consumption) by up to 24% and power consumption by up to 23%, all without compromising model accuracy or computation overhead

    Stack-and-Delay: a new codebook pattern for music generation

    Full text link
    In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay style of decoding strategy to improve upon the flat pattern decoding where generation speed is four times faster as opposed to vanilla flat decoding. This brings the inference time close to that of the delay decoding strategy, and allows for faster inference on GPU for small batch sizes. For the same inference efficiency budget as the delay pattern, we show that the proposed approach performs better in objective evaluations, almost closing the gap with the flat pattern in terms of quality. The results are corroborated by subjective evaluations which show that samples generated by the new model are slightly more often preferred to samples generated by the competing model given the same text prompts

    Enhance audio generation controllability through representation similarity regularization

    Full text link
    This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regularization to ensure the alignment between the chosen text representation and the language model's predictions. Our proposal involves the incorporation of audio and text representation regularization, particularly during the classifier-free guidance (CFG) phase, where the text condition is excluded from cross attention during language model training. The aim of this proposed representation regularization is to minimize discrepancies in audio and text similarity compared to other samples within the same training batch. Experimental results on both music and audio generation tasks demonstrate that our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.Comment: 5 page

    Electron transfer kinetics on natural crystals of MoS2 and graphite

    Get PDF
    Here, we evaluate the electrochemical performance of sparsely studied natural crystals of molybdenite and graphite, which have increasingly been used for fabrication of next generation monolayer molybdenum disulphide and graphene energy storage devices. Heterogeneous electron transfer kinetics of several redox mediators, including Fe(CN)63−/4−, Ru(NH3)63+/2+ and IrCl62−/3− are determined using voltammetry in a micro-droplet cell. The kinetics on both materials are studied as a function of surface defectiveness, surface ageing, applied potential and illumination. We find that the basal planes of both natural MoS2 and graphite show significant electroactivity, but a large decrease in electron transfer kinetics is observed on atmosphere-aged surfaces in comparison to in situ freshly cleaved surfaces of both materials. This is attributed to surface oxidation and adsorption of airborne contaminants at the surface exposed to an ambient environment. In contrast to semimetallic graphite, the electrode kinetics on semiconducting MoS2 are strongly dependent on the surface illumination and applied potential. Furthermore, while visibly present defects/cracks do not significantly affect the response of graphite, the kinetics on MoS2 systematically accelerate with small increase in disorder. These findings have direct implications for use of MoS2 and graphene/graphite as electrode materials in electrochemistry-related applications
    corecore