49 research outputs found
DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool
We present a lightweight annotation tool, the Data AnnotatoR Tool (DART), for
the general task of labeling structured data with textual descriptions. The
tool is implemented as an interactive application that reduces human efforts in
annotating large quantities of structured data, e.g. in the format of a table
or tree structure. By using a backend sequence-to-sequence model, our system
iteratively analyzes the annotated labels in order to better sample unlabeled
data. In a simulation experiment performed on annotating large quantities of
structured data, DART has been shown to reduce the total number of annotations
needed with active learning and automatically suggesting relevant labels.Comment: Accepted to COLING 2020 (selected as outstanding paper
FoleyGen: Visually-Guided Audio Generation
Recent advancements in audio generation have been spurred by the evolution of
large-scale deep learning models and expansive datasets. However, the task of
video-to-audio (V2A) generation continues to be a challenge, principally
because of the intricate relationship between the high-dimensional visual and
auditory data, and the challenges associated with temporal synchronization. In
this study, we introduce FoleyGen, an open-domain V2A generation system built
on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural
audio codec for bidirectional conversion between waveforms and discrete tokens.
The generation of audio tokens is facilitated by a single Transformer model,
which is conditioned on visual features extracted from a visual encoder. A
prevalent problem in V2A generation is the misalignment of generated audio with
the visible actions in the video. To address this, we explore three novel
visual attention mechanisms. We further undertake an exhaustive evaluation of
multiple visual encoders, each pretrained on either single-modal or multi-modal
tasks. The experimental results on VGGSound dataset show that our proposed
FoleyGen outperforms previous systems across all objective metrics and human
evaluations
Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition
Transformer-based models excel in speech recognition. Existing efforts to
optimize Transformer inference, typically for long-context applications, center
on simplifying attention score calculations. However, streaming speech
recognition models usually process a limited number of tokens each time, making
attention score calculation less of a bottleneck. Instead, the bottleneck lies
in the linear projection layers of multi-head attention and feedforward
networks, constituting a substantial portion of the model size and contributing
significantly to computation, memory, and power usage.
To address this bottleneck, we propose folding attention, a technique
targeting these linear layers, significantly reducing model size and improving
memory and power efficiency. Experiments on on-device Transformer-based
streaming speech recognition models show that folding attention reduces model
size (and corresponding memory consumption) by up to 24% and power consumption
by up to 23%, all without compromising model accuracy or computation overhead
Violencia econ贸mica patrimonial frente a la omisi贸n de prestaci贸n de alimentos
La investigaci贸n tiene por objetivo exponer como la violencia econ贸mica patrimonial
incide en la omisi贸n de prestaci贸n de alimentos, se hace uso de la metodolog铆a de tipo
b谩sica ya se busca el poder obtener nuevos conocimientos, de enfoque cualitativo,
dise帽o fenomenol贸gico ya que se estudia los hechos dentro de un contexto y de acuerdo
a las experiencias de los participantes. Se aplico la t茅cnica de la gu铆a de entrevista y de
acuerdo al resultado de las interrogantes poder obtener nuestros resultados, los cuales
nos indican que la figura de violencia econ贸mica patrimonial es totalmente ajena al delito
de omisi贸n de prestaci贸n de alimentos ya que la primera por un lado depende de un
contexto de subordinaci贸n y poder y la segunda es meramente un incumplimiento de una
resoluci贸n judicial. Como conclusiones tenemos que, dentro del delito de la omisi贸n de
prestaci贸n de alimentos, el enfoque de g茅nero no tiene relevancia y tampoco incide en
el delito como tal, la responsabilidad penal de la violencia econ贸mica patrimonial, a煤n
est谩 lejos de ser identificada con facilidad por los operadores de justicia ya que dentro
de la ley N掳 30364 solo se le conceptualiza de manera general a las diferentes formas
de agresiones
Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition
Power consumption plays an important role in on-device streaming speech
recognition, as it has a direct impact on the user experience. This study
delves into how weight parameters in speech recognition models influence the
overall power consumption of these models. We discovered that the impact of
weight parameters on power consumption varies, influenced by factors including
how often they are invoked and their placement in memory. Armed with this
insight, we developed design guidelines aimed at optimizing on-device speech
recognition models. These guidelines focus on minimizing power use without
substantially affecting accuracy. Our method, which employs targeted
compression based on the varying sensitivities of weight parameters,
demonstrates superior performance compared to state-of-the-art compression
methods. It achieves a reduction in energy usage of up to 47% while maintaining
similar model accuracy and improving the real-time factor
Stack-and-Delay: a new codebook pattern for music generation
In language modeling based music generation, a generated waveform is
represented by a sequence of hierarchical token stacks that can be decoded
either in an auto-regressive manner or in parallel, depending on the codebook
patterns. In particular, flattening the codebooks represents the highest
quality decoding strategy, while being notoriously slow. To this end, we
propose a novel stack-and-delay style of decoding strategy to improve upon the
flat pattern decoding where generation speed is four times faster as opposed to
vanilla flat decoding. This brings the inference time close to that of the
delay decoding strategy, and allows for faster inference on GPU for small batch
sizes. For the same inference efficiency budget as the delay pattern, we show
that the proposed approach performs better in objective evaluations, almost
closing the gap with the flat pattern in terms of quality. The results are
corroborated by subjective evaluations which show that samples generated by the
new model are slightly more often preferred to samples generated by the
competing model given the same text prompts
Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin
Developing effective spoken language processing systems for low-resource languages poses several challenges due to the lack of parallel data and limited resources for fine-tuning models. In this work, we target on improving upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus and further propose a framework of cross-lingual adaptive training that includes both continual and task adaptive training so as to adapt a base pre-trained model to low-resource languages. Our studies show that English pre-trained language models serve as a stronger prior than multilingual language models on English-Pidgin tasks with up to 2.38 BLEU improvements; and demonstrate that augmenting orthographic data and using task adaptive training with back-translation can have a significant impact on model performance