19 research outputs found
Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping
We propose Quootstrap, a method for extracting quotations, as well as the
names of the speakers who uttered them, from large news corpora. Whereas prior
work has addressed this problem primarily with supervised machine learning, our
approach follows a fully unsupervised bootstrapping paradigm. It leverages the
redundancy present in large news corpora, more precisely, the fact that the
same quotation often appears across multiple news articles in slightly
different contexts. Starting from a few seed patterns, such as ["Q", said S.],
our method extracts a set of quotation-speaker pairs (Q, S), which are in turn
used for discovering new patterns expressing the same quotations; the process
is then repeated with the larger pattern set. Our algorithm is highly scalable,
which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus.
Validating our results against a crowdsourced ground truth, we obtain 90%
precision at 40% recall using a single seed pattern, with significantly higher
recall values for more frequently reported (and thus likely more interesting)
quotations. Finally, we showcase the usefulness of our algorithm's output for
computational social science by analyzing the sentiment expressed in our
extracted quotations.Comment: Accepted at the 12th International Conference on Web and Social Media
(ICWSM), 201
Riconoscimento real-time di gesture tramite tecniche di machine learning
Il riconoscimento delle gesture è un tema di ricerca che sta acquisendo sempre più popolarità , specialmente negli ultimi anni, grazie ai progressi tecnologici dei dispositivi embedded e dei sensori. Lo scopo di questa tesi è quello di utilizzare alcune tecniche di machine learning per realizzare un sistema in grado di riconoscere e classificare in tempo reale i gesti delle mani, a partire dai segnali mioelettrici (EMG) prodotti dai muscoli. Inoltre, per consentire il riconoscimento di movimenti spaziali complessi, verranno elaborati anche segnali di tipo inerziale, provenienti da una Inertial Measurement Unit (IMU) provvista di accelerometro, giroscopio e magnetometro.
La prima parte della tesi, oltre ad offrire una panoramica sui dispositivi wearable e sui sensori, si occuperà di analizzare alcune tecniche per la classificazione di sequenze temporali, evidenziandone vantaggi e svantaggi. In particolare, verranno considerati approcci basati su Dynamic Time Warping (DTW), Hidden Markov Models (HMM), e reti neurali ricorrenti (RNN) di tipo Long Short-Term Memory (LSTM), che rappresentano una delle ultime evoluzioni nel campo del deep learning.
La seconda parte, invece, riguarderà il progetto vero e proprio. Verrà impiegato il dispositivo wearable Myo di Thalmic Labs come caso di studio, e saranno applicate nel dettaglio le tecniche basate su DTW e HMM per progettare e realizzare un framework in grado di eseguire il riconoscimento real-time di gesture. Il capitolo finale mostrerà i risultati ottenuti (fornendo anche un confronto tra le tecniche analizzate), sia per la classificazione di gesture isolate che per il riconoscimento in tempo reale
Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion
Neural Radiance Fields (NeRF) coupled with GANs represent a promising
direction in the area of 3D reconstruction from a single view, owing to their
ability to efficiently model arbitrary topologies. Recent work in this area,
however, has mostly focused on synthetic datasets where exact ground-truth
poses are known, and has overlooked pose estimation, which is important for
certain downstream applications such as augmented reality (AR) and robotics. We
introduce a principled end-to-end reconstruction framework for natural images,
where accurate ground-truth poses are not available. Our approach recovers an
SDF-parameterized 3D shape, pose, and appearance from a single image of an
object, without exploiting multiple views during training. More specifically,
we leverage an unconditional 3D-aware generator, to which we apply a hybrid
inversion scheme where a model produces a first guess of the solution which is
then refined via optimization. Our framework can de-render an image in as few
as 10 steps, enabling its use in practical scenarios. We demonstrate
state-of-the-art results on a variety of real and synthetic benchmarks
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
This paper presents a novel framework for automatic speech-driven gesture
generation, applicable to human-agent interaction including both virtual agents
and robots. Specifically, we extend recent deep-learning-based, data-driven
methods for speech-driven gesture generation by incorporating representation
learning. Our model takes speech as input and produces gestures as output, in
the form of a sequence of 3D coordinates. Our approach consists of two steps.
First, we learn a lower-dimensional representation of human motion using a
denoising autoencoder neural network, consisting of a motion encoder MotionE
and a motion decoder MotionD. The learned representation preserves the most
important aspects of the human pose variation while removing less relevant
variation. Second, we train a novel encoder network SpeechE to map from speech
to a corresponding motion representation with reduced dimensionality. At test
time, the speech encoder and the motion decoder networks are combined: SpeechE
predicts motion representations based on a given speech signal and MotionD then
decodes these representations to produce motion sequences. We evaluate
different representation sizes in order to find the most effective
dimensionality for the representation. We also evaluate the effects of using
different speech features as input to the model. We find that mel-frequency
cepstral coefficients (MFCCs), alone or combined with prosodic features,
perform the best. The results of a subsequent user study confirm the benefits
of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code
is available at
https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard
to scale to long sequences. Despite several works trying to reduce their
computational cost, most of LLMs still adopt attention layers between all pairs
of tokens in the sequence, thus incurring a quadratic cost. In this study, we
present a novel approach that dynamically prunes contextual information while
preserving the model's expressiveness, resulting in reduced memory and
computational requirements during inference. Our method employs a learnable
mechanism that determines which uninformative tokens can be dropped from the
context at any point across the generation process. By doing so, our approach
not only addresses performance concerns but also enhances interpretability,
providing valuable insight into the model's decision-making process. Our
technique can be applied to existing pre-trained models through a
straightforward fine-tuning process, and the pruning strength can be specified
by a sparsity parameter. Notably, our empirical findings demonstrate that we
can effectively prune up to 80\% of the context without significant performance
degradation on downstream tasks, offering a valuable tool for mitigating
inference costs. Our reference implementation achieves up to increase
in inference throughput and even greater memory savings
ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit
Dance and music are two highly correlated artistic forms. Synthesizing dance
motions has attracted much attention recently. Most previous works conduct
music-to-dance synthesis via directly music to human skeleton keypoints
mapping. Meanwhile, human choreographers design dance motions from music in a
two-stage manner: they firstly devise multiple choreographic dance units
(CAUs), each with a series of dance motions, and then arrange the CAU sequence
according to the rhythm, melody and emotion of the music. Inspired by these, we
systematically study such two-stage choreography approach and construct a
dataset to incorporate such choreography knowledge. Based on the constructed
dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to
imitate human choreography procedure. Our framework firstly devises a CAU
prediction model to learn the mapping relationship between music and CAU
sequences. Afterwards, we devise a spatial-temporal inpainting model to convert
the CAU sequence into continuous dance motions. Experimental results
demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in
terms of CAU BLEU score and 1.59 in terms of user study score).Comment: 10 pages, 5 figures, Accepted by ACM MM 202
Structured Generative Models for Controllable Scene and 3D Content Synthesis
Deep learning has fundamentally transformed the field of image synthesis, facilitated by the emergence of generative models that demonstrate remarkable ability to generate photorealistic imagery and intricate graphics. These models have advanced a wide range of industries, including art, gaming, movies, augmented & virtual reality (AR/VR), and advertising. While realism is undoubtedly a major contributor to their success, the ability to control these models is equally important in ensuring their practical viability and making them more useful for downstream applications. For instance, it is natural to describe an image through natural language, sketches, or attributes controlling the style of specific objects. Therefore, it is convenient to devise generative frameworks that follow a workflow similar to that of an artist. Furthermore, for interactive applications, the generated content needs to be visualized from various viewpoints while making sure that the identity of the scene is preserved and is consistent across multiple views. Addressing this issue is interesting not only from an application-oriented standpoint, but also from an image understanding perspective. Our visual system perceives 2D projections of 3D scenes, but the convolutional architectures commonly used in generative models ignore the concept of image formation and attempt to learn this structure from the data. Generative models that explicitly reason about 3D representations can provide disentangled control over shape, pose, appearance, can better handle spatial phenomena such as occlusions, and can generalize with less data. These practical requirements motivate the need for generative models driven by structured representations that are efficient, easily interpretable, and more aligned with human perception.
In this dissertation, we initially focus on the research question of controlling generative adversarial networks (GANs) for complex scene synthesis. We observe that, while existing approaches exhibit some degree of control over simple domains such as faces or centered objects, they fall short when it comes to complex scenes consisting of multiple objects. We therefore propose a weakly-supervised approach where generated images are described by a sparse scene layout (i.e. a sketch), and in which the style of individual objects can be refined through textual descriptions or attributes. We then show that this paradigm can effectively be used to generate complex images without trading off realism for control.
Next, we address the aforementioned issue of view consistency. Following recent advances in differentiable rendering, we introduce a convolutional mesh generation paradigm that can be used to generate textured 3D meshes using GANs. This model can natively reason using 3D representations, and can therefore be used to generate 3D content for computer graphics applications. We also demonstrate that our 3D generator can be controlled using standard techniques that can also be applied to 2D GANs, and successfully condition our model on class labels, attributes, and textual descriptions. We then observe that methods for 3D content generation typically require ground-truth poses, restricting their applicability to simple datasets where these are available. We therefore propose a follow-up approach to relax this requirement, demonstrating our method on a larger set of classes from ImageNet.
Finally, we draw inspiration from the literature on Neural Radiance Fields (NeRF) and incorporate this recently-proposed representation into our work on 3D generative modelling. We show how these models can be used to solve a series of downstream tasks such as single-view 3D reconstruction. To this end, we propose an approach that bridges NeRFs and GANs to reconstruct the 3D shape, appearance, and pose of an object from a single 2D image. Our approach adopts a bootstrapped GAN inversion strategy where an encoder produces a first guess of the solution, which is then refined through optimization by inverting a pre-trained 3D generator