2,323 research outputs found
Backbone Can Not be Trained at Once: Rolling Back to Pre-trained Network for Person Re-Identification
In person re-identification (ReID) task, because of its shortage of trainable
dataset, it is common to utilize fine-tuning method using a classification
network pre-trained on a large dataset. However, it is relatively difficult to
sufficiently fine-tune the low-level layers of the network due to the gradient
vanishing problem. In this work, we propose a novel fine-tuning strategy that
allows low-level layers to be sufficiently trained by rolling back the weights
of high-level layers to their initial pre-trained weights. Our strategy
alleviates the problem of gradient vanishing in low-level layers and robustly
trains the low-level layers to fit the ReID dataset, thereby increasing the
performance of ReID tasks. The improved performance of the proposed strategy is
validated via several experiments. Furthermore, without any add-ons such as
pose estimation or segmentation, our strategy exhibits state-of-the-art
performance using only vanilla deep convolutional neural network architecture.Comment: Accepted to AAAI 201
An Empirical Study Of Hospitality Management Student Attitudes Toward Group Projects: Instructional Factors And Team Problems
The development of positive attitudes in team-based work is important in management education. This study investigates hospitality studentsā attitudes toward group projects by examining instructional factors and team problems. Specifically, we examine how the studentsā perceptions of project appropriateness, instructorsā support, and evaluation fairness influence their attitudes toward group projects. Also the effect of studentsā team problems on their attitudes toward group projects is examined. This study has highlighted the criticality of the instructorās role in group project management for achieving a high level of positive attitudes toward group projects among the hospitality management students
AutoLR: Layer-wise Pruning and Auto-tuning of Learning Rates in Fine-tuning of Deep Networks
Existing fine-tuning methods use a single learning rate over all layers. In
this paper, first, we discuss that trends of layer-wise weight variations by
fine-tuning using a single learning rate do not match the well-known notion
that lower-level layers extract general features and higher-level layers
extract specific features. Based on our discussion, we propose an algorithm
that improves fine-tuning performance and reduces network complexity through
layer-wise pruning and auto-tuning of layer-wise learning rates. The proposed
algorithm has verified the effectiveness by achieving state-of-the-art
performance on the image retrieval benchmark datasets (CUB-200, Cars-196,
Stanford online product, and Inshop). Code is available at
https://github.com/youngminPIL/AutoLR.Comment: Accepted to AAAI 202
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding
Recent research has demonstrated impressive results in video-to-speech
synthesis which involves reconstructing speech solely from visual input.
However, previous works have struggled to accurately synthesize speech due to a
lack of sufficient guidance for the model to infer the correct content with the
appropriate sound. To resolve the issue, they have adopted an extra speaker
embedding as a speaking style guidance from a reference auditory information.
Nevertheless, it is not always possible to obtain the audio information from
the corresponding video input, especially during the inference time. In this
paper, we present a novel vision-guided speaker embedding extractor using a
self-supervised pre-trained model and prompt tuning technique. In doing so, the
rich speaker embedding information can be produced solely from input visual
information, and the extra audio information is not necessary during the
inference time. Using the extracted vision-guided speaker embedding
representations, we further develop a diffusion-based video-to-speech synthesis
model, so called DiffV2S, conditioned on those speaker embeddings and the
visual representation extracted from the input video. The proposed DiffV2S not
only maintains phoneme details contained in the input video frames, but also
creates a highly intelligible mel-spectrogram in which the speaker identities
of the multiple speakers are all preserved. Our experimental results show that
DiffV2S achieves the state-of-the-art performance compared to the previous
video-to-speech synthesis technique.Comment: ICCV 202
Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
Talking face generation is the challenging task of synthesizing a natural and
realistic face that requires accurate synchronization with a given audio. Due
to co-articulation, where an isolated phone is influenced by the preceding or
following phones, the articulation of a phone varies upon the phonetic context.
Therefore, modeling lip motion with the phonetic context can generate more
spatio-temporally aligned lip movement. In this respect, we investigate the
phonetic context in generating lip motion for talking face generation. We
propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages
phonetic context to generate lip movement of the target face. CALS is comprised
of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained
based on masked learning to map each phone to a contextualized lip motion unit.
The contextualized lip motion unit then guides the latter in synthesizing a
target identity with context-aware lip motion. From extensive experiments, we
verify that simply exploiting the phonetic context in the proposed CALS
framework effectively enhances spatio-temporal alignment. We also demonstrate
the extent to which the phonetic context assists in lip synchronization and
find the effective window size for lip generation to be approximately 1.2
seconds.Comment: Accepted at ICASSP 202
Reprogramming Audio-driven Talking Face Synthesis into Text-driven
In this paper, we propose a method to reprogram pre-trained audio-driven
talking face synthesis models to be able to operate with text inputs. As the
audio-driven talking face synthesis model takes speech audio as inputs, in
order to generate a talking avatar with the desired speech content, speech
recording needs to be performed in advance. However, this is burdensome to
record audio for every video to be generated. In order to alleviate this
problem, we propose a novel method that embeds input text into the learned
audio latent space of the pre-trained audio-driven model. To this end, we
design a Text-to-Audio Embedding Module (TAEM) which is guided to learn to map
a given text input to the audio latent features. Moreover, to model the speaker
characteristics lying in the audio features, we propose to inject visual
speaker embedding into the TAEM, which is obtained from a single face image.
After training, we can synthesize talking face videos with either text or
speech audio
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
This paper proposes a novel lip reading framework, especially for
low-resource languages, which has not been well addressed in the previous
literature. Since low-resource languages do not have enough video-text paired
data to train the model to have sufficient power to model lip movements and
language, it is regarded as challenging to develop lip reading models for
low-resource languages. In order to mitigate the challenge, we try to learn
general speech knowledge, the ability to model lip movements, from a
high-resource language through the prediction of speech units. It is known that
different languages partially share common phonemes, thus general speech
knowledge learned from one language can be extended to other languages. Then,
we try to learn language-specific knowledge, the ability to model language, by
proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder
saves language-specific audio features into memory banks and can be trained on
audio-text paired data which is more easily accessible than video-text paired
data. Therefore, with LMDecoder, we can transform the input speech units into
language-specific audio features and translate them into texts by utilizing the
learned rich language knowledge. Finally, by combining general speech knowledge
and language-specific knowledge, we can efficiently develop lip reading models
even for low-resource languages. Through extensive experiments using five
languages, English, Spanish, French, Italian, and Portuguese, the effectiveness
of the proposed method is evaluated.Comment: Accepted at ICCV 202
Distinct Roles of Outer Membrane Porins in Antibiotic Resistance and Membrane Integrity in Escherichia coli
A defining characteristic of Gram-negative bacteria is the presence of an outer membrane, which functions as an additional barrier inhibiting the penetration of toxic chemicals, such as antibiotics. Porins are outer membrane proteins associated with the modulation of cellular permeability and antibiotic resistance. Although there are numerous studies regarding porins, a systematic approach about the roles of porins in bacterial physiology and antibiotic resistance does not exist yet. In this study, we constructed mutants of all porins in Escherichia coli and examined the effect of porins on antibiotic resistance and membrane integrity. The OmpF-defective mutant was resistant to several antibiotics including Ī²-lactams, suggesting that OmpF functions as the main route of outer membrane penetration for many antibiotics. In contrast, OmpA was strongly associated with the maintenance of membrane integrity, which resulted in the increased susceptibility of the ompA mutant to many antibiotics. Notably, OmpC was involved in both the roles. Additionally, our systematic analyses revealed that other porins were not involved in the maintenance of membrane integrity, but several porins played a major or minor role in the outer membrane penetration for a few antibiotics. Collectively, these results show that each porin plays a distinct role in antibiotic resistance and membrane integrity, which could improve our understanding of the physiological function and clinical importance of porins
- ā¦