29 research outputs found
Audio Visual Speaker Localization from EgoCentric Views
The use of audio and visual modality for speaker localization has been well
studied in the literature by exploiting their complementary characteristics.
However, most previous works employ the setting of static sensors mounted at
fixed positions. Unlike them, in this work, we explore the ego-centric setting,
where the heterogeneous sensors are embodied and could be moving with a human
to facilitate speaker localization. Compared to the static scenario, the
ego-centric setting is more realistic for smart-home applications e.g., a
service robot. However, this also brings new challenges such as blurred images,
frequent speaker disappearance from the field of view of the wearer, and
occlusions. In this paper, we study egocentric audio-visual speaker DOA
estimation and deal with the challenges mentioned above. Specifically, we
propose a transformer-based audio-visual fusion method to estimate the relative
DOA of the speaker to the wearer, and design a training strategy to mitigate
the problem of the speaker disappearing from the camera's view. We also develop
a new dataset for simulating the out-of-view scenarios, by creating a scene
with a camera wearer walking around while a speaker is moving at the same time.
The experimental results show that our proposed method offers promising
performance in this new dataset in terms of tracking accuracy. Finally, we
adapt the proposed method for the multi-speaker scenario. Experiments on
EasyCom show the effectiveness of the proposed model for multiple speakers in
real scenarios, which achieves state-of-the-art results in the sphere active
speaker detection task and the wearer activity prediction task. The simulated
dataset and related code are available at
https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization
Implicit Factorization with Shared Any Bits
At PKC 2009, May and Ritzenhofen proposed the implicit factorization problem (IFP). They showed that it is undemanding to factor two h-bit RSA moduli N1=p1q1, N2=p2q2 where q1, q2 are both αh-bit, and p1, p2 share uh>2αh the least significant bits (LSBs). Subsequent works mainly focused on extending the IFP to the cases where p1, p2 share some of the most significant bits (MSBs) or the middle bits (MBs). In this paper, we propose a novel generalized IFP where p1 and p2 share an arbitrary number of bit blocks, with each block having a consistent displacement in its position between p1 and p2, and we solve it successfully based on Coppersmith’s method. Specifically, we generate a new set of shift polynomials to construct the lattice and optimize the structure of the lattice by introducing a new variable z=p1. We derive that we can factor the two moduli in polynomial time when u>2(n+1)α(1−α^1/(n+1)) with p1, p2 sharing n blocks. Further, no matter how many blocks are shared, we can theoretically factor the two moduli as long as u>2αln(1/α). In addition, we consider two other cases where the positions of the shared blocks are arbitrary or there are k>2 known moduli. Meanwhile, we provide the corresponding solutions for the two cases. Our work is verified by experiments. </p
Clarifying the mechanisms of the light-induced color formation of apple peel under dark conditions through metabolomics and transcriptomic analyses
Many studies have demonstrated that anthocyanin synthesis in apple peel is induced by light, but the color of bagged apple peel continues to change under dark conditions after light induction has not been characterized. Here, transcriptional and metabolic changes associated with changes in apple peel coloration in the dark after different light induction treatments were studied. Apple pericarp can achieve a normal color under complete darkness followed by light induction. Metabolomics analysis indicated that the expression levels of cyanidin-3-O-galactoside and cyanidin-3-O-glucoside were high, which might be associated with the red color development of apple peel. Transcriptome analysis revealed high expression levels of MdUFGTs, MdMYBs, and MdNACs, which might play a key role in light-induced anthocyanin accumulation under dark conditions. 13 key genes related to dark coloring after light induction was screened. The results of this study provide new insights into the mechanism of anthocyanin synthesis under dark conditions
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
Audio-visual speaker tracking has drawn increasing attention over the past
few years due to its academic values and wide application. Audio and visual
modalities can provide complementary information for localization and tracking.
With audio and visual information, the Bayesian-based filter can solve the
problem of data association, audio-visual fusion and track management. In this
paper, we conduct a comprehensive overview of audio-visual speaker tracking. To
our knowledge, this is the first extensive survey over the past five years. We
introduce the family of Bayesian filters and summarize the methods for
obtaining audio-visual measurements. In addition, the existing trackers and
their performance on AV16.3 dataset are summarized. In the past few years, deep
learning techniques have thrived, which also boosts the development of audio
visual speaker tracking. The influence of deep learning techniques in terms of
measurement extraction and state estimation is also discussed. At last, we
discuss the connections between audio-visual speaker tracking and other areas
such as speech separation and distributed speaker tracking
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) aims to convert speech from one
language into another, and has demonstrated significant progress to date.
Despite the recent success, current S2ST models still suffer from distinct
degradation in noisy environments and fail to translate visual speech (i.e.,
the movement of lips and teeth). In this work, we present AV-TranSpeech, the
first audio-visual speech-to-speech (AV-S2ST) translation model without relying
on intermediate text. AV-TranSpeech complements the audio stream with visual
information to promote system robustness and opens up a host of practical
applications: dictation or dubbing archival films. To mitigate the data
scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised
pre-training with unlabeled audio-visual data to learn contextual
representation, and 2) introduce cross-modal distillation with S2ST models
trained on the audio-only corpus to further reduce the requirements of visual
data. Experimental results on two language pairs demonstrate that AV-TranSpeech
outperforms audio-only models under all settings regardless of the type of
noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation
yields an improvement of 7.6 BLEU on average compared with baselines. Audio
samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts
Zero-shot text-to-speech aims at synthesizing voices with unseen speech
prompts. Previous large-scale multispeaker TTS models have successfully
achieved this goal with an enrolled recording within 10 seconds. However, most
of them are designed to utilize only short speech prompts. The limited
information in short speech prompts significantly hinders the performance of
fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a
generic zero-shot multispeaker TTS model that is capable of synthesizing speech
for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a
multi-reference timbre encoder to extract timbre information from multiple
reference speeches; 2) and train a prosody language model with arbitrary-length
speech prompts; With these designs, our model is suitable for prompts of
different lengths, which extends the upper bound of speech quality for
zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce
arbitrary-source prompts, which leverages the probabilities derived from
multiple P-LLM outputs to produce expressive and controlled prosody.
Furthermore, we propose a phoneme-level auto-regressive duration model to
introduce in-context learning capabilities to duration modeling. Experiments
demonstrate that our method could not only synthesize identity-preserving
speech with a short prompt of an unseen speaker but also achieve improved
performance with longer speech prompts. Audio samples can be found in
https://mega-tts.github.io/mega2_demo/
Lender Trust on the P2P Lending: Analysis Based on Sentiment Analysis of Comment Text
Lender trust is important to ensure the sustainability of P2P lending. This paper uses web crawling to collect more than 240,000 unique pieces of comment text data. Based on the mapping relationship between emotion and trust, we use the lexicon-based method and deep learning to check the trust of a given lender in P2P lending. Further, we use the Latent Dirichlet Allocation (LDA) topic model to mine topics concerned with this research. The results show that lenders are positive about P2P lending, though this tendency fluctuates downward with time. The security, rate of return, and compliance of P2P lending are the issues of greatest concern to lenders. This study reveals the core subject areas that influence a lender’s emotions and trusts and provides a theoretical basis and empirical reference for relevant platforms to improve their operational level while enhancing competitiveness. This analytical approach offers insights for researchers to understand the hidden content behind the text data
PARTIAL ARITHMETIC CONSENSUS BASED DISTRIBUTED INTENSITY PARTICLE FLOW SMC-PHD FILTER FOR MULTI-TARGET TRACKING
Intensity Particle Flow (IPF) SMC-PHD has been proposed recently for multi-target tracking. In this paper, we extend IPF-SMC-PHD filter to distributed setting, and develop a novel consensus method for fusing the estimates from individual sensors, based on Arithmetic Average (AA) fusion. Different from conventional AA method which may be degraded when unreliable estimates are presented, we develop a novel arithmetic consensus method to fuse estimates from each individual IPF-SMC-PHD filter with partial consensus. The proposed method contains a scheme for evaluating the reliability of the sensor nodes and preventing unreliable sensor information to be used in fusion and communication in sensor network, which help improve fusion accuracy and reduce sensor communication costs. Numerical simulations are performed to demonstrate the advantages of the proposed algorithm over the uncooperative IPF-SMC-PHD and distributed particle-PHD with AA fusion