29 research outputs found

    Audio Visual Speaker Localization from EgoCentric Views

    Full text link
    The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization

    Implicit Factorization with Shared Any Bits

    Get PDF
    At PKC 2009, May and Ritzenhofen proposed the implicit factorization problem (IFP). They showed that it is undemanding to factor two h-bit RSA moduli N1=p1q1, N2=p2q2 where q1, q2 are both αh-bit, and p1, p2 share uh&gt;2αh the least significant bits (LSBs). Subsequent works mainly focused on extending the IFP to the cases where p1, p2 share some of the most significant bits (MSBs) or the middle bits (MBs). In this paper, we propose a novel generalized IFP where p1 and p2 share an arbitrary number of bit blocks, with each block having a consistent displacement in its position between p1 and p2, and we solve it successfully based on Coppersmith’s method. Specifically, we generate a new set of shift polynomials to construct the lattice and optimize the structure of the lattice by introducing a new variable z=p1. We derive that we can factor the two moduli in polynomial time when u&gt;2(n+1)α(1−α^1/(n+1)) with p1, p2 sharing n blocks. Further, no matter how many blocks are shared, we can theoretically factor the two moduli as long as u&gt;2αln(1/α). In addition, we consider two other cases where the positions of the shared blocks are arbitrary or there are k&gt;2 known moduli. Meanwhile, we provide the corresponding solutions for the two cases. Our work is verified by experiments. </p

    Clarifying the mechanisms of the light-induced color formation of apple peel under dark conditions through metabolomics and transcriptomic analyses

    Get PDF
    Many studies have demonstrated that anthocyanin synthesis in apple peel is induced by light, but the color of bagged apple peel continues to change under dark conditions after light induction has not been characterized. Here, transcriptional and metabolic changes associated with changes in apple peel coloration in the dark after different light induction treatments were studied. Apple pericarp can achieve a normal color under complete darkness followed by light induction. Metabolomics analysis indicated that the expression levels of cyanidin-3-O-galactoside and cyanidin-3-O-glucoside were high, which might be associated with the red color development of apple peel. Transcriptome analysis revealed high expression levels of MdUFGTs, MdMYBs, and MdNACs, which might play a key role in light-induced anthocyanin accumulation under dark conditions. 13 key genes related to dark coloring after light induction was screened. The results of this study provide new insights into the mechanism of anthocyanin synthesis under dark conditions

    Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

    Full text link
    Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking

    AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

    Full text link
    Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202

    Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

    Full text link
    Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple P-LLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling. Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in https://mega-tts.github.io/mega2_demo/

    Lender Trust on the P2P Lending: Analysis Based on Sentiment Analysis of Comment Text

    No full text
    Lender trust is important to ensure the sustainability of P2P lending. This paper uses web crawling to collect more than 240,000 unique pieces of comment text data. Based on the mapping relationship between emotion and trust, we use the lexicon-based method and deep learning to check the trust of a given lender in P2P lending. Further, we use the Latent Dirichlet Allocation (LDA) topic model to mine topics concerned with this research. The results show that lenders are positive about P2P lending, though this tendency fluctuates downward with time. The security, rate of return, and compliance of P2P lending are the issues of greatest concern to lenders. This study reveals the core subject areas that influence a lender&rsquo;s emotions and trusts and provides a theoretical basis and empirical reference for relevant platforms to improve their operational level while enhancing competitiveness. This analytical approach offers insights for researchers to understand the hidden content behind the text data

    PARTIAL ARITHMETIC CONSENSUS BASED DISTRIBUTED INTENSITY PARTICLE FLOW SMC-PHD FILTER FOR MULTI-TARGET TRACKING

    No full text
    Intensity Particle Flow (IPF) SMC-PHD has been proposed recently for multi-target tracking. In this paper, we extend IPF-SMC-PHD filter to distributed setting, and develop a novel consensus method for fusing the estimates from individual sensors, based on Arithmetic Average (AA) fusion. Different from conventional AA method which may be degraded when unreliable estimates are presented, we develop a novel arithmetic consensus method to fuse estimates from each individual IPF-SMC-PHD filter with partial consensus. The proposed method contains a scheme for evaluating the reliability of the sensor nodes and preventing unreliable sensor information to be used in fusion and communication in sensor network, which help improve fusion accuracy and reduce sensor communication costs. Numerical simulations are performed to demonstrate the advantages of the proposed algorithm over the uncooperative IPF-SMC-PHD and distributed particle-PHD with AA fusion
    corecore