1,333 research outputs found

    Speaker segmentation and clustering

    Get PDF
    This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved

    Service public, société de l'information et Internet

    Get PDF

    Investigating the Effects of Training Set Synthesis for Audio Segmentation of Radio Broadcast

    Get PDF
    Special Issue "Machine Learning Applied to Music/Audio Signal Processing"Music and speech detection provides us valuable information regarding the nature of content in broadcast audio. It helps detect acoustic regions that contain speech, voice over music, only music, or silence. In recent years, there have been developments in machine learning algorithms to accomplish this task. However, broadcast audio is generally well-mixed and copyrighted, which makes it challenging to share across research groups. In this study, we address the challenges encountered in automatically synthesising data that resembles a radio broadcast. Firstly, we compare state-of-the-art neural network architectures such as CNN, GRU, LSTM, TCN, and CRNN. Later, we investigate how audio ducking of background music impacts the precision and recall of the machine learning algorithm. Thirdly, we examine how the quantity of synthetic training data impacts the results. Finally, we evaluate the effectiveness of synthesised, real-world, and combined approaches for training models, to understand if the synthetic data presents any additional value. Amongst the network architectures, CRNN was the best performing network. Results also show that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners. After testing our model on in-house and public datasets, we observe that our proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative

    Recherche du rĂŽle des intervenants et de leurs interactions pour la structuration de documents audiovisuels

    Get PDF
    Nous présentons un systÚme de structuration automatique d'enregistrements audiovisuels s'appuyant sur des informations non lexicales caractéristiques des rÎles des intervenants et de leurs interactions. Dans une premiÚre étape, nous proposons une méthode de détection et de caractérisation de séquences temporelles, nommée « zones d'interaction », susceptibles de correspondre à des conversations. La seconde étape de notre systÚme réalise une reconnaissance du rÎle des intervenants : présentateur, journaliste et autre. Notre contribution au domaine de la reconnaissance automatique du rÎle se distingue en reposant sur l'hypothÚse selon laquelle les rÎles des intervenants sont accessibles à travers des paramÚtres « bas-niveau » inscrits d'une part dans l'organisation temporelle des tours de parole des intervenants, dans les environnements acoustiques dans lesquels ils apparaissent, ainsi que dans plusieurs paramÚtres prosodiques (intonation et débit). Dans une derniÚre étape, nous combinons l'information du rÎle des intervenants à la connaissance des séquences d'interaction afin de produire deux niveaux de description du contenu des documents. Le premier niveau de description segmente les enregistrements en zones de 4 types : informations, entretiens, transition et intermÚde. Un second niveau de description classe les zones d'interaction orales en 4 catégories : débat, interview, chronique et relais. Chaque étape du systÚme est validée par une grand nombre d'expériences menées sur le corpus du projet EPAC et celui de la campagne d'évaluation ESTER.We present a system for audiovisual document structuring, based-on speaker role recognition and speech interaction zone detection. The first stage of our system consists in an automatic method for speech interaction zones detection and characterization. Such zones correspond to temporal sequences of documents which potentially contain conversations between speakers. The second stage of our system achieves the recognition of speaker roles : anchorman, journalist and other. Our contribution to this domain is based on the hypothesis that cues about speaker roles are available through low-level features extracted from the temporal organization of turn-takings and from acoustic and prosodic features (speech rate and pitch). In the last stage of our system, we combine speaker roles and speech interaction zones to provide two descriptive layers of the audiovisual document contents. The first descriptive layer gathers segments of 4 types : informations, meeting, transition and interlude. The second descriptive layer consists in a classification of speech interaction zones into 4 categories : debate, interview, chronicle and relay. Each step of the system has been evaluated using a large number of experiments realized using the EPAC project and ESTER campaign corpora

    SPARC 2016 Salford postgraduate annual research conference book of abstracts

    Get PDF

    Perspectives de valorisation des fonds d’archives sonores et audiovisuelles de la RTS pour le jeune public sur les rĂ©seaux sociaux

    Get PDF
    Ce travail de Bachelor a pour objet de proposer des solutions de mise en valeur des fonds d’archives audiovisuelles et sonores de la RTS sur les rĂ©seaux sociaux. L’objectif est d’atteindre un public cible ĂągĂ© de 18 Ă  25 ans, soit des reprĂ©sentants des gĂ©nĂ©rations Y et Z. Le service DonnĂ©es et Archives (D+A) de la RTS met en valeur depuis quelques annĂ©es ses fonds d’archives, mais peine encore Ă  toucher ce public, mĂȘme via les mĂ©dias et rĂ©seaux sociaux. Ce projet s’articule en trois parties principales. La premiĂšre prĂ©sente un Ă©tat des lieux des pratiques de mise en valeur des archives par le service D+A, et une comparaison avec les mĂ©thodes de valorisation d’autres institutions aux missions similaires. Il est ensuite question, dans la deuxiĂšme partie de ce travail, de l’utilisation des mĂ©dias et rĂ©seaux sociaux par le public cible, soit les jeunes de 18 Ă  25 ans. Enfin, dans la derniĂšre partie et Ă  partir des informations rĂ©coltĂ©es prĂ©cĂ©demment, nous nous intĂ©ressons Ă  la mise en valeur des archives audiovisuelles et sonores, et proposons des solutions de valorisation pertinentes pour chaque type d’archives. A la suite de nos recherches, nous avons dĂ©terminĂ© que, pour mettre en valeur de maniĂšre optimale des archives, qu’elles soient audiovisuelles ou sonores, il est nĂ©cessaire de choisir correctement les contenus Ă  publier et de connaĂźtre le public ciblĂ©. En effet, si certaines archives rĂ©ussissent Ă  toucher plusieurs gĂ©nĂ©rations, d’autres sont clairement destinĂ©es Ă  atteindre une certaine tranche d’ñge. Dans le cas des 18-25 ans, nous avons identifiĂ© un certain nombre de thĂ©matiques qui puissent les toucher, ainsi qu’une pĂ©riode de temps, les annĂ©es 1990 Ă  2000, durant laquelle les archives Ă  mettre en valeur doivent avoir Ă©tĂ© produites. Cette pĂ©riode correspond Ă  l’enfance de notre public cible. Enfin, nous nous sommes concentrĂ©s plus prĂ©cisĂ©ment sur les archives sonores d’une part et les archives audiovisuelles d’autre part, afin de proposer des solutions de mise en valeur adaptĂ©es Ă  chacun de ces formats. Nous avons alors dĂ©terminĂ© qu’il Ă©tait nĂ©cessaire d’illustrer les archives sonores pour qu’elles s’inscrivent mieux dans la dynamique des rĂ©seaux sociaux. Les archives audiovisuelles quant Ă  elles devront ĂȘtre prĂ©sentĂ©es sous la forme de courtes capsules regroupant un ou plusieurs extraits d’archives datant d’aprĂšs les annĂ©es 1990

    Deep Learning for Audio Segmentation and Intelligent Remixing

    Get PDF
    Audio segmentation divides an audio signal into homogenous sections such as music and speech. It is useful as a preprocessing step to index, store, and modify audio recordings, radio broadcasts and TV programmes. Machine learning models for audio segmentation are generally trained on copyrighted material, which cannot be shared across research groups. Furthermore, annotating these datasets is a time-consuming and expensive task. In this thesis, we present a novel approach that artificially synthesises data that resembles radio signals. We replicate the workflow of a radio DJ in mixing audio and investigate parameters like fade curves and audio ducking. Using this approach, we obtained state-of-the-art performance for music-speech detection on in-house and public datasets. After demonstrating the efficacy of training set synthesis, we investigate how audio ducking of background music impacts the precision and recall of the machine learning algorithm. Interestingly, we observed that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners. Furthermore, we observe that our proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative. This project also proposes a novel deep learning system called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. The relative improvement for F-measure of YOHO, compared to the state-of-the-art Convolutional Recurrent Neural Network, ranged from 1% to 6% across multiple datasets. As YOHO predicts acoustic boundaries directly, the speed of inference and post-processing steps are 6 times faster than frame-based classification. Furthermore, we investigate domain generalisation methods such as transfer learning and adversarial training. We demonstrated that these methods helped our algorithm perform better in unseen domains. In addition to audio segmentation, another objective of this project is to explore real-time radio remixing. This is a step towards building a customised radio and consequently, integrating it with the schedule of the listener. The system would remix music from the user’s personal playlist and play snippets of diary reminders at appropriate transition points. The intelligent remixing is governed by the underlying audio segmentation and other deep learning methods. We also explore how individuals can communicate with intelligent mixing systems through non-technical language. We demonstrated that word embeddings help in understanding representations of semantic descriptors

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
    • 

    corecore