1,263 research outputs found

    Recent advances in LVCSR : A benchmark comparison of performances

    Get PDF
    Large Vocabulary Continuous Speech Recognition (LVCSR), which is characterized by a high variability of the speech, is the most challenging task in automatic speech recognition (ASR). Believing that the evaluation of ASR systems on relevant and common speech corpora is one of the key factors that help accelerating research, we present, in this paper, a benchmark comparison of the performances of the current state-of-the-art LVCSR systems over different speech recognition tasks. Furthermore, we put objectively into evidence the best performing technologies and the best accuracy achieved so far in each task. The benchmarks have shown that the Deep Neural Networks and Convolutional Neural Networks have proven their efficiency on several LVCSR tasks by outperforming the traditional Hidden Markov Models and Guaussian Mixture Models. They have also shown that despite the satisfying performances in some LVCSR tasks, the problem of large-vocabulary speech recognition is far from being solved in some others, where more research efforts are still needed

    No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration

    Full text link
    Recognizing who is speaking in a crowded scene is a key challenge towards the understanding of the social interactions going on within. Detecting speaking status from body movement alone opens the door for the analysis of social scenes in which personal audio is not obtainable. Video and wearable sensors make it possible recognize speaking in an unobtrusive, privacy-preserving way. When considering the video modality, in action recognition problems, a bounding box is traditionally used to localize and segment out the target subject, to then recognize the action taking place within it. However, cross-contamination, occlusion, and the articulated nature of the human body, make this approach challenging in a crowded scene. Here, we leverage articulated body poses for subject localization and in the subsequent speech detection stage. We show that the selection of local features around pose keypoints has a positive effect on generalization performance while also significantly reducing the number of local features considered, making for a more efficient method. Using two in-the-wild datasets with different viewpoints of subjects, we investigate the role of cross-contamination in this effect. We additionally make use of acceleration measured through wearable sensors for the same task, and present a multimodal approach combining both methods

    Recognition of Dialogue Acts in Multiparty Meetings using a Switching DBN

    Get PDF
    This paper is concerned with the automatic recognition of dialogue acts (DAs) in multiparty conversational speech. We present a joint generative model for DA recognition in which segmentation and classification of DAs are carried out in parallel. Our approach to DA recognition is based on a switching dynamic Bayesian network (DBN) architecture. This generative approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. The switching DBN coordinates the recognition process by integrating the component models. The factored language model, which is estimated from multiple conversational data corpora, is used in conjunction with additional task-specific language models. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. We have carried out experiments on the AMI corpus of multimodal meeting recordings, using both manually transcribed speech, and the output of an automatic speech recognizer, and using different configurations of the generative model. Our results indicate that the system performs well both on reference and fully automatic transcriptions. A further significant improvement in recognition accuracy is obtained by the application of the discriminative reranking approach based on conditional random fields

    Acoustic-based Smart Tactile Sensing in Social Robots

    Get PDF
    Mención Internacional en el título de doctorEl sentido del tacto es un componente crucial de la interacción social humana y es único entre los cinco sentidos. Como único sentido proximal, el tacto requiere un contacto físico cercano o directo para registrar la información. Este hecho convierte al tacto en una modalidad de interacción llena de posibilidades en cuanto a comunicación social. A través del tacto, podemos conocer la intención de la otra persona y comunicar emociones. De esta idea surge el concepto de social touch o tacto social como el acto de tocar a otra persona en un contexto social. Puede servir para diversos fines, como saludar, mostrar afecto, persuadir y regular el bienestar emocional y físico. Recientemente, el número de personas que interactúan con sistemas y agentes artificiales ha aumentado, principalmente debido al auge de los dispositivos tecnológicos, como los smartphones o los altavoces inteligentes. A pesar del auge de estos dispositivos, sus capacidades de interacción son limitadas. Para paliar este problema, los recientes avances en robótica social han mejorado las posibilidades de interacción para que los agentes funcionen de forma más fluida y sean más útiles. En este sentido, los robots sociales están diseñados para facilitar interacciones naturales entre humanos y agentes artificiales. El sentido del tacto en este contexto se revela como un vehículo natural que puede mejorar la Human-Robot Interaction (HRI) debido a su relevancia comunicativa en entornos sociales. Además de esto, para un robot social, la relación entre el tacto social y su aspecto es directa, al disponer de un cuerpo físico para aplicar o recibir toques. Desde un punto de vista técnico, los sistemas de detección táctil han sido objeto recientemente de nuevas investigaciones, sobre todo dedicado a comprender este sentido para crear sistemas inteligentes que puedan mejorar la vida de las personas. En este punto, los robots sociales se han convertido en dispositivos muy populares que incluyen tecnologías para la detección táctil. Esto está motivado por el hecho de que un robot puede esperada o inesperadamente tener contacto físico con una persona, lo que puede mejorar o interferir en la ejecución de sus comportamientos. Por tanto, el sentido del tacto se antoja necesario para el desarrollo de aplicaciones robóticas. Algunos métodos incluyen el reconocimiento de gestos táctiles, aunque a menudo exigen importantes despliegues de hardware que requieren de múltiples sensores. Además, la fiabilidad de estas tecnologías de detección es limitada, ya que la mayoría de ellas siguen teniendo problemas tales como falsos positivos o tasas de reconocimiento bajas. La detección acústica, en este sentido, puede proporcionar un conjunto de características capaces de paliar las deficiencias anteriores. A pesar de que se trata de una tecnología utilizada en diversos campos de investigación, aún no se ha integrado en la interacción táctil entre humanos y robots. Por ello, en este trabajo proponemos el sistema Acoustic Touch Recognition (ATR), un sistema inteligente de detección táctil (smart tactile sensing system) basado en la detección acústica y diseñado para mejorar la interacción social humano-robot. Nuestro sistema está desarrollado para clasificar gestos táctiles y localizar su origen. Además de esto, se ha integrado en plataformas robóticas sociales y se ha probado en aplicaciones reales con éxito. Nuestra propuesta se ha enfocado desde dos puntos de vista: uno técnico y otro relacionado con el tacto social. Por un lado, la propuesta tiene una motivación técnica centrada en conseguir un sistema táctil rentable, modular y portátil. Para ello, en este trabajo se ha explorado el campo de las tecnologías de detección táctil, los sistemas inteligentes de detección táctil y su aplicación en HRI. Por otro lado, parte de la investigación se centra en el impacto afectivo del tacto social durante la interacción humano-robot, lo que ha dado lugar a dos estudios que exploran esta idea.The sense of touch is a crucial component of human social interaction and is unique among the five senses. As the only proximal sense, touch requires close or direct physical contact to register information. This fact makes touch an interaction modality full of possibilities regarding social communication. Through touch, we are able to ascertain the other person’s intention and communicate emotions. From this idea emerges the concept of social touch as the act of touching another person in a social context. It can serve various purposes, such as greeting, showing affection, persuasion, and regulating emotional and physical well-being. Recently, the number of people interacting with artificial systems and agents has increased, mainly due to the rise of technological devices, such as smartphones or smart speakers. Still, these devices are limited in their interaction capabilities. To deal with this issue, recent developments in social robotics have improved the interaction possibilities to make agents more seamless and useful. In this sense, social robots are designed to facilitate natural interactions between humans and artificial agents. In this context, the sense of touch is revealed as a natural interaction vehicle that can improve HRI due to its communicative relevance. Moreover, for a social robot, the relationship between social touch and its embodiment is direct, having a physical body to apply or receive touches. From a technical standpoint, tactile sensing systems have recently been the subject of further research, mostly devoted to comprehending this sense to create intelligent systems that can improve people’s lives. Currently, social robots are popular devices that include technologies for touch sensing. This is motivated by the fact that robots may encounter expected or unexpected physical contact with humans, which can either enhance or interfere with the execution of their behaviours. There is, therefore, a need to detect human touch in robot applications. Some methods even include touch-gesture recognition, although they often require significant hardware deployments primarily that require multiple sensors. Additionally, the dependability of those sensing technologies is constrained because the majority of them still struggle with issues like false positives or poor recognition rates. Acoustic sensing, in this sense, can provide a set of features that can alleviate the aforementioned shortcomings. Even though it is a technology that has been utilised in various research fields, it has yet to be integrated into human-robot touch interaction. Therefore, in thiswork,we propose theATRsystem, a smart tactile sensing system based on acoustic sensing designed to improve human-robot social interaction. Our system is developed to classify touch gestures and locate their source. It is also integrated into real social robotic platforms and tested in real-world applications. Our proposal is approached from two standpoints, one technical and the other related to social touch. Firstly, the technical motivation of thiswork centred on achieving a cost-efficient, modular and portable tactile system. For that, we explore the fields of touch sensing technologies, smart tactile sensing systems and their application in HRI. On the other hand, part of the research is centred around the affective impact of touch during human-robot interaction, resulting in two studies exploring this idea.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidente: Pedro Manuel Urbano de Almeida Lima.- Secretaria: María Dolores Blanco Rojas.- Vocal: Antonio Fernández Caballer

    Robust speaker diarization for meetings

    Get PDF
    Aquesta tesi doctoral mostra la recerca feta en l'àrea de la diarització de locutor per a sales de reunions. En la present s'estudien els algorismes i la implementació d'un sistema en diferit de segmentació i aglomerat de locutor per a grabacions de reunions a on normalment es té accés a més d'un micròfon per al processat. El bloc més important de recerca s'ha fet durant una estada al International Computer Science Institute (ICSI, Berkeley, Caligornia) per un període de dos anys.La diarització de locutor s'ha estudiat força per al domini de grabacions de ràdio i televisió. La majoria dels sistemes proposats utilitzen algun tipus d'aglomerat jeràrquic de les dades en grups acústics a on de bon principi no se sap el número de locutors òptim ni tampoc la seva identitat. Un mètode molt comunment utilitzat s'anomena "bottom-up clustering" (aglomerat de baix-a-dalt), amb el qual inicialment es defineixen molts grups acústics de dades que es van ajuntant de manera iterativa fins a obtenir el nombre òptim de grups tot i acomplint un criteri de parada. Tots aquests sistemes es basen en l'anàlisi d'un canal d'entrada individual, el qual no permet la seva aplicació directa per a reunions. A més a més, molts d'aquests algorisms necessiten entrenar models o afinar els parameters del sistema usant dades externes, el qual dificulta l'aplicabilitat d'aquests sistemes per a dades diferents de les usades per a l'adaptació.La implementació proposada en aquesta tesi es dirigeix a solventar els problemes mencionats anteriorment. Aquesta pren com a punt de partida el sistema existent al ICSI de diarització de locutor basat en l'aglomerat de "baix-a-dalt". Primer es processen els canals de grabació disponibles per a obtindre un sol canal d'audio de qualitat major, a més dínformació sobre la posició dels locutors existents. Aleshores s'implementa un sistema de detecció de veu/silenci que no requereix de cap entrenament previ, i processa els segments de veu resultant amb una versió millorada del sistema mono-canal de diarització de locutor. Aquest sistema ha estat modificat per a l'ús de l'informació de posició dels locutors (quan es tingui) i s'han adaptat i creat nous algorismes per a que el sistema obtingui tanta informació com sigui possible directament del senyal acustic, fent-lo menys depenent de les dades de desenvolupament. El sistema resultant és flexible i es pot usar en qualsevol tipus de sala de reunions pel que fa al nombre de micròfons o la seva posició. El sistema, a més, no requereix en absolute dades d´entrenament, sent més senzill adaptar-lo a diferents tipus de dades o dominis d'aplicació. Finalment, fa un pas endavant en l'ús de parametres que siguin mes robusts als canvis en les dades acústiques. Dos versions del sistema es van presentar amb resultats excel.lents a les evaluacions de RT05s i RT06s del NIST en transcripció rica per a reunions, a on aquests es van avaluar amb dades de dos subdominis diferents (conferencies i reunions). A més a més, es fan experiments utilitzant totes les dades disponibles de les evaluacions RT per a demostrar la viabilitat dels algorisms proposats en aquesta tasca.This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to.The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data.The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.Postprint (published version
    corecore