3,842 research outputs found

    Towards responsive Sensitive Artificial Listeners

    Get PDF
    This paper describes work in the recently started project SEMAINE, which aims to build a set of Sensitive Artificial Listeners – conversational agents designed to sustain an interaction with a human user despite limited verbal skills, through robust recognition and generation of non-verbal behaviour in real-time, both when the agent is speaking and listening. We report on data collection and on the design of a system architecture in view of real-time responsiveness

    Object Tracking from Audio and Video data using Linear Prediction method

    Get PDF
    Microphone arrays and video surveillance by camera are widely used for detection and tracking of a moving speaker. In this project, object tracking was planned using multimodal fusion i.e., Audio-Visual perception. Source localisation can be done by GCC-PHAT, GCC-ML for time delay estimation delay estimation. These methods are based on spectral content of the speech signals that can be effected by noise and reverberation. Video tracking can be done using Kalman filter or Particle filter. Therefore Linear Prediction method is used for audio and video tracking. Linear prediction in source localisation use features related to excitation source information of speech which are less effected by noise. Hence by using this excitation source information, time delays are estimated and the results are compared with GCC PHAT method. The dataset obtained from [20] is used in video tracking a single moving object captured through stationary camera. Then for object detection, projection histogram is done followed by linear prediction for tracking and the corresponding results are compared with Kalman filter method

    Visual units and confusion modelling for automatic lip-reading

    Get PDF
    Automatic lip-reading (ALR) is a challenging task because the visual speech signal is known to be missing some important information, such as voicing. We propose an approach to ALR that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled. We describe a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers. Our results show a small but statistically significant improvement in recognition accuracy. We also investigate the issue of suitable visual units for ALR, and show that visemes are sub-optimal, not but because they introduce lexical ambiguity, but because the reduction in modelling units entailed by their use reduces accuracy

    Idealized computational models for auditory receptive fields

    Full text link
    This paper presents a theory by which idealized models of auditory receptive fields can be derived in a principled axiomatic manner, from a set of structural properties to enable invariance of receptive field responses under natural sound transformations and ensure internal consistency between spectro-temporal receptive fields at different temporal and spectral scales. For defining a time-frequency transformation of a purely temporal sound signal, it is shown that the framework allows for a new way of deriving the Gabor and Gammatone filters as well as a novel family of generalized Gammatone filters, with additional degrees of freedom to obtain different trade-offs between the spectral selectivity and the temporal delay of time-causal temporal window functions. When applied to the definition of a second-layer of receptive fields from a spectrogram, it is shown that the framework leads to two canonical families of spectro-temporal receptive fields, in terms of spectro-temporal derivatives of either spectro-temporal Gaussian kernels for non-causal time or the combination of a time-causal generalized Gammatone filter over the temporal domain and a Gaussian filter over the logspectral domain. For each filter family, the spectro-temporal receptive fields can be either separable over the time-frequency domain or be adapted to local glissando transformations that represent variations in logarithmic frequencies over time. Within each domain of either non-causal or time-causal time, these receptive field families are derived by uniqueness from the assumptions. It is demonstrated how the presented framework allows for computation of basic auditory features for audio processing and that it leads to predictions about auditory receptive fields with good qualitative similarity to biological receptive fields measured in the inferior colliculus (ICC) and primary auditory cortex (A1) of mammals.Comment: 55 pages, 22 figures, 3 table

    The influence of external and internal motor processes on human auditory rhythm perception

    Get PDF
    Musical rhythm is composed of organized temporal patterns, and the processes underlying rhythm perception are found to engage both auditory and motor systems. Despite behavioral and neuroscience evidence converging to this audio-motor interaction, relatively little is known about the effect of specific motor processes on auditory rhythm perception. This doctoral thesis was devoted to investigating the influence of both external and internal motor processes on the way we perceive an auditory rhythm. The first half of the thesis intended to establish whether overt body movement had a facilitatory effect on our ability to perceive the auditory rhythmic structure, and whether this effect was modulated by musical training. To this end, musicians and non-musicians performed a pulse-finding task either using natural body movement or through listening only, and produced their identified pulse by finger tapping. The results showed that overt movement benefited rhythm (pulse) perception especially for non-musicians, confirming the facilitatory role of external motor activities in hearing the rhythm, as well as its interaction with musical training. The second half of the thesis tested the idea that indirect, covert motor input, such as that transformed from the visual stimuli, could influence our perceived structure of an auditory rhythm. Three experiments examined the subjectively perceived tempo of an auditory sequence under different visual motion stimulations, while the auditory and visual streams were presented independently of each other. The results revealed that the perceived auditory tempo was accordingly influenced by the concurrent visual motion conditions, and the effect was related to the increment or decrement of visual motion speed. This supported the hypothesis that the internal motor information extracted from the visuomotor stimulation could be incorporated into the percept of an auditory rhythm. Taken together, the present thesis concludes that, rather than as a mere reaction to the given auditory input, our motor system plays an important role in contributing to the perceptual process of the auditory rhythm. This can occur via both external and internal motor activities, and may not only influence how we hear a rhythm but also under some circumstances improve our ability to hear the rhythm.Musikalische Rhythmen bestehen aus zeitlich strukturierten Mustern akustischer Stimuli. Es konnte gezeigt werden, dass die Prozesse, welche der Rhythmuswahrnehmung zugrunde liegen, sowohl motorische als auch auditive Systeme nutzen. Obwohl sich für diese auditiv-motorischen Interaktionen sowohl in den Verhaltenswissenschaften als auch Neurowissenschaften übereinstimmende Belege finden, weiß man bislang relativ wenig über die Auswirkungen spezifischer motorischer Prozesse auf die auditive Rhythmuswahrnehmung. Diese Doktorarbeit untersucht den Einfluss externaler und internaler motorischer Prozesse auf die Art und Weise, wie auditive Rhythmen wahrgenommen werden. Der erste Teil der Arbeit diente dem Ziel herauszufinden, ob körperliche Bewegungen es dem Gehirn erleichtern können, die Struktur von auditiven Rhythmen zu erkennen, und, wenn ja, ob dieser Effekt durch ein musikalisches Training beeinflusst wird. Um dies herauszufinden wurde Musikern und Nichtmusikern die Aufgabe gegeben, innerhalb von präsentierten auditiven Stimuli den Puls zu finden, wobei ein Teil der Probanden währenddessen Körperbewegungen ausführen sollte und der andere Teil nur zuhören sollte. Anschließend sollten die Probanden den gefundenen Puls durch Finger-Tapping ausführen, wobei die Reizgaben sowie die Reaktionen mittels eines computerisierten Systems kontrolliert wurden. Die Ergebnisse zeigen, dass offen ausgeführte Bewegungen die Wahrnehmung des Pulses vor allem bei Nichtmusikern verbesserten. Diese Ergebnisse bestätigen, dass Bewegungen beim Hören von Rhythmen unterstützend wirken. Außerdem zeigte sich, dass hier eine Wechselwirkung mit dem musikalischen Training besteht. Der zweite Teil der Doktorarbeit überprüfte die Idee, dass indirekte, verdeckte Bewegungsinformationen, wie sie z.B. in visuellen Stimuli enthalten sind, die wahrgenommene Struktur von auditiven Rhythmen beeinflussen können. Drei Experimente untersuchten, inwiefern das subjektiv wahrgenommene Tempo einer akustischen Sequenz durch die Präsentation unterschiedlicher visueller Bewegungsreize beeinflusst wird, wobei die akustischen und optischen Stimuli unabhängig voneinander präsentiert wurden. Die Ergebnisse zeigten, dass das wahrgenommene auditive Tempo durch die visuellen Bewegungsinformationen beeinflusst wird, und dass der Effekt in Verbindung mit der Zunahme oder Abnahme der visuellen Geschwindigkeit steht. Dies unterstützt die Hypothese, dass internale Bewegungsinformationen, welche aus visuomotorischen Reizen extrahiert werden, in die Wahrnehmung eines auditiven Rhythmus integriert werden können. Zusammen genommen, 5 zeigt die vorgestellte Arbeit, dass unser motorisches System eine wichtige Rolle im Wahrnehmungsprozess von auditiven Rhythmen spielt. Dies kann sowohl durch äußere als auch durch internale motorische Aktivitäten geschehen, und beeinflusst nicht nur die Art, wie wir Rhythmen hören, sondern verbessert unter bestimmten Bedingungen auch unsere Fähigkeit Rhythmen zu identifizieren

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems
    corecore