23 research outputs found

    Multimodal feature extraction and fusion for audio-visual speech recognition

    Get PDF
    Multimodal signal processing analyzes a physical phenomenon through several types of measures, or modalities. This leads to the extraction of higher-quality and more reliable information than that obtained from single-modality signals. The advantage is two-fold. First, as the modalities are usually complementary, the end-result of multimodal processing is more informative than for each of the modalities individually, which represents the first advantage. This is true in all application domains: human-machine interaction, multimodal identification or multimodal image processing. The second advantage is that, as modalities are not always reliable, it is possible, when one modality becomes corrupted, to extract the missing information from the other one. There are two essential challenges in multimodal signal processing. First, the features used from each modality need to be as relevant and as few as possible. The fact that multimodal systems have to process more than just one modality means that they can run into errors caused by the curse of dimensionality much more easily than mono-modal ones. The curse of dimensionality is a term used essentially to say that the number of equally-distributed samples required to cover a region of space grows exponentially with the dimensionality of the space. This has important implications in the classification domain, since accurate models can only be obtained if an adequate number of samples is available, and obviously this required number of samples grows with the dimensionality of the features. Dimensionality reduction is thus a necessary step in any application dealing with complex signals, and this is achieved through selection, transforms or the combination of the two. The second essential challenge is multimodal integration. Since the signals involved do not necessarily have the same data rate, range or even dimensionality, combining information coming from such different sources is not straightforward. This can be done at different levels, starting from the basic signal level by combining the signals themselves, if they are compatible, up to the highest decision level, where only the individual decisions taken based on the signals are combined. Ideally, the fusion method should allow temporal variations in the relative importance of the two streams, to account for possible changes in their quality. However, this can only be done with methods operating at a high decision level. The aim of this thesis is to offer solutions to both these challenges, in the context of audio-visual speech recognition and speaker localization. Both these applications are from the field of human-machine interaction. Audio-visual speech recognition aims to improve the accuracy of speech recognizers by augmenting the audio with information extracted from the video, more particularly, the movement of the speaker's lips. This works well especially when the audio is corrupted, leading in this case to significant gains in accuracy. Speaker localization means detecting who is the active speaker in a audio-video sequence containing several persons, something that is useful for videoconferencing and the automated annotation of meetings. These two applications are the context in which we present our solutions to both feature selection and multimodal integration. First, we show how informative features can be extracted from the visual modality, using an information-theoretic framework which gives us a quantitative measure of the relevance of individual features. We also prove that reducing redundancy between these features is important for avoiding the curse of dimensionality and improving recognition results. The methods that we present are novel in the field of audio-visual speech recognition and we found that their use leads to significant improvements compared to the state of the art. Second, we present a method of multimodal fusion at the level of intermediate decisions using a weight for each of the streams. The weights are adaptive, changing according to the estimated reliability of each stream. This makes the system tolerant to changes in the quality of either stream, and even to the temporary interruption of one of the streams. The reliability estimate is based on the entropy of the posterior probability distributions of each stream at the intermediate decision level. Our results are superior to those obtained with a state of the art method based on maximizing the same posteriors. Moreover, we analyze the effect of a constraint typically imposed on stream weights in the literature, the constraint that they should sum to one. Our results show that removing this constraint can lead to improvements in recognition accuracy. Finally, we develop a method for audio-visual speaker localization, based on the correlation between audio energy and the movement of the speaker's lips. Our method is based on a joint probability model of the audio and video which is used to build a likelihood map showing the likely positions of the speaker's mouth. We show that our novel method performs better than a similar method from the literature. In conclusion, we analyze two different challenges of multimodal signal processing for two audio-visual problems, and offer innovative approaches for solving them

    Multiresolution models in image restoration and reconstruction with medical and other applications

    Get PDF

    Cerebellar and sensory contributions to the optomotor response in larval zebrafish

    Get PDF

    Discrete Wavelet Transforms

    Get PDF
    The discrete wavelet transform (DWT) algorithms have a firm position in processing of signals in several areas of research and industry. As DWT provides both octave-scale frequency and spatial timing of the analyzed signal, it is constantly used to solve and treat more and more advanced problems. The present book: Discrete Wavelet Transforms: Algorithms and Applications reviews the recent progress in discrete wavelet transform algorithms and applications. The book covers a wide range of methods (e.g. lifting, shift invariance, multi-scale analysis) for constructing DWTs. The book chapters are organized into four major parts. Part I describes the progress in hardware implementations of the DWT algorithms. Applications include multitone modulation for ADSL and equalization techniques, a scalable architecture for FPGA-implementation, lifting based algorithm for VLSI implementation, comparison between DWT and FFT based OFDM and modified SPIHT codec. Part II addresses image processing algorithms such as multiresolution approach for edge detection, low bit rate image compression, low complexity implementation of CQF wavelets and compression of multi-component images. Part III focuses watermaking DWT algorithms. Finally, Part IV describes shift invariant DWTs, DC lossless property, DWT based analysis and estimation of colored noise and an application of the wavelet Galerkin method. The chapters of the present book consist of both tutorial and highly advanced material. Therefore, the book is intended to be a reference text for graduate students and researchers to obtain state-of-the-art knowledge on specific applications

    Multiresolution image models and estimation techniques

    Get PDF

    Cerebellar and sensory contributions to the optomotor response in larval zebrafish

    Get PDF

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity

    Inverse problems in astronomical and general imaging

    Get PDF
    The resolution and the quality of an imaged object are limited by four contributing factors. Firstly, the primary resolution limit of a system is imposed by the aperture of an instrument due to the effects of diffraction. Secondly, the finite sampling frequency, the finite measurement time and the mechanical limitations of the equipment also affect the resolution of the images captured. Thirdly, the images are corrupted by noise, a process inherent to all imaging systems. Finally, a turbulent imaging medium introduces random degradations to the signals before they are measured. In astronomical imaging, it is the atmosphere which distorts the wavefronts of the objects, severely limiting the resolution of the images captured by ground-based telescopes. These four factors affect all real imaging systems to varying degrees. All the limitations imposed on an imaging system result in the need to deduce or reconstruct the underlying object distribution from the distorted measured data. This class of problems is called inverse problems. The key to the success of solving an inverse problem is the correct modelling of the physical processes which give rise to the corresponding forward problem. However, the physical processes have an infinite amount of information, but only a finite number of parameters can be used in the model. Information loss is therefore inevitable. As a result, the solution to many inverse problems requires additional information or prior knowledge. The application of prior information to inverse problems is a recurrent theme throughout this thesis. An inverse problem that has been an active research area for many years is interpolation, and there exist numerous techniques for solving this problem. However, many of these techniques neither account for the sampling process of the instrument nor include prior information in the reconstruction. These factors are taken into account in the proposed optimal Bayesian interpolator. The process of interpolation is also examined from the point of view of superresolution, as these processes can be viewed as being complementary. Since the principal effect of atmospheric turbulence on an incoming wavefront is a phase distortion, most of the inverse problem techniques devised for this seek to either estimate or compensate for this phase component. These techniques are classified into computer post-processing methods, adaptive optics (AO) and hybrid techniques. Blind deconvolution is a post-processing technique which uses the speckle images to estimate both the object distribution and the point spread function (PSF), the latter of which is directly related to the phase. The most successful approaches are based on characterising the PSF as the aberrations over the aperture. Since the PSF is also dependent on the atmosphere, it is possible to constrain the solution using the statistics of the atmosphere. An investigation shows the feasibility of this approach. Bispectrum is also a post-processing method which reconstructs the spectrum of the object. The key component for phase preservation is the property of phase closure, and its application as prior information for blind deconvolution is examined. Blind deconvolution techniques utilise only information in the image channel to estimate the phase which is difficult. An alternative method for phase estimation is from a Shack-Hartmann (SH) wavefront sensing channel. However, since phase information is present in both the wavefront sensing and the image channels simultaneously, both of these approaches suffer from the problem that phase information from only one channel is used. An improved estimate of the phase is achieved by a combination of these methods, ensuring that the phase estimation is made jointly from the data in both the image and the wavefront sensing measurements. This formulation, posed as a blind deconvolution framework, is investigated in this thesis. An additional advantage of this approach is that since speckle images are imaged in a narrowband, while wavefront sensing images are captured by a charge-coupled device (CCD) camera at all wavelengths, the splitting of the light does not compromise the light level for either channel. This provides a further incentive for using simultaneous data sets. The effectiveness of using Shack-Hartmann wavefront sensing data for phase estimation relies on the accuracy of locating the data spots. The commonly used method which calculates the centre of gravity of the image is in fact prone to noise and is suboptimal. An improved method for spot location based on blind deconvolution is demonstrated. Ground-based adaptive optics (AO) technologies aim to correct for atmospheric turbulence in real time. Although much success has been achieved, the space- and time-varying nature of the atmosphere renders the accurate measurement of atmospheric properties difficult. It is therefore usual to perform additional post-processing on the AO data. As a result, some of the techniques developed in this thesis are applicable to adaptive optics. One of the methods which utilise elements of both adaptive optics and post-processing is the hybrid technique of deconvolution from wavefront sensing (DWFS). Here, both the speckle images and the SH wavefront sensing data are used. The original proposal of DWFS is simple to implement but suffers from the problem where the magnitude of the object spectrum cannot be reconstructed accurately. The solution proposed for overcoming this is to use an additional set of reference star measurements. This however does not completely remove the original problem; in addition it introduces other difficulties associated with reference star measurements such as anisoplanatism and reduction of valuable observing time. In this thesis a parameterised solution is examined which removes the need for a reference star, as well as offering a potential to overcome the problem of estimating the magnitude of the object

    Towards Computational Efficiency of Next Generation Multimedia Systems

    Get PDF
    To address throughput demands of complex applications (like Multimedia), a next-generation system designer needs to co-design and co-optimize the hardware and software layers. Hardware/software knobs must be tuned in synergy to increase the throughput efficiency. This thesis provides such algorithmic and architectural solutions, while considering the new technology challenges (power-cap and memory aging). The goal is to maximize the throughput efficiency, under timing- and hardware-constraints

    Video-based analysis of Gait pathologies

    Get PDF
    Lā€™analyse de la marche a eĢmergeĢ comme lā€™un des domaines meĢdicaux le plus im- portants reĢcemment. Les systeĢ€mes aĢ€ base de marqueurs sont les meĢthodes les plus fa- voriseĢes par lā€™eĢvaluation du mouvement humain et lā€™analyse de la marche, cependant, ces systeĢ€mes neĢcessitent des eĢquipements et de lā€™expertise speĢcifiques et sont lourds, couĢ‚teux et difficiles aĢ€ utiliser. De nombreuses approches reĢcentes baseĢes sur la vision par ordinateur ont eĢteĢ deĢveloppeĢes pour reĢduire le couĢ‚t des systeĢ€mes de capture de mou- vement tout en assurant un reĢsultat de haute preĢcision. Dans cette theĢ€se, nous preĢsentons notre nouveau systeĢ€me dā€™analyse de la deĢmarche aĢ€ faible couĢ‚t, qui est composeĢ de deux cameĢras videĢo monoculaire placeĢes sur le coĢ‚teĢ gauche et droit dā€™un tapis roulant. Chaque modeĢ€le 2D de la moitieĢ du squelette humain est reconstruit aĢ€ partir de chaque vue sur la base de la segmentation dynamique de la couleur, lā€™analyse de la marche est alors effectueĢe sur ces deux modeĢ€les. La validation avec lā€™eĢtat de lā€™art baseĢe sur la vision du systeĢ€me de capture de mouvement (en utilisant le Microsoft Kinect) et la reĢaliteĢ du ter- rain (avec des marqueurs) a eĢteĢ faite pour deĢmontrer la robustesse et lā€™efficaciteĢ de notre systeĢ€me. Lā€™erreur moyenne de lā€™estimation du modeĢ€le de squelette humain par rapport aĢ€ la reĢaliteĢ du terrain entre notre meĢthode vs Kinect est treĢ€s prometteur: les joints des angles de cuisses (6,29ā—¦ contre 9,68ā—¦), jambes (7,68ā—¦ contre 11,47ā—¦), pieds (6,14ā—¦ contre 13,63ā—¦), la longueur de la fouleĢe (6.14cm rapport de 13.63cm) sont meilleurs et plus stables que ceux de la Kinect, alors que le systeĢ€me peut maintenir une preĢcision assez proche de la Kinect pour les bras (7,29ā—¦ contre 6,12ā—¦), les bras infeĢrieurs (8,33ā—¦ contre 8,04ā—¦), et le torse (8,69ā—¦contre 6,47ā—¦). BaseĢ sur le modeĢ€le de squelette obtenu par chaque meĢthode, nous avons reĢaliseĢ une eĢtude de symeĢtrie sur diffeĢrentes articulations (coude, genou et cheville) en utilisant chaque meĢthode sur trois sujets diffeĢrents pour voir quelle meĢthode permet de distinguer plus efficacement la caracteĢristique symeĢtrie / asymeĢtrie de la marche. Dans notre test, notre systeĢ€me a un angle de genou au maximum de 8,97ā—¦ et 13,86ā—¦ pour des promenades normale et asymeĢtrique respectivement, tandis que la Kinect a donneĢ 10,58ā—¦et 11,94ā—¦. Par rapport aĢ€ la reĢaliteĢ de terrain, 7,64ā—¦et 14,34ā—¦, notre systeĢ€me a montreĢ une plus grande preĢcision et pouvoir discriminant entre les deux cas.Gait analysis has emerged as one of the most important medical field recently due to its wide range of applications. Marker-based systems are the most favoured methods of human motion assessment and gait analysis, however, these systems require specific equipment and expertise, and are cumbersome, costly and difficult to use. Many re- cent computer-vision-based approaches have been developed to reduce the cost of the expensive motion capture systems while ensuring high accuracy result. In this thesis, we introduce our new low-cost gait analysis system that is composed of two low-cost monocular cameras (camcorders) placed on the left and right sides of a treadmill. Each 2D left or right human skeleton model is reconstructed from each view based on dy- namic color segmentation, the gait analysis is then performed on these two models. The validation with one state-of-the-art vision-based motion capture system (using the Mi- crosoft Kinect v.1) and one ground-truth (with markers) was done to demonstrate the robustness and efficiency of our system. The average error in human skeleton model estimation compared to ground-truth between our method vs. Kinect are very promis- ing: the joints angles of upper legs (6.29ā—¦ vs. 9.68ā—¦), lower legs (7.68ā—¦ vs. 11.47ā—¦), feet (6.14ā—¦ vs. 13.63ā—¦), stride lengths (6.14cm vs. 13.63cm) were better and more stable than those from the Kinect, while the system could maintain a reasonably close accu- racy to the Kinect for upper arms (7.29ā—¦ vs. 6.12ā—¦), lower arms (8.33ā—¦ vs. 8.04ā—¦), and torso (8.69ā—¦ vs. 6.47ā—¦). Based on the skeleton model obtained by each method, we per- formed a symmetry study on various joints (elbow, knee and ankle) using each method on two different subjects to see which method can distinguish more efficiently the sym- metry/asymmetry characteristic of gaits. In our test, our system reported a maximum knee angle of 8.97ā—¦ and 13.86ā—¦ for normal and asymmetric walks respectively, while the Kinect gave 10.58ā—¦ and 11.94ā—¦. Compared to the ground-truth, 7.64ā—¦ and 14.34ā—¦, our system showed more accuracy and discriminative power between the two cases
    corecore