7 research outputs found

    Subband beamforming with higher order statistics for distant speech recognition

    Get PDF
    This dissertation presents novel beamforming methods for distant speech recognition (DSR). Such techniques can relieve users from the necessity of putting on close talking microphones. DSR systems are useful in many applications such as humanoid robots, voice control systems for automobiles, automatic meeting transcription systems and so on. A main problem in DSR is that recognition performance is seriously degraded when a speaker is far from the microphones. In order to avoid the degradation, noise and reverberation should be removed from signals received with the microphones. Acoustic beamforming techniques have a potential to enhance speech from the far field with little distortion since they can maintain a distortionless constraint for a look direction. In beamforming, multiple signals propagating from a position are captured with multiple microphones. Typical conventional beamformers then adjust their weights so as to minimize the variance of their own outputs subject to a distortionless constraint in a look direction. The variance is the average of the second power (square) of the beamformer\u27s outputs. Accordingly, it is considered that the conventional beamformer uses second orderstatistics (SOS) of the beamformer\u27s outputs. The conventional beamforming techniques can effectively place a null on any source of interference. However, the desired signal is also canceled in reverberant environments, which is known as the signal cancellation problem. To avoid that problem, many algorithms have been developed. However, none of the algorithms can essentially solve the signal cancellation problem in reverberant environments. While many efforts have been made in order to overcome the signal cancellation problem in the field of acoustic beamforming, researchers have addressed another research issue with the microphone array, that is, blind source separation (BSS) [1]. The BSS techniques aim at separating sources from the mixture of signals without information about the geometry of the microphone array and positions of sources. It is achieved by multiplying an un-mixing matrix with input signals. The un-mixing matrix is constructed so that the outputs are stochastically independent. Measuring the stochastic independence of the signals is based on the theory of the independent component analysis (ICA) [1]. The field of ICA is based on the fact that distributions of information-bearing signals are not Gaussian and distributions of sums of various signals are close to Gaussian. There are two popular criteria for measuring the degree of the non-Gaussianity, namely, kurtosis and negentropy. As described in detail in this thesis, both criteria use more than the second moment. Accordingly, it is referred to as higher order statistics (HOS) in contrast to SOS. HOS is not considered in the field of acoustic beamforming well although Arai et al. showed the similarity between acoustic beamforming and BSS [2]. This thesis investigates new beamforming algorithms which take into consideration higher-order statistics (HOS). The new beamforming methods adjust the beamformer\u27s weights based on one of the following criteria: • minimum mutual information of the two beamformer\u27s outputs, • maximum negentropy of the beamformer\u27s outputs and • maximum kurtosis of the beamformer\u27s outputs. Those algorithms do not suffer from the signal cancellation, which is shown in this thesis. Notice that the new beamforming techniques can keep the distortionless constraint for the direction of interest in contrast to the BSS algorithms. The effectiveness of the new techniques is finally demonstrated through a series of distant automatic speech recognition experiments on real data recorded with real sensors unlike other work where signals artificially convolved with measured impulse responses are considered. Significant improvements are achieved by the beamforming algorithms proposed here.Diese Dissertation präsentiert neue Methoden zur Spracherkennung auf Entfernung. Mit diesen Methoden ist es möglich auf Nahbesprechungsmikrofone zu verzichten. Spracherkennungssysteme, die auf Nahbesprechungsmikrofone verzichten, sind in vielen Anwendungen nützlich, wie zum Beispiel bei Humanoiden-Robotern, in Voice Control Systemen für Autos oder bei automatischen Transcriptionssystemen von Meetings. Ein Hauptproblem in der Spracherkennung auf Entfernung ist, dass mit zunehmendem Abstand zwischen Sprecher und Mikrofon, die Genauigkeit der Spracherkennung stark abnimmt. Aus diesem Grund ist es elementar die Störungen, nämlich Hintergrundgeräusche, Hall und Echo, aus den Mikrofonsignalen herauszurechnen. Durch den Einsatz von mehreren Mikrofonen ist eine räumliche Trennung des Nutzsignals von den Störungen möglich. Diese Methode wird als akustisches Beamformen bezeichnet. Konventionelle akustische Beamformer passen ihre Gewichte so an, dass die Varianz des Ausgangssignals minimiert wird, wobei das Signal in "Blickrichtung" die Bedingung der Verzerrungsfreiheit erfüllen muss. Die Varianz ist definiert als das quadratische Mittel des Ausgangssignals.Somit werden bei konventionellen Beamformingmethoden Second-Order Statistics (SOS) des Ausgangssignals verwendet. Konventionelle Beamformer können Störquellen effizient unterdrücken, aber leider auch das Nutzsignal. Diese unerwünschte Unterdrückung des Nutzsignals wird im Englischen signal cancellation genannt und es wurden bereits viele Algorithmen entwickelt um dies zu vermeiden. Keiner dieser Algorithmen, jedoch, funktioniert effektiv in verhallter Umgebung. Eine weitere Methode das Nutzsignal von den Störungen zu trennen, diesesmal jedoch ohne die geometrische Information zu nutzen, wird Blind Source Separation (BSS) [1] genannt. Hierbei wird eine Matrixmultiplikation mit dem Eingangssignal durchgeführt. Die Matrix muss so konstruiert werden, dass die Ausgangssignale statistisch unabhängig voneinander sind. Die statistische Unabhängigkeit wird mit der Theorie der Independent Component Analysis (ICA) gemessen [1]. Die ICA nimmt an, dass informationstragende Signale, wie z.B. Sprache, nicht gaußverteilt sind, wohingegen die Summe der Signale, z.B. das Hintergrundrauschen, gaußverteilt sind. Es gibt zwei gängige Arten um den Grad der Nichtgaußverteilung zu bestimmen, Kurtosis und Negentropy. Wie in dieser Arbeit beschrieben, werden hierbei höhere Momente als das zweite verwendet und somit werden diese Methoden als Higher-Order Statistics (HOS) bezeichnet. Obwohl Arai et al. zeigten, dass sich Beamforming und BSS ähnlich sind, werden HOS beim akustischen Beamforming bisher nicht verwendet [2] und beruhen weiterhin auf SOS. In der hier vorliegenden Dissertation werden neue Beamformingalgorithmen entwickelt und evaluiert, die auf HOS basieren. Die neuen Beamformingmethoden passen ihre Gewichte anhand eines der folgenden Kriterien an: • Minimum Mutual Information zweier Beamformer Ausgangssignale • Maximum Negentropy der Beamformer Ausgangssignale und • Maximum Kurtosis der Beamformer Ausgangssignale. Es wird anhand von Spracherkennerexperimenten (gemessen in Wortfehlerrate) gezeigt, dass die hier entwickelten Beamformingtechniken auch erfolgreich Störquellen in verhallten Umgebungen unterdrücken, was ein klarer Vorteil gegenüber den herkömmlichen Methoden ist

    Classification of Sound Scenes and Events in Real-World Scenarios with Deep Learning Techniques

    Get PDF
    La clasificación de los eventos sonoros es un campo de la audición por computador que se está volviendo cada vez más interesante debido al gran número de aplicaciones que podrían beneficiarse de esta tecnología. A diferencia de otros campos de la audición por computador relacionados con la recuperación de información musical o el reconocimiento del habla, la clasificación de eventos sonoros tiene una serie de problemas intrínsecos. Estos problemas son la naturaleza polifónica de la mayoría de las grabaciones de sonido ambiental, la diferencia en la naturaleza de cada sonido, la falta de estructura temporal y la adición de ruido de fondo y reverberación en el proceso de grabación. Estos problemas son campos de estudio para la comunidad científica a día de hoy. Sin embargo, cabe señalar que cuando se despliega una solución de audición por computador en entornos reales, pueden surgir una serie de problemas adicionales. Estos problemas son el Reconocimiento de Conjunto Abierto (OSR), el Aprendizaje de Pocos Disparos (FSL) y la consideración del tiempo de ejecución del sistema (baja complejidad). El OSR se define como el problema que aparece cuando un sistema de inteligencia artificial tiene que enfrentarse a una situación desconocida en la que clases no vistas durante la etapa de entrenamiento están presentes en una etapa de inferencia. El FSL corresponde al problema que se produce cuando hay muy pocas muestras disponibles para cada clase considerada. Por último, dado que estos sistemas se despliegan normalmente en dispositivos de borde, hay que tener en cuenta el tiempo de ejecución, ya que cuanto menos tiempo tarde el sistema en dar una respuesta, mejor será la experiencia percibida por los usuarios. Las soluciones basadas en las técnicas de aprendizaje en profundidad para problemas similares en el dominio de la imagen han mostrado resultados prometedores. Las soluciones más difundidas son las que implementan Redes Neuronales Convolucionales (CNN). Por lo tanto, muchos sistemas de audio de última generación proponen convertir las señales de audio en una representación bidimensional que puede ser tratada como una imagen. La generación de mapas internos se realiza a menudo por las capas convolucionales de las CNN. Sin embargo, estas capas tienen una serie de limitaciones que deben ser estudiadas para poder proponer técnicas para mejorar los mapas de características resultantes. Con este fin, se han propuesto novedosas redes que fusionan dos métodos diferentes, como el aprendizaje residual y las técnicas de excitación y compresión. Los resultados muestran una mejora de la precisión del sistema con la adición de un número reducido de parámetros adicionales. Por otra parte, estas soluciones basadas en entradas bidimensionales pueden mostrar un cierto sesgo, ya que la elección de la representación de audio puede ser específica para una tarea concreta. Por lo tanto, se ha realizado un estudio comparativo de diferentes redes residuales alimentadas directamente por la señal de audio en bruto. Estas soluciones se conocen como de extremo a extremo. Si bien se han realizado estudios similares en la literatura en el dominio de la imagen, los resultados sugieren que los bloques residuales de mejor rendimiento para las tareas de visión artificial pueden no ser los mismos que los de la clasificación de audio. En cuanto a los problemas de FSL y OSR, se propone un marco basado en un autoencoder capaz de mitigar ambos problemas juntos. Esta solución es capaz de crear representaciones robustas de estos patrones de audio a partir de sólo unas pocas muestras, al tiempo que es capaz de rechazar las clases de audio no deseadas.The classification of sound events is a field of machine listening that is becoming increasingly interesting due to the large number of applications that could benefit from this technology. Unlike other fields of machine listening related to music information retrieval or speech recognition, sound event classification has a number of intrinsic problems. These problems are the polyphonic nature of most environmental sound recordings, the difference in the nature of each sound, the lack of temporal structure and the addition of background noise and reverberation in the recording process. These problems are fields of study for the scientific community today. However, it should be noted that when a machine listening solution is deployed in real environments, a number of extra problems may arise. These problems are Open-Set Recognition (OSR), Few-Shot Learning (FSL) and consideration of system runtime (low-complexity). OSR is defined as the problem that appears when an artificial intelligence system has to face an unknown situation where classes unseen during the training stage are present at a usage stage. FSL corresponds to the problem that occurs when there are very few samples available for each considered class. Finally, since these systems are normally deployed in edge devices, the consideration of the execution time must be taken into account, as the less time the system takes to give a response, the better the experience perceived by the users. Solutions based on Deep Learning techniques for similar problems in the image domain have shown promising results. The most widespread solutions are those that implement Convolutional Neural Networks (CNNs). Therefore, many state-of-the-art audio systems propose to convert audio signals into a two-dimensional representation that can be treated as an image. The generation of internal maps is often done by the convolutional layers of the CNNs. However, these layers have a series of limitations that must be studied in order to be able to propose techniques for improving the resulting feature maps. To this end, novel networks have been proposed that merge two different methods such as residual learning and squeeze-excitation techniques. The results show an improvement in the accuracy of the system with the addition of few number of extra parameters. On the other hand, these solutions based on two-dimensional inputs can show a certain bias since the choice of audio representation can be specific to a particular task. Therefore, a comparative study of different residual networks directly fed by the raw audio signal has been carried out. These solutions are known as end-to-end. While similar studies have been carried out in the literature in the image domain, the results suggest that the best performing residual blocks for computer vision tasks may not be the same as those for audio classification. Regarding the FSL and OSR problems, an autoencoder-based framework capable of mitigating both problems together is proposed. This solution is capable of creating robust representations of these audio patterns from just a few samples while being able to reject unwanted audio classes

    Particle Swarm Optimization

    Get PDF
    Particle swarm optimization (PSO) is a population based stochastic optimization technique influenced by the social behavior of bird flocking or fish schooling.PSO shares many similarities with evolutionary computation techniques such as Genetic Algorithms (GA). The system is initialized with a population of random solutions and searches for optima by updating generations. However, unlike GA, PSO has no evolution operators such as crossover and mutation. In PSO, the potential solutions, called particles, fly through the problem space by following the current optimum particles. This book represents the contributions of the top researchers in this field and will serve as a valuable tool for professionals in this interdisciplinary field

    Target recognition techniques for multifunction phased array radar

    Get PDF
    This thesis, submitted for the degree of Doctor of Philosophy at University College London, is a discussion and analysis of combined stepped-frequency and pulse-Doppler target recognition methods which enable a multifunction phased array radar designed for automatic surveillance and multi-target tracking to offer a Non Cooperative Target Recognition (NCTR) capability. The primary challenge is to investigate the feasibility of NCTR via the use of high range resolution profiles. Given stepped frequency waveforms effectively trade time for enhanced bandwidth, and thus resolution, attention is paid to the design of a compromise between resolution and dwell time. A secondary challenge is to investigate the additional benefits to overall target classification when the number of coherent pulses within an NCTR wavefrom is expanded to enable the extraction of spectral features which can help to differentiate particular classes of target. As with increased range resolution, the price for this extra information is a further increase in dwell time. The response to the primary and secondary challenges described above has involved the development of a number of novel techniques, which are summarized below: • Design and execution of a series of experiments to further the understanding of multifunction phased array Radar NCTR techniques • Development of a ‘Hybrid’ stepped frequency technique which enables a significant extension of range profiles without the proportional trade in resolution as experienced with ‘Classical’ techniques • Development of an ‘end to end’ NCTR processing and visualization pipeline • Use of ‘Doppler fraction’ spectral features to enable aircraft target classification via propulsion mechanism. Combination of Doppler fraction and physical length features to enable broad aircraft type classification. • Optimization of NCTR method classification performance as a function of feature and waveform parameters. • Generic waveform design tools to enable delivery of time costly NCTR waveforms within operational constraints. The thesis is largely based upon an analysis of experimental results obtained using the multifunction phased array radar MESAR2, based at BAE Systems on the Isle of Wight. The NCTR mode of MESAR2 consists of the transmission and reception of successive multi-pulse coherent bursts upon each target being tracked. Each burst is stepped in frequency resulting in an overall bandwidth sufficient to provide sub-metre range resolution. A sequence of experiments, (static trials, moving point target trials and full aircraft trials) are described and an analysis of the robustness of target length and Doppler spectra feature measurements from NCTR mode data recordings is presented. A recorded data archive of 1498 NCTR looks upon 17 different trials aircraft using five different varieties of stepped frequency waveform is used to determine classification performance as a function of various signal processing parameters and extent (numbers of pulses) of the data used. From analysis of the trials data, recommendations are made with regards to the design of an NCTR mode for an operational system that uses stepped frequency techniques by design choice

    Flight Mechanics/Estimation Theory Symposium, 1992

    Get PDF
    This conference publication includes 40 papers and abstracts presented at the Flight Mechanics/Estimation Theory Symposium on May 5-7, 1992. Sponsored by the Flight Dynamics Division of Goddard Space Flight Center, this symposium featured technical papers on a wide range of issues related to orbit-attitude prediction, determination, and control; attitude sensor calibration; attitude determination error analysis; attitude dynamics; and orbit decay and maneuver strategy. Government, industry, and the academic community participated in the preparation and presentation of these papers

    Remote Sensing of Earth Resources: A literature survey with indexes (1970 - 1973 supplement). Section 1: Abstracts

    Get PDF
    Abstracts of reports, articles, and other documents introduced into the NASA scientific and technical information system between March 1970 and December 1973 are presented in the following areas: agriculture and forestry, environmental changes and cultural resources, geodesy and cartography, geology and mineral resources, oceanography and marine resources, hydrology and water management, data processing and distribution systems, instrumentation and sensors, and economic analysis

    Book of abstracts

    Get PDF
    corecore