A novel constrained multimodal approach for convolutive blind source separation is presented which incorporates video information related to geometrical position of both the speakers and the microphones, and the directionality of the speakers into the separation algorithm. The separation is performed in the frequency domain and the constraints are incorporated through a penalty function-based formulation. The separation results show a considerable improvement over traditional frequency domain convolutive BSS systems such as that developed by Parra and Spence. Importantly, the inherent permutation problem in the frequency domain BSS is potentially solve

Jonathon Chambers (1251609)

Saeid Sanei (7207403)

Syed M.R. Naqvi (7200659)

Yulia Hicks (7208804)

English

Loughborough University Institutional Repository

   This item was submitted to Loughborough’s Institutional Repository (https://dspace.lboro.ac.uk/) by the author and is made available under the following Creative Commons Licence conditions.      For the full text of this licence, please go to: http://creativecommons.org/licenses/by-nc-nd/2.5/  A GEOMETRICALLY CONSTRAINED MULTIMODAL APPROACH FOR CONVOLUTIVEBLIND SOURCE SEPARATIONS. Sanei, S. M. Naqvi, J. A. Chambers and Y. HicksCentre of Digital Signal Processing, Cardiff University, Cardiff, CF24 3AA, U. K.Email: {saneis, naqvisr, chambersj, hicksya}@cardiff.ac.ukABSTRACTA novel constrained multimodal approach for convolutive blind sourceseparation is presented which incorporates video information relatedto geometrical position of both the speakers and the microphones,and the directionality of the speakers into the separation algorithm.The separation is performed in the frequency domain and the con-straints are incorporated through a penalty function-based formula-tion. The separation results show a considerable improvement overtraditional frequency domain convolutive BSS systems such as thatdeveloped by Parra and Spence. Importantly, the inherent permuta-tion problem in the frequency domain BSS is potentially solved.Index Terms— Frequency domain BSS, geometrical constraintsand multimodal separation.1. INTRODUCTIONConvolutive blind source separation (CBSS) has been a subject ofconsiderable research recently since it attempts to address the inher-ent characteristics of (a real echoic) mixing environment. Generally,the main objective of BSS is to decompose the measurement signalsinto their constituent independent components as an estimation ofthe true sources which are assumed a priori to be independent.CBSS has been conventionally developed in either the time [1]or frequency [2] [3] [4] domains. Frequency domain convolutiveblind source separation (FDCBSS) however, has been more pop-ular as the convolutive mixing is converted into a number of in-stantaneous mixing operations. The permutation problem inherentto FDCBSS is more severe and destructive than for time domainschemes [5]. In such systems there are no priori assumptions onthe source statistics or the mixing system. On the other hand, in amultimodal approach the video system can capture the positions ofthe speakers and the directions they face [6]. The video informa-tion can thereby help to estimate the mixing matrix more accuratelyand ultimately increase the separation performance. Following thisidea, the objective of this paper is to efficiently use such informa-tion in the enhancement of the separation results. The CBSS sys-tem can be described as follows: assume m statistically independentsources as s(t) = [s1(t), . . . , sm(t)]T where [.]T denotes trans-pose operation. The sources are convolved with a linear model ofthe physical medium (mixing matrix) which can be represented inthe form of a multichannel FIR filter H to produce n sensor signalsx(t) = [x1(t), . . . , xn(t)]T asx(t) =P∑τ=0H(τ)s(t− τ) + v(t) (1)Work supported by the Engineering and Physical Sciences ResearchCouncil (EPSRC) of the UK.where v(t) = [v1(t), . . . , vn(t)]T is the noise vector at discretetime sample t and H = [H(0),H(1), . . . ,H(P )]. Using time do-main CBSS, the sources are estimated using a set of unmixing filtersW(τ), τ = 0, .., Q, such thaty(t) =Q∑τ=0W(τ)x(t− τ) (2)where y(t) = [y1(t), . . . , ym(t)]T are the estimated sources. P andQ are respectively the lengths of the mixing and unmixing filters.The length of the signals is T . In FDBSS the problem is transferredinto the frequency domain using the STFT. (1) and (2) then changerespectively to:X(ω, t) ≈ H(ω)S(ω, t) + V(ω, t) (3)Y(ω, t) ≈ W(ω)X(ω, t) (4)where ω denotes discrete normalized frequency. An inverse STFTis then used to find the estimated sources sˆ(t) = y(t); however,this will be certainly affected by the permutation effects due to thevariation of W(ωi) with ωi. Parra’s algorithm jointly diagonalizesthe unmixing matrix for all the frequency bins by minimising thesquared error (as the sum of off diagonal elements of the covariancematrix of the estimated sources) using the constrained gradient de-scent algorithm [7]. Considering the FDCBSS system developed byParra and Spence the main cost function Jm is expressed in the formofJm =T∑ω=0K∑k=1||E(ω, k)||2F (5)whereE(ω, k) = W(ω)[Rx(ω, k)−Λv(ω, k)]WH(ω)−Λs(ω, k) (6)and Rx, Λv and Λs are respectively the covariance matrices ofthe signals, noise and source signals spectra, and ‖.‖2F denotes theFrobenius norm. Ignoring the noise for simplicity, the main updateequation for estimation ofW(ωi) for the i-th FFT frequency is givenas [2]Wj+1(ωi) = Wj(ωi)− μK∑k=1E(ωi, k)Wj(ωi)Rx(ωi, k) (7)where j, K and μ are the iteration index, the number of FFT pointsand learning rate respectively. W is updated for all the frequencybins ωi and each time is initialized to the identity matrix. In theIII ­ 9691­4244­0728­1/07/$20.00 ©2007 IEEE ICASSP 2007Authorized licensed use limited to: LOUGHBOROUGH UNIVERSITY. Downloaded on December 10, 2009 at 04:58 from IEEE Xplore.  Restrictions apply. following section we use the spacial information indicating the posi-tions and directions of the sources using the “data” acquired instan-taneously by a number of video cameras. The separation process isthen constrained by such information. The comparison between theoriginal Parra and Spence, and the proposed multimodal constrainedFDCBSS, algorithms will be presented at the end.2. THE CONSTRAINED PROBLEMGiven the position of the speakers and the microphones, the dis-tances between the ith microphone and the jth speaker dij , and alsotheir propagation times τij , can be calculated (See Figure 1 for a sim-ple two-speaker two-microphone case). Accordingly, in a homoge-nous medium such as air, the attenuation is related to the distancesviaαij =κd2ij(8)where κ is a constant representing the attenuation per unit length in ahomogenous medium. Similarly, τij in terms of the number of sam-ples, is proportional to the sampling frequency fs, sound velocity C,and the distance dij as:τij =fsCdij (9)which is independent of the directionality. Both fs and C are con-sidered constant within each observation block for a block-basedBSS system, or slowly varying in a real-time BSS process. How-ever, in practical situations the speakers directions introduce anothervariable into the attenuation measurement. In the case of electronicloudspeakers (not humans) the directionality pattern depends on thetype of loadspeaker. Here, we approximate this pattern as cos(θij/r)where r > 2, and has a smaller value for highly directional speakersand vice versa (an accurate profile can be easily measured using aSPL meter). Therefore, the attenuation parameters becomeαij =κd2ijcos(θij/r) (10)If, for simplicity, only the direct path is considered the mixing filteris expected to have a form as:Hˆ(t) =[α11δ(t− τ11) α12δ(t− τ12)α21δ(t− τ21) α22δ(t− τ22)](11)for which in the frequency domain the above filter has the formHˆ =[α11e−jωτ11 α12e−jωτ12α21e−jωτ21 α22e−jωτ22]=[α11z−τ11 α12z−τ12α21z−τ21 α22z−τ22](12)Although the actual mixing matrix includes the reverberation termsrelated to the reflection of sounds by the obstacles and walls, in sucha room environment it will always contain the direct path compo-nents as in the above equations. Therefore, we can consider Hˆ as abiased estimate of the mixing filter and set the following constraint,which minimizes the Frobenius norm distance between the unmixingfilter W and the permuted mixing filter Hˆ , i.e.Jc = ‖W −PHˆ−1‖2F = ‖vec(W −PHˆ−1)‖22 (13)Fig. 1. A two-speaker two-microphone setup for recording withina reverberating (room) environment; only distances and angles be-tween sources and microphones are shown.where ‖.‖22 represent respectively, the Euclidean norm, vec(.) con-verts a matrix argument column-wise into a column vector, and Pis the permutation matrix. Ultimately, the cost function Jc has to beminimized with respect to both W and P.3. THE OVERALL CONSTRAINED BSSIn order to achieve the above goal, we need to minimize jointly Jmand Jc with respect to W, and also minimise Jc with respect to thepermutation matrix P. The constrained optimisation problem can bechanged to an unconstrained one using a Lagrangian approach or bymeans of a penalty function as in [8]. In this caseJ(W(ω)) = Jm(W(ω)) + λJc(W(ω)) (14)where λ is the Lagrange multiplier. W and P are then found byminimizing the gradients of J and Jc respectively with respect toW and P, i.e.Wopt(ω) = argminW{Jm(W(ω)) + λJc(W(ω))} (15)andPopt(ω) = argminP{Jc(W(ω))} (16)Therefore, at each frequency bin ωi the estimated sources will bealigned with the input source signals; as one of the major advantagesof this algorithm there will not generally remain any permutationproblem. Consequently, the update equations are obtained as:Wj+1(ω) = Wj(ω)− μ∇W(J(Wj(ω))) (17)Pj+1(ω) = Pj(ω)− η∇P(Jc(Wj(ω))) (18)where j is the iteration index, μ and η are the learning rates, and∇W∗(J(W)) = ∇W∗(Jm(W)) + λ∇W∗(Jc(W))= 2K∑k=1E(ω, k)W(ω)Rx(ω, k)+2λ[W(ω)−P(ω)H˜−1(ω)] (19)III ­ 970Authorized licensed use limited to: LOUGHBOROUGH UNIVERSITY. Downloaded on December 10, 2009 at 04:58 from IEEE Xplore.  Restrictions apply. and∇P(Jc(W)) = −2H˜−1(ω)[W(ω)−P(ω)H˜−1(ω)] (20)Before starting the update process H˜−1(ω) is normalised once usingH˜−1(ω) ← H˜−1(ω)/‖H˜−1(ω)‖F where ‖.‖F denotes the Frobe-nius norm and after each iteration W(ωi) is also normalised. In thecase of fractional filters where the distances between the speakersand the microphones are not integer multiples of the sampling in-terval, a previously developed algorithm to firstly estimate the frac-tional delay and then perform the BSS process [9] [10] can be used.4. EXPERIMENTAL RESULTSTwo experiments were carried out; in the first experiment two sin-gle tones were used. The mixing matrix H was carefully chosen tomodel the room environment and H˜ was selected to include only thedirect path and the angle of departures θ (r considered to be 4). Boththe Parra and Spence algorithm, and the proposed constrained FD-CBSS were employed and the signal-to-interference ratio (SIR) wascalculated as [2]SIR =ΣiΣω|Hii(ω)|2〈|Si(ω)|2〉ΣiΣi=jΣω|Hij(ω)|2〈|Sj(ω)|2〉 (21)Using the Parra and Spence algorithm the SIR was 6.1dB and usingthe CBSS the SIR achieved was 9.25dB. The 3dB superior perfor-mance is not only because of application of the geometrical con-straints but also as a result of solving the permutation problem. Twomajor drawbacks of the system are the slight increase in the com-plexity and potential slower rate of convergence. In the second ex-periment, the Parra and Spence algorithm and the proposed CBSSwere tested for a real room recording. The variables were selectedas: d11 = 24 cm, d12 = 50 cm, d21 = 40 cm, d22 = 18 cm, r = 4,θ11 = 60o, θ12 = 5o, θ21 = 45o, and θ22 = 45o. λ is empiricallychosen (here λ = 0.15) and the learning rates μ and η graduallydecreased with respect to the iteration index jμj = ηj = γ0.021− (0.98)j (22)where γ is a constant equal to γ = 0.01. Figure 2 shows the originalsignals using a couple of microphones very close to the mouth ofthe speakers, the mixed signals, the separated signals using the Parraand Spence algorithm, and the estimated signals using our proposedCBSS method. SIRs for P = Q = 1024 have been calculated ac-cording to [2]. In this experiment we achieved SIRParra′s = 6.8dBand SIRCBSS = 9.4dB, which shows a marked improvement. Inaddition the filter length (DFT points) may be changed according tothe room geometry to obtain even better results. Table 1 shows theSIR values for both experiments. Figure 3 illustrates the convergencegraph of the cost function within the last frequency bin for both Parraand Spence, and the proposed CBSS method. As expected, by us-ing the constraint term the convergence is slightly slower and thecomplexity of the system is higher.( a ) ( b ) ( c ) ( d ) Fig. 2. (a) the original signals recorded by very close microphones,(b) the mixed signals (c) the separated sources using Parra andSpence algorithm, and (d) the estimated sources using the con-strained FDCBSS.Table 1. Comparison between Parra and Spence algorithm and theproposed method for different sets of mixtures.SIR Parra’s ConstrainedMethod/dB FDCBSS/dBSinusoidal Signal 6.1 9.25Speech Signal 6.8 9.4Fig. 3. The convergence graphs for both the Para and Spence, andthe proposed constrained FDCBSS algorithms for only the last fre-quency bin.III ­ 971Authorized licensed use limited to: LOUGHBOROUGH UNIVERSITY. Downloaded on December 10, 2009 at 04:58 from IEEE Xplore.  Restrictions apply. 5. SUMMARY AND CONCLUSIONSIn this paper the conventional FDCBSS algorithm has been modifiedby accommodating the geometrical information about the sources ina multi-modal BSS approach. The location and direction informa-tion have been obtained using a number of cameras equipped with aspeaker tracking algorithm. The constrained problem has been par-tially changed to an unconstrained problem using Lagrange multipli-ers. The results show that the modified CBSS system enhances theperformance of the traditional FDBSS system both objectively andsubjectively. The outcome of this approach paves the way for estab-lishing a multi-modal audio-video system for separation of speechand music signals.6. REFERENCES[1] A. S. Bregman, Auditory scence analysis, MIT Press, Cam-bridge, MA, 1990.[2] L. Parra and C. Spence, “Convolutive blind separation of non-stationary sources,” IEEE Trans. On Speech and Audio Pro-cessing, vol. 8, no. 3, pp. 320–327, 2000.[3] A. Cichocki and S. Amari, Adaptive Blind Signal and ImageProcessing: Learning Algorithms and Applications, John Wi-ley, 2002.[4] P. Smaragdis, “Blind separation of convolved mixtures in thefrequency domain,” Neurocomputing, vol. 22, pp. 21–34, 1998.[5] W. Wang, S. Sanei, and J. A. Chambers, “A joint diagonaliza-tion method for convolutive blind separation of nonstationarysources in the frequency domain,” Proc. ICA, Nara, Japan,April 2003.[6] W. Wang, D. Cosker, Y. Hicks, S. Sanei, and J. A. Chambers,“Video assisted speech source separation,” Proc. IEEE ICASSP2005, Pennsylvania, USA, March 19-23.[7] S. Haykin, Adaptive filters, John Wiley, 1994.[8] W. Wang, S. Sanei, and J.A. Chambers, “Penalty functionbased joint diagonalization approach for convolutive blind sep-aration of nonstationary sources,” IEEE Trans. Signal Process-ing, vol. 53, no. 5, pp. 1654–1669, 2005.[9] C. Cheong Took, K. Nazarpour, S. Sanei, and J.A. Chambers,“Fractional differential delay estimation in local sparse compo-nent analysis of temporomandibular joint sounds,” Submittedto IEEE Transaction on Biomedical Engineering, June 2006.[10] C. Cheong Took, K. Nazarpour, S. Sanei, and J. Chambers,“Blind separation of temporomandibular joint sounds by incor-porating fractional delay estimation,” Proc. of IEE IMA 2006,Cirincester, UK.III ­ 972Authorized licensed use limited to: LOUGHBOROUGH UNIVERSITY. Downloaded on December 10, 2009 at 04:58 from IEEE Xplore.  Restrictions apply. 

A geometrically constrained multimodal approach for convolutive blind source separation

https://repository.lboro.ac.uk/articles/A_geometrically_constrained_multimodal_approach_for_convolutive_blind_source_separation/9556751/files/17188649.pdf

A geometrically constrained multimodal approach for convolutive blind source separation

Abstract

Similar works

Full text

Available Versions

Loughborough University Institutional Repository