540 research outputs found

    Improving speaker recognition by biometric voice deconstruction

    Get PDF
    Person identification, especially in critical environments, has always been a subject of great interest. However, it has gained a new dimension in a world threatened by a new kind of terrorism that uses social networks (e.g., YouTube) to broadcast its message. In this new scenario, classical identification methods (such as fingerprints or face recognition) have been forcedly replaced by alternative biometric characteristics such as voice, as sometimes this is the only feature available. The present study benefits from the advances achieved during last years in understanding and modeling voice production. The paper hypothesizes that a gender-dependent characterization of speakers combined with the use of a set of features derived from the components, resulting from the deconstruction of the voice into its glottal source and vocal tract estimates, will enhance recognition rates when compared to classical approaches. A general description about the main hypothesis and the methodology followed to extract the gender-dependent extended biometric parameters is given. Experimental validation is carried out both on a highly controlled acoustic condition database, and on a mobile phone network recorded under non-controlled acoustic conditions

    Efficient speaker recognition for mobile devices

    Get PDF

    Comparison GMM and SVM Classifier for Automatic Speaker Verification

    Get PDF
    The objective of this thesis is to develop automatic text-independent speaker verification systems using unconstrained telephone conversational speech. We began by performing a Gaussian Mixture Model Likelihood ratio verification task in speaker independent system as described by MIT Lincoln Lab. We next introduced a speaker dependent verification system based on speaker dependent thresholds. We then implemented the same system applying Support Vector Machine. In SVM, we used polynomial kernels and radial basis function kernels and compared the performance. For training and testing the system, we used low-level spectral features. Finally, we provided a performance assessment of these systems using the National Institute of Standards and technology (NIST) speaker recognition evaluation 2008 telephone corpora

    Robust speaker identification using artificial neural networks

    Full text link
    This research mainly focuses on recognizing the speakers through their speech samples. Numerous Text-Dependent or Text-Independent algorithms have been developed by people so far, to recognize the speaker from his/her speech. In this thesis, we concentrate on the recognition of the speaker from the fixed text i.e. Text-Dependent . Possibility of extending this method to variable text i.e. Text-Independent is also analyzed. Different feature extraction algorithms are employed and their performance with Artificial Neural Networks as a Data Classifier on a fixed training set is analyzed. We find a way to combine all these individual feature extraction algorithms by incorporating their interdependence. The efficiency of these algorithms is determined after the input speech is classified using Back Propagation Algorithm of Artificial Neural Networks. A special case of Back Propagation Algorithm which improves the efficiency of the classification is also discussed

    Automated Testing of Speech-to-Speech Machine Translation in Telecom Networks

    Get PDF
    Globalisoituvassa maailmassa kyky kommunikoida kielimuurien yli kÀy yhÀ tÀrkeÀmmÀksi. Kielten opiskelu on työlÀstÀ ja siksi halutaan kehittÀÀ automaattisia konekÀÀnnösjÀrjestelmiÀ. Ericsson on kehittÀnyt prototyypin nimeltÀ Real-Time Interpretation System (RTIS), joka toimii mobiiliverkossa ja kÀÀntÀÀ matkailuun liittyviÀ fraaseja puhemuodossa kahden kielen vÀlillÀ. Nykyisten konekÀÀnnösjÀrjestelmien suorituskyky on suhteellisen huono ja siksi testauksella on suuri merkitys jÀrjestelmien suunnittelussa. Testauksen tarkoituksena on varmistaa, ettÀ jÀrjestelmÀ sÀilyttÀÀ kÀÀnnösekvivalenssin sekÀ puhekÀÀnnösjÀrjestelmÀn tapauksessa myös riittÀvÀn puheenlaadun. Luotettavimmin testaus voidaan suorittaa ihmisten antamiin arviointeihin perustuen, mutta tÀllaisen testauksen kustannukset ovat suuria ja tulokset subjektiivisia. TÀssÀ työssÀ suunniteltiin ja analysoitiin automatisoitu testiympÀristö Real-Time Interpretation System -kÀÀnnösprototyypille. Tavoitteina oli tutkia, voidaanko testaus suorittaa automatisoidusti ja pystytÀÀnkö todellinen, kÀyttÀjÀn havaitsema kÀÀnnösten laatu mittaamaan automatisoidun testauksen keinoin. Tulokset osoittavat ettÀ mobiiliverkoissa puheenlaadun testaukseen kÀytetyt menetelmÀt eivÀt ole optimaalisesti sovellettavissa konekÀÀnnösten testaukseen. Nykytuntemuksen mukaan ihmisten suorittama arviointi on ainoa luotettava tapa mitata kÀÀnnösekvivalenssia ja puheen ymmÀrrettÀvyyttÀ. KonekÀÀnnösten testauksen automatisointi vaatii lisÀÀ tutkimusta, jota ennen subjektiivinen arviointi tulisi sÀilyttÀÀ ensisijaisena testausmenetelmÀnÀ RTIS-testauksessa.In the globalizing world, the ability to communicate over language barriers is increasingly important. Learning languages is laborious, which is why there is a strong desire to develop automatic machine translation applications. Ericsson has developed a speech-to-speech translation prototype called the Real-Time Interpretation System (RTIS). The service runs in a mobile network and translates travel phrases between two languages in speech format. The state-of-the-art machine translation systems suffer from a relatively poor performance and therefore evaluation plays a big role in machine translation development. The purpose of evaluation is to ensure the system preserves the translational equivalence, and in case of a speech-to-speech system, the speech quality. The evaluation is most reliably done by human judges. However, human-conducted evaluation is costly and subjective. In this thesis, a test environment for Ericsson Real-Time Interpretation System prototype is designed and analyzed. The goals are to investigate if the RTIS verification can be conducted automatically, and if the test environment can truthfully measure the end-to-end performance of the system. The results conclude that methods used in end-to-end speech quality verification in mobile networks can not be optimally adapted for machine translation evaluation. With current knowledge, human-conducted evaluation is the only method that can truthfully measure translational equivalence and the speech intelligibility. Automating machine translation evaluation needs further research, until which human-conducted evaluation should remain the preferred method in RTIS verification

    Speaker Recognition in Unconstrained Environments

    Get PDF
    Speaker recognition is applied in smart home devices, interactive voice response systems, call centers, online banking and payment solutions as well as in forensic scenarios. This dissertation is concerned with speaker recognition systems in unconstrained environments. Before this dissertation, research on making better decisions in unconstrained environments was insufficient. Aside from decision making, unconstrained environments imply two other subjects: security and privacy. Within the scope of this dissertation, these research subjects are regarded as both security against short-term replay attacks and privacy preservation within state-of-the-art biometric voice comparators in the light of a potential leak of biometric data. The aforementioned research subjects are united in this dissertation to sustain good decision making processes facing uncertainty from varying signal quality and to strengthen security as well as preserve privacy. Conventionally, biometric comparators are trained to classify between mated and non-mated reference,--,probe pairs under idealistic conditions but are expected to operate well in the real world. However, the more the voice signal quality degrades, the more erroneous decisions are made. The severity of their impact depends on the requirements of a biometric application. In this dissertation, quality estimates are proposed and employed for the purpose of making better decisions on average in a formalized way (quantitative method), while the specifications of decision requirements of a biometric application remain unknown. By using the Bayesian decision framework, the specification of application-depending decision requirements is formalized, outlining operating points: the decision thresholds. The assessed quality conditions combine ambient and biometric noise, both of which occurring in commercial as well as in forensic application scenarios. Dual-use (civil and governmental) technology is investigated. As it seems unfeasible to train systems for every possible signal degradation, a low amount of quality conditions is used. After examining the impact of degrading signal quality on biometric feature extraction, the extraction is assumed ideal in order to conduct a fair benchmark. This dissertation proposes and investigates methods for propagating information about quality to decision making. By employing quality estimates, a biometric system's output (comparison scores) is normalized in order to ensure that each score encodes the least-favorable decision trade-off in its value. Application development is segregated from requirement specification. Furthermore, class discrimination and score calibration performance is improved over all decision requirements for real world applications. In contrast to the ISOIEC 19795-1:2006 standard on biometric performance (error rates), this dissertation is based on biometric inference for probabilistic decision making (subject to prior probabilities and cost terms). This dissertation elaborates on the paradigm shift from requirements by error rates to requirements by beliefs in priors and costs. Binary decision error trade-off plots are proposed, interrelating error rates with prior and cost beliefs, i.e., formalized decision requirements. Verbal tags are introduced to summarize categories of least-favorable decisions: the plot's canvas follows from Bayesian decision theory. Empirical error rates are plotted, encoding categories of decision trade-offs by line styles. Performance is visualized in the latent decision subspace for evaluating empirical performance regarding changes in prior and cost based decision requirements. Security against short-term audio replay attacks (a collage of sound units such as phonemes and syllables) is strengthened. The unit-selection attack is posed by the ASVspoof 2015 challenge (English speech data), representing the most difficult to detect voice presentation attack of this challenge. In this dissertation, unit-selection attacks are created for German speech data, where support vector machine and Gaussian mixture model classifiers are trained to detect collage edges in speech representations based on wavelet and Fourier analyses. Competitive results are reached compared to the challenged submissions. Homomorphic encryption is proposed to preserve the privacy of biometric information in the case of database leakage. In this dissertation, log-likelihood ratio scores, representing biometric evidence objectively, are computed in the latent biometric subspace. Conventional comparators rely on the feature extraction to ideally represent biometric information, latent subspace comparators are trained to find ideal representations of the biometric information in voice reference and probe samples to be compared. Two protocols are proposed for the the two-covariance comparison model, a special case of probabilistic linear discriminant analysis. Log-likelihood ratio scores are computed in the encrypted domain based on encrypted representations of the biometric reference and probe. As a consequence, the biometric information conveyed in voice samples is, in contrast to many existing protection schemes, stored protected and without information loss. The first protocol preserves privacy of end-users, requiring one public/private key pair per biometric application. The latter protocol preserves privacy of end-users and comparator vendors with two key pairs. Comparators estimate the biometric evidence in the latent subspace, such that the subspace model requires data protection as well. In both protocols, log-likelihood ratio based decision making meets the requirements of the ISOIEC 24745:2011 biometric information protection standard in terms of unlinkability, irreversibility, and renewability properties of the protected voice data

    INFLUENCE OF SPECIFIC VOIP TRANSMISSION CONDITIONS ON SPEAKER RECOGNITION PROBLEM

    Get PDF
    The paper presents the problem of signal degradation in packet-based voice transmission and its influence on the voice recognition correctness. The Internet is evolving into universal communication network which carries all types of traffic including data, video and voice. Among them the Internet telephony, namely VoIP is going to be an application of a great importance and that is why it is so important to assess how specific conditions and distortions of the Internet transmission (speech coding and most of all packet loss and delay) can influence speaker recognition problem. The Gaussian Mixture Models classification, the feature extraction, the Internet speech transmission standards and the signal degradation methodology applied in the tested system were overviewed. The experiments carried out for two most commonly applied encoders (G.711 and G.723) and three network conditions (poor, average and with no packet loss) revealed a minor significance of the packet loss problem in the tested text-independent system

    INFLUENCE OF SPECIFIC VOIP TRANSMISSION CONDITIONS ON SPEAKER RECOGNITION PROBLEM

    Get PDF
    The paper presents the problem of signal degradation in packet-based voice transmission and its influence on the voice recognition correctness. The Internet is evolving into universal communication network which carries all types of traffic including data, video and voice. Among them the Internet telephony, namely VoIP is going to be an application of a great importance and that is why it is so important to assess how specific conditions and distortions of the Internet transmission (speech coding and most of all packet loss and delay) can influence speaker recognition problem. The Gaussian Mixture Models classification, the feature extraction, the Internet speech transmission standards and the signal degradation methodology applied in the tested system were overviewed. The experiments carried out for two most commonly applied encoders (G.711 and G.723) and three network conditions (poor, average and with no packet loss) revealed a minor significance of the packet loss problem in the tested text-independent system
    • 

    corecore