503 research outputs found
Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment
En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems
Métodos discriminativos para la optimización de modelos en la Verificación del Hablante
La creciente necesidad de sistemas de autenticación seguros ha motivado el interés de algoritmos efectivos de Verificación de Hablante (VH). Dicha necesidad de algoritmos de alto rendimiento, capaces de obtener tasas de error bajas, ha abierto varias ramas de investigación. En este trabajo proponemos investigar, desde un punto de vista discriminativo, un conjunto de metodologías para mejorar el desempeño del estado del arte de los sistemas de VH. En un primer enfoque investigamos la optimización de los hiper-parámetros para explícitamente considerar el compromiso entre los errores de falsa aceptación y falso rechazo. El objetivo de la optimización se puede lograr maximizando el área bajo la curva conocida como ROC (Receiver Operating Characteristic) por sus siglas en inglés. Creemos que esta optimización de los parámetros no debe de estar limitada solo a un punto de operación y una estrategia más robusta es optimizar los parámetros para incrementar el área bajo la curva, AUC (Area Under the Curve por sus siglas en inglés) de modo que todos los puntos sean maximizados. Estudiaremos cómo optimizar los parámetros utilizando la representación matemática del área bajo la curva ROC basada en la estadística de Wilcoxon Mann Whitney (WMW) y el cálculo adecuado empleando el algoritmo de descendente probabilístico generalizado. Además, analizamos el efecto y mejoras en métricas como la curva detection error tradeoff (DET), el error conocido como Equal Error Rate (EER) y el valor mínimo de la función de detección de costo, minimum value of the detection cost function (minDCF) todos ellos por sue siglas en inglés. En un segundo enfoque, investigamos la señal de voz como una combinación de atributos que contienen información del hablante, del canal y el ruido. Los sistemas de verificación convencionales entrenan modelos únicos genéricos para todos los casos, y manejan las variaciones de estos atributos ya sea usando análisis de factores o no considerando esas variaciones de manera explícita. Proponemos una nueva metodología para particionar el espacio de los datos de acuerdo a estas carcterísticas y entrenar modelos por separado para cada partición. Las particiones se pueden obtener de acuerdo a cada atributo. En esta investigación mostraremos como entrenar efectivamente los modelos de manera discriminativa para maximizar la separación entre ellos. Además, el diseño de algoritimos robustos a las condiciones de ruido juegan un papel clave que permite a los sistemas de VH operar en condiciones reales. Proponemos extender nuestras metodologías para mitigar los efectos del ruido en esas condiciones. Para nuestro primer enfoque, en una situación donde el ruido se encuentre presente, el punto de operación puede no ser solo un punto, o puede existir un corrimiento de forma impredecible. Mostraremos como nuestra metodología de maximización del área bajo la curva ROC es más robusta que la usada por clasificadores convencionales incluso cuando el ruido no está explícitamente considerado. Además, podemos encontrar ruido a diferentes relación señal a ruido (SNR) que puede degradar el desempeño del sistema. Así, es factible considerar una descomposición eficiente de las señales de voz que tome en cuenta los diferentes atributos como son SNR, el ruido y el tipo de canal. Consideramos que en lugar de abordar el problema con un modelo unificado, una descomposición en particiones del espacio de características basado en atributos especiales puede proporcionar mejores resultados. Esos atributos pueden representar diferentes canales y condiciones de ruido. Hemos analizado el potencial de estas metodologías que permiten mejorar el desempeño del estado del arte de los sistemas reduciendo el error, y por otra parte controlar los puntos de operación y mitigar los efectos del ruido
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Metaheuristic Optimization Techniques for Articulated Human Tracking
Four adaptive metaheuristic optimization algorithms are proposed and demonstrated: Adaptive Parameter Particle Swarm Optimization (AP-PSO), Modified Artificial Bat (MAB), Differential Mutated Artificial Immune System (DM-AIS) and hybrid Particle Swarm Accelerated Artificial Immune System (PSO-AIS). The algorithms adapt their search parameters on the basis of the fitness of obtained solutions such that a good fitness value favors local search, while a poor fitness value favors global search. This efficient feedback of the solution quality, imparts excellent global and local search characteristic to the proposed algorithms.
The algorithms are tested on the challenging Articulated Human Tracking (AHT) problem whose objective is to infer human pose, expressed in terms of joint angles, from a continuous video stream. The Particle Filter (PF) algorithms, widely applied in generative model based AHT, suffer from the 'curse of dimensionality' and 'degeneracy' challenges. The four proposed algorithms show stable performance throughout the course of numerical experiments. DM-AIS performs best among the proposed algorithms followed in order by PSO-AIS, AP-PSO, and MBA in terms of Most Appropriate Pose (MAP) tracking error.
The MAP tracking error of the proposed algorithms is compared with four heuristic approaches: generic PF, Annealed Particle Filter (APF), Partitioned Sampled Annealed Particle Filter (PSAPF) and Hierarchical Particle Swarm Optimization (HPSO). They are found to outperform generic PF with a confidence level of 95%, PSAPF and HPSO with a confidence level of 85%. While DM-AIS and PSO-AIS outperform APF with a confidence level of 80%. Further, it is noted that the proposed algorithms outperform PSAPF and HPSO using a significantly lower number of function evaluations, 2500 versus 7200.
The proposed algorithms demonstrate reduced particle requirements, hence improving computational efficiency and helping to alleviate the 'curse of dimensionality'. The adaptive nature of the algorithms is found to guide the whole swarm towards the optimal solution by sharing information and exploring a wider solution space which resolves the 'degeneracy' challenge. Furthermore, the decentralized structure of the algorithms renders them insensitive to accumulation of error and allows them to recover from catastrophic failures due to loss of image data, sudden change in motion pattern or discrete instances of algorithmic failure. The performance enhancements demonstrated by the proposed algorithms, attributed to their balanced local and global search capabilities, makes real-time AHT applications feasible.
Finally, the utility of the proposed algorithms in low-dimensional system identification problems as well as high-dimensional AHT problems demonstrates their applicability in various problem domains
General methods for fine-grained morphological and syntactic disambiguation
We present methods for improved handling of morphologically
rich languages (MRLS) where we define
MRLS as languages that
are morphologically more complex than English. Standard
algorithms for language modeling, tagging and parsing have
problems with the productive nature of such
languages. Consider for example the possible forms of a
typical English verb like work that generally has four
four different
forms: work, works, working
and worked. Its Spanish counterpart trabajar
has 6 different forms in present
tense: trabajo, trabajas, trabaja, trabajamos, trabajáis
and trabajan and more than 50 different forms when
including the different tenses, moods (indicative,
subjunctive and imperative) and participles. Such a high
number of forms leads to sparsity issues: In a recent
Wikipedia dump of more than 400 million tokens we find that
20 of these forms occur only twice or less and that 10 forms
do not occur at all. This means that even if we only need
unlabeled data to estimate a model and even when looking at
a relatively common and frequent verb, we do not have enough
data to make reasonable estimates for some of its
forms. However, if we decompose an unseen form such
as trabajaréis `you will work', we find that it
is trabajar in future tense and second person
plural. This allows us to make the predictions that are
needed to decide on the grammaticality (language modeling)
or syntax (tagging and parsing) of a sentence.
In the first part of this thesis, we develop
a morphological language model. A language model
estimates the grammaticality and coherence of a
sentence. Most language models used today are word-based
n-gram models, which means that they estimate the
transitional probability of a word following a history, the
sequence of the (n - 1) preceding words. The probabilities
are estimated from the frequencies of the history and the
history followed by the target word in a huge text
corpus. If either of the sequences is unseen, the length of
the history has to be reduced. This leads to a less accurate
estimate as less context is taken into account.
Our morphological language model estimates an additional
probability from the morphological classes of the
words. These classes are built automatically by extracting
morphological features from the word forms. To this end, we
use unsupervised segmentation algorithms to find the
suffixes of word forms. Such an algorithm might for example
segment trabajaréis into trabaja
and réis and we can then estimate the properties
of trabajaréis from other word forms with the same or
similar morphological properties. The data-driven nature of
the segmentation algorithms allows them to not only find
inflectional suffixes (such as -réis), but also more
derivational phenomena such as the head nouns of compounds
or even endings such as -tec, which identify
technology oriented companies such
as Vortec, Memotec and Portec and would
not be regarded as a morphological suffix by traditional
linguistics. Additionally, we extract shape features such as
if a form contains digits or capital characters. This is
important because many rare or unseen forms are proper
names or numbers and often do not have meaningful
suffixes. Our class-based morphological model is then
interpolated with a word-based model to combine the
generalization capabilities of the first and the high
accuracy in case of sufficient data of the second.
We evaluate our model across 21 European languages and find
improvements between 3% and 11% in perplexity, a standard
language modeling evaluation measure. Improvements are
highest for languages with more productive and complex
morphology such as Finnish and Estonian, but also visible
for languages with a relatively simple morphology such as
English and Dutch. We conclude that a morphological
component yields consistent improvements for all the tested
languages and argue that it should be part of every language
model.
Dependency trees represent the syntactic structure of a
sentence by attaching each word to its syntactic head, the
word it is directly modifying. Dependency parsing
is usually tackled using heavily lexicalized (word-based)
models and a thorough morphological preprocessing is
important for optimal performance, especially for MRLS. We
investigate if the lack of morphological features can be
compensated by features induced using hidden Markov
models with latent annotations (HMM-LAs)
and find this to be the case for German. HMM-LAs were
proposed as a method to increase part-of-speech tagging
accuracy. The model splits the observed part-of-speech tags
(such as verb and noun) into subtags. An expectation
maximization algorithm is then used to fit the subtags to
different roles. A verb tag for example might be split into
an auxiliary verb and a full verb subtag. Such a split is
usually beneficial because these two verb classes have
different contexts. That is, a full verb might follow an
auxiliary verb, but usually not another full verb.
For German and English, we find that our model leads to
consistent improvements over a parser
not using subtag features. Looking at the labeled attachment
score (LAS), the number of words correctly attached to their head,
we observe an improvement from 90.34 to 90.75 for English
and from 87.92 to 88.24 for German. For German, we
additionally find that our model achieves almost the same
performance (88.24) as a model using tags annotated by a
supervised morphological tagger (LAS of 88.35). We also find
that the German latent tags correlate with
morphology. Articles for example are split by their
grammatical case.
We also investigate the part-of-speech tagging accuracies of
models using the traditional treebank tagset and models
using induced tagsets of the same size and find that the
latter outperform the former, but are in turn outperformed
by a discriminative tagger.
Furthermore, we present a method for fast and
accurate morphological tagging. While
part-of-speech tagging annotates tokens in context with
their respective word categories, morphological tagging
produces a complete annotation containing all the relevant
inflectional features such as case, gender and tense. A
complete reading is represented as a single tag. As a
reading might consist of several morphological features the
resulting tagset usually contains hundreds or even thousands
of tags. This is an issue for many decoding algorithms such
as Viterbi which have runtimes depending quadratically on
the number of tags. In the case of morphological tagging,
the problem can be avoided by using a morphological
analyzer. A morphological analyzer is a manually created
finite-state transducer that produces the possible
morphological readings of a word form. This analyzer can be
used to prune the tagging lattice and to allow for the
application of standard sequence labeling algorithms. The
downside of this approach is that such an analyzer is not
available for every language or might not have the coverage
required for the task. Additionally, the output tags of some
analyzers are not compatible with the annotations of the
treebanks, which might require some manual mapping of the
different annotations or even to reduce the complexity of
the annotation.
To avoid this problem we propose to use the posterior
probabilities of a conditional random field (CRF)
lattice to prune the space of possible
taggings. At the zero-order level the posterior
probabilities of a token can be calculated independently
from the other tokens of a sentence. The necessary
computations can thus be performed in linear time. The
features available to the model at this time are similar to
the features used by a morphological analyzer (essentially
the word form and features based on it), but also include
the immediate lexical context. As the ambiguity of word
types varies substantially, we just fix the average number of
readings after pruning by dynamically estimating a
probability threshold. Once we obtain the pruned lattice, we
can add tag transitions and convert it into a first-order
lattice. The quadratic forward-backward computations are now
executed on the remaining plausible readings and thus
efficient. We can now continue pruning and extending the
lattice order at a relatively low additional runtime cost
(depending on the pruning thresholds). The training of the
model can be implemented efficiently by applying stochastic
gradient descent (SGD). The CRF gradient can be calculated
from a lattice of any order as long as the correct reading
is still in the lattice. During training, we thus run the
lattice pruning until we either reach the maximal order or
until the correct reading is pruned. If the reading is
pruned we perform the gradient update with the highest order
lattice still containing the reading. This approach is
similar to early updating in the structured perceptron
literature and forces the model to learn how to keep the
correct readings in the lower order lattices. In practice,
we observe a high number of lower updates during the first
training epoch and almost exclusively higher order updates
during later epochs.
We evaluate our CRF tagger on six languages with different
morphological properties. We find that for languages with a
high word form ambiguity such as German, the pruning results
in a moderate drop in tagging accuracy while for languages
with less ambiguity such as Spanish and Hungarian the loss
due to pruning is negligible. However, our pruning strategy
allows us to train higher order models (order > 1), which give
substantial improvements for all languages and also
outperform unpruned first-order models. That is, the model
might lose some of the correct readings during pruning, but
is also able to solve more of the harder cases that require
more context. We also find our model to substantially and
significantly outperform a number of frequently used taggers
such as Morfette and SVMTool.
Based on our morphological tagger we develop a simple method
to increase the performance of a state-of-the-art
constituency parser. A constituency tree
describes the syntactic properties of a sentence by
assigning spans of text to a hierarchical bracket
structure. developed a
language-independent approach for the automatic annotation
of accurate and compact grammars. Their implementation --
known as the Berkeley parser -- gives state-of-the-art results
for many languages such as English and German. For some MRLS
such as Basque and Korean, however, the parser gives
unsatisfactory results because of its simple unknown word
model. This model maps unknown words to a small number of
signatures (similar to our morphological classes). These
signatures do not seem expressive enough for many of the
subtle distinctions made during parsing. We propose to
replace rare words by the morphological reading generated by
our tagger instead. The motivation is twofold. First, our
tagger has access to a number of lexical and sublexical
features not available during parsing. Second, we expect
the morphological readings to contain most of the
information required to make the correct parsing decision
even though we know that things such as the correct
attachment of prepositional phrases might require some
notion of lexical semantics.
In experiments on the SPMRL 2013 dataset
of nine MRLS we find our method to give improvements for all
languages except French for which we observe a minor drop in
the Parseval score of 0.06. For Hebrew, Hungarian and
Basque we find substantial absolute improvements of 5.65,
11.87 and 15.16, respectively.
We also performed an extensive evaluation on the utility of
word representations for morphological tagging. Our goal was
to reduce the drop in performance that is caused when a
model trained on a specific domain is applied to some other
domain. This problem is usually addressed by domain adaption
(DA). DA adapts a model towards a specific domain using a
small amount of labeled or a huge amount of unlabeled data
from that domain. However, this procedure requires us to
train a model for every target domain. Instead we are trying
to build a robust system that is trained on domain-specific
labeled and domain-independent or general unlabeled data. We
believe word representations to be key in the development of
such models because they allow us to leverage unlabeled
data efficiently. We compare data-driven representations to
manually created morphological analyzers. We understand
data-driven representations as models that cluster word
forms or map them to a vectorial representation. Examples
heavily used in the literature include Brown clusters,
Singular Value Decompositions of count
vectors and neural-network-based
embeddings. We create a test suite of
six languages consisting of in-domain and out-of-domain test
sets. To this end we converted annotations for Spanish and
Czech and annotated the German part of the Smultron
treebank with a morphological layer. In
our experiments on these data sets we find Brown clusters to
outperform the other data-driven representations. Regarding
the comparison with morphological analyzers, we find Brown
clusters to give slightly better performance in
part-of-speech tagging, but to be substantially outperformed
in morphological tagging
Specialized IoT systems: Models, Structures, Algorithms, Hardware, Software Tools
Монография включает анализ проблем, модели, алгоритмы и программно-
аппаратные средства специализированных сетей интернета вещей.
Рассмотрены результаты проектирования и моделирования сети интернета вещей,
мониторинга качества продукции, анализа звуковой информации окружающей среды, а
также технология выявления заболеваний легких на базе нейронных сетей.
Монография предназначена для специалистов в области инфокоммуникаций,
может быть полезна студентам соответствующих специальностей, слушателям
факультетов повышения квалификации, магистрантам и аспирантам
Multibiometric security in wireless communication systems
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University, 05/08/2010.This thesis has aimed to explore an application of Multibiometrics to secured wireless communications. The medium of study for this purpose included Wi-Fi, 3G, and
WiMAX, over which simulations and experimental studies were carried out to assess the performance. In specific, restriction of access to authorized users only is provided by a technique referred to hereafter as multibiometric cryptosystem. In brief, the system is built upon a complete challenge/response methodology in order to obtain a high level of security on the basis of user identification by fingerprint and further confirmation by verification of the user through text-dependent speaker recognition.
First is the enrolment phase by which the database of watermarked fingerprints with
memorable texts along with the voice features, based on the same texts, is created by sending them to the server through wireless channel.
Later is the verification stage at which claimed users, ones who claim are genuine, are verified against the database, and it consists of five steps. Initially faced by the identification level, one is asked to first present one’s fingerprint and a memorable word, former is watermarked into latter, in order for system to authenticate the fingerprint and verify the validity of it by retrieving the challenge for accepted user.
The following three steps then involve speaker recognition including the user
responding to the challenge by text-dependent voice, server authenticating the response, and finally server accepting/rejecting the user.
In order to implement fingerprint watermarking, i.e. incorporating the memorable word as a watermark message into the fingerprint image, an algorithm of five steps has been developed. The first three novel steps having to do with the fingerprint
image enhancement (CLAHE with 'Clip Limit', standard deviation analysis and
sliding neighborhood) have been followed with further two steps for embedding, and
extracting the watermark into the enhanced fingerprint image utilising Discrete
Wavelet Transform (DWT).
In the speaker recognition stage, the limitations of this technique in wireless
communication have been addressed by sending voice feature (cepstral coefficients)
instead of raw sample. This scheme is to reap the advantages of reducing the
transmission time and dependency of the data on communication channel, together
with no loss of packet. Finally, the obtained results have verified the claims
- …