30 research outputs found
Proceedings: Voice Technology for Interactive Real-Time Command/Control Systems Application
Speech understanding among researchers and managers, current developments in voice technology, and an exchange of information concerning government voice technology efforts are discussed
An acoustic-phonetic approach in automatic Arabic speech recognition
In a large vocabulary speech recognition system the broad phonetic classification
technique is used instead of detailed phonetic analysis to overcome the variability in the
acoustic realisation of utterances. The broad phonetic description of a word is used as a
means of lexical access, where the lexicon is structured into sets of words sharing the
same broad phonetic labelling.
This approach has been applied to a large vocabulary isolated word Arabic speech
recognition system. Statistical studies have been carried out on 10,000 Arabic words
(converted to phonemic form) involving different combinations of broad phonetic
classes. Some particular features of the Arabic language have been exploited. The results
show that vowels represent about 43% of the total number of phonemes. They also show
that about 38% of the words can uniquely be represented at this level by using eight
broad phonetic classes. When introducing detailed vowel identification the percentage of
uniquely specified words rises to 83%. These results suggest that a fully detailed
phonetic analysis of the speech signal is perhaps unnecessary.
In the adopted word recognition model, the consonants are classified into four broad
phonetic classes, while the vowels are described by their phonemic form. A set of 100
words uttered by several speakers has been used to test the performance of the
implemented approach.
In the implemented recognition model, three procedures have been developed, namely
voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic
spectral transition detection between phonemes within a word. The accuracy of both the
V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic
segmentation procedure has been implemented, which exploits information from the
above mentioned three procedures. Simple phonological constraints have been used to
improve the accuracy of the segmentation process. The resultant sequence of labels are
used for lexical access to retrieve the word or a small set of words sharing the same broad
phonetic labelling. For the case of having more than one word-candidates, a verification
procedure is used to choose the most likely one
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Accessing spoken interaction through dialogue processing [online]
Zusammenfassung
Unser Leben, unsere Leistungen und unsere Umgebung, alles wird
derzeit durch Schriftsprache dokumentiert. Die rasante
Fortentwicklung der technischen Möglichkeiten Audio, Bilder und
Video aufzunehmen, abzuspeichern und wiederzugeben kann genutzt
werden um die schriftliche Dokumentation von menschlicher
Kommunikation, zum Beispiel Meetings, zu unterstützen, zu
ergänzen oder gar zu ersetzen. Diese neuen Technologien können
uns in die Lage versetzen Information aufzunehmen, die
anderweitig verloren gehen, die Kosten der Dokumentation zu
senken und hochwertige Dokumente mit audiovisuellem Material
anzureichern. Die Indizierung solcher Aufnahmen stellt die
Kerntechnologie dar um dieses Potential auszuschöpfen. Diese
Arbeit stellt effektive Alternativen zu schlüsselwortbasierten
Indizes vor, die Suchraumeinschränkungen bewirken und teilweise
mit einfachen Mitteln zu berechnen sind.
Die Indizierung von Sprachdokumenten kann auf verschiedenen
Ebenen erfolgen: Ein Dokument gehört stilistisch einer
bestimmten Datenbasis an, welche durch sehr einfache Merkmale
bei hoher Genauigkeit automatisch bestimmt werden kann.
Durch diese Art von Klassifikation kann eine Reduktion des
Suchraumes um einen Faktor der Größenordnung 410 erfolgen. Die
Anwendung von thematischen Merkmalen zur Textklassifikation
bei einer Nachrichtendatenbank resultiert in einer Reduktion um
einen Faktor 18. Da Sprachdokumente sehr lang sein können müssen
sie in thematische Segmente unterteilt werden. Ein neuer
probabilistischer Ansatz sowie neue Merkmale (Sprecherinitia
tive und Stil) liefern vergleichbare oder bessere Resultate als
traditionelle schlüsselwortbasierte Ansätze. Diese thematische
Segmente können durch die vorherrschende Aktivität
charakterisiert werden (erzählen, diskutieren, planen, ...),
die durch ein neuronales Netz detektiert werden kann. Die
Detektionsraten sind allerdings begrenzt da auch Menschen
diese Aktivitäten nur ungenau bestimmen. Eine maximale
Reduktion des Suchraumes um den Faktor 6 ist bei den verwendeten
Daten theoretisch möglich. Eine thematische Klassifikation
dieser Segmente wurde ebenfalls auf einer Datenbasis
durchgeführt, die Detektionsraten für diesen Index sind jedoch
gering.
Auf der Ebene der einzelnen Äußerungen können Dialogakte wie
Aussagen, Fragen, Rückmeldungen (aha, ach ja, echt?, ...) usw.
mit einem diskriminativ trainierten Hidden Markov Model erkannt
werden. Dieses Verfahren kann um die Erkennung von kurzen Folgen
wie Frage/AntwortSpielen erweitert werden (Dialogspiele).
Dialogakte und spiele können eingesetzt werden um
Klassifikatoren für globale Sprechstile zu bauen. Ebenso
könnte ein Benutzer sich an eine bestimmte Dialogaktsequenz
erinnern und versuchen, diese in einer grafischen
Repräsentation wiederzufinden.
In einer Studie mit sehr pessimistischen Annahmen konnten
Benutzer eines aus vier ähnlichen und gleichwahrscheinlichen
Gesprächen mit einer Genauigkeit von ~ 43% durch eine graphische
Repräsentation von Aktivität bestimmt.
Dialogakte könnte in diesem Szenario ebenso nützlich sein, die
Benutzerstudie konnte aufgrund der geringen Datenmenge darüber
keinen endgültigen Aufschluß geben. Die Studie konnte allerdings
für detailierte Basismerkmale wie Formalität und
Sprecheridentität keinen Effekt zeigen.
Abstract
Written language is one of our primary means for documenting our
lives, achievements, and environment. Our capabilities to
record, store and retrieve audio, still pictures, and video are
undergoing a revolution and may support, supplement or even
replace written documentation. This technology enables us to
record information that would otherwise be lost, lower the cost
of documentation and enhance highquality documents with
original audiovisual material.
The indexing of the audio material is the key technology to
realize those benefits. This work presents effective
alternatives to keyword based indices which restrict the search
space and may in part be calculated with very limited resources.
Indexing speech documents can be done at a various levels:
Stylistically a document belongs to a certain database which can
be determined automatically with high accuracy using very simple
features. The resulting factor in search space reduction is in
the order of 410 while topic classification yielded a factor
of 18 in a news domain.
Since documents can be very long they need to be segmented into
topical regions. A new probabilistic segmentation framework as
well as new features (speaker initiative and style) prove to be
very effective compared to traditional keyword based methods. At
the topical segment level activities (storytelling, discussing,
planning, ...) can be detected using a machine learning approach
with limited accuracy; however even human annotators do not
annotate them very reliably. A maximum search space reduction
factor of 6 is theoretically possible on the databases used. A
topical classification of these regions has been attempted
on one database, the detection accuracy for that index, however,
was very low.
At the utterance level dialogue acts such as statements,
questions, backchannels (aha, yeah, ...), etc. are being
recognized using a novel discriminatively trained HMM procedure.
The procedure can be extended to recognize short sequences such
as question/answer pairs, so called dialogue games.
Dialog acts and games are useful for building classifiers for
speaking style. Similarily a user may remember a certain dialog
act sequence and may search for it in a graphical
representation.
In a study with very pessimistic assumptions users are able to
pick one out of four similar and equiprobable meetings correctly
with an accuracy ~ 43% using graphical activity information.
Dialogue acts may be useful in this situation as well but the
sample size did not allow to draw final conclusions. However the
user study fails to show any effect for detailed basic features
such as formality or speaker identity
Products and Services
Today’s global economy offers more opportunities, but is also more complex and competitive than ever before. This fact leads to a wide range of research activity in different fields of interest, especially in the so-called high-tech sectors. This book is a result of widespread research and development activity from many researchers worldwide, covering the aspects of development activities in general, as well as various aspects of the practical application of knowledge