4,480 research outputs found
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
Towards responsive Sensitive Artificial Listeners
This paper describes work in the recently started project SEMAINE, which aims to build a set of Sensitive Artificial Listeners – conversational agents designed to sustain an interaction with a human user despite limited verbal skills, through robust recognition and generation of non-verbal behaviour in real-time, both when the agent is speaking and listening. We report on data collection and on the design of a system architecture in view of real-time responsiveness
Recognition of Dialogue Acts in Multiparty Meetings using a Switching DBN
This paper is concerned with the automatic recognition of dialogue acts (DAs) in multiparty conversational speech. We present a joint generative model for DA recognition in which segmentation and classification of DAs are carried out in parallel. Our approach to DA recognition is based on a switching dynamic Bayesian network (DBN) architecture. This generative approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. The switching DBN coordinates the recognition process by integrating the component models. The factored language model, which is estimated from multiple conversational data corpora, is used in conjunction with additional task-specific language models. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. We have carried out experiments on the AMI corpus of multimodal meeting recordings, using both manually transcribed speech, and the output of an automatic speech recognizer, and using different configurations of the generative model. Our results indicate that the system performs well both on reference and fully automatic transcriptions. A further significant improvement in recognition accuracy is obtained by the application of the discriminative reranking approach based on conditional random fields
An Analysis of Students’ Difficulties in Learning Listening (A Study at Sman 11 Banda Aceh)
Listening has been one of neglected skills in language teaching and learning among high school learners in Banda Aceh. There were many various causes that many students complained during the comprehending stage of the listening texts. This study was aimed find out the difficulties that mostly faced by students in learning listening at SMAN 11 Banda Aceh. Using simple random sampling, Ten XI-grade students out of 188 students and one English teacher were selected as the respondents. They were interviewed to collect the data for this qualitative study, later interview result was analysed based on thematic analysis procedures. Interview was done by asking five prepared questions and additional questions were also asked to clarify the answers given. Based on the result of this study, the students of SMAN 11 Banda Aceh had four major problems in learning listening. Respondents admitted that listening was frequently neglected in their school, it was only taught twice or three times in one semester. Also, the unavailability of adequate facilities in the school added their adversity. Two more problems faced by students in SMAN 11 Banda Aceh were limited vocabulary and paralinguistic issues. Students sometimes encounter unfamiliar words which caused them to stop and think about the meaning of those words and they missed the next part of the speech. In the same way, paralinguistic features such as the accent, noise, rate of delivery, pronunciation, and intonation caused students difficulties while doing listening exercises
Masked Morphological Priming and Sensitivity to the Statistical Structure of form-to-Meaning mapping in L2
In one’s native language, visual word identification is based on early morphological analysis and is sensitive to the statistical structure of the mapping between form and meaning (Orthography–to–Semantic Consistency, OSC). How these mechanisms apply to a second language is much less clear. We recruited L1 Italian-L2 English speakers for a masked priming task where the relationship between prime and target was morphologically transparent, e.g., employer–EMPLOY, morphologically opaque, e.g., corner-CORN, or merely orthographic, e.g., brothel–BROTH. Critically, participants underwent thorough testing of their lexical, morphological, phonological, spelling, and semantic proficiency in their second language. By exploring a wide spectrum of L2 proficiency, we showed that this factor critically qualifies L2 priming. Genuine morphological facilitation only arises as proficiency grows, while orthographic priming shrinks as L2 competence increases. OSC was also found to modulate priming and interact with proficiency, providing an alternative way of describing the transparency continuum in derivational morphology. Overall, these data illustrate the trajectory towards a fully consolidated L2 lexicon and show that masked priming and sensitivity to OSC are key trackers of this process
Learning to recognize novel words and novel objects
Reading seems as easy and natural as listening. It is still not clear how we acquire this skill, and how visual word identification mechanisms are refined through reading experience. Theoretical models of word recognition describe general principles of skilled reading behaviour. However, these models have been based on averaged data from relatively small samples of skilled readers, mainly English native speakers, and are based on the assumption that skilled reading involves a specialized system of word identification. In this thesis it is proposed that expert reading requires the development and refinement of basic visual processing mechanisms originally employed to identify everyday objects, and then adapted to reading. To test this hypothesis, I carried out three experiments investigating: (i) how L2 visual word recognition changes with growing proficiency; (ii) how novel lexical memories are integrated into the lexicon, i.e., how they interact with previously existing words; and (iii) how sensitivity to the lexicon statistics plays out in the process of learning a novel set of visual stimuli, either in the language and non--language domain
Accessing spoken interaction through dialogue processing [online]
Zusammenfassung
Unser Leben, unsere Leistungen und unsere Umgebung, alles wird
derzeit durch Schriftsprache dokumentiert. Die rasante
Fortentwicklung der technischen Möglichkeiten Audio, Bilder und
Video aufzunehmen, abzuspeichern und wiederzugeben kann genutzt
werden um die schriftliche Dokumentation von menschlicher
Kommunikation, zum Beispiel Meetings, zu unterstützen, zu
ergänzen oder gar zu ersetzen. Diese neuen Technologien können
uns in die Lage versetzen Information aufzunehmen, die
anderweitig verloren gehen, die Kosten der Dokumentation zu
senken und hochwertige Dokumente mit audiovisuellem Material
anzureichern. Die Indizierung solcher Aufnahmen stellt die
Kerntechnologie dar um dieses Potential auszuschöpfen. Diese
Arbeit stellt effektive Alternativen zu schlüsselwortbasierten
Indizes vor, die Suchraumeinschränkungen bewirken und teilweise
mit einfachen Mitteln zu berechnen sind.
Die Indizierung von Sprachdokumenten kann auf verschiedenen
Ebenen erfolgen: Ein Dokument gehört stilistisch einer
bestimmten Datenbasis an, welche durch sehr einfache Merkmale
bei hoher Genauigkeit automatisch bestimmt werden kann.
Durch diese Art von Klassifikation kann eine Reduktion des
Suchraumes um einen Faktor der Größenordnung 410 erfolgen. Die
Anwendung von thematischen Merkmalen zur Textklassifikation
bei einer Nachrichtendatenbank resultiert in einer Reduktion um
einen Faktor 18. Da Sprachdokumente sehr lang sein können müssen
sie in thematische Segmente unterteilt werden. Ein neuer
probabilistischer Ansatz sowie neue Merkmale (Sprecherinitia
tive und Stil) liefern vergleichbare oder bessere Resultate als
traditionelle schlüsselwortbasierte Ansätze. Diese thematische
Segmente können durch die vorherrschende Aktivität
charakterisiert werden (erzählen, diskutieren, planen, ...),
die durch ein neuronales Netz detektiert werden kann. Die
Detektionsraten sind allerdings begrenzt da auch Menschen
diese Aktivitäten nur ungenau bestimmen. Eine maximale
Reduktion des Suchraumes um den Faktor 6 ist bei den verwendeten
Daten theoretisch möglich. Eine thematische Klassifikation
dieser Segmente wurde ebenfalls auf einer Datenbasis
durchgeführt, die Detektionsraten für diesen Index sind jedoch
gering.
Auf der Ebene der einzelnen Äußerungen können Dialogakte wie
Aussagen, Fragen, Rückmeldungen (aha, ach ja, echt?, ...) usw.
mit einem diskriminativ trainierten Hidden Markov Model erkannt
werden. Dieses Verfahren kann um die Erkennung von kurzen Folgen
wie Frage/AntwortSpielen erweitert werden (Dialogspiele).
Dialogakte und spiele können eingesetzt werden um
Klassifikatoren für globale Sprechstile zu bauen. Ebenso
könnte ein Benutzer sich an eine bestimmte Dialogaktsequenz
erinnern und versuchen, diese in einer grafischen
Repräsentation wiederzufinden.
In einer Studie mit sehr pessimistischen Annahmen konnten
Benutzer eines aus vier ähnlichen und gleichwahrscheinlichen
Gesprächen mit einer Genauigkeit von ~ 43% durch eine graphische
Repräsentation von Aktivität bestimmt.
Dialogakte könnte in diesem Szenario ebenso nützlich sein, die
Benutzerstudie konnte aufgrund der geringen Datenmenge darüber
keinen endgültigen Aufschluß geben. Die Studie konnte allerdings
für detailierte Basismerkmale wie Formalität und
Sprecheridentität keinen Effekt zeigen.
Abstract
Written language is one of our primary means for documenting our
lives, achievements, and environment. Our capabilities to
record, store and retrieve audio, still pictures, and video are
undergoing a revolution and may support, supplement or even
replace written documentation. This technology enables us to
record information that would otherwise be lost, lower the cost
of documentation and enhance highquality documents with
original audiovisual material.
The indexing of the audio material is the key technology to
realize those benefits. This work presents effective
alternatives to keyword based indices which restrict the search
space and may in part be calculated with very limited resources.
Indexing speech documents can be done at a various levels:
Stylistically a document belongs to a certain database which can
be determined automatically with high accuracy using very simple
features. The resulting factor in search space reduction is in
the order of 410 while topic classification yielded a factor
of 18 in a news domain.
Since documents can be very long they need to be segmented into
topical regions. A new probabilistic segmentation framework as
well as new features (speaker initiative and style) prove to be
very effective compared to traditional keyword based methods. At
the topical segment level activities (storytelling, discussing,
planning, ...) can be detected using a machine learning approach
with limited accuracy; however even human annotators do not
annotate them very reliably. A maximum search space reduction
factor of 6 is theoretically possible on the databases used. A
topical classification of these regions has been attempted
on one database, the detection accuracy for that index, however,
was very low.
At the utterance level dialogue acts such as statements,
questions, backchannels (aha, yeah, ...), etc. are being
recognized using a novel discriminatively trained HMM procedure.
The procedure can be extended to recognize short sequences such
as question/answer pairs, so called dialogue games.
Dialog acts and games are useful for building classifiers for
speaking style. Similarily a user may remember a certain dialog
act sequence and may search for it in a graphical
representation.
In a study with very pessimistic assumptions users are able to
pick one out of four similar and equiprobable meetings correctly
with an accuracy ~ 43% using graphical activity information.
Dialogue acts may be useful in this situation as well but the
sample size did not allow to draw final conclusions. However the
user study fails to show any effect for detailed basic features
such as formality or speaker identity
- …