7 research outputs found

    Human-Computer Interaction

    Get PDF
    In this book the reader will find a collection of 31 papers presenting different facets of Human Computer Interaction, the result of research projects and experiments as well as new approaches to design user interfaces. The book is organized according to the following main topics in a sequential order: new interaction paradigms, multimodality, usability studies on several interaction mechanisms, human factors, universal design and development methodologies and tools

    ENHANCING EXPRESSIVITY OF DOCUMENT-CENTERED COLLABORATION WITH MULTIMODAL ANNOTATIONS

    Full text link
    As knowledge work moves online, digital documents have become a staple of human collaboration. To communicate beyond the constraints of time and space, remote and asynchronous collaborators create digital annotations over documents, substituting face-to-face meetings with online conversations. However, existing document annotation interfaces depend primarily on text commenting, which is not as expressive or nuanced as in-person communication where interlocutors can speak and gesture over physical documents. To expand the communicative capacity of digital documents, we need to enrich annotation interfaces with face-to-face-like multimodal expressions (e.g., talking and pointing over texts). This thesis makes three major contributions toward multimodal annotation interfaces for enriching collaboration around digital documents. The first contribution is a set of design requirements for multimodal annotations drawn from our user studies and explorative literature surveys. We found that the major challenges were to support lightweight access to recorded voice, to control visual occlusions of graphically rich audio interfaces, and to reduce speech anxiety in voice comment production. Second, to address these challenges, we present RichReview, a novel multimodal annotation system. RichReview is designed to capture natural communicative expressions in face-to-face document descriptions as the combination of multimodal user inputs (e.g., speech, pen-writing, and deictic pen-hovering). To balance the consumption and production of speech comments, the system employs (1) cross-modal indexing interfaces for faster audio navigation, (2) fluid document-annotation layout for reduced visual clutter, and (3) voice synthesis-based speech editing for reduced speech anxiety. The third contribution is a series of evaluations that examines the effectiveness of our design solutions. Results of our lab studies show that RichReview can successfully address the above mentioned interface problems of multimodal annotations. A subsequent series of field deployment studies test the real-world efficacy of RichReview by deploying the system for document-centered conversation activities in classrooms, such as instructor feedback for student assignments and peer discussions about course material. The results suggest that using rich annotation helps students better understand the instructor’s comments, and makes them feel more valued as a person. From the results of the peer-discussion study, we learned that retaining the richness of original speech is the key to the success of speech commenting. What follows is the discussion on the benefits, challenges, and future of multimodal annotation interfaces, and technical innovations required to realize the vision

    Qualitative investigation of the display of speech recognition results for communication with deaf people

    Get PDF
    International audienceSpeech technologies provide ways of helping people with hearing loss by improving their autonomy. This study focuses on an application in French language which is developed in the collaborative project RAPSODIE in order to improve communication between a hearing person and a deaf or hard-of-hearing person. Our goal is to investigate different ways of displaying the speech recognition results which takes also into account the reliability of the recognized items. In this qualitative study, 10 persons have been interviewed to find the best way of displaying the speech transcription results. All the participants are deaf with different levels of hearing loss and various modes of communication

    Accessing spoken interaction through dialogue processing [online]

    Get PDF
    Zusammenfassung Unser Leben, unsere Leistungen und unsere Umgebung, alles wird derzeit durch Schriftsprache dokumentiert. Die rasante Fortentwicklung der technischen Möglichkeiten Audio, Bilder und Video aufzunehmen, abzuspeichern und wiederzugeben kann genutzt werden um die schriftliche Dokumentation von menschlicher Kommunikation, zum Beispiel Meetings, zu unterstützen, zu ergänzen oder gar zu ersetzen. Diese neuen Technologien können uns in die Lage versetzen Information aufzunehmen, die anderweitig verloren gehen, die Kosten der Dokumentation zu senken und hochwertige Dokumente mit audiovisuellem Material anzureichern. Die Indizierung solcher Aufnahmen stellt die Kerntechnologie dar um dieses Potential auszuschöpfen. Diese Arbeit stellt effektive Alternativen zu schlüsselwortbasierten Indizes vor, die Suchraumeinschränkungen bewirken und teilweise mit einfachen Mitteln zu berechnen sind. Die Indizierung von Sprachdokumenten kann auf verschiedenen Ebenen erfolgen: Ein Dokument gehört stilistisch einer bestimmten Datenbasis an, welche durch sehr einfache Merkmale bei hoher Genauigkeit automatisch bestimmt werden kann. Durch diese Art von Klassifikation kann eine Reduktion des Suchraumes um einen Faktor der Größenordnung 4­10 erfolgen. Die Anwendung von thematischen Merkmalen zur Textklassifikation bei einer Nachrichtendatenbank resultiert in einer Reduktion um einen Faktor 18. Da Sprachdokumente sehr lang sein können müssen sie in thematische Segmente unterteilt werden. Ein neuer probabilistischer Ansatz sowie neue Merkmale (Sprecherinitia­ tive und Stil) liefern vergleichbare oder bessere Resultate als traditionelle schlüsselwortbasierte Ansätze. Diese thematische Segmente können durch die vorherrschende Aktivität charakterisiert werden (erzählen, diskutieren, planen, ...), die durch ein neuronales Netz detektiert werden kann. Die Detektionsraten sind allerdings begrenzt da auch Menschen diese Aktivitäten nur ungenau bestimmen. Eine maximale Reduktion des Suchraumes um den Faktor 6 ist bei den verwendeten Daten theoretisch möglich. Eine thematische Klassifikation dieser Segmente wurde ebenfalls auf einer Datenbasis durchgeführt, die Detektionsraten für diesen Index sind jedoch gering. Auf der Ebene der einzelnen Äußerungen können Dialogakte wie Aussagen, Fragen, Rückmeldungen (aha, ach ja, echt?, ...) usw. mit einem diskriminativ trainierten Hidden Markov Model erkannt werden. Dieses Verfahren kann um die Erkennung von kurzen Folgen wie Frage/Antwort­Spielen erweitert werden (Dialogspiele). Dialogakte und ­spiele können eingesetzt werden um Klassifikatoren für globale Sprechstile zu bauen. Ebenso könnte ein Benutzer sich an eine bestimmte Dialogaktsequenz erinnern und versuchen, diese in einer grafischen Repräsentation wiederzufinden. In einer Studie mit sehr pessimistischen Annahmen konnten Benutzer eines aus vier ähnlichen und gleichwahrscheinlichen Gesprächen mit einer Genauigkeit von ~ 43% durch eine graphische Repräsentation von Aktivität bestimmt. Dialogakte könnte in diesem Szenario ebenso nützlich sein, die Benutzerstudie konnte aufgrund der geringen Datenmenge darüber keinen endgültigen Aufschluß geben. Die Studie konnte allerdings für detailierte Basismerkmale wie Formalität und Sprecheridentität keinen Effekt zeigen. Abstract Written language is one of our primary means for documenting our lives, achievements, and environment. Our capabilities to record, store and retrieve audio, still pictures, and video are undergoing a revolution and may support, supplement or even replace written documentation. This technology enables us to record information that would otherwise be lost, lower the cost of documentation and enhance high­quality documents with original audiovisual material. The indexing of the audio material is the key technology to realize those benefits. This work presents effective alternatives to keyword based indices which restrict the search space and may in part be calculated with very limited resources. Indexing speech documents can be done at a various levels: Stylistically a document belongs to a certain database which can be determined automatically with high accuracy using very simple features. The resulting factor in search space reduction is in the order of 4­10 while topic classification yielded a factor of 18 in a news domain. Since documents can be very long they need to be segmented into topical regions. A new probabilistic segmentation framework as well as new features (speaker initiative and style) prove to be very effective compared to traditional keyword based methods. At the topical segment level activities (storytelling, discussing, planning, ...) can be detected using a machine learning approach with limited accuracy; however even human annotators do not annotate them very reliably. A maximum search space reduction factor of 6 is theoretically possible on the databases used. A topical classification of these regions has been attempted on one database, the detection accuracy for that index, however, was very low. At the utterance level dialogue acts such as statements, questions, backchannels (aha, yeah, ...), etc. are being recognized using a novel discriminatively trained HMM procedure. The procedure can be extended to recognize short sequences such as question/answer pairs, so called dialogue games. Dialog acts and games are useful for building classifiers for speaking style. Similarily a user may remember a certain dialog act sequence and may search for it in a graphical representation. In a study with very pessimistic assumptions users are able to pick one out of four similar and equiprobable meetings correctly with an accuracy ~ 43% using graphical activity information. Dialogue acts may be useful in this situation as well but the sample size did not allow to draw final conclusions. However the user study fails to show any effect for detailed basic features such as formality or speaker identity

    A Contextual Study of Semantic Speech Editing in Radio Production

    Get PDF
    Radio production involves editing speech-based audio using tools that represent sound using simple waveforms. Semantic speech editing systems allow users to edit audio using an automatically generated transcript, which has the potential to improve the production workflow. To investigate this, we developed a semantic audio editor based on a pilot study. Through a contextual qualitative study of five professional radio producers at the BBC, we examined the existing radio production process and evaluated our semantic editor by using it to create programmes that were later broadcast. We observed that the participants in our study wrote detailed notes about their recordings and used annotation to mark which parts they wanted to use. They collaborated closely with the presenter of their programme to structure the contents and write narrative elements. Participants reported that they often work away from the office to avoid distractions, and print transcripts so they can work away from screens. They also emphasised that listening is an important part of production, to ensure high sound quality. We found that semantic speech editing with automated speech recognition can be used to improve the radio production workflow, but that annotation, collaboration, portability and listening were not well supported by current semantic speech editing systems. In this paper, we make recommendations on how future semantic speech editing systems can better support the requirements of radio production

    Error Correction of Voicemail Transcripts in SCANMail

    No full text
    Despite its widespread use, voicemail presents numerous usability challenges: People must listen to messages in their entirety, they cannot search by keywords, and audio files do not naturally support visual skimming. SCANMail overcomes these flaws by automatically generating text transcripts of voicemail messages and presenting them in an email-like interface. Transcripts facilitate quick browsing and permanent archive. However, errors from the automatic speech recognition (ASR) hinder the usefulness of the transcripts. The work presented here specifically addresses these problems by evaluating user-initiated error correction of transcripts. User studies of two editor interfaces—a grammar-assisted menu and simple replacement by typing—reveal reduced audio playback times and an emphasis on editing important words with the menu, suggesting its value in mobile environments where limited input capabilities are the norm and user privacy is essential. The study also adds to the scarce body of work on ASR confidence shading, suggesting that shading may be more helpful than previously reported. Author Keywords Voicemail, error correction, speech recognition, edito
    corecore