1,935 research outputs found

    Access to recorded interviews: A research agenda

    Get PDF
    Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed

    DCU at the NTCIR-11 SpokenQuery&Doc task

    Get PDF
    We describe DCU's participation in the NTCIR-11 Spoken-Query&Document task. We participated in the spoken query spoken content retrieval (SQ-SCR) subtask by using the slide group segments as basic indexing and retrieval units. Our approach integrates normalised prosodic features into a standard BM25 weighting function to increase weights for terms that are prominent in speech. Text queries and relevance assessment data from the NTCIR-10 SpokenDoc-2 passage retrieval task were used to train the prosodic-based models. Evaluation results indicate that our prosodic-based retrieval models do not provide significant improvements over a text-based BM25 model, but suggest that they can be useful for certain queries

    Memory as discrimination: what distraction reveals

    Get PDF
    Recalling information involves the process of discriminating between relevant and irrelevant information stored in memory. Not infrequently, the relevant information needs to be selected from amongst a series of related possibilities. This is likely to be particularly problematic when the irrelevant possibilities are not only temporally or contextually appropriate but also overlap semantically with the target or targets. Here, we investigate the extent to which purely perceptual features which discriminate between irrelevant and target material can be used to overcome the negative impact of contextual and semantic relatedness. Adopting a distraction paradigm, it is demonstrated that when distracters are interleaved with targets presented either visually (Experiment 1) or auditorily (Experiment 2), a within-modality semantic distraction effect occurs; semantically-related distracters impact upon recall more than unrelated distracters. In the semantically-related condition, the number of intrusions in recall is reduced whilst the number of correctly recalled targets is simultaneously increased by the presence of perceptual cues to relevance (color features in Experiment 1 or speaker’s gender in Experiment 2). However, as demonstrated in Experiment 3, even presenting semantically-related distracters in a language and a sensory modality (spoken Welsh) distinct from that of the targets (visual English) is insufficient to eliminate false recalls completely, or to restore correct recall to levels seen with unrelated distracters . Together, the study shows how semantic and non-semantic discriminability shape patterns of both erroneous and correct recall

    Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users

    Get PDF
    People who are deaf or hard-of-hearing (DHH) benefit from sign-language interpreting or live-captioning (with a human transcriptionist), to access spoken information. However, such services are not legally required, affordable, nor available in many settings, e.g., impromptu small-group meetings in the workplace or online video content that has not been professionally captioned. As Automatic Speech Recognition (ASR) systems improve in accuracy and speed, it is natural to investigate the use of these systems to assist DHH users in a variety of tasks. But, ASR systems are still not perfect, especially in realistic conversational settings, leading to the issue of trust and acceptance of these systems from the DHH community. To overcome these challenges, our work focuses on: (1) building metrics for accurately evaluating the quality of automatic captioning systems, and (2) designing interventions for improving the usability of captions for DHH users. The first part of this dissertation describes our research on methods for identifying words that are important for understanding the meaning of a conversational turn within transcripts of spoken dialogue. Such knowledge about the relative importance of words in spoken messages can be used in evaluating ASR systems (in part 2 of this dissertation) or creating new applications for DHH users of captioned video (in part 3 of this dissertation). We found that models which consider both the acoustic properties of spoken words as well as text-based features (e.g., pre-trained word embeddings) are more effective at predicting the semantic importance of a word than models that utilize only one of these types of features. The second part of this dissertation describes studies to understand DHH users\u27 perception of the quality of ASR-generated captions; the goal of this work was to validate the design of automatic metrics for evaluating captions in real-time applications for these users. Such a metric could facilitate comparison of various ASR systems, for determining the suitability of specific ASR systems for supporting communication for DHH users. We designed experimental studies to elicit feedback on the quality of captions from DHH users, and we developed and evaluated automatic metrics for predicting the usability of automatically generated captions for these users. We found that metrics that consider the importance of each word in a text are more effective at predicting the usability of imperfect text captions than the traditional Word Error Rate (WER) metric. The final part of this dissertation describes research on importance-based highlighting of words in captions, as a way to enhance the usability of captions for DHH users. Similar to highlighting in static texts (e.g., textbooks or electronic documents), highlighting in captions involves changing the appearance of some texts in caption to enable readers to attend to the most important bits of information quickly. Despite the known benefits of highlighting in static texts, research on the usefulness of highlighting in captions for DHH users is largely unexplored. For this reason, we conducted experimental studies with DHH participants to understand the benefits of importance-based highlighting in captions, and their preference on different design configurations for highlighting in captions. We found that DHH users subjectively preferred highlighting in captions, and they reported higher readability and understandability scores and lower task-load scores when viewing videos with captions containing highlighting compared to the videos without highlighting. Further, in partial contrast to recommendations in prior research on highlighting in static texts (which had not been based on experimental studies with DHH users), we found that DHH participants preferred boldface, word-level, non-repeating highlighting in captions

    Chinese Tones: Can You Listen With Your Eyes?:The Influence of Visual Information on Auditory Perception of Chinese Tones

    Get PDF
    CHINESE TONES: CAN YOU LISTEN WITH YOUR EYES? The Influence of Visual Information on Auditory Perception of Chinese Tones YUEQIAO HAN Summary Considering the fact that more than half of the languages spoken in the world (60%-70%) are so-called tone languages (Yip, 2002), and tone is notoriously difficult to learn for westerners, this dissertation focused on tone perception in Mandarin Chinese by tone-naïve speakers. Moreover, it has been shown that speech perception is more than just an auditory phenomenon, especially in situations when the speaker’s face is visible. Therefore, the aim of this dissertation is to also study the value of visual information (over and above that of acoustic information) in Mandarin tone perception for tone-naïve perceivers, in combination with other contextual (such as speaking style) and individual factors (such as musical background). Consequently, this dissertation assesses the relative strength of acoustic and visual information in tone perception and tone classification. In the first two empirical and exploratory studies in Chapter 2 and 3 , we set out to investigate to what extent tone-naïve perceivers are able to identify Mandarin Chinese tones in isolated words, and whether or not they can benefit from (seeing) the speakers’ face, and what the contribution is of a hyperarticulated speaking style, and/or their own musical experience. Respectively, in Chapter 2 we investigated the effect of visual cues (comparing audio-only with audio-visual presentations) and speaking style (comparing a natural speaking style with a teaching speaking style) on the perception of Mandarin tones by tone-naïve listeners, looking both at the relative strength of these two factors and their possible interactions; Chapter 3 was concerned with the effects of musicality of the participants (combined with modality) on Mandarin tone perception. In both of these studies, a Mandarin Chinese tone identification experiment was conducted: native speakers of a non-tonal language were asked to distinguish Mandarin Chinese tones based on audio (-only) or video (audio-visual) materials. In order to include variations, the experimental stimuli were recorded using four different speakers in imagined natural and teaching speaking scenarios. The proportion of correct responses (and average reaction times) of the participants were reported. The tone identification experiment presented in Chapter 2 showed that the video conditions (audio-visual natural and audio-visual teaching) resulted in an overall higher accuracy in tone perception than the auditory-only conditions (audio-only natural and audio-only teaching), but no better performance was observed in the audio-visual conditions in terms of reaction time, compared to the auditory-only conditions. Teaching style turned out to make no difference on the speed or accuracy of Mandarin tone perception (as compared to a natural speaking style). Further on, we presented the same experimental materials and procedure in Chapter 3 , but now with musicians and non-musicians as participants. The Goldsmith Musical Sophistication Index (Gold-MSI) was used to assess the musical aptitude of the participants. The data showed that overall, musicians outperformed non-musicians in the tone identification task in both auditory-visual and auditory-only conditions. Both groups identified tones more accurately in the auditory-visual conditions than in the auditory-only conditions. These results provided further evidence for the view that the availability of visual cues along with auditory information is useful for people who have no knowledge of Mandarin Chinese tones when they need to learn to identify these tones. Out of all the musical skills measured by Gold-MSI, the amount of musical training was the only predictor that had an impact on the accuracy of Mandarin tone perception. These findings suggest that learning to perceive Mandarin tones benefits from musical expertise, and visual information can facilitate Mandarin tone identification, but mainly for tone-naïve non-musicians. In addition, performance differed by tone: musicality improves accuracy for every tone; some tones are easier to identify than others: in particular, the identification of tone 3 (a low-falling-rising) proved to be the easiest, while tone 4 (a high-falling tone) was the most difficult to identify for all participants. The results of the first two experiments presented in chapters 2 and 3 showed that adding visual cues to clear auditory information facilitated the tone identification for tone-naïve perceivers (there is a significantly higher accuracy in audio-visual condition(s) than in auditory-only condition(s)). This visual facilitation was unaffected by the presence of (hyperarticulated) speaking style or the musical skill of the participants. Moreover, variations in speakers and tones had effects on the accurate identification of Mandarin tones by tone-naïve perceivers. In Chapter 4 , we compared the relative contribution of auditory and visual information during Mandarin Chinese tone perception. More specifically, we aimed to answer two questions: firstly, whether or not there is audio-visual integration at the tone level (i.e., we explored perceptual fusion between auditory and visual information). Secondly, we studied how visual information affects tone perception for native speakers and non-native (tone-naïve) speakers. To do this, we constructed various tone combinations of congruent (e.g., an auditory tone 1 paired with a visual tone 1, written as AxVx) and incongruent (e.g., an auditory tone 1 paired with a visual tone 2, written as AxVy) auditory-visual materials and presented them to native speakers of Mandarin Chinese and speakers of tone-naïve languages. Accuracy, defined as the percentage correct identification of a tone based on its auditory realization, was reported. When comparing the relative contribution of auditory and visual information during Mandarin Chinese tone perception with congruent and incongruent auditory and visual Chinese material for native speakers of Chinese and non-tonal languages, we found that visual information did not significantly contribute to the tone identification for native speakers of Mandarin Chinese. When there is a discrepancy between visual cues and acoustic information, (native and tone-naïve) participants tend to rely more on the auditory input than on the visual cues. Unlike the native speakers of Mandarin Chinese, tone-naïve participants were significantly influenced by the visual information during their auditory-visual integration, and they identified tones more accurately in congruent stimuli than in incongruent stimuli. In line with our previous work, the tone confusion matrix showed that tone identification varies with individual tones, with tone 3 (the low-dipping tone) being the easiest one to identify, whereas tone 4 (the high-falling tone) was the most difficult one. The results did not show evidence for auditory-visual integration among native participants, while visual information was helpful for tone-naïve participants. However, even for this group, visual information only marginally increased the accuracy in the tone identification task, and this increase depended on the tone in question. Chapter 5 is another chapter that zooms in on the relative strength of auditory and visual information for tone-naïve perceivers, but from the aspect of tone classification. In this chapter, we studied the acoustic and visual features of the tones produced by native speakers of Mandarin Chinese. Computational models based on acoustic features, visual features and acoustic-visual features were constructed to automatically classify Mandarin tones. Moreover, this study examined what perceivers pick up (perception) from what a speaker does (production, facial expression) by studying both production and perception. To be more specific, this chapter set out to answer: (1) which acoustic and visual features of tones produced by native speakers could be used to automatically classify Mandarin tones. Furthermore, (2) whether or not the features used in tone production are similar to or different from the ones that have cue value for tone-naïve perceivers when they categorize tones; and (3) whether and how visual information (i.e., facial expression and facial pose) contributes to the classification of Mandarin tones over and above the information provided by the acoustic signal. To address these questions, the stimuli that had been recorded (and described in chapter 2) and the response data that had been collected (and reported on in chapter 3) were used. Basic acoustic and visual features were extracted. Based on them, we used Random Forest classification to identify the most important acoustic and visual features for classifying the tones. The classifiers were trained on produced tone classification (given a set of auditory and visual features, predict the produced tone) and on perceived/responded tone classification (given a set of features, predict the corresponding tone as identified by the participant). The results showed that acoustic features outperformed visual features for tone classification, both for the classification of the produced and the perceived tone. However, tone-naïve perceivers did revert to the use of visual information in certain cases (when they gave wrong responses). So, visual information does not seem to play a significant role in native speakers’ tone production, but tone-naïve perceivers do sometimes consider visual information in their tone identification. These findings provided additional evidence that auditory information is more important than visual information in Mandarin tone perception and tone classification. Notably, visual features contributed to the participants’ erroneous performance. This suggests that visual information actually misled tone-naïve perceivers in their task of tone identification. To some extent, this is consistent with our claim that visual cues do influence tone perception. In addition, the ranking of the auditory features and visual features in tone perception showed that the factor perceiver (i.e., the participant) was responsible for the largest amount of variance explained in the responses by our tone-naïve participants, indicating the importance of individual differences in tone perception. To sum up, perceivers who do not have tone in their language background tend to make use of visual cues from the speakers’ faces for their perception of unknown tones (Mandarin Chinese in this dissertation), in addition to the auditory information they clearly also use. However, auditory cues are still the primary source they rely on. There is a consistent finding across the studies that the variations between tones, speakers and participants have an effect on the accuracy of tone identification for tone-naïve speaker
    • …
    corecore