Search CORE

3,310 research outputs found

The Impact of Degraded Speech and Stimulus Familiarity in a Dichotic Listening Task

Author: Sinatra Anne M.
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2012
Field of study

It has been previously established that when engaged in a difficult attention intensive task, which involves repeating information while blocking out other information (the dichotic listening task), participants are often able to report hearing their own names in an unattended audio channel (Moray, 1959). This phenomenon, called the cocktail party effect is a result of words that are important to oneself having a lower threshold, resulting in less attention being necessary to process them (Treisman, 1960). The current studies examined the ability of a person who was engaged in an attention demanding task to hear and recall low-threshold words from a fictional story. These low-threshold words included a traditional alert word, fire and fictional character names from a popular franchise-Harry Potter. Further, the role of stimulus degradation was examined by including synthetic and accented speech in the task to determine how it would impact attention and performance. In Study 1 participants repeated passages from a novel that was largely unfamiliar to them, The Secret Garden while blocking out a passage from a much more familiar source, Harry Potter and the Deathly Hallows. Each unattended Harry Potter passage was edited so that it would include 4 names from the series, and the word fire twice. The type of speech present in the attended and unattended ears (Natural or Synthetic) was varied to examine the impact that processing a degraded speech would have on performance. The speech that the participant shadowed did not impact unattended recall, however it did impact shadowing accuracy. The speech type that was present in the unattended ear did impact the ability to recall low-threshold, Harry Potter information. When the unattended speech type was synthetic, significantly less Harry Potter information was recalled. Interestingly, while Harry Potter information was recalled by participants with both high and low Harry Potter experience, the traditional low-threshold word, fire was not noticed by participants. In order to determine if synthetic speech impeded the ability to report low-threshold Harry Potter names due to being degraded or simply being different than natural speech, Study 2 was designed. In Study 2 the attended (shadowed) speech was held constant as American Natural speech, and the unattended ear was manipulated. An accent which was different than the native accent of the participants was included as a mild form of degradation. There were four experimental stimuli which contained one of the following in the unattended ear: American Natural, British Natural, American Synthetic and British Synthetic. Overall, more unattended information was reported when the unattended channel was Natural than Synthetic. This implies that synthetic speech does take more working memory processing power than even an accented natural speech. Further, it was found that experience with the Harry Potter franchise played a role in the ability to report unattended Harry Potter information. Those who had high levels of Harry Potter experience, particularly with audiobooks, were able to process and report Harry Potter information from the unattended stimulus when it was British Natural. While, those with low Harry Potter experience were not able to report unattended Harry Potter information from this slightly degraded stimulus. Therefore, it is believed that the previous audiobook experience of those in the high Harry Potter experience group acted as training and resulted in less working memory being necessary to encode the unattended Harry Potter information. A pilot study was designed in order to examine the impact of story familiarity in the attended and unattended channels of a dichotic listening task. In the pilot study, participants shadowed a Harry Potter passage (familiar) in one condition with a passage from The Secret Garden (unfamiliar) playing in the unattended ear. A second condition had participants shadowing The Secret Garden (unfamiliar) with a passage from Harry Potter (familiar) present in the unattended ear. There was no significant difference in the number of unattended names recalled. Those with low Harry Potter experience reported significantly less attended information when they shadowed Harry Potter than when they shadowed The Secret Garden. Further, there appeared to be a trend such that those with high Harry Potter experience were reporting more attended information when they shadowed Harry Potter than The Secret Garden. This implies that experience with a franchise and characters may make it easier to recall information about a passage, while lack of experience provides no assistance. Overall, the results of the studies indicate that we do treat fictional characters in a way similarly to ourselves. Names and information about fictional characters were able to break through into attention during a task that required a great deal of attention. The experience one had with the characters also served to assist the working memory in processing the information in degraded circumstances. These results have important implications for training, design of alerts, and the use of popular media in the classroom

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Electrophysiological signatures of second language multimodal comprehension

Author: Ding R.
Frassinelli D.
Klavinskis-Whiting S.
Tuomainen J.
Vigliocco G.
Zhang Y.
Publication venue
Publication date: 01/01/2021
Field of study

Language is multimodal: non-linguistic cues, such as prosody, gestures and mouth movements, are always present in face-to- face communication and interact to support processing. In this paper, we ask whether and how multimodal cues affect L2 processing by recording EEG for highly proficient bilinguals when watching naturalistic materials. For each word, we quantified surprisal and the informativeness of prosody, gestures, and mouth movements. We found that each cue modulates the N400: prosodic accentuation, meaningful gestures, and informative mouth movements all reduce N400. Further, effects of meaningful gestures but not mouth informativeness are enhanced by prosodic accentuation, whereas effects of mouth are enhanced by meaningful gestures but reduced by beat gestures. Compared with L1, L2 participants benefit less from cues and their interactions, except for meaningful gestures and mouth movements. Thus, in real- world language comprehension, L2 comprehenders use multimodal cues just as L1 speakers albeit to a lesser extent

UCL Discovery

eScholarship - University of California

MPG.PuRe

Access to recorded interviews: A research agenda

Author: Heeren W.F.L.
Jong F.M.G. de
Oard D.W.
Ordelman R.J.F.
Publication venue: ACM
Publication date: 01/01/2008
Field of study

Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed

University of Twente Research Information

Comprehension in-situ: how multimodal information shapes language processing

Author: Zhang Ye
Publication venue: UCL (University College London)
Publication date: 28/06/2022
Field of study

The human brain supports communication in dynamic face-to-face environments where spoken words are embedded in linguistic discourse and accompanied by multimodal cues, such as prosody, gestures and mouth movements. However, we only have limited knowledge of how these multimodal cues jointly modulate language comprehension. In a series of behavioural and EEG studies, we investigated the joint impact of these cues when processing naturalistic-style materials. First, we built a mouth informativeness corpus of English words, to quantify mouth informativeness of a large number of words used in the following experiments. Then, across two EEG studies, we found and replicated that native English speakers use multimodal cues and that their interactions dynamically modulate N400 amplitude elicited by words that are less predictable in the discourse context (indexed by surprisal values per word). We then extended the findings to second language comprehenders, finding that multimodal cues modulate L2 comprehension, just like in L1, but to a lesser extent; although L2 comprehenders benefit more from meaningful gestures and mouth movements. Finally, in two behavioural experiments investigating whether multimodal cues jointly modulate the learning of new concepts, we found some evidence that presence of iconic gestures improves memory, and that the effect may be larger if information is presented also with prosodic accentuation. Overall, these findings suggest that real-world comprehension uses all cues present and weights cues differently in a dynamic manner. Therefore, multimodal cues should not be neglected for language studies. Investigating communication in naturalistic contexts containing more than one cue can provide new insight into our understanding of language comprehension in the real world

UCL Discovery

Effects of intrinsic and imposed modulation masking on speech perception

Author: Curetti Lorenza Zaira
Publication venue
Publication date: 01/08/2023
Field of study

The University of Manchester - Institutional Repository

Developmental changes in the weighting of prosodic cues

Author: Cristia A.
Seidl A.
Publication venue: 'Wiley'
Publication date: 01/01/2008
Field of study

Previous research has shown that the weighting of, or attention to, acoustic cues at the level of the segment changes over the course of development (Nittrouer & Miller, 1997; Nittrouer, Manning & Meyer, 1993). In this paper we examined changes over the course of development in weighting of acoustic cues at the suprasegmental level. Specifically, we tested English-learning 4-month-olds’ performance on a clause segmentation task when each of three acoustic cues to clausal units was neutralized and contrasted it with performance on a Baseline condition where no cues were manipulated. Comparison with the reported performance of 6-month-olds on the same task (Seidl, 2007) reveals that 4-month-olds weight prosodic cues to clausal boundaries differently than 6-month-olds, relying more heavily on all three correlates of clausal boundaries (pause, pitch and vowel duration) than 6-month-olds do, who rely primarily on pitch. We interpret this as evidence that 4-month-olds use a holistic processing strategy, while 6-month-olds may already be able to attend separately to isolated cues in the input stream and may, furthermore, be able to exploit a language-specific cue weighting. Thus, in a way similar to that in other cognitive domains, infants begin as holistic auditory scene processors and are only later able to process individual auditory cues

MPG.PuRe

Oesophageal speech: enrichment and evaluations

Author: Raman Sneha
Publication venue
Publication date: 22/12/2021
Field of study

167 p.After a laryngectomy (i.e. removal of the larynx) a patient can no more speak in a healthy laryngeal voice. Therefore, they need to adopt alternative methods of speaking such as oesophageal speech. In this method, speech is produced using swallowed air and the vibrations of the pharyngo-oesophageal segment, which introduces several undesired artefacts and an abnormal fundamental frequency. This makes oesophageal speech processing difficult compared to healthy speech, both auditory processing and signal processing. The aim of this thesis is to find solutions to make oesophageal speech signals easier to process, and to evaluate these solutions by exploring a wide range of evaluation metrics.First, some preliminary studies were performed to compare oesophageal speech and healthy speech. This revealed significantly lower intelligibility and higher listening effort for oesophageal speech compared to healthy speech. Intelligibility scores were comparable for familiar and non-familiar listeners of oesophageal speech. However, listeners familiar with oesophageal speech reported less effort compared to non-familiar listeners. In another experiment, oesophageal speech was reported to have more listening effort compared to healthy speech even though its intelligibility was comparable to healthy speech. On investigating neural correlates of listening effort (i.e. alpha power) using electroencephalography, a higher alpha power was observed for oesophageal speech compared to healthy speech, indicating higher listening effort. Additionally, participants with poorer cognitive abilities (i.e. working memory capacity) showed higher alpha power.Next, using several algorithms (preexisting as well as novel approaches), oesophageal speech was transformed with the aim of making it more intelligible and less effortful. The novel approach consisted of a deep neural network based voice conversion system where the source was oesophageal speech and the target was synthetic speech matched in duration with the source oesophageal speech. This helped in eliminating the source-target alignment process which is particularly prone to errors for disordered speech such as oesophageal speech. Both speaker dependent and speaker independent versions of this system were implemented. The outputs of the speaker dependent system had better short term objective intelligibility scores, automatic speech recognition performance and listener preference scores compared to unprocessed oesophageal speech. The speaker independent system had improvement in short term objective intelligibility scores but not in automatic speech recognition performance. Some other signal transformations were also performed to enhance oesophageal speech. These included removal of undesired artefacts and methods to improve fundamental frequency. Out of these methods, only removal of undesired silences had success to some degree (1.44 \% points improvement in automatic speech recognition performance), and that too only for low intelligibility oesophageal speech.Lastly, the output of these transformations were evaluated and compared with previous systems using an ensemble of evaluation metrics such as short term objective intelligibility, automatic speech recognition, subjective listening tests and neural measures obtained using electroencephalography. Results reveal that the proposed neural network based system outperformed previous systems in improving the objective intelligibility and automatic speech recognition performance of oesophageal speech. In the case of subjective evaluations, the results were mixed - some positive improvement in preference scores and no improvement in speech intelligibility and listening effort scores. Overall, the results demonstrate several possibilities and new paths to enrich oesophageal speech using modern machine learning algorithms. The outcomes would be beneficial to the disordered speech community

Archivo Digital para la Docencia y la Investigación

Assessing the quality of audio and video components in desktop multimedia conferencing

Author: Watson Anna Olivia
Publication venue: UCL (University College London)
Publication date: 01/01/2001
Field of study

This thesis seeks to address the HCI (Human-Computer Interaction) research problem of how to establish the level of audio and video quality that end users require to successfully perform tasks via networked desktop videoconferencing. There are currently no established HCI methods of assessing the perceived quality of audio and video delivered in desktop videoconferencing. The transport of real-time speech and video information across new digital networks causes novel and different degradations, problems and issues to those common in the traditional telecommunications areas (telephone and television). Traditional assessment methods involve the use of very short test samples, are traditionally conducted outside a task-based environment, and focus on whether a degradation is noticed or not. But these methods cannot help establish what audio-visual quality is required by users to perform tasks successfully with the minimum of user cost, in interactive conferencing environments. This thesis addresses this research gap by investigating and developing a battery of assessment methods for networked videoconferencing, suitable for use in both field trials and laboratory-based studies. The development and use of these new methods helps identify the most critical variables (and levels of these variables) that affect perceived quality, and means by which network designers and HCI practitioners can address these problems are suggested. The output of the thesis therefore contributes both methodological (i.e. new rating scales and data-gathering methods) and substantive (i.e. explicit knowledge about quality requirements for certain tasks) knowledge to the HCI and networking research communities on the subjective quality requirements of real-time interaction in networked videoconferencing environments. Exploratory research is carried out through an interleaved series of field trials and controlled studies, advancing substantive and methodological knowledge in an incremental fashion. Initial studies use the ITU-recommended assessment methods, but these are found to be unsuitable for assessing networked speech and video quality for a number of reasons. Therefore later studies investigate and establish a novel polar rating scale, which can be used both as a static rating scale and as a dynamic continuous slider. These and further developments of the methods in future lab- based and real conferencing environments will enable subjective quality requirements and guidelines for different videoconferencing tasks to be established

UCL Discovery

Facilitatory stimulation of the pre-SMA in healthy aging has distinct effects on task-based activity and connectivity

Author: Frieling R.
Hartwigsen G.
Martin S.
Saur D.
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 21/10/2022
Field of study

Semantic cognition is central to communication and our understanding of the world. It is usually well preserved in healthy aging. However, semantic control processes, which guide semantic access and retrieval, decline with age. The present study explored the potential of intermittent theta burst stimulation (iTBS) to enhance semantic cognition in healthy middle-aged to older adults. Using an individualized stimulation approach, we applied iTBS to the pre-supplementary motor area (pre-SMA) and assessed task-specific effects on semantic judgments in functional neuroimaging. We found increased activation after effective relative to sham stimulation only for the semantic task in visual and dorsal attention networks. Further, iTBS increased functional connectivity in domain-general executive networks. Notably, stimulation-induced changes in activation and connectivity related differently to behavior: While increased activation of the parietal dorsal attention network was linked to poorer semantic performance, its enhanced coupling with the pre-SMA was associated with more efficient semantic processing. Our findings indicate differential effects of iTBS on activity and connectivity. We show that iTBS modulates networks in a task-dependent manner and generates remote network effects. Stimulating the pre-SMA was linked to more efficient but not better performance, indicating a role in domain-general semantic control processes distinct to domain-specific semantic control

MPG.PuRe