491 research outputs found

    ์Œ์„ฑ ์˜๋ฏธ ์ง€๊ฐ์‹œ์˜ ๊ณ ๋“ฑ ์–ธ์–ด ์„ฑ๋ถ„ ์ฒ˜๋ฆฌ ๋””์ฝ”๋”ฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ๋‡Œ๊ณผํ•™์ „๊ณต, 2022. 8. ์ •์ฒœ๊ธฐ.High-level linguistic processing in the human brain remains incompletely understood and constitutes a challenging topic in speech neuroscience. While most studies focused on decoding low-level phonetic components using intracranial recordings of the human brain during speech perception, few studies have attempted to decode high-level syntactic or semantic features. If any, most of the research targeting semantic decoding is conducted with picture naming tasks, which only deal with visual language rather than spoken language. The presenting study is focused on better characterizing the neural representations of processing spoken language perception, namely speech perception. Especially not on the lower-level language components such as phonemes or phonetics, but the higher-level components such as syntax and semantics. Since it is widely accepted that the tripartite nature of language processing consists of phonology, syntax, and semantics, a strategical method for analyzing speech perception tasks that can reject the intervention of phonetic factors was mandatory. Therefore, we conducted a question-and-answer speech task containing four questions revolving around two semantic categories (alive, body parts) with phonetically controlled words. Intracranial neural signals were recorded during the question-and-answer speech task using electrocorticography (ECoG) electrodes for 14 epilepsy patients. Post hoc brain activity analysis was conducted for three subjects who answered correctly to every trial (144 trials in total) to ensure the analyzed data contained only brain signals collected during the correct semantic processing. The decoding results suggest that absolute and relative spectral neural feature trends occur across all participants in particular time windows. Furthermore, the spatial aspect of the neural features that yield the best decoding accuracy verifies the current biophysiological brain language model explaining the circular nature of word meaning comprehension in the left hemisphere language network.์ธ๊ฐ„์˜ ๊ณ ๋“ฑ ์„ฑ๋ถ„ ์–ธ์–ด ์ฒ˜๋ฆฌ์™€ ๊ด€๋ จํ•œ ๋‘๋‡Œ ํ™œ๋™์„ ํ•ด๋…ํ•˜๋Š” ์—ฐ๊ตฌ๋Š” ์‹ ๊ฒฝ์–ธ์–ดํ•™ ๋ถ„์•ผ์—์„œ๋„ ์•„์ง ๊นŠ์ด ์—ฐ๊ตฌ๋˜์ง€ ์•Š์€ ๋ถ„์•ผ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์นจ์Šต์  ์ „๊ทน์„ ํ†ตํ•ด ์–ป์€ ๋‡Œํ”ผ์งˆ ๋‡ŒํŒŒ๋ฅผ ์ด์šฉํ•œ ๋Œ€๋ถ€๋ถ„์˜ ์–ธ์–ด ๋””์ฝ”๋”ฉ ์—ฐ๊ตฌ๋Š” ์Œ์†Œ๋‚˜ ์Œ์ ˆ ์ˆ˜์ค€์˜ ํ•˜์œ„ ์–ธ์–ด ์„ฑ๋ถ„์—์„œ ์ง„ํ–‰๋˜์–ด ์™”๊ณ , ํ†ต์‚ฌ๋‚˜ ์˜๋ฏธ์™€ ๊ฐ™์€ ๊ณ ๋“ฑ ์–ธ์–ด ์„ฑ๋ถ„์— ๋Œ€ํ•œ ๋””์ฝ”๋”ฉ ์—ฐ๊ตฌ๋Š” ๋“œ๋ฌผ๋‹ค. ๋“œ๋ฌผ๊ฒŒ ์ง„ํ–‰๋œ ๊ณ ๋“ฑ ์–ธ์–ด ์„ฑ๋ถ„ ๋””์ฝ”๋”ฉ ์—ฐ๊ตฌ ๋˜ํ•œ ๋Œ€๋‹ค์ˆ˜๊ฐ€ ์‹œ๊ฐ์  ์–ธ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์—ฐ๊ตฌํ•œ ๊ฒฐ๊ณผ๋“ค์ด๋ฉฐ, ์†Œ๋ฆฌ ์–ธ์–ด ๋””์ฝ”๋”ฉ ์—ฐ๊ตฌ๋Š” ํƒœ๋™ ๋‹จ๊ณ„์— ๋จธ๋ฌด๋ฅด๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์†Œ๋ฆฌ ์–ธ์–ด ์ง€๊ฐ์‹œ์˜ ๋‘๋‡Œ ํ™œ์„ฑ์„ ๋ถ„์„ํ•˜์—ฌ ๊ทธ ์ฒ˜๋ฆฌ ๊ณผ์ •์˜ ๋‡ŒํŒŒ ์‹ ํ˜ธ ํŠน์„ฑ์„ ๊ทœ๋ช…ํ•˜๊ณ ์ž ํ•œ๋‹ค. ํŠนํžˆ ์ธ๊ฐ„ ์Œ์„ฑ ์–ธ์–ด์˜ ํ•˜์œ„ ๊ตฌ์„ฑ ์„ฑ๋ถ„๋ณด๋‹ค๋Š” ํ†ต์‚ฌ์™€ ์˜๋ฏธ ์œ„์ฃผ์˜ ๊ณ ๋“ฑ ๊ตฌ์„ฑ ์„ฑ๋ถ„์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์— ๊ด€์—ฌํ•˜๋Š” ๋‡ŒํŒŒ์˜ ์‹œ๊ฐ„์ , ์ฃผํŒŒ์ˆ˜์ , ๊ณต๊ฐ„์  ํŠน์„ฑ์— ์ง‘์ค‘ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์–ธ์–ด ์ฒ˜๋ฆฌ์˜ ์ฃผ๋œ ์„ธ ๊ฐ€์ง€ ์š”์†Œ๋Š” โ€˜์Œ์†Œ (phonetics)โ€™, โ€˜ํ†ต์‚ฌ (syntactics)โ€™, โ€˜์˜๋ฏธ (semantics)โ€™๋ผ๋Š” ์ ์„ ๊ณ ๋ คํ•˜์—ฌ, ์Œ์†Œ ์†Œ์ค€์˜ ๋‡ŒํŒŒ ํ™œ๋™์„ ํ†ต์ œํ•  ์ˆ˜ ์žˆ๋Š” ์‹คํ—˜ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ๊ตฌ์ƒํ•˜์˜€์œผ๋ฉฐ, ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ๋‘๊ฐœ์˜ ๋‹ค๋ฅธ ์˜๋ฏธ ๋ฒ”์ฃผ (์ƒ๋ช…, ์‹ ์ฒด)์— ๋Œ€ํ•ด์„œ ๋ฌป๋Š” ์Œ์†Œ์ ์œผ๋กœ ๋™๋“ฑํ•œ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ ์งˆ๋ฌธ์„ ๋“ค๋ ค์ค€ ํ›„ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•ด ๋Œ€๋‹ตํ•˜๋Š” ๊ณผ์ •์˜ ๋‡ŒํŒŒ๋ฅผ ๊ธฐ๋กํ•˜๋Š” ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๋‡ŒํŒŒ ์‹ ํ˜ธ๋Š” ๊ฒฝ๋ง‰ํ•˜ ์ „๊ทน ์‚ฝ์ž…์ˆ  (Electrocorticography, ECoG)์„ ํ†ตํ•ด 14๋ช…์˜ ๋‡Œ์ „์ฆ ํ™˜์ž๋กœ๋ถ€ํ„ฐ ์นจ์Šต์  ๋ฐฉ์‹์œผ๋กœ ์ธก์ •๋˜์—ˆ๋‹ค. ๋‡ŒํŒŒ ๋””์ฝ”๋”ฉ ๋ถ„์„์—๋Š” ํ”ผํ—˜์ž์˜ ๋‘๋‡Œ๊ฐ€ ์˜ณ์€ ๋ฐฉ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•œ ๊ณ ๋“ฑ ์–ธ์–ด ์„ฑ๋ถ„์ด ๋ฐ˜์˜๋œ ์‹คํ—˜๋งŒ์„ ํฌํ•จํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ชจ๋“  ์‹คํ—˜์—์„œ ์˜ณ์€ ๋Œ€๋‹ต์„ ํ•œ ์„ธ ๋ช…์˜ ํ™˜์ž๋งŒ์„ ๋Œ€์ƒ์œผ๋กœ ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๋””์ฝ”๋”ฉ ๋ถ„์„ ๊ฒฐ๊ณผ ์„ธ ๋ช…์˜ ํ™˜์ž์— ๊ฑธ์ณ ํ•ต์‹ฌ ๋‹จ์–ด (โ€˜๊ฒƒ์€โ€™, โ€˜๋ฌด์—‡์ž…๋‹ˆ๊นŒโ€™) ์Œ์„ฑ ์ง€๊ฐ ์ดํ›„ ํŠน์ • ์‹œ๊ฐ„๋Œ€์—์„œ ํŠน์ • ์ฃผํŒŒ์ˆ˜๋Œ€์˜ ๋‡ŒํŒŒ๊ฐ€ ์–‘ ๊ทน๋‹จ์˜ ์˜๋ฏธ๋ฅผ ๋†’์€ ์ˆ˜์ค€์˜ ์ •ํ™•๋„(%)๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐํ˜”๋‹ค. ๋˜ํ•œ ์ด๋Ÿฌํ•œ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ธฐ๋กํ•œ ๋‡ŒํŒŒ์˜ ํŠน์„ฑ์—๋Š” ๋ชจ๋“  ํ™˜์ž์— ๊ฑธ์ณ ์ ˆ๋Œ€์  ํ˜น์€ ์ƒ๋Œ€์  ํŠธ๋ Œ๋“œ๊ฐ€ ๊ด€์ฐฐ๋˜๋ฉฐ, ๊ด€์ฐฐ๋˜๋Š” ๋‡ŒํŒŒ์˜ ๊ณต๊ฐ„์  ํŠน์„ฑ์€ ํ˜„์žฌ ํ†ต์šฉ๋˜๋Š” ์‹ ๊ฒฝ์–ธ์–ดํ•™์  ์–ธ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์ด ์„ค๋ช…ํ•˜๋Š” ์Œ์„ฑ ์–ธ์–ด ์ฒ˜๋ฆฌ ๋ฐฉ์‹๊ณผ ์ผ๋งฅ์ƒํ†ตํ•จ์„ ๋ฐํ˜”๋‹ค.Abstract โ…ฐ 1. Introduction 1 2. Materials and Methods 4 3. Results 8 4. Discussion 12 References 15 List of Figures 20 Supplementary information 28 Abstract in Korean 36์„

    Telephone speech recognition via the combination of knowledge sources in a segmental speech model

    Get PDF
    The currently dominant speech recognition methodology, Hidden Markov Modeling, treats speech as a stochastic random process with very simple mathematical properties. The simplistic assumptions of the model, and especially that of the independence of the observation vectors have been criticized by many in the literature, and alternative solutions have been proposed. One such alternative is segmental modeling, and the OASIS recognizer we have been working on in the recent years belongs to this category. In this paper we go one step further and suggest that we should consider speech recognition as a knowledge source combination problem. We offer a generalized algorithmic framework for this approach and show that both hidden Markov and segmental modeling are a special case of this decoding scheme. In the second part of the paper we describe the current components of the OASIS system and evaluate its performance on a very difficult recognition task, the phonetically balanced sentences of the MTBA Hungarian Telephone Speech Database. Our results show that OASIS outperforms a traditional HMM system in phoneme classification and achieves practically the same recognition scores at the sentence level

    Spoonerisms: An Analysis of Language Processing in Light of Neurobiology

    Get PDF
    Spoonerisms are described as the category of speech errors involving jumbled-up words. The author examines language, the brain, and the correlation between spoonerisms and the neural structures involved in language processing

    Transformation of a temporal speech cue to a spatial neural code in human auditory cortex

    Get PDF
    In speech, listeners extract continuously-varying spectrotemporal cues from the acoustic signal to perceive discrete phonetic categories. Spectral cues are spatially encoded in the amplitude of responses in phonetically-tuned neural populations in auditory cortex. It remains unknown whether similar neurophysiological mechanisms encode temporal cues like voice-onset time (VOT), which distinguishes sounds like /b/ and/p/. We used direct brain recordings in humans to investigate the neural encoding of temporal speech cues with a VOT continuum from /ba/ to /pa/. We found that distinct neural populations respond preferentially to VOTs from one phonetic category, and are also sensitive to sub-phonetic VOT differences within a populationโ€™s preferred category. In a simple neural network model, simulated populations tuned to detect either temporal gaps or coincidences between spectral cues captured encoding patterns observed in real neural data. These results demonstrate that a spatial/amplitude neural code underlies the cortical representation of both spectral and temporal speech cues

    The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation

    Full text link
    Forced alignment systems automatically determine boundaries between segments in speech data, given an orthographic transcription. These tools are commonplace in phonetics to facilitate the use of speech data that would be infeasible to manually transcribe and segment. In the present paper, we describe a new neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). The MAPS aligner serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model in a forced aligner as a tagging task, rather than a classification task, motivated by the common understanding that segments in speech are not truly discrete and commonly overlap. The second is an interpolation technique to allow boundaries more precise than the common 10 ms limit in modern forced alignment systems. We compare configurations of our system to a state-of-the-art system, the Montreal Forced Aligner. The tagging approach did not generally yield improved results over the Montreal Forced Aligner. However, a system with the interpolation technique had a 27.92% increase relative to the Montreal Forced Aligner in the amount of boundaries within 10 ms of the target on the test set. We also reflect on the task and training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciliation of this tension may require rethinking the task and output targets or how speech itself should be segmented.Comment: submitted for publicatio

    Better Evaluation of ASR in Speech Translation Context Using Word Embeddings

    No full text
    International audienceThis paper investigates the evaluation of ASR in spoken language translation context. More precisely, we propose a simple extension of WER metric in order to penalize differently substitution errors according to their context using word embeddings. For instance, the proposed metric should catch near matches (mainly morphological variants) and penalize less this kind of error which has a more limited impact on translation performance. Our experiments show that the correlation of the new proposed metric with SLT performance is better than the one of WER. Oracle experiments are also conducted and show the ability of our metric to find better hypotheses (to be translated) in the ASR N-best. Finally, a preliminary experiment where ASR tuning is based on our new metric shows encouraging results. For reproductible experiments, the code allowing to call our modified WER and the corpora used are made available to the research community

    MULTIVARIATE ANALYSIS FOR UNDERSTANDING COGNITIVE SPEECH PROCESSING

    Get PDF
    MULTIVARIATE ANALYSIS FOR UNDERSTANDING COGNITIVE SPEECH PROCESSIN
    • โ€ฆ
    corecore