491 research outputs found
์์ฑ ์๋ฏธ ์ง๊ฐ์์ ๊ณ ๋ฑ ์ธ์ด ์ฑ๋ถ ์ฒ๋ฆฌ ๋์ฝ๋ฉ
ํ์๋
ผ๋ฌธ(์์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ์์ฐ๊ณผํ๋ํ ํ๋๊ณผ์ ๋๊ณผํ์ ๊ณต, 2022. 8. ์ ์ฒ๊ธฐ.High-level linguistic processing in the human brain remains incompletely understood and constitutes a challenging topic in speech neuroscience. While most studies focused on decoding low-level phonetic components using intracranial recordings of the human brain during speech perception, few studies have attempted to decode high-level syntactic or semantic features. If any, most of the research targeting semantic decoding is conducted with picture naming tasks, which only deal with visual language rather than spoken language.
The presenting study is focused on better characterizing the neural representations of processing spoken language perception, namely speech perception. Especially not on the lower-level language components such as phonemes or phonetics, but the higher-level components such as syntax and semantics. Since it is widely accepted that the tripartite nature of language processing consists of phonology, syntax, and semantics, a strategical method for analyzing speech perception tasks that can reject the intervention of phonetic factors was mandatory. Therefore, we conducted a question-and-answer speech task containing four questions revolving around two semantic categories (alive, body parts) with phonetically controlled words.
Intracranial neural signals were recorded during the question-and-answer speech task using electrocorticography (ECoG) electrodes for 14 epilepsy patients. Post hoc brain activity analysis was conducted for three subjects who answered correctly to every trial (144 trials in total) to ensure the analyzed data contained only brain signals collected during the correct semantic processing. The decoding results suggest that absolute and relative spectral neural feature trends occur across all participants in particular time windows. Furthermore, the spatial aspect of the neural features that yield the best decoding accuracy verifies the current biophysiological brain language model explaining the circular nature of word meaning comprehension in the left hemisphere language network.์ธ๊ฐ์ ๊ณ ๋ฑ ์ฑ๋ถ ์ธ์ด ์ฒ๋ฆฌ์ ๊ด๋ จํ ๋๋ ํ๋์ ํด๋
ํ๋ ์ฐ๊ตฌ๋ ์ ๊ฒฝ์ธ์ดํ ๋ถ์ผ์์๋ ์์ง ๊น์ด ์ฐ๊ตฌ๋์ง ์์ ๋ถ์ผ ์ค ํ๋์ด๋ค. ์นจ์ต์ ์ ๊ทน์ ํตํด ์ป์ ๋ํผ์ง ๋ํ๋ฅผ ์ด์ฉํ ๋๋ถ๋ถ์ ์ธ์ด ๋์ฝ๋ฉ ์ฐ๊ตฌ๋ ์์๋ ์์ ์์ค์ ํ์ ์ธ์ด ์ฑ๋ถ์์ ์งํ๋์ด ์๊ณ , ํต์ฌ๋ ์๋ฏธ์ ๊ฐ์ ๊ณ ๋ฑ ์ธ์ด ์ฑ๋ถ์ ๋ํ ๋์ฝ๋ฉ ์ฐ๊ตฌ๋ ๋๋ฌผ๋ค. ๋๋ฌผ๊ฒ ์งํ๋ ๊ณ ๋ฑ ์ธ์ด ์ฑ๋ถ ๋์ฝ๋ฉ ์ฐ๊ตฌ ๋ํ ๋๋ค์๊ฐ ์๊ฐ์ ์ธ์ด ์ฒ๋ฆฌ๋ฅผ ์ฐ๊ตฌํ ๊ฒฐ๊ณผ๋ค์ด๋ฉฐ, ์๋ฆฌ ์ธ์ด ๋์ฝ๋ฉ ์ฐ๊ตฌ๋ ํ๋ ๋จ๊ณ์ ๋จธ๋ฌด๋ฅด๊ณ ์๋ค.
๋ณธ ์ฐ๊ตฌ๋ ์๋ฆฌ ์ธ์ด ์ง๊ฐ์์ ๋๋ ํ์ฑ์ ๋ถ์ํ์ฌ ๊ทธ ์ฒ๋ฆฌ ๊ณผ์ ์ ๋ํ ์ ํธ ํน์ฑ์ ๊ท๋ช
ํ๊ณ ์ ํ๋ค. ํนํ ์ธ๊ฐ ์์ฑ ์ธ์ด์ ํ์ ๊ตฌ์ฑ ์ฑ๋ถ๋ณด๋ค๋ ํต์ฌ์ ์๋ฏธ ์์ฃผ์ ๊ณ ๋ฑ ๊ตฌ์ฑ ์ฑ๋ถ์ ์ฒ๋ฆฌํ๋ ๋ฐ์ ๊ด์ฌํ๋ ๋ํ์ ์๊ฐ์ , ์ฃผํ์์ , ๊ณต๊ฐ์ ํน์ฑ์ ์ง์คํ์ฌ ๋ถ์์ ์งํํ์๋ค. ์ธ์ด ์ฒ๋ฆฌ์ ์ฃผ๋ ์ธ ๊ฐ์ง ์์๋ โ์์ (phonetics)โ, โํต์ฌ (syntactics)โ, โ์๋ฏธ (semantics)โ๋ผ๋ ์ ์ ๊ณ ๋ คํ์ฌ, ์์ ์์ค์ ๋ํ ํ๋์ ํต์ ํ ์ ์๋ ์คํ ํจ๋ฌ๋ค์์ ๊ตฌ์ํ์์ผ๋ฉฐ, ๊ตฌ์ฒด์ ์ผ๋ก๋ ๋๊ฐ์ ๋ค๋ฅธ ์๋ฏธ ๋ฒ์ฃผ (์๋ช
, ์ ์ฒด)์ ๋ํด์ ๋ฌป๋ ์์์ ์ผ๋ก ๋๋ฑํ ๋จ์ด๊ฐ ํฌํจ๋ ์ง๋ฌธ์ ๋ค๋ ค์ค ํ ์๋ฏธ๋ฅผ ํ์
ํด ๋๋ตํ๋ ๊ณผ์ ์ ๋ํ๋ฅผ ๊ธฐ๋กํ๋ ์คํ์ ์งํํ์๋ค.
๋ํ ์ ํธ๋ ๊ฒฝ๋งํ ์ ๊ทน ์ฝ์
์ (Electrocorticography, ECoG)์ ํตํด 14๋ช
์ ๋์ ์ฆ ํ์๋ก๋ถํฐ ์นจ์ต์ ๋ฐฉ์์ผ๋ก ์ธก์ ๋์๋ค. ๋ํ ๋์ฝ๋ฉ ๋ถ์์๋ ํผํ์์ ๋๋๊ฐ ์ณ์ ๋ฐฉ์์ผ๋ก ์ฒ๋ฆฌํ ๊ณ ๋ฑ ์ธ์ด ์ฑ๋ถ์ด ๋ฐ์๋ ์คํ๋ง์ ํฌํจํ๊ธฐ ์ํด์ ๋ชจ๋ ์คํ์์ ์ณ์ ๋๋ต์ ํ ์ธ ๋ช
์ ํ์๋ง์ ๋์์ผ๋ก ํ์ฌ ๋ถ์์ ์งํํ์๋ค. ๋์ฝ๋ฉ ๋ถ์ ๊ฒฐ๊ณผ ์ธ ๋ช
์ ํ์์ ๊ฑธ์ณ ํต์ฌ ๋จ์ด (โ๊ฒ์โ, โ๋ฌด์์
๋๊นโ) ์์ฑ ์ง๊ฐ ์ดํ ํน์ ์๊ฐ๋์์ ํน์ ์ฃผํ์๋์ ๋ํ๊ฐ ์ ๊ทน๋จ์ ์๋ฏธ๋ฅผ ๋์ ์์ค์ ์ ํ๋(%)๋ก ๋ถ๋ฅํ๋ ๋ฐ์ ์ฌ์ฉ๋ ์ ์๋ค๋ ๊ฒ์ ๋ฐํ๋ค. ๋ํ ์ด๋ฌํ ๋์ ์ ํ๋๋ฅผ ๊ธฐ๋กํ ๋ํ์ ํน์ฑ์๋ ๋ชจ๋ ํ์์ ๊ฑธ์ณ ์ ๋์ ํน์ ์๋์ ํธ๋ ๋๊ฐ ๊ด์ฐฐ๋๋ฉฐ, ๊ด์ฐฐ๋๋ ๋ํ์ ๊ณต๊ฐ์ ํน์ฑ์ ํ์ฌ ํต์ฉ๋๋ ์ ๊ฒฝ์ธ์ดํ์ ์ธ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ด ์ค๋ช
ํ๋ ์์ฑ ์ธ์ด ์ฒ๋ฆฌ ๋ฐฉ์๊ณผ ์ผ๋งฅ์ํตํจ์ ๋ฐํ๋ค.Abstract โ
ฐ
1. Introduction 1
2. Materials and Methods 4
3. Results 8
4. Discussion 12
References 15
List of Figures 20
Supplementary information 28
Abstract in Korean 36์
Telephone speech recognition via the combination of knowledge sources in a segmental speech model
The currently dominant speech recognition methodology, Hidden Markov Modeling, treats speech as a stochastic random process with very simple mathematical properties. The simplistic assumptions of the model, and especially that of the independence of the observation vectors have been criticized by many in the literature, and alternative solutions have been proposed. One such alternative is segmental modeling, and the OASIS recognizer we have been working on in the recent years belongs to this category. In this paper we go one step further and suggest that we should consider speech recognition as a knowledge source combination problem. We offer a generalized algorithmic framework for this approach and show that both hidden Markov and segmental modeling are a special case of this decoding scheme. In the second part of the paper we describe the current components of the OASIS system and evaluate its performance on a very difficult recognition task, the phonetically balanced sentences of the MTBA Hungarian Telephone Speech Database. Our results show that OASIS outperforms a traditional HMM system in phoneme classification and achieves practically the same recognition scores at the sentence level
Spoonerisms: An Analysis of Language Processing in Light of Neurobiology
Spoonerisms are described as the category of speech errors involving jumbled-up words. The author examines language, the brain, and the correlation between spoonerisms and the neural structures involved in language processing
Transformation of a temporal speech cue to a spatial neural code in human auditory cortex
In speech, listeners extract continuously-varying spectrotemporal cues from the acoustic signal to perceive discrete phonetic categories. Spectral cues are spatially encoded in the amplitude of responses in phonetically-tuned neural populations in auditory cortex. It remains unknown whether similar neurophysiological mechanisms encode temporal cues like voice-onset time (VOT), which distinguishes sounds like /b/ and/p/. We used direct brain recordings in humans to investigate the neural encoding of temporal speech cues with a VOT continuum from /ba/ to /pa/. We found that distinct neural populations respond preferentially to VOTs from one phonetic category, and are also sensitive to sub-phonetic VOT differences within a populationโs preferred category. In a simple neural network model, simulated populations tuned to detect either temporal gaps or coincidences between spectral cues captured encoding patterns observed in real neural data. These results demonstrate that a spatial/amplitude neural code underlies the cortical representation of both spectral and temporal speech cues
The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation
Forced alignment systems automatically determine boundaries between segments
in speech data, given an orthographic transcription. These tools are
commonplace in phonetics to facilitate the use of speech data that would be
infeasible to manually transcribe and segment. In the present paper, we
describe a new neural network-based forced alignment system, the Mason-Alberta
Phonetic Segmenter (MAPS). The MAPS aligner serves as a testbed for two
possible improvements we pursue for forced alignment systems. The first is
treating the acoustic model in a forced aligner as a tagging task, rather than
a classification task, motivated by the common understanding that segments in
speech are not truly discrete and commonly overlap. The second is an
interpolation technique to allow boundaries more precise than the common 10 ms
limit in modern forced alignment systems. We compare configurations of our
system to a state-of-the-art system, the Montreal Forced Aligner. The tagging
approach did not generally yield improved results over the Montreal Forced
Aligner. However, a system with the interpolation technique had a 27.92%
increase relative to the Montreal Forced Aligner in the amount of boundaries
within 10 ms of the target on the test set. We also reflect on the task and
training process for acoustic modeling in forced alignment, highlighting how
the output targets for these models do not match phoneticians' conception of
similarity between phones and that reconciliation of this tension may require
rethinking the task and output targets or how speech itself should be
segmented.Comment: submitted for publicatio
Better Evaluation of ASR in Speech Translation Context Using Word Embeddings
International audienceThis paper investigates the evaluation of ASR in spoken language translation context. More precisely, we propose a simple extension of WER metric in order to penalize differently substitution errors according to their context using word embeddings. For instance, the proposed metric should catch near matches (mainly morphological variants) and penalize less this kind of error which has a more limited impact on translation performance. Our experiments show that the correlation of the new proposed metric with SLT performance is better than the one of WER. Oracle experiments are also conducted and show the ability of our metric to find better hypotheses (to be translated) in the ASR N-best. Finally, a preliminary experiment where ASR tuning is based on our new metric shows encouraging results. For reproductible experiments, the code allowing to call our modified WER and the corpora used are made available to the research community
MULTIVARIATE ANALYSIS FOR UNDERSTANDING COGNITIVE SPEECH PROCESSING
MULTIVARIATE ANALYSIS FOR UNDERSTANDING COGNITIVE SPEECH PROCESSIN
Recommended from our members
Cortical encoding and decoding models of speech production
To speak is to dynamically orchestrate the movements of the articulators (jaw, tongue, lips, and larynx), which in turn generate speech sounds. It is an amazing mental and motor feat that is controlled by the brain and is fundamental for communication. Technology that could translate brain signals into speech would be transformative for people who are unable to communicate as a result of neurological impairments. This work first investigates how articulator movements that underlie natural speech production are represented in the brain. Building upon this, this work also presents a neural decoder that can synthesize audible speech from brain signals. Data to support these results were from direct cortical recordings of the human sensorimotor cortex while participants spoke natural sentences. Neural activity at individual electrodes encoded a diversity of articulatory kinematic trajectories (AKTs), each revealing coordinated articulator movements towards specific vocal tract shapes. The neural decoder was designed to leverage the kinematic trajectories encoded in the sensorimotor cortex which enhanced performance even with limited data. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication
- โฆ