33 research outputs found
Transformer-based encoder-encoder architecture for Spoken Term Detection
The paper presents a method for spoken term detection based on the
Transformer architecture. We propose the encoder-encoder architecture employing
two BERT-like encoders with additional modifications, including convolutional
and upsampling layers, attention masking, and shared parameters. The encoders
project a recognized hypothesis and a searched term into a shared embedding
space, where the score of the putative hit is computed using the calibrated dot
product. In the experiments, we used the Wav2Vec 2.0 speech recognizer, and the
proposed system outperformed a baseline method based on deep LSTMs on the
English and Czech STD datasets based on USC Shoah Foundation Visual History
Archive (MALACH).Comment: Submitted to ICASSP 202
Air Traffic Control Communication
Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono
ATCC: Pronunciation lexicon and n-gram counts for ASR module
The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0)
Air Traffic Control Communication
Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono
ATCC: Pronunciation lexicon and n-gram counts for ASR module
The corpus contains pronunciation lexicon and n-gram counts (unigrams, bigrams and trigrams) that can be used for constructing the language model for air traffic control communication domain. It could be used together with the Air Traffic Control Communication corpus (http://hdl.handle.net/11858/00-097C-0000-0001-CCA1-0)
OVM – Otázky Václava Moravce
The corpus consists of transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. It contains 35 hours of speech and corresponding word-by-word transcriptions, including the transcription of some non-speech events. Speakers’ names are also assigned to corresponding segments. The resulting corpus is suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 16 sound files (WAV PCM, 16-bit, 48 kHz, mono) and transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net
RozpoznánĂ spojitĂ© spontánnĂ Ĺ™eÄŤi s velkĂ˝m slovnĂkem a v reálnĂ©m ÄŤase pro dialogovĂ© systĂ©my
ÄŚlánek popisuje modifikaci vĂ˝chozĂho systĂ©mu pro rozpoznávánĂ Ĺ™eÄŤi. VĂ˝slednĂ˝ systĂ©m je vhodnĂ˝ pro pouĹľitĂ v hlasovĂ©m dialogovĂ©m systĂ©mu se smĂšenou iniciativou a pĹ™irozenĂ˝m vstupem. Jsou prezentovány tĹ™i pĹ™Ăstupy pro rozšiĹ™enĂ rozpoznávacĂho slovnĂku za účelem zajištÄ›nĂ schopnosti rozpoznat všechny entity z danĂ© domĂ©ny. Dále je navrĹľena metoda normalizace nespisovnĂ©ho textu. Experimenty provedenĂ© na korpusu spontánnĂ Ĺ™eÄŤi ukazujĂ, Ĺľe navrĹľená metoda je velmi vĂ˝znamná pro jazyky, kde se podstatnÄ› lišà psaná formálnĂ podoba jazyka a obecná nespisnovná Ĺ™eÄŤ. Celková chybovost slov (Word Error Rate) byla redukována o 16.7%.This paper describes the method for modifying the
baseline speech recognition system to be suitable for a use in
spoken dialog system with mixed initiative and natural user’s
input. We present three approaches for extending the recognition
vocabulary to ensure the spoken dialog system is able to recognize
all entities in the given domain. The colloquial text normalization
method is proposed. The experiments performed on spontaneous
speech corpus suggested that the proposed method is very
important for languages where the formal written language and
a common colloquial speech are very different. The overall word
error rate was reduced by 16.7%
Czech Parliament Meetings
The corpus consists of recordings from the Chamber of Deputies of the Parliament of the Czech Republic. It currently consists of 88 hours of speech data, which corresponds roughly to 0.5 million tokens. The annotation process is semi-automatic, as we are able to perform the speech recognition on the data with high accuracy (over 90%) and consequently align the resulting automatic transcripts with the speech. The annotator’s task is then to check the transcripts, correct errors, add proper punctuation and label speech sections with information about the speaker. The resulting corpus is therefore suitable for both acoustic model training for ASR purposes and training of speaker identification and/or verification systems. The archive contains 18 sound files (WAV PCM, 16-bit, 44.1 kHz, mono) and corresponding transcriptions in XML-based standard Transcriber format (http://trans.sourceforge.net)
The date of airing of a particular recording is encoded in the filename in the form SOUND_YYMMDD_*. Note that the recordings are usually aired in the early morning on the day following the actual Parliament session. If the recording is too long to fit in the broadcasting scheme, it is divided into several parts and aired on the consecutive days