    This paper presents a survey of basic methods for acoustic and language model development based on artificial neural networks for automatic speech recognition systems. The hybrid and tandem approaches for combination of Hidden Markov Models and artificial neural networks for acoustic modelling are given. The creation of language models using feedforward and recurrent neural networks is described. The survey of researches, conducted in this field, shows that application of artificial neural networks at the stages of both acoustic and language modeling allows decreasing word error rate.В статье представлен аналитический обзор основных разновидностей акустических и языковых моделей на основе искусственных нейронных сетей для систем автоматического распознавания речи. Рассмотрены гибридный и тандемный под-ходы объединения скрытых марковских моделей и искусственных нейронных сетей для акустического моделирования, описано построение языковых моделей с применением сетей прямого распространения и рекуррентных нейросетей. Обзор исследований в данной области показывает, что применение искусственных нейронных сетей как на этапе акустического, так и на этапе языкового моделирования позволяет снизить ошибку распознавания слов

    A practical speech audiometry tool is the digits-in-noise (DIN) test for hearing screening of populations of varying ages and hearing status. The test is usually conducted by a human supervisor (e.g., clinician), who scores the responses spoken by the listener, or online, where a software scores the responses entered by the listener. The test has 24 digit-triplets presented in an adaptive staircase procedure, resulting in a speech reception threshold (SRT). We propose an alternative automated DIN test setup that can evaluate spoken responses whilst conducted without a human supervisor, using the open-source automatic speech recognition toolkit, Kaldi-NL. Thirty self-reported normal-hearing Dutch adults (19-64 years) completed one DIN+Kaldi-NL test. Their spoken responses were recorded, and used for evaluating the transcript of decoded responses by Kaldi-NL. Study 1 evaluated the Kaldi-NL performance through its word error rate (WER), percentage of summed decoding errors regarding only digits found in the transcript compared to the total number of digits present in the spoken responses. Average WER across participants was 5.0% (range 0 - 48%, SD = 8.8%), with average decoding errors in three triplets per participant. Study 2 analysed the effect that triplets with decoding errors from Kaldi-NL had on the DIN test output (SRT), using bootstrapping simulations. Previous research indicated 0.70 dB as the typical within-subject SRT variability for normal-hearing adults. Study 2 showed that up to four triplets with decoding errors produce SRT variations within this range, suggesting that our proposed setup could be feasible for clinical applications

    Automatsko prepoznavanje govora je tehnologija koja računarima omogućava pretvaranje izgovorenih reči u tekst. Ona se može primeniti u mnogim savremenim sistemima koji uključuju komunikaciju između čoveka i mašine. U ovoj disertaciji detaljno je opisana jedna od dve glavne komponente sistema za prepoznavanje govora, a to je jezički model, koji specificira rečnik sistema, kao i pravila prema kojim se pojedinačne reči mogu povezati u rečenicu. Srpski jezik spada u grupu visoko inflektivnih i morfološki bogatih jezika, što znači da koristi veći broj različitih završetaka reči za izražavanje željene gramatičke, sintaksičke ili semantičke funkcije date reči. Ovakvo ponašanje često dovodi do velikog broja grešaka sistema za prepoznavanje govora kod kojih zbog dobrog akustičkog poklapanja prepoznavač pogodi osnovni oblik reči, ali pogreši njen završetak. Taj završetak može da označava drugu morfološku kategoriju, na primer, padež, rod ili broj. U radu je predstavljen novi alat za modelovanje jezika, koji uz identitet reči u modelu može da koristi dodatna leksička i morfološka obeležja reči, čime je testirana hipoteza da te dodatne informacije mogu pomoći u prevazilaženju značajnog broja grešaka prepoznavača koje su posledica inflektivnosti srpskog jezika.Automatic speech recognition is a technology that allows computers to convert spoken words into text. It can be applied in various areas which involve communication between humans and machines. This thesis primarily deals with one of two main components of speech recognition systems - the language model, that specifies the vocabulary of the system, as well as the rules by which individual words can be linked into sentences. The Serbian language belongs to a group of highly inflective and morphologically rich languages, which means that it uses a number of different word endings to express the desired grammatical, syntactic, or semantic function of the given word. Such behavior often leads to a significant number of errors in speech recognition systems where due to good acoustic matching the recognizer correctly guesses the basic form of the word, but an error occurs in the word ending. This word ending may indicate a different morphological category, for example, word case, grammatical gender, or grammatical number. The thesis presents a new language modeling tool which, along with the word identity, can also model additional lexical and morphological features of the word, thus testing the hypothesis that this additional information can help overcome a significant number of recognition errors that result from the high inflectivity of the Serbian language