11,661 research outputs found
Pauses and the temporal structure of speech
Natural-sounding speech synthesis requires close control over the temporal structure of the speech flow. This includes a full predictive scheme for the durational structure and in particuliar the prolongation of final syllables of lexemes as well as for the pausal structure in the utterance. In this chapter, a description of the temporal structure and the summary of the numerous factors that modify it are presented. In the second part, predictive schemes for the temporal structure of speech ("performance structures") are introduced, and their potential for characterising the overall prosodic structure of speech is demonstrated
Sperry Univac speech communications technology
Technology and systems for effective verbal communication with computers were developed. A continuous speech recognition system for verbal input, a word spotting system to locate key words in conversational speech, prosodic tools to aid speech analysis, and a prerecorded voice response system for speech output are described
ORCA-SPOT: An Automatic Killer Whale Sound Detection Toolkit Using Deep Learning
Large bioacoustic archives of wild animals are an important source to identify reappearing communication patterns, which can then be related to recurring behavioral patterns to advance the current understanding of intra-specific communication of non-human animals. A main challenge remains that most large-scale bioacoustic archives contain only a small percentage of animal vocalizations and a large amount of environmental noise, which makes it extremely difficult to manually retrieve sufficient vocalizations for further analysis – particularly important for species with advanced social systems and complex vocalizations. In this study deep neural networks were trained on 11,509 killer whale (Orcinus orca) signals and 34,848 noise segments. The resulting toolkit ORCA-SPOT was tested on a large-scale bioacoustic repository – the Orchive – comprising roughly 19,000 hours of killer whale underwater recordings. An automated segmentation of the entire Orchive recordings (about 2.2 years) took approximately 8 days. It achieved a time-based precision or positive-predictive-value (PPV) of 93.2% and an area-under-the-curve (AUC) of 0.9523. This approach enables an automated annotation procedure of large bioacoustics databases to extract killer whale sounds, which are essential for subsequent identification of significant communication patterns. The code will be publicly available in October 2019 to support the application of deep learning to bioaoucstic research. ORCA-SPOT can be adapted to other animal species
Complex sequencing rules of birdsong can be explained by simple hidden Markov processes
Complex sequencing rules observed in birdsongs provide an opportunity to
investigate the neural mechanism for generating complex sequential behaviors.
To relate the findings from studying birdsongs to other sequential behaviors,
it is crucial to characterize the statistical properties of the sequencing
rules in birdsongs. However, the properties of the sequencing rules in
birdsongs have not yet been fully addressed. In this study, we investigate the
statistical propertiesof the complex birdsong of the Bengalese finch (Lonchura
striata var. domestica). Based on manual-annotated syllable sequences, we first
show that there are significant higher-order context dependencies in Bengalese
finch songs, that is, which syllable appears next depends on more than one
previous syllable. This property is shared with other complex sequential
behaviors. We then analyze acoustic features of the song and show that
higher-order context dependencies can be explained using first-order hidden
state transition dynamics with redundant hidden states. This model corresponds
to hidden Markov models (HMMs), well known statistical models with a large
range of application for time series modeling. The song annotation with these
models with first-order hidden state dynamics agreed well with manual
annotation, the score was comparable to that of a second-order HMM, and
surpassed the zeroth-order model (the Gaussian mixture model (GMM)), which does
not use context information. Our results imply that the hierarchical
representation with hidden state dynamics may underlie the neural
implementation for generating complex sequences with higher-order dependencies
Stosunek polskich uczniów do nauki wymowy języka angielskiego: analizując od nowa
It is widely agreed that acquisition of a sound system of a second language
always presents a great challenge for L2 learners (e.g. Rojczyk, 2010). Numerous
studies (e.g. Nowacka, 2010; Flege, 1991) prove that L2 learners
whose first language has a scarce number of sounds, encounter difficulties
in distinguishing L2 sound categories and tend to apply their L1 segments
to new contexts. There is abundance of studies examining L2 learners’ successes
and failures in production of L1 and L2 sounds, especially vowels
(e.g. Flege, 1992; Nowacka, 2010; Rojczyk, 2010). However, the situation
becomes more complicated when we consider third language production.
While in the case of L2 segmental production the number of factors affecting
L2 sounds is rather limited (either interference from learners’ L1 or
some kind of L2 intralingual influence), in the case of L3 segmental production
we may encounter L1→L3, L2→L3, L1+L2→L3 or L3 intralingual
interference. This makes separation of L3 sounds a much more complex
process.
The aim of this paper is to examine whether speakers of L1 Polish, L2
English and L3 German are able to separate new, L3 vowel categories from
their native and L2 categories. The research presented in this article is
a part of a larger project assessing production of L3 segments. This time the focus is on German /y/. This vowel was chosen since it is regarded as
especially difficult for Polish learners of German and it is frequently substituted
with some other sounds. A group of English philology (Polish-English-
German translation and interpretation programme) students was
chosen to participate in this study. They were native speakers of Polish, advanced
speakers of English and upper-intermediate users of German. They
had been taught both English and German pronunciation courses during
their studies at the University of Silesia. The subjects were asked to produce
words containing analysed vowels, namely: P /u/, P /i/, E /uÉ/, E /iÉ/, E
/ɪ/ and G /y/. All examined vowels were embedded in a /bVt/ context. The
target /bVt/ words were then embedded in carrier sentences: I said /bVt/
this time in English, Ich sag’ /bVt/ diesmal in German and Mówię /bVt/ teraz
in Polish, in a non-final position. The sentences were presented to subjects
on a computer screen and the produced chunks were stored in a notebook’s
memory as .wav files ready for inspection. The Praat 5.3.12 speech-analysis
software package (Boersma, 2001) was used to measure and analyse
the recordings. The obtained results suggest that L2 affects L3 segmental
production to a significant extent. Learners find it difficult to separate all
“new” and “old” vowel categories, especially if they are perceived as “similar”
to one another and when learners strive to sound “foreign”.Przyswajanie systemu fonetycznego języka drugiego (J2) zawsze jest
ogromnym wyzwaniem dla uczących się nowego języka (np. Rojczyk, 2010).
Liczne badania (np. Flege, 1991; Nowacka, 2010) udowodniły, że w przypadku,
gdy J1 uczących się nowego języka ma raczej ograniczoną liczbę
dźwięków, wówczas osoby te mają problemy z odróżnianiem większej
liczby nowych głosek i często zastępują je ojczystymi segmentami. Łatwo
można znaleźć wiele badań dotyczących sukcesów i porażek w produkcji
i percepcji nowych dźwięków przez uczących się J2 (np. Flege, 1992; Nowacka,
2010; Rojczyk, 2010), jednakże sytuacja staje się znacznie bardziej skomplikowana w przypadku przyswajania języka trzeciego (J3). Podczas
przyswajania języka drugiego liczba czynników wpływających na proces
produkcji poszczególnych segmentów jest raczej ograniczona (może to być
wpływ języka pierwszego lub też interferencja językowa wewnątrz J2), natomiast
podczas przyswajania języka trzeciego ich liczba jest zdecydowanie
większa (J1→J3, J2→L3, J1+J2→L3 lub procesy zachodzące wewnątrz
J3). To wszystko sprawia, że przyswajanie systemu fonetycznego języka
trzeciego jest procesem wyjątkowo złożonym.
Celem niniejszego artykułu było zbadanie czy rodzimi użytkownicy
języka polskiego z J2 — angielskim i J3 — niemieckim, są zdolni do oddzielenia
nowych, niemieckich kategorii samogłoskowych od tych polskich
i angielskich. Badanie tu opisane jest częścią większego projektu mającego
na celu ocenę produkcji samogłosek w J3. Tym razem opisana jest produkcja
niemieckiego /y/. Samogłoska ta została wybrana ponieważ jest uważana
przez uczących się języka niemieckiego za wyjątkowo trudną i często
jest zastępowana innymi, podobnymi polskimi dźwiękami. Uczestnikami
badania była grupa studentów filologii angielskiej, potrójnego programu
tłumaczeniowego: polsko-angielsko-niemieckiego. Byli rodzimymi użytkownikami
języka polskiego, zaawansowanymi użytkownikami języka
angielskiego i średniozaawansowanymi użytkownikami języka niemieckiego.
Przed przystąpieniem do badania, byli oni uczeni wymowy obu obcych
języków. W trakcie badania musieli wyprodukować słowa zawierające
wszystkie badane dźwięki, mianowicie: P/u/, P/i/, A/uÉ/, A/iÉ/, A /ɪ/ oraz
N/y/. Wszystkie badane samogłoski były ukryte w kontekście /bSt/ . Te słowa
były następnie ukryte w zdaniach: I said /bVt/ this time po angielsku, Ich
sag’ /bVt/ diesmal po niemiecku oraz Mówię /bVt/ teraz po polsku. Wszystkie
wypowiedzi zostały nagrane jako pliki .wav, a następnie poddane analizie
akustycznej przy użyciu programu Praat (Boersma, 2001). Uzyskane wyniki
pokazały jak trudne dla uczących się języków jest rozdzielenie „nowych”
i „starych” samogłosek, zwłaszcza, gdy brzmią one podobnie, a mówiący
starają się mówić „jak obcokrajowiec”
Integrated speech and morphological processing in a connectionist continuous speech understanding for Korean
A new tightly coupled speech and natural language integration model is
presented for a TDNN-based continuous possibly large vocabulary speech
recognition system for Korean. Unlike popular n-best techniques developed for
integrating mainly HMM-based speech recognition and natural language processing
in a {\em word level}, which is obviously inadequate for morphologically
complex agglutinative languages, our model constructs a spoken language system
based on a {\em morpheme-level} speech and language integration. With this
integration scheme, the spoken Korean processing engine (SKOPE) is designed and
implemented using a TDNN-based diphone recognition module integrated with a
Viterbi-based lexical decoding and symbolic phonological/morphological
co-analysis. Our experiment results show that the speaker-dependent continuous
{\em eojeol} (Korean word) recognition and integrated morphological analysis
can be achieved with over 80.6% success rate directly from speech inputs for
the middle-level vocabularies.Comment: latex source with a4 style, 15 pages, to be published in computer
processing of oriental language journa
- …