530 research outputs found
Automatic syllabification using segmental conditional random fields
In this paper we present a statistical approach for the automatic syllabification of phonetic word transcriptions. A syllable bigram language model forms the core of the system. Given the large number of syllables in non-syllabic languages, sparsity is the main issue, especially since the available syllabified corpora tend to be small. Traditional back-off mechanisms only give a partial solution to the sparsity problem. In this work we use a set of features for back-off purposes: on the one hand probabilities such as consonant cluster probabilities, and on the other hand a set of rules based on generic syllabification principles such as legality, sonority and maximal onset. For the combination of these highly correlated features with the baseline bigram feature we employ segmental conditional random fields (SCRFs) as statistical framework. The resulting method is very versatile and can be used for any amount of data of any language.
The method was tested on various datasets in English and Dutch with dictionary sizes varying between 1 and 60 thousand words. We obtained a 97.96% word accuracy for supervised syllabification and a 91.22% word accuracy for unsupervised syllabification for English. When including the top-2 generated syllabifications for a small fraction of the words, virtual perfect syllabification is obtained in supervised mode
ShefCE: A Cantonese-English Bilingual Speech Corpus for Pronunciation Assessment
This paper introduces the development of ShefCE: a Cantonese-English bilingual speech corpus from L2 English speakers in Hong Kong. Bilingual parallel recording materials were chosen from TED online lectures. Script selection were carried out according to bilingual consistency (evaluated using a machine translation system) and the distribution balance of phonemes. 31 undergraduate to postgraduate students in Hong Kong aged 20-30 were recruited and recorded a 25-hour speech corpus (12 hours in Cantonese and 13 hours in English). Baseline phoneme/syllable recognition systems were trained on background data with and without the ShefCE training data. The final syllable error rate (SER) for Cantonese is 17.3% and final phoneme error rate (PER) for English is 34.5%. The automatic speech recognition performance on English showed a significant mismatch when applying L1 models on L2 data, suggesting the need for explicit accent adaptation. ShefCE and the corresponding baseline models will be made openly available for academic research
Exploiting Contextual Information for Prosodic Event Detection Using Auto-Context
Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are important aspects of speech. In the field of prosodic event detection, many local acoustic features have been investigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this lies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we introduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of local acoustic features, after which the generated probabilities are used along with the local features as contextual information to train new classifiers. By iteratively using updated probabilities as the contextual information, the algorithm can accurately model contextual dependencies and improve classification ability. The advantages of this method include its flexible structure and the ability of capturing contextual relationships. When using the auto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and F-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic context. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%. The new algorithm outperforms conditional random fields, especially on boundary detection in terms of F-score. It also outperforms an n-gram language model on the task of pitch accent detection
Word learning in the first year of life
In the first part of this thesis, we ask whether 4-month-old infants can represent objects
and movements after a short exposure in such a way that they recognize either a repeated
object or a repeated movement when they are presented simultaneously with a new object
or a new movement. If they do, we ask whether the way they observe the visual input is
modified when auditory input is presented. We investigate whether infants react to the
familiarization labels and to novel labels in the same manner. If the labels as well as the
referents are matched for saliency, any difference should be due to processes that are not
limited to sensorial perception. We hypothesize that infants will, if they map words to the
objects or movements, change their looking behavior whenever they hear a familiar label,
a novel label, or no label at all.
In the second part of this thesis, we assess the problem of word learning from a different
perspective. If infants reason about possible label-referent pairs and are able to make
inferences about novel pairs, are the same processes involved in all intermodal learning?
We compared the task of learning to associate auditory regularities to visual stimuli
(reinforcers), and the word-learning task. We hypothesized that even if infants succeed in
learning more than one label during one single event, learning the intermodal connection
between auditory and visual regularities might present a more demanding task for them.
The third part of this thesis addresses the role of associative learning in word learning. In
the last decades, it was repeatedly suggested that co-occurrence probabilities can play an
important role in word segmentation. However, the vast majority of studies test infants
with artificial streams that do not resemble a natural input: most studies use words of
equal length and with unambiguous syllable sequences within word, where the only point
of variability is at the word boundaries (Aslin et al., 1998; Saffran, Johnson, Aslin, & Newport, 1999; Saffran et al., 1996; Thiessen et al., 2005; Thiessen & Saffran, 2003).
Even if the input is modified to resemble the natural input more faithfully, the words with
which infants are tested are always unambiguous \u2013 within words, each syllable predicts
its adjacent syllable with the probability of 1.0 (Pelucchi, Hay, & Saffran, 2009; Thiessen
et al., 2005). We therefore tested 6-month-old infants with such statistically ambiguous
words. Before doing that, we also verified on a large sample of languages whether
statistical information in the natural input, where the majority of the words are
statistically ambiguous, is indeed useful for segmenting words. Our motivation was partly
due to the fact that studies that modeled the segmentation process with a natural language
input often yielded ambivalent results about the usefulness of such computation
(Batchelder, 2002; Gambell & Yang, 2006; Swingley, 2005).
We conclude this introduction with a small remark about the term word. It will be used
throughout this thesis without questioning its descriptive value: the common-sense
meaning of the term word is unambiguous enough, since all people know what are we
referring to when we say or think of the term word. However, the term word is not
unambiguous at all (Di Sciullo & Williams, 1987). To mention only some of the classical
examples: (1) Do jump and jumped, or go and went, count as one word or as two? This
example might seem all too trivial, especially in languages with weak overt morphology
as English, but in some languages, each basic form of the word has tens of inflected
variables. (2) A similar question arises with all the words that are morphological
derivations of other words, such as evict and eviction, examine and reexamine, unhappy
and happily, and so on. (3) And finally, each language contains many phrases and idioms:
Does air conditioner and give up count as one word, or two? Statistical word
segmentation studies in general neglect the issue of the definition of words, assuming that
phrases and idioms have strong internal statistics and will therefore be selected as one
word (Cutler, 2012). But because compounds or phrases are usually composed of smaller
meaningful chunks, it is unclear how would infants extracts these smaller units of speech
if they were using predominantly statistical information. We will address the problem of
over-segmentations shortly in the third part of the thesis
- …