530 research outputs found

    Automatic syllabification using segmental conditional random fields

    Get PDF
    In this paper we present a statistical approach for the automatic syllabification of phonetic word transcriptions. A syllable bigram language model forms the core of the system. Given the large number of syllables in non-syllabic languages, sparsity is the main issue, especially since the available syllabified corpora tend to be small. Traditional back-off mechanisms only give a partial solution to the sparsity problem. In this work we use a set of features for back-off purposes: on the one hand probabilities such as consonant cluster probabilities, and on the other hand a set of rules based on generic syllabification principles such as legality, sonority and maximal onset. For the combination of these highly correlated features with the baseline bigram feature we employ segmental conditional random fields (SCRFs) as statistical framework. The resulting method is very versatile and can be used for any amount of data of any language. The method was tested on various datasets in English and Dutch with dictionary sizes varying between 1 and 60 thousand words. We obtained a 97.96% word accuracy for supervised syllabification and a 91.22% word accuracy for unsupervised syllabification for English. When including the top-2 generated syllabifications for a small fraction of the words, virtual perfect syllabification is obtained in supervised mode

    ShefCE: A Cantonese-English Bilingual Speech Corpus for Pronunciation Assessment

    Get PDF
    This paper introduces the development of ShefCE: a Cantonese-English bilingual speech corpus from L2 English speakers in Hong Kong. Bilingual parallel recording materials were chosen from TED online lectures. Script selection were carried out according to bilingual consistency (evaluated using a machine translation system) and the distribution balance of phonemes. 31 undergraduate to postgraduate students in Hong Kong aged 20-30 were recruited and recorded a 25-hour speech corpus (12 hours in Cantonese and 13 hours in English). Baseline phoneme/syllable recognition systems were trained on background data with and without the ShefCE training data. The final syllable error rate (SER) for Cantonese is 17.3% and final phoneme error rate (PER) for English is 34.5%. The automatic speech recognition performance on English showed a significant mismatch when applying L1 models on L2 data, suggesting the need for explicit accent adaptation. ShefCE and the corresponding baseline models will be made openly available for academic research

    On the automatic segmentation of transcribed words

    Get PDF

    Exploiting Contextual Information for Prosodic Event Detection Using Auto-Context

    Get PDF
    Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are important aspects of speech. In the field of prosodic event detection, many local acoustic features have been investigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this lies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we introduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of local acoustic features, after which the generated probabilities are used along with the local features as contextual information to train new classifiers. By iteratively using updated probabilities as the contextual information, the algorithm can accurately model contextual dependencies and improve classification ability. The advantages of this method include its flexible structure and the ability of capturing contextual relationships. When using the auto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and F-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic context. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%. The new algorithm outperforms conditional random fields, especially on boundary detection in terms of F-score. It also outperforms an n-gram language model on the task of pitch accent detection

    Word learning in the first year of life

    Get PDF
    In the first part of this thesis, we ask whether 4-month-old infants can represent objects and movements after a short exposure in such a way that they recognize either a repeated object or a repeated movement when they are presented simultaneously with a new object or a new movement. If they do, we ask whether the way they observe the visual input is modified when auditory input is presented. We investigate whether infants react to the familiarization labels and to novel labels in the same manner. If the labels as well as the referents are matched for saliency, any difference should be due to processes that are not limited to sensorial perception. We hypothesize that infants will, if they map words to the objects or movements, change their looking behavior whenever they hear a familiar label, a novel label, or no label at all. In the second part of this thesis, we assess the problem of word learning from a different perspective. If infants reason about possible label-referent pairs and are able to make inferences about novel pairs, are the same processes involved in all intermodal learning? We compared the task of learning to associate auditory regularities to visual stimuli (reinforcers), and the word-learning task. We hypothesized that even if infants succeed in learning more than one label during one single event, learning the intermodal connection between auditory and visual regularities might present a more demanding task for them. The third part of this thesis addresses the role of associative learning in word learning. In the last decades, it was repeatedly suggested that co-occurrence probabilities can play an important role in word segmentation. However, the vast majority of studies test infants with artificial streams that do not resemble a natural input: most studies use words of equal length and with unambiguous syllable sequences within word, where the only point of variability is at the word boundaries (Aslin et al., 1998; Saffran, Johnson, Aslin, & Newport, 1999; Saffran et al., 1996; Thiessen et al., 2005; Thiessen & Saffran, 2003). Even if the input is modified to resemble the natural input more faithfully, the words with which infants are tested are always unambiguous \u2013 within words, each syllable predicts its adjacent syllable with the probability of 1.0 (Pelucchi, Hay, & Saffran, 2009; Thiessen et al., 2005). We therefore tested 6-month-old infants with such statistically ambiguous words. Before doing that, we also verified on a large sample of languages whether statistical information in the natural input, where the majority of the words are statistically ambiguous, is indeed useful for segmenting words. Our motivation was partly due to the fact that studies that modeled the segmentation process with a natural language input often yielded ambivalent results about the usefulness of such computation (Batchelder, 2002; Gambell & Yang, 2006; Swingley, 2005). We conclude this introduction with a small remark about the term word. It will be used throughout this thesis without questioning its descriptive value: the common-sense meaning of the term word is unambiguous enough, since all people know what are we referring to when we say or think of the term word. However, the term word is not unambiguous at all (Di Sciullo & Williams, 1987). To mention only some of the classical examples: (1) Do jump and jumped, or go and went, count as one word or as two? This example might seem all too trivial, especially in languages with weak overt morphology as English, but in some languages, each basic form of the word has tens of inflected variables. (2) A similar question arises with all the words that are morphological derivations of other words, such as evict and eviction, examine and reexamine, unhappy and happily, and so on. (3) And finally, each language contains many phrases and idioms: Does air conditioner and give up count as one word, or two? Statistical word segmentation studies in general neglect the issue of the definition of words, assuming that phrases and idioms have strong internal statistics and will therefore be selected as one word (Cutler, 2012). But because compounds or phrases are usually composed of smaller meaningful chunks, it is unclear how would infants extracts these smaller units of speech if they were using predominantly statistical information. We will address the problem of over-segmentations shortly in the third part of the thesis

    Syllable Based Speech Recognition

    Get PDF
    corecore