1,165 research outputs found
Word learning in the first year of life
In the first part of this thesis, we ask whether 4-month-old infants can represent objects
and movements after a short exposure in such a way that they recognize either a repeated
object or a repeated movement when they are presented simultaneously with a new object
or a new movement. If they do, we ask whether the way they observe the visual input is
modified when auditory input is presented. We investigate whether infants react to the
familiarization labels and to novel labels in the same manner. If the labels as well as the
referents are matched for saliency, any difference should be due to processes that are not
limited to sensorial perception. We hypothesize that infants will, if they map words to the
objects or movements, change their looking behavior whenever they hear a familiar label,
a novel label, or no label at all.
In the second part of this thesis, we assess the problem of word learning from a different
perspective. If infants reason about possible label-referent pairs and are able to make
inferences about novel pairs, are the same processes involved in all intermodal learning?
We compared the task of learning to associate auditory regularities to visual stimuli
(reinforcers), and the word-learning task. We hypothesized that even if infants succeed in
learning more than one label during one single event, learning the intermodal connection
between auditory and visual regularities might present a more demanding task for them.
The third part of this thesis addresses the role of associative learning in word learning. In
the last decades, it was repeatedly suggested that co-occurrence probabilities can play an
important role in word segmentation. However, the vast majority of studies test infants
with artificial streams that do not resemble a natural input: most studies use words of
equal length and with unambiguous syllable sequences within word, where the only point
of variability is at the word boundaries (Aslin et al., 1998; Saffran, Johnson, Aslin, & Newport, 1999; Saffran et al., 1996; Thiessen et al., 2005; Thiessen & Saffran, 2003).
Even if the input is modified to resemble the natural input more faithfully, the words with
which infants are tested are always unambiguous \u2013 within words, each syllable predicts
its adjacent syllable with the probability of 1.0 (Pelucchi, Hay, & Saffran, 2009; Thiessen
et al., 2005). We therefore tested 6-month-old infants with such statistically ambiguous
words. Before doing that, we also verified on a large sample of languages whether
statistical information in the natural input, where the majority of the words are
statistically ambiguous, is indeed useful for segmenting words. Our motivation was partly
due to the fact that studies that modeled the segmentation process with a natural language
input often yielded ambivalent results about the usefulness of such computation
(Batchelder, 2002; Gambell & Yang, 2006; Swingley, 2005).
We conclude this introduction with a small remark about the term word. It will be used
throughout this thesis without questioning its descriptive value: the common-sense
meaning of the term word is unambiguous enough, since all people know what are we
referring to when we say or think of the term word. However, the term word is not
unambiguous at all (Di Sciullo & Williams, 1987). To mention only some of the classical
examples: (1) Do jump and jumped, or go and went, count as one word or as two? This
example might seem all too trivial, especially in languages with weak overt morphology
as English, but in some languages, each basic form of the word has tens of inflected
variables. (2) A similar question arises with all the words that are morphological
derivations of other words, such as evict and eviction, examine and reexamine, unhappy
and happily, and so on. (3) And finally, each language contains many phrases and idioms:
Does air conditioner and give up count as one word, or two? Statistical word
segmentation studies in general neglect the issue of the definition of words, assuming that
phrases and idioms have strong internal statistics and will therefore be selected as one
word (Cutler, 2012). But because compounds or phrases are usually composed of smaller
meaningful chunks, it is unclear how would infants extracts these smaller units of speech
if they were using predominantly statistical information. We will address the problem of
over-segmentations shortly in the third part of the thesis
Emerging Linguistic Functions in Early Infancy
This paper presents results from experimental
studies on early language acquisition in infants and
attempts to interpret the experimental results within
the framework of the Ecological Theory of
Language Acquisition (ETLA) recently proposed
by (Lacerda et al., 2004a). From this perspective,
the infant’s first steps in the acquisition of the
ambient language are seen as a consequence of the
infant’s general capacity to represent sensory input
and the infant’s interaction with other actors in its
immediate ecological environment. On the basis of
available experimental evidence, it will be argued
that ETLA offers a productive alternative to
traditional descriptive views of the language
acquisition process by presenting an operative
model of how early linguistic function may emerge
through interaction
CLASS - A Study of methods for coarse phonetic classification
The objective of this thesis was to examine computer techniques for classifying speech signals into four coarse phonetic classes: vowel-like, strong fricative, weak fricative and silence. The study compared classification results from the K-means clustering algorithm using Euclidian distance measurements with classification using a multivariate maximum likelihood distance measure. In addition to the comparison of statistical methods, this study compared classification using several tree-structured decision making processes. The system was trained on ten speakers using 98 utterances with both known and unknown speakers. Results showed very little difference between the Euclidian distance and maximum likelihood; however, the introduction of the tree structure on both systems had a positive influence on their performance
Recommended from our members
A speech envelope landmark for syllable encoding in human superior temporal gyrus.
The most salient acoustic features in speech are the modulations in its intensity, captured by the amplitude envelope. Perceptually, the envelope is necessary for speech comprehension. Yet, the neural computations that represent the envelope and their linguistic implications are heavily debated. We used high-density intracranial recordings, while participants listened to speech, to determine how the envelope is represented in human speech cortical areas on the superior temporal gyrus (STG). We found that a well-defined zone in middle STG detects acoustic onset edges (local maxima in the envelope rate of change). Acoustic analyses demonstrated that timing of acoustic onset edges cues syllabic nucleus onsets, while their slope cues syllabic stress. Synthesized amplitude-modulated tone stimuli showed that steeper slopes elicited greater responses, confirming cortical encoding of amplitude change, not absolute amplitude. Overall, STG encoding of the timing and magnitude of acoustic onset edges underlies the perception of speech temporal structure
Speech data analysis for semantic indexing of video of simulated medical crises.
The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatrics at the University of Louisville, was established to enhance the care of children by using simulation based educational methodologies to improve patient safety and strengthen clinician-patient interactions. After each simulation session, the physician must manually review and annotate the recordings and then debrief the trainees. The physician responsible for the simulation has recorded 100s of videos, and is seeking solutions that can automate the process. This dissertation introduces our developed system for efficient segmentation and semantic indexing of videos of medical simulations using machine learning methods. It provides the physician with automated tools to review important sections of the simulation by identifying who spoke, when and what was his/her emotion. Only audio information is extracted and analyzed because the quality of the image recording is low and the visual environment is static for most parts. Our proposed system includes four main components: preprocessing, speaker segmentation, speaker identification, and emotion recognition. The preprocessing consists of first extracting the audio component from the video recording. Then, extracting various low-level audio features to detect and remove silence segments. We investigate and compare two different approaches for this task. The first one is threshold-based and the second one is classification-based. The second main component of the proposed system consists of detecting speaker changing points for the purpose of segmenting the audio stream. We propose two fusion methods for this task. The speaker identification and emotion recognition components of our system are designed to provide users the capability to browse the video and retrieve shots that identify ”who spoke, when, and the speaker’s emotion” for further analysis. For this component, we propose two feature representation methods that map audio segments of arbitary length to a feature vector with fixed dimensions. The first one is based on soft bag-of-word (BoW) feature representations. In particular, we define three types of BoW that are based on crisp, fuzzy, and possibilistic voting. The second feature representation is a generalization of the BoW and is based on Fisher Vector (FV). FV uses the Fisher Kernel principle and combines the benefits of generative and discriminative approaches. The proposed feature representations are used within two learning frameworks. The first one is supervised learning and assumes that a large collection of labeled training data is available. Within this framework, we use standard classifiers including K-nearest neighbor (K-NN), support vector machine (SVM), and Naive Bayes. The second framework is based on semi-supervised learning where only a limited amount of labeled training samples are available. We use an approach that is based on label propagation. Our proposed algorithms were evaluated using 15 medical simulation sessions. The results were analyzed and compared to those obtained using state-of-the-art algorithms. We show that our proposed speech segmentation fusion algorithms and feature mappings outperform existing methods. We also integrated all proposed algorithms and developed a GUI prototype system for subjective evaluation. This prototype processes medical simulation video and provides the user with a visual summary of the different speech segments. It also allows the user to browse videos and retrieve scenes that provide answers to semantic queries such as: who spoke and when; who interrupted who? and what was the emotion of the speaker? The GUI prototype can also provide summary statistics of each simulation video. Examples include: for how long did each person spoke? What is the longest uninterrupted speech segment? Is there an unusual large number of pauses within the speech segment of a given speaker
- …