116 research outputs found
The Effect of Narrow-Band Transmission on Recognition of Paralinguistic Information From Human Vocalizations
Practically, no knowledge exists on the effects of speech coding and recognition for narrow-band transmission of speech signals within certain frequency ranges especially in relation to the recognition of paralinguistic cues in speech. We thus investigated the impact of narrow-band standard speech coders on the machine-based classification of affective vocalizations and clinical vocal recordings. In addition, we analyzed the effect of speech low-pass filtering by a set of different cut-off frequencies, either chosen as static values in the 0.5-5-kHz range or given dynamically by different upper limits from the first five speech formants (F1-F5). Speech coding and recognition were tested, first, according to short-term speaker states by using affective vocalizations as given by the Geneva Multimodal Emotion Portrayals. Second, in relation to long-term speaker traits, we tested vocal recording from clinical populations involving speech impairments as found in the Child Pathological Speech Database. We employ a large acoustic feature space derived from the Interspeech Computational Paralinguistics Challenge. Besides analysis of the sheer corruption outcome, we analyzed the potential of matched and multicondition training as opposed to miss-matched condition. In the results, first, multicondition and matched-condition training significantly increase performances as opposed to mismatched condition. Second, downgrades in classification accuracy occur, however, only at comparably severe levels of low-pass filtering. The downgrades especially appear for multi-categorical rather than for binary decisions. These can be dealt with reasonably by the alluded strategies
Contributions on Automatic Recognition of Faces using Local Texture Features
Uno de los temas más destacados del área de visión artifical se deriva del análisis facial automático. En particular, la detección precisa de caras humanas y el análisis biométrico de las mismas son problemas que han generado especial interés debido a la gran cantidad de aplicaciones que actualmente hacen uso de estos mecnismos.
En esta Tesis Doctoral se analizan por separado los problemas relacionados con detección precisa de caras basada en la localización de los ojos y el reconomcimiento facial a partir de la extracción de características locales de textura. Los algoritmos desarrollados abordan el problema de la extracción de la identidad a partir de una imagen de cara ( en vista frontal o semi-frontal), para escenarios parcialmente controlados. El objetivo es desarrollar algoritmos robustos y que puedan incorpararse fácilmente a aplicaciones reales, tales como seguridad avanzada en banca o la definición de estrategias comerciales aplicadas al sector de retail.
Respecto a la extracción de texturas locales, se ha realizado un análisis exhaustivo de los descriptores más extendidos; se ha puesto especial énfasis en el estudio de los Histogramas de Grandientes Orientados (HOG features). En representaciones normalizadas de la cara, estos descriptores ofrecen información discriminativa de los elementos faciales (ojos, boca, etc.), siendo robustas a variaciones en la iluminación y pequeños desplazamientos.
Se han elegido diferentes algoritmos de clasificación para realizar la detección y el reconocimiento de caras, todos basados en una estrategia de sistemas supervisados. En particular, para la localización de ojos se ha utilizado clasificadores boosting y Máquinas de Soporte Vectorial (SVM) sobre descriptores HOG. En el caso de reconocimiento de caras, se ha desarrollado un nuevo algoritmo, HOG-EBGM (HOG sobre Elastic Bunch Graph Matching). Dada la imagen de una cara, el esquema seguido por este algoritmo se puede resumir en pocos pasos: en una primera etapa se extMonzó Ferrer, D. (2012). Contributions on Automatic Recognition of Faces using Local Texture Features [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16698Palanci
Latent Dependency Mining for Solving Regression Problems in Computer Vision
PhDRegression-based frameworks, learning the direct mapping between low-level imagery features
and vector/scalar-formed continuous labels, have been widely exploited in computer vision, e.g.
in crowd counting, age estimation and human pose estimation. In the last decade, many efforts
have been dedicated by researchers in computer vision for better regression fitting. Nevertheless,
solving these computer vision problems with regression frameworks remained a formidable
challenge due to 1) feature variation and 2) imbalance and sparse data. On one hand, large feature
variation can be caused by the changes of extrinsic conditions (i.e. images are taken under
different lighting condition and viewing angles) and also intrinsic conditions (e.g. different aging
process of different persons in age estimation and inter-object occlusion in crowd density
estimation). On the other hand, imbalanced and sparse data distributions can also have an important
effect on regression performance. Apparently, these two challenges existing in regression
learning are related in the sense that the feature inconsistency problem is compounded by sparse
and imbalanced training data and vice versa, and they need be tackled jointly in modelling and
explicitly in representation. This thesis firstly mines an intermediary feature representation consisting
of concatenating spatially localised feature for sharing the information from neighbouring
localised cells in the frames. This thesis secondly introduces the cumulative attribute concept
constructed for learning a regression model by exploiting the latent cumulative dependent nature
of label space in regression, in the application of facial age and crowd density estimation.
The thesis thirdly demonstrates the effectiveness of a discriminative structured-output regression
framework to learn the inherent latent correlation between each element of output variables in
the application of 2D human upper body pose estimation. The effectiveness of the proposed regression
frameworks for crowd counting, age estimation, and human pose estimation is validated
with public benchmarks
Improving Deep Representation Learning with Complex and Multimodal Data.
Representation learning has emerged as a way to learn meaningful representation from data and made a breakthrough in many applications including visual object recognition, speech recognition, and text understanding. However, learning representation from complex high-dimensional sensory data is challenging since there exist many irrelevant factors of variation (e.g., data transformation, random noise). On the other hand, to build an end-to-end prediction system for structured output variables, one needs to incorporate probabilistic inference to properly model a mapping from single input to possible configurations of output variables. This thesis addresses limitations of current representation learning in two parts.
The first part discusses efficient learning algorithms of invariant representation based on restricted Boltzmann machines (RBMs). Pointing out the difficulty of learning, we develop an efficient initialization method for sparse and convolutional RBMs. On top of that, we develop variants of RBM that learn representations invariant to data transformations such as translation, rotation, or scale variation by pooling the filter responses of input data after a transformation, or to irrelevant patterns such as random or structured noise, by jointly performing feature selection and feature learning. We demonstrate improved performance on visual object recognition and weakly supervised foreground object segmentation.
The second part discusses conditional graphical models and learning frameworks for structured output variables using deep generative models as prior. For example, we combine the best properties of the CRF and the RBM to enforce both local and global (e.g., object shape) consistencies for visual object segmentation. Furthermore, we develop a deep conditional generative model of structured output variables, which is an end-to-end system trainable by backpropagation. We demonstrate the importance of global prior and probabilistic inference for visual object segmentation. Second, we develop a novel multimodal learning framework by casting the problem into structured output representation learning problems, where the output is one data modality to be predicted from the other modalities, and vice versa. We explain as to how our method could be more effective than maximum likelihood learning and demonstrate the state-of-the-art performance on visual-text and visual-only recognition tasks.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113549/1/kihyuks_1.pd
Recommended from our members
Large-scale Functional Connectivity in the Human Brain Reveals Fundamental Mechanisms of Cognitive, Sensory and Emotion Processing in Health and Psychiatric Disorders
Functional connectivity networks that integrate remote areas of the brain as working functional units are thought to underlie fundamental mechanisms of perception and cognition, and have emerged as an active area of investigation. However, traditional approaches of measuring functional connectivity are limited in that they rely on a priori specification of one or a few brain regions. Therefore, the development of data-driven and exploratory approaches that assess functional connectivity on a large-scale are required in order to further understand the functional network organization of these processes in both health and disease.
In this thesis project, I investigate the roles of functional connectivity in visual search (Chapter 2, (Pantazatos, Yanagihara et al., 2012)) and bistable perception (Chapter 3, (Karten et al., 2013)) using traditional functional connectivity approaches, and develop and apply new approaches to characterize the large-scale networks underlying the processing of supraliminal (Chapter 4, (Pantazatos et al., 2012a)) and subliminal (Chapter 5, (Pantazatos, Talati et al., 2012b)) emotional threat signals, speech and song processing in autism (Chapter 6, (Lai et al., 2012)), and face processing in social anxiety disorder (Chapter 7, (Pantazatos et al., 2013)). Finally, I complement the latter study with an investigation of structural morphological abnormalities in social anxiety disorder (Chapter 8, (Talati et al., 2013)). Each of these chapters has been or is about to be published in peer reviewed journals and this thesis provides an overview of the entire body of investigation, based on advances in understanding the role of large-scale neural processes as fundamental organizational units that underlie behavior.
In Chapter 2, Independent Components Analysis (ICA), Psychophysiological Interactions (PPI) and Dynamic Causal Modeling (DCM) analyses were used to investigate the hypothesis that expectation and attention-related interactions between ventral and medial prefrontal cortex and association visual cortex underlie visual search for an object. Results extend previous models of visual search processes to include specific frontal-occipital neuronal interactions during a natural and complex search task. In Chapter 3, PPI analyses revealed percept-dependent changes in connectivity between visual cortex, frontoparietal attention and default mode networks during bistable image perception. These findings advance neural models of bistable perception by implicating the default mode and frontoparietal networks during image segmentation.
In Chapters 4 and 5, an exploratory approach based on multivariate pattern analysis of large-scale, condition-dependent functional connectivity was developed and applied in order to further understand the neural mechanisms of threat-related emotion processing. This approach was successful in extracting sufficient information to "brain-read" both unattended supraliminal (Chapter 4) and subliminal (Chapter 5) fear perception in healthy subjects. Informative features for supraliminal fear perception included functional connections between thalamus and superior temporal gyrus, angular gyrus and hippocampus, and fusiform and amygdala, while informative features for subliminal fear perception included middle temporal gyrus, cerebellum and angular gyrus.
In psychiatric disorders, large-scale functional connectivity is typically assessed during resting-state (i.e. no task or stimulus). However, disorder-dependent alterations in functional network architecture may be more or less prominent during a stimulus or task that is behaviorally relevant to the disorder, as is exemplified by enhanced long-range, frontal-posterior connectivity during song (vs. speech) perception in autism (Chapter 6). In the case of social anxiety disorder (SAD), pattern analysis of large-scale, functional connectivity during neutral face perception was sensitive enough to discriminate individual subjects with SAD from both healthy controls and panic disorder (Chapter 7). The most informative feature was functional connectivity between left hippocampus and left temporal pole, which was reduced in medication-free SAD subjects, and which increased following 8-weeks SSRI treatment, with greater increases correlating with greater decreases in symptom severity. This finding parallels results from observed neuroanatomical abnormalities in SAD, which include reduced grey matter volume in the temporal pole, in addition to increased grey matter volume in cerebellum and fusiform (Chapter 8). The above findings suggest promise for emerging functional connectivity and structural-based neurobiomarkers for SAD diagnosis and treatment effects
- …