17,675 research outputs found
Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema
In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisakis model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: Linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers error rates and then to evaluate the information expressed by the classifiers confusion matrices. © Springer Science+Business Media, LLC 2011
Robust Head-Pose Estimation Based on Partially-Latent Mixture of Linear Regressions
Head-pose estimation has many applications, such as social event analysis,
human-robot and human-computer interaction, driving assistance, and so forth.
Head-pose estimation is challenging because it must cope with changing
illumination conditions, variabilities in face orientation and in appearance,
partial occlusions of facial landmarks, as well as bounding-box-to-face
alignment errors. We propose tu use a mixture of linear regressions with
partially-latent output. This regression method learns to map high-dimensional
feature vectors (extracted from bounding boxes of faces) onto the joint space
of head-pose angles and bounding-box shifts, such that they are robustly
predicted in the presence of unobservable phenomena. We describe in detail the
mapping method that combines the merits of unsupervised manifold learning
techniques and of mixtures of regressions. We validate our method with three
publicly available datasets and we thoroughly benchmark four variants of the
proposed algorithm with several state-of-the-art head-pose estimation methods.Comment: 12 pages, 5 figures, 3 table
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
Fast and Reliable Autonomous Surgical Debridement with Cable-Driven Robots Using a Two-Phase Calibration Procedure
Automating precision subtasks such as debridement (removing dead or diseased
tissue fragments) with Robotic Surgical Assistants (RSAs) such as the da Vinci
Research Kit (dVRK) is challenging due to inherent non-linearities in
cable-driven systems. We propose and evaluate a novel two-phase coarse-to-fine
calibration method. In Phase I (coarse), we place a red calibration marker on
the end effector and let it randomly move through a set of open-loop
trajectories to obtain a large sample set of camera pixels and internal robot
end-effector configurations. This coarse data is then used to train a Deep
Neural Network (DNN) to learn the coarse transformation bias. In Phase II
(fine), the bias from Phase I is applied to move the end-effector toward a
small set of specific target points on a printed sheet. For each target, a
human operator manually adjusts the end-effector position by direct contact
(not through teleoperation) and the residual compensation bias is recorded.
This fine data is then used to train a Random Forest (RF) to learn the fine
transformation bias. Subsequent experiments suggest that without calibration,
position errors average 4.55mm. Phase I can reduce average error to 2.14mm and
the combination of Phase I and Phase II can reduces average error to 1.08mm. We
apply these results to debridement of raisins and pumpkin seeds as fragment
phantoms. Using an endoscopic stereo camera with standard edge detection,
experiments with 120 trials achieved average success rates of 94.5%, exceeding
prior results with much larger fragments (89.4%) and achieving a speedup of
2.1x, decreasing time per fragment from 15.8 seconds to 7.3 seconds. Source
code, data, and videos are available at
https://sites.google.com/view/calib-icra/.Comment: Code, data, and videos are available at
https://sites.google.com/view/calib-icra/. Final version for ICRA 201
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
- …