2,612 research outputs found
Continuous Multimodal Emotion Recognition Approach for AVEC 2017
This paper reports the analysis of audio and visual features in predicting
the continuous emotion dimensions under the seventh Audio/Visual Emotion
Challenge (AVEC 2017), which was done as part of a B.Tech. 2nd year internship
project. For visual features we used the HOG (Histogram of Gradients) features,
Fisher encodings of SIFT (Scale-Invariant Feature Transform) features based on
Gaussian mixture model (GMM) and some pretrained Convolutional Neural Network
layers as features; all these extracted for each video clip. For audio features
we used the Bag-of-audio-words (BoAW) representation of the LLDs (low-level
descriptors) generated by openXBOW provided by the organisers of the event.
Then we trained fully connected neural network regression model on the dataset
for all these different modalities. We applied multimodal fusion on the output
models to get the Concordance correlation coefficient on Development set as
well as Test set.Comment: 4 pages, 3 figures, arXiv:1605.06778, arXiv:1512.0338
Occlusion-Aware Human Pose Estimation with Mixtures of Sub-Trees
In this paper, we study the problem of learning a model for human pose
estimation as mixtures of compositional sub-trees in two layers of prediction.
This involves estimating the pose of a sub-tree followed by identifying the
relationships between sub-trees that are used to handle occlusions between
different parts. The mixtures of the sub-trees are learnt utilising both
geometric and appearance distances. The Chow-Liu (CL) algorithm is recursively
applied to determine the inter-relations between the nodes and to build the
structure of the sub-trees. These structures are used to learn the latent
parameters of the sub-trees and the inference is done using a standard belief
propagation technique. The proposed method handles occlusions during the
inference process by identifying overlapping regions between different
sub-trees and introducing a penalty term for overlapping parts. Experiments are
performed on three different datasets: the Leeds Sports, Image Parse and UIUC
People datasets. The results show the robustness of the proposed method to
occlusions over the state-of-the-art approaches.Comment: 12 pages, 5 figures and 3 Table
Depression Scale Recognition from Audio, Visual and Text Analysis
Depression is a major mental health disorder that is rapidly affecting lives
worldwide. Depression not only impacts emotional but also physical and
psychological state of the person. Its symptoms include lack of interest in
daily activities, feeling low, anxiety, frustration, loss of weight and even
feeling of self-hatred. This report describes work done by us for Audio Visual
Emotion Challenge (AVEC) 2017 during our second year BTech summer internship.
With the increase in demand to detect depression automatically with the help of
machine learning algorithms, we present our multimodal feature extraction and
decision level fusion approach for the same. Features are extracted by
processing on the provided Distress Analysis Interview Corpus-Wizard of Oz
(DAIC-WOZ) database. Gaussian Mixture Model (GMM) clustering and Fisher vector
approach were applied on the visual data; statistical descriptors on gaze,
pose; low level audio features and head pose and text features were also
extracted. Classification is done on fused as well as independent features
using Support Vector Machine (SVM) and neural networks. The results obtained
were able to cross the provided baseline on validation data set by 17% on audio
features and 24.5% on video features
Real-time 3D Traffic Cone Detection for Autonomous Driving
Considerable progress has been made in semantic scene understanding of road
scenes with monocular cameras. It is, however, mainly related to certain
classes such as cars and pedestrians. This work investigates traffic cones, an
object class crucial for traffic control in the context of autonomous vehicles.
3D object detection using images from a monocular camera is intrinsically an
ill-posed problem. In this work, we leverage the unique structure of traffic
cones and propose a pipelined approach to the problem. Specifically, we first
detect cones in images by a tailored 2D object detector; then, the spatial
arrangement of keypoints on a traffic cone are detected by our deep structural
regression network, where the fact that the cross-ratio is projection invariant
is leveraged for network regularization; finally, the 3D position of cones is
recovered by the classical Perspective n-Point algorithm. Extensive experiments
show that our approach can accurately detect traffic cones and estimate their
position in the 3D world in real time. The proposed method is also deployed on
a real-time, critical system. It runs efficiently on the low-power Jetson TX2,
providing accurate 3D position estimates, allowing a race-car to map and drive
autonomously on an unseen track indicated by traffic cones. With the help of
robust and accurate perception, our race-car won both Formula Student
Competitions held in Italy and Germany in 2018, cruising at a top-speed of 54
kmph. Visualization of the complete pipeline, mapping and navigation can be
found on our project page.Comment: IEEE Intelligent Vehicles Symposium (IV'19). arXiv admin note: text
overlap with arXiv:1809.1054
Parallel Matrix Condensation for Calculating Log-Determinant of Large Matrix
Calculating the log-determinant of a matrix is useful for statistical
computations used in machine learning, such as generative learning which uses
the log-determinant of the covariance matrix to calculate the log-likelihood of
model mixtures. The log-determinant calculation becomes challenging as the
number of variables becomes large. Therefore, finding a practical speedup for
this computation can be useful. In this study, we present a parallel matrix
condensation algorithm for calculating the log-determinant of a large matrix.
We demonstrate that in a distributed environment, Parallel Matrix Condensation
has several advantages over the well-known Parallel Gaussian Elimination. The
advantages include high data distribution efficiency and less data
communication operations. We test our Parallel Matrix Condensation against
self-implemented Parallel Gaussian Elimination as well as ScaLAPACK (Scalable
Linear Algebra Package) on 1000 x1000 to 8000x8000 for 1,2,4,8,16,32,64 and 128
processors. The results show that Matrix Condensation yields the best speed-up
among all other tested algorithms. The code is available on
https://github.com/vbvg2008/MatrixCondensatio
Clustering and Learning from Imbalanced Data
A learning classifier must outperform a trivial solution, in case of
imbalanced data, this condition usually does not hold true. To overcome this
problem, we propose a novel data level resampling method - Clustering Based
Oversampling for improved learning from class imbalanced datasets. The
essential idea behind the proposed method is to use the distance between a
minority class sample and its respective cluster centroid to infer the number
of new sample points to be generated for that minority class sample. The
proposed algorithm has very less dependence on the technique used for finding
cluster centroids and does not effect the majority class learning in any way.
It also improves learning from imbalanced data by incorporating the
distribution structure of minority class samples in generation of new data
samples. The newly generated minority class data is handled in a way as to
prevent outlier production and overfitting. Implementation analysis on
different datasets using deep neural networks as the learning classifier shows
the effectiveness of this method as compared to other synthetic data resampling
techniques across several evaluation metrics.Comment: 9 pages, To Appear at NIPS 2018 Workshop
LiDAR-Camera Calibration using 3D-3D Point correspondences
With the advent of autonomous vehicles, LiDAR and cameras have become an
indispensable combination of sensors. They both provide rich and complementary
data which can be used by various algorithms and machine learning to sense and
make vital inferences about the surroundings. We propose a novel pipeline and
experimental setup to find accurate rigid-body transformation for extrinsically
calibrating a LiDAR and a camera. The pipeling uses 3D-3D point correspondences
in LiDAR and camera frame and gives a closed form solution. We further show the
accuracy of the estimate by fusing point clouds from two stereo cameras which
align perfectly with the rotation and translation estimated by our method,
confirming the accuracy of our method's estimates both mathematically and
visually. Taking our idea of extrinsic LiDAR-camera calibration forward, we
demonstrate how two cameras with no overlapping field-of-view can also be
calibrated extrinsically using 3D point correspondences. The code has been made
available as open-source software in the form of a ROS package, more
information about which can be sought here:
https://github.com/ankitdhall/lidar_camera_calibration
DisguiseNet : A Contrastive Approach for Disguised Face Verification in the Wild
This paper describes our approach for the Disguised Faces in the Wild (DFW)
2018 challenge. The task here is to verify the identity of a person among
disguised and impostors images. Given the importance of the task of face
verification it is essential to compare methods across a common platform. Our
approach is based on VGG-face architecture paired with Contrastive loss based
on cosine distance metric. For augmenting the data set, we source more data
from the internet. The experiments show the effectiveness of the approach on
the DFW data. We show that adding extra data to the DFW dataset with noisy
labels also helps in increasing the generalization performance of the network.
The proposed network achieves 27.13% absolute increase in accuracy over the DFW
baseline
Unsupervised Learning of Eye Gaze Representation from the Web
Automatic eye gaze estimation has interested researchers for a while now. In
this paper, we propose an unsupervised learning based method for estimating the
eye gaze region. To train the proposed network "Ize-Net" in self-supervised
manner, we collect a large `in the wild' dataset containing 1,54,251 images
from the web. For the images in the database, we divide the gaze into three
regions based on an automatic technique based on pupil-centers localization and
then use a feature-based technique to determine the gaze region. The
performance is evaluated on the Tablet Gaze and CAVE datasets by fine-tuning
results of Ize-Net for the task of eye gaze estimation. The feature
representation learned is also used to train traditional machine learning
algorithms for eye gaze estimation. The results demonstrate that the proposed
method learns a rich data representation, which can be efficiently fine-tuned
for any eye gaze estimation dataset
Scanning gate microscopy of ultra clean carbon nanotube quantum dots
We perform scanning gate microscopy on individual suspended carbon nanotube
quantum dots. The size and position of the quantum dots can be visually
identified from the concentric high conductance rings. For the ultra clean
devices used in this study, two new effects are clearly identified.
Electrostatic screening creates non-overlapping multiple sets of Coulomb rings
from a single quantum dot. In double quantum dots, by changing the tip voltage,
the interactions between the quantum dots can be tuned from the weak to strong
coupling regime.Comment: 5 pages, 4 figure
- …
