1,519 research outputs found
Deep Multiple Instance Learning for Zero-shot Image Tagging
In-line with the success of deep learning on traditional recognition problem,
several end-to-end deep models for zero-shot recognition have been proposed in
the literature. These models are successful to predict a single unseen label
given an input image, but does not scale to cases where multiple unseen objects
are present. In this paper, we model this problem within the framework of
Multiple Instance Learning (MIL). To the best of our knowledge, we propose the
first end-to-end trainable deep MIL framework for the multi-label zero-shot
tagging problem. Due to its novel design, the proposed framework has several
interesting features: (1) Unlike previous deep MIL models, it does not use any
off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation.
(2) During test time, it can process any number of unseen labels given their
semantic embedding vectors. (3) Using only seen labels per image as weak
annotation, it can produce a bounding box for each predicted labels. We
experiment with the NUS-WIDE dataset and achieve superior performance across
conventional, zero-shot and generalized zero-shot tagging tasks
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Attention-based Multi-instance Neural Network for Medical Diagnosis from Incomplete and Low Quality Data
One way to extract patterns from clinical records is to consider each patient
record as a bag with various number of instances in the form of symptoms.
Medical diagnosis is to discover informative ones first and then map them to
one or more diseases. In many cases, patients are represented as vectors in
some feature space and a classifier is applied after to generate diagnosis
results. However, in many real-world cases, data is often of low-quality due to
a variety of reasons, such as data consistency, integrity, completeness,
accuracy, etc. In this paper, we propose a novel approach, attention based
multi-instance neural network (AMI-Net), to make the single disease
classification only based on the existing and valid information in the
real-world outpatient records. In the context of a patient, it takes a bag of
instances as input and output the bag label directly in end-to-end way.
Embedding layer is adopted at the beginning, mapping instances into an
embedding space which represents the individual patient condition. The
correlations among instances and their importance for the final classification
are captured by multi-head attention transformer, instance-level multi-instance
pooling and bag-level multi-instance pooling. The proposed approach was test on
two non-standardized and highly imbalanced datasets, one in the Traditional
Chinese Medicine (TCM) domain and the other in the Western Medicine (WM)
domain. Our preliminary results show that the proposed approach outperforms all
baselines results by a significant margin
A Review on Deep Learning Techniques Applied to Semantic Segmentation
Image semantic segmentation is more and more being of interest for computer
vision and machine learning researchers. Many applications on the rise need
accurate and efficient segmentation mechanisms: autonomous driving, indoor
navigation, and even virtual or augmented reality systems to name a few. This
demand coincides with the rise of deep learning approaches in almost every
field or application target related to computer vision, including semantic
segmentation or scene understanding. This paper provides a review on deep
learning methods for semantic segmentation applied to various application
areas. Firstly, we describe the terminology of this field as well as mandatory
background concepts. Next, the main datasets and challenges are exposed to help
researchers decide which are the ones that best suit their needs and their
targets. Then, existing methods are reviewed, highlighting their contributions
and their significance in the field. Finally, quantitative results are given
for the described methods and the datasets in which they were evaluated,
following up with a discussion of the results. At last, we point out a set of
promising future works and draw our own conclusions about the state of the art
of semantic segmentation using deep learning techniques.Comment: Submitted to TPAMI on Apr. 22, 201
Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data
The development of audio event recognition models requires labeled training
data, which are generally hard to obtain. One promising source of recordings of
audio events is the large amount of multimedia data on the web. In particular,
if the audio content analysis must itself be performed on web audio, it is
important to train the recognizers themselves from such data. Training from
these web data, however, poses several challenges, the most important being the
availability of labels : labels, if any, that may be obtained for the data are
generally {\em weak}, and not of the kind conventionally required for training
detectors or classifiers. We propose that learning algorithms that can exploit
weak labels offer an effective method to learn from web data. We then propose a
robust and efficient deep convolutional neural network (CNN) based framework to
learn audio event recognizers from weakly labeled data. The proposed method can
train from and analyze recordings of variable length in an efficient manner and
outperforms a network trained with {\em strongly labeled} web data by a
considerable margin
Video-based Sign Language Recognition without Temporal Segmentation
Millions of hearing impaired people around the world routinely use some
variants of sign languages to communicate, thus the automatic translation of a
sign language is meaningful and important. Currently, there are two
sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that
recognizes word by word and continuous SLR that translates entire sentences.
Existing continuous SLR methods typically utilize isolated SLRs as building
blocks, with an extra layer of preprocessing (temporal segmentation) and
another layer of post-processing (sentence synthesis). Unfortunately, temporal
segmentation itself is non-trivial and inevitably propagates errors into
subsequent steps. Worse still, isolated SLR methods typically require strenuous
labeling of each word separately in a sentence, severely limiting the amount of
attainable training data. To address these challenges, we propose a novel
continuous sign recognition framework, the Hierarchical Attention Network with
Latent Space (LS-HAN), which eliminates the preprocessing of temporal
segmentation. The proposed LS-HAN consists of three components: a two-stream
Convolutional Neural Network (CNN) for video feature representation generation,
a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention
Network (HAN) for latent space based recognition. Experiments are carried out
on two large scale datasets. Experimental results demonstrate the effectiveness
of the proposed framework.Comment: 32nd AAAI Conference on Artificial Intelligence (AAAI-18), Feb. 2-7,
2018, New Orleans, Louisiana, US
Weakly Supervised Few-shot Object Segmentation using Co-Attention with Visual and Semantic Embeddings
Significant progress has been made recently in developing few-shot object
segmentation methods. Learning is shown to be successful in few-shot
segmentation settings, using pixel-level, scribbles and bounding box
supervision. This paper takes another approach, i.e., only requiring
image-level label for few-shot object segmentation. We propose a novel
multi-modal interaction module for few-shot object segmentation that utilizes a
co-attention mechanism using both visual and word embedding. Our model using
image-level labels achieves 4.8% improvement over previously proposed
image-level few-shot object segmentation. It also outperforms state-of-the-art
methods that use weak bounding box supervision on PASCAL-5i. Our results show
that few-shot segmentation benefits from utilizing word embeddings, and that we
are able to perform few-shot segmentation using stacked joint visual semantic
processing with weak image-level labels. We further propose a novel setup,
Temporal Object Segmentation for Few-shot Learning (TOSFL) for videos. TOSFL
can be used on a variety of public video data such as Youtube-VOS, as
demonstrated in both instance-level and category-level TOSFL experiments.Comment: Accepted to IJCAI'20. The first three authors listed contributed
equall
Weakly Supervised Medical Diagnosis and Localization from Multiple Resolutions
Diagnostic imaging often requires the simultaneous identification of a
multitude of findings of varied size and appearance. Beyond global indication
of said findings, the prediction and display of localization information
improves trust in and understanding of results when augmenting clinical
workflow. Medical training data rarely includes more than global image-level
labels as segmentations are time-consuming and expensive to collect. We
introduce an approach to managing these practical constraints by applying a
novel architecture which learns at multiple resolutions while generating
saliency maps with weak supervision. Further, we parameterize the Log-Sum-Exp
pooling function with a learnable lower-bounded adaptation (LSE-LBA) to build
in a sharpness prior and better handle localizing abnormalities of different
sizes using only image-level labels. Applying this approach to interpreting
chest x-rays, we set the state of the art on 9 abnormalities in the NIH's CXR14
dataset while generating saliency maps with the highest resolution to date.Comment: submitted to ECCV 201
Near Perfect Protein Multi-Label Classification with Deep Neural Networks
Artificial neural networks (ANNs) have gained a well-deserved popularity
among machine learning tools upon their recent successful applications in
image- and sound processing and classification problems. ANNs have also been
applied for predicting the family or function of a protein, knowing its residue
sequence. Here we present two new ANNs with multi-label classification ability,
showing impressive accuracy when classifying protein sequences into 698 UniProt
families (AUC=99.99%) and 983 Gene Ontology classes (AUC=99.45%)
Deep Learning for Generic Object Detection: A Survey
Object detection, one of the most fundamental and challenging problems in
computer vision, seeks to locate object instances from a large number of
predefined categories in natural images. Deep learning techniques have emerged
as a powerful strategy for learning feature representations directly from data
and have led to remarkable breakthroughs in the field of generic object
detection. Given this period of rapid evolution, the goal of this paper is to
provide a comprehensive survey of the recent achievements in this field brought
about by deep learning techniques. More than 300 research contributions are
included in this survey, covering many aspects of generic object detection:
detection frameworks, object feature representation, object proposal
generation, context modeling, training strategies, and evaluation metrics. We
finish the survey by identifying promising directions for future research.Comment: IJCV Mino
- …