30,035 research outputs found
Early Prediction for Physical Human Robot Collaboration in the Operating Room
To enable a natural and fluent human robot collaboration flow, it is critical
for a robot to comprehend their human peers' on-going actions, predict their
behaviors in the near future, and plan its actions correspondingly.
Specifically, the capability of making early predictions is important, so that
the robot can foresee the precise timing of a turn-taking event and start
motion planning and execution early enough to smooth the turn-taking
transition. Such proactive behavior would reduce human's waiting time, increase
efficiency and enhance naturalness in collaborative task. To that end, this
paper presents the design and implementation of an early turn-taking prediction
algorithm, catered for physical human robot collaboration scenarios.
Specifically, a Robotic Scrub Nurse (RSN) system which can comprehend surgeon's
multimodal communication cues and perform turn-taking prediction is presented.
The developed algorithm was tested on a collected data set of simulated
surgical procedures in a surgeon-nurse tandem. The proposed turn-taking
prediction algorithm is found to be significantly superior to its algorithmic
counterparts, and is more accurate than human baseline when little partial
input is given (less than 30% of full action). After observing more
information, the algorithm can achieve comparable performances as humans with a
F1 score of 0.90
Modeling Multi-turn Conversation with Deep Utterance Aggregation
Multi-turn conversation understanding is a major challenge for building
intelligent dialogue systems. This work focuses on retrieval-based response
matching for multi-turn conversation whose related work simply concatenates the
conversation utterances, ignoring the interactions among previous utterances
for context modeling. In this paper, we formulate previous utterances into
context using a proposed deep utterance aggregation model to form a
fine-grained context representation. In detail, a self-matching attention is
first introduced to route the vital information in each utterance. Then the
model matches a response with each refined utterance and the final matching
score is obtained after attentive turns aggregation. Experimental results show
our model outperforms the state-of-the-art methods on three multi-turn
conversation benchmarks, including a newly introduced e-commerce dialogue
corpus.Comment: Proceedings of the 27th International Conference on Computational
Linguistics (COLING 2018
Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering
This paper proposes deep convolutional network models that utilize local and
global context to make human activity label predictions in still images,
achieving state-of-the-art performance on two recent datasets with hundreds of
labels each. We use multiple instance learning to handle the lack of
supervision on the level of individual person instances, and weighted loss to
handle unbalanced training data. Further, we show how specialized features
trained on these datasets can be used to improve accuracy on the Visual
Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank
questions (Visual Madlibs). Specifically, we tackle two types of questions on
person activity and person-object relationship and show improvements over
generic features trained on the ImageNet classification task
FusionNet: Fusing via Fully-Aware Attention with Application to Machine Comprehension
This paper introduces a new neural structure called FusionNet, which extends
existing attention approaches from three perspectives. First, it puts forward a
novel concept of "history of word" to characterize attention information from
the lowest word-level embedding up to the highest semantic-level
representation. Second, it introduces an improved attention scoring function
that better utilizes the "history of word" concept. Third, it proposes a
fully-aware multi-level attention mechanism to capture the complete information
in one text (such as a question) and exploit it in its counterpart (such as
context or passage) layer by layer. We apply FusionNet to the Stanford Question
Answering Dataset (SQuAD) and it achieves the first position for both single
and ensemble model on the official SQuAD leaderboard at the time of writing
(Oct. 4th, 2017). Meanwhile, we verify the generalization of FusionNet with two
adversarial SQuAD datasets and it sets up the new state-of-the-art on both
datasets: on AddSent, FusionNet increases the best F1 metric from 46.6% to
51.4%; on AddOneSent, FusionNet boosts the best F1 metric from 56.0% to 60.7%.Comment: Published in Sixth International Conference on Learning
Representations (ICLR), 201
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Visual dialog is a challenging vision-language task, where a dialog agent
needs to answer a series of questions through reasoning on the image content
and dialog history. Prior work has mostly focused on various attention
mechanisms to model such intricate interactions. By contrast, in this work, we
propose VD-BERT, a simple yet effective framework of unified vision-dialog
Transformer that leverages the pretrained BERT language models for Visual
Dialog tasks. The model is unified in that (1) it captures all the interactions
between the image and the multi-turn dialog using a single-stream Transformer
encoder, and (2) it supports both answer ranking and answer generation
seamlessly through the same architecture. More crucially, we adapt BERT for the
effective fusion of vision and dialog contents via visually grounded training.
Without the need of pretraining on external vision-language data, our model
yields new state of the art, achieving the top position in both single-model
and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog
leaderboard. Our code and pretrained models are released at
https://github.com/salesforce/VD-BERT.Comment: EMNLP 2020 (14 pages
Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities
New technologies have enabled the investigation of biology and human health
at an unprecedented scale and in multiple dimensions. These dimensions include
a myriad of properties describing genome, epigenome, transcriptome, microbiome,
phenotype, and lifestyle. No single data type, however, can capture the
complexity of all the factors relevant to understanding a phenomenon such as a
disease. Integrative methods that combine data from multiple technologies have
thus emerged as critical statistical and computational approaches. The key
challenge in developing such approaches is the identification of effective
models to provide a comprehensive and relevant systems view. An ideal method
can answer a biological or medical question, identifying important features and
predicting outcomes, by harnessing heterogeneous data across several dimensions
of biological variation. In this Review, we describe the principles of data
integration and discuss current methods and available implementations. We
provide examples of successful data integration in biology and medicine.
Finally, we discuss current challenges in biomedical integrative methods and
our perspective on the future development of the field
Putting Question-Answering Systems into Practice: Transfer Learning for Efficient Domain Customization
Traditional information retrieval (such as that offered by web search
engines) impedes users with information overload from extensive result pages
and the need to manually locate the desired information therein. Conversely,
question-answering systems change how humans interact with information systems:
users can now ask specific questions and obtain a tailored answer - both
conveniently in natural language. Despite obvious benefits, their use is often
limited to an academic context, largely because of expensive domain
customizations, which means that the performance in domain-specific
applications often fails to meet expectations. This paper proposes
cost-efficient remedies: (i) we leverage metadata through a filtering
mechanism, which increases the precision of document retrieval, and (ii) we
develop a novel fuse-and-oversample approach for transfer learning in order to
improve the performance of answer extraction. Here knowledge is inductively
transferred from a related, yet different, tasks to the domain-specific
application, while accounting for potential differences in the sample sizes
across both tasks. The resulting performance is demonstrated with actual use
cases from a finance company and the film industry, where fewer than 400
question-answer pairs had to be annotated in order to yield significant
performance gains. As a direct implication to management, this presents a
promising path to better leveraging of knowledge stored in information systems.Comment: Accepted by ACM TMI
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
Audio-visual representation learning is an important task from the
perspective of designing machines with the ability to understand complex
events. To this end, we propose a novel multimodal framework that instantiates
multiple instance learning. We show that the learnt representations are useful
for classifying events and localizing their characteristic audio-visual
elements. The system is trained using only video-level event labels without any
timing information. An important feature of our method is its capacity to learn
from unsynchronized audio-visual events. We achieve state-of-the-art results on
a large-scale dataset of weakly-labeled audio event videos. Visualizations of
localized visual regions and audio segments substantiate our system's efficacy,
especially when dealing with noisy situations where modality-specific cues
appear asynchronously
Reproducibility Evaluation of SLANT Whole Brain Segmentation Across Clinical Magnetic Resonance Imaging Protocols
Whole brain segmentation on structural magnetic resonance imaging (MRI) is
essential for understanding neuroanatomical-functional relationships.
Traditionally, multi-atlas segmentation has been regarded as the standard
method for whole brain segmentation. In past few years, deep convolutional
neural network (DCNN) segmentation methods have demonstrated their advantages
in both accuracy and computational efficiency. Recently, we proposed the
spatially localized atlas network tiles (SLANT) method, which is able to
segment a 3D MRI brain scan into 132 anatomical regions. Commonly, DCNN
segmentation methods yield inferior performance under external validations,
especially when the testing patterns were not presented in the training
cohorts. Recently, we obtained a clinically acquired, multi-sequence MRI brain
cohort with 1480 clinically acquired, de-identified brain MRI scans on 395
patients using seven different MRI protocols. Moreover, each subject has at
least two scans from different MRI protocols. Herein, we assess the SLANT
method's intra- and inter-protocol reproducibility. SLANT achieved less than
0.05 coefficient of variation (CV) for intra-protocol experiments and less than
0.15 CV for inter-protocol experiments. The results show that the SLANT method
achieved high intra- and inter- protocol reproducibility.Comment: To appear in SPIE Medical Imaging 201
Learning to Measure Change: Fully Convolutional Siamese Metric Networks for Scene Change Detection
A critical challenge problem of scene change detection is that noisy changes
generated by varying illumination, shadows and camera viewpoint make variances
of a scene difficult to define and measure since the noisy changes and semantic
ones are entangled. Following the intuitive idea of detecting changes by
directly comparing dissimilarities between a pair of features, we propose a
novel fully Convolutional siamese metric Network(CosimNet) to measure changes
by customizing implicit metrics. To learn more discriminative metrics, we
utilize contrastive loss to reduce the distance between the unchanged feature
pairs and to enlarge the distance between the changed feature pairs.
Specifically, to address the issue of large viewpoint differences, we propose
Thresholded Contrastive Loss (TCL) with a more tolerant strategy to punish
noisy changes. We demonstrate the effectiveness of the proposed approach with
experiments on three challenging datasets: CDnet, PCD2015, and VL-CMU-CD. Our
approach is robust to lots of challenging conditions, such as illumination
changes, large viewpoint difference caused by camera motion and zooming. In
addition, we incorporate the distance metric into the segmentation framework
and validate the effectiveness through visualization of change maps and feature
distribution. The source code is available at
https://github.com/gmayday1997/ChangeDet.Comment: 10 pages, 12 figure
- …