30,035 research outputs found

    Early Prediction for Physical Human Robot Collaboration in the Operating Room

    Full text link
    To enable a natural and fluent human robot collaboration flow, it is critical for a robot to comprehend their human peers' on-going actions, predict their behaviors in the near future, and plan its actions correspondingly. Specifically, the capability of making early predictions is important, so that the robot can foresee the precise timing of a turn-taking event and start motion planning and execution early enough to smooth the turn-taking transition. Such proactive behavior would reduce human's waiting time, increase efficiency and enhance naturalness in collaborative task. To that end, this paper presents the design and implementation of an early turn-taking prediction algorithm, catered for physical human robot collaboration scenarios. Specifically, a Robotic Scrub Nurse (RSN) system which can comprehend surgeon's multimodal communication cues and perform turn-taking prediction is presented. The developed algorithm was tested on a collected data set of simulated surgical procedures in a surgeon-nurse tandem. The proposed turn-taking prediction algorithm is found to be significantly superior to its algorithmic counterparts, and is more accurate than human baseline when little partial input is given (less than 30% of full action). After observing more information, the algorithm can achieve comparable performances as humans with a F1 score of 0.90

    Modeling Multi-turn Conversation with Deep Utterance Aggregation

    Full text link
    Multi-turn conversation understanding is a major challenge for building intelligent dialogue systems. This work focuses on retrieval-based response matching for multi-turn conversation whose related work simply concatenates the conversation utterances, ignoring the interactions among previous utterances for context modeling. In this paper, we formulate previous utterances into context using a proposed deep utterance aggregation model to form a fine-grained context representation. In detail, a self-matching attention is first introduced to route the vital information in each utterance. Then the model matches a response with each refined utterance and the final matching score is obtained after attentive turns aggregation. Experimental results show our model outperforms the state-of-the-art methods on three multi-turn conversation benchmarks, including a newly introduced e-commerce dialogue corpus.Comment: Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018

    Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

    Full text link
    This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task

    FusionNet: Fusing via Fully-Aware Attention with Application to Machine Comprehension

    Full text link
    This paper introduces a new neural structure called FusionNet, which extends existing attention approaches from three perspectives. First, it puts forward a novel concept of "history of word" to characterize attention information from the lowest word-level embedding up to the highest semantic-level representation. Second, it introduces an improved attention scoring function that better utilizes the "history of word" concept. Third, it proposes a fully-aware multi-level attention mechanism to capture the complete information in one text (such as a question) and exploit it in its counterpart (such as context or passage) layer by layer. We apply FusionNet to the Stanford Question Answering Dataset (SQuAD) and it achieves the first position for both single and ensemble model on the official SQuAD leaderboard at the time of writing (Oct. 4th, 2017). Meanwhile, we verify the generalization of FusionNet with two adversarial SQuAD datasets and it sets up the new state-of-the-art on both datasets: on AddSent, FusionNet increases the best F1 metric from 46.6% to 51.4%; on AddOneSent, FusionNet boosts the best F1 metric from 56.0% to 60.7%.Comment: Published in Sixth International Conference on Learning Representations (ICLR), 201

    VD-BERT: A Unified Vision and Dialog Transformer with BERT

    Full text link
    Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.Comment: EMNLP 2020 (14 pages

    Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities

    Full text link
    New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field

    Putting Question-Answering Systems into Practice: Transfer Learning for Efficient Domain Customization

    Full text link
    Traditional information retrieval (such as that offered by web search engines) impedes users with information overload from extensive result pages and the need to manually locate the desired information therein. Conversely, question-answering systems change how humans interact with information systems: users can now ask specific questions and obtain a tailored answer - both conveniently in natural language. Despite obvious benefits, their use is often limited to an academic context, largely because of expensive domain customizations, which means that the performance in domain-specific applications often fails to meet expectations. This paper proposes cost-efficient remedies: (i) we leverage metadata through a filtering mechanism, which increases the precision of document retrieval, and (ii) we develop a novel fuse-and-oversample approach for transfer learning in order to improve the performance of answer extraction. Here knowledge is inductively transferred from a related, yet different, tasks to the domain-specific application, while accounting for potential differences in the sample sizes across both tasks. The resulting performance is demonstrated with actual use cases from a finance company and the film industry, where fewer than 400 question-answer pairs had to be annotated in order to yield significant performance gains. As a direct implication to management, this presents a promising path to better leveraging of knowledge stored in information systems.Comment: Accepted by ACM TMI

    Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

    Full text link
    Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is trained using only video-level event labels without any timing information. An important feature of our method is its capacity to learn from unsynchronized audio-visual events. We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos. Visualizations of localized visual regions and audio segments substantiate our system's efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously

    Reproducibility Evaluation of SLANT Whole Brain Segmentation Across Clinical Magnetic Resonance Imaging Protocols

    Full text link
    Whole brain segmentation on structural magnetic resonance imaging (MRI) is essential for understanding neuroanatomical-functional relationships. Traditionally, multi-atlas segmentation has been regarded as the standard method for whole brain segmentation. In past few years, deep convolutional neural network (DCNN) segmentation methods have demonstrated their advantages in both accuracy and computational efficiency. Recently, we proposed the spatially localized atlas network tiles (SLANT) method, which is able to segment a 3D MRI brain scan into 132 anatomical regions. Commonly, DCNN segmentation methods yield inferior performance under external validations, especially when the testing patterns were not presented in the training cohorts. Recently, we obtained a clinically acquired, multi-sequence MRI brain cohort with 1480 clinically acquired, de-identified brain MRI scans on 395 patients using seven different MRI protocols. Moreover, each subject has at least two scans from different MRI protocols. Herein, we assess the SLANT method's intra- and inter-protocol reproducibility. SLANT achieved less than 0.05 coefficient of variation (CV) for intra-protocol experiments and less than 0.15 CV for inter-protocol experiments. The results show that the SLANT method achieved high intra- and inter- protocol reproducibility.Comment: To appear in SPIE Medical Imaging 201

    Learning to Measure Change: Fully Convolutional Siamese Metric Networks for Scene Change Detection

    Full text link
    A critical challenge problem of scene change detection is that noisy changes generated by varying illumination, shadows and camera viewpoint make variances of a scene difficult to define and measure since the noisy changes and semantic ones are entangled. Following the intuitive idea of detecting changes by directly comparing dissimilarities between a pair of features, we propose a novel fully Convolutional siamese metric Network(CosimNet) to measure changes by customizing implicit metrics. To learn more discriminative metrics, we utilize contrastive loss to reduce the distance between the unchanged feature pairs and to enlarge the distance between the changed feature pairs. Specifically, to address the issue of large viewpoint differences, we propose Thresholded Contrastive Loss (TCL) with a more tolerant strategy to punish noisy changes. We demonstrate the effectiveness of the proposed approach with experiments on three challenging datasets: CDnet, PCD2015, and VL-CMU-CD. Our approach is robust to lots of challenging conditions, such as illumination changes, large viewpoint difference caused by camera motion and zooming. In addition, we incorporate the distance metric into the segmentation framework and validate the effectiveness through visualization of change maps and feature distribution. The source code is available at https://github.com/gmayday1997/ChangeDet.Comment: 10 pages, 12 figure
    • …
    corecore