4 research outputs found

    Visual context for verb sense disambiguation and multilingual representation learning

    Get PDF
    Every day billions of images are uploaded to the web. To process images at such a large scale it is important to build automatic image understanding systems. An important step towards understanding the content of the images is to be able to understand all the objects, scenes and actions depicted in the image. These systems should be capable of integrating with natural language or text to be able to query and interact with humans for tasks such as image retrieval. Verbs play a key role in the understanding of sentences and scenes. Verbs express the semantics of an actions as well as the interactions between objects participating in an event. Thus understanding verbs is central to both language and image understanding. However, verbs are known for their variability in meaning with context. Many studies in psychology have shown that contextual information plays an important role in semantic understanding and processing in the human visual system. We use this as intuition and understand the role of textual or visual context in tasks that combine language and vision. Our research presented in this thesis focuses on the problems of integrating visual and textual contexts for: (i) automatically identifying verbs that denote actions depicted in the images; (ii) fine-grained analysis of how visual context can help disambiguate different meanings of verbs in a language or across languages; (iii) the role played by the visual and multilingual context in learning representations that allow us to query information across modalities and languages. First, we propose the task of visual sense disambiguation, an alternative way of addressing the action recognition task. Instead of identifying the actions directly, we develop a two step process: identifying the verb that denotes the action being depicted in an image and then disambiguate the meaning of the verb based on the visual and textual context associated with the image. We first build a image-verb classifier based on the weak signal from image description data and analyse the specific regions that model focuses on while predicting the verb. We then disambiguate the meaning of the verb shown in the image using image features and sense-inventories. We test the hypothesis that visual and textual context associated with the image contribute to the disambiguation task. Second, we ask whether the predictions made by such models correspond to human intuitions about visual verbs or actions. We analyse whether the image regions a verb prediction model identifies as salient for a given verb correlate with the regions fixated by human observers performing an action classification task. We also compare the correlation of human fixations against visual saliency and center bias models. Third, we propose the crosslingual verb disambiguation task: identifying the correct translation of the verb in a target language based on visual context. This task has the potential to resolve lexical ambiguity in machine translation when the visual context is available. We propose a series of models and show that multimodal models that fuse textual information with visual features have an edge over text or visual only models. We then demonstrate how visual sense disambiguation can be combined with lexical constraint decoding to improve the performance of a standard unimodal machine translation system on image descriptions. Finally, we move on to learn joint representations for images and text in multiple languages. We test the hypothesis that context provided as visual information or text in other language contributes to better representation learning. We propose models to map text from multiple languages and images into a common space and evaluating the usefulness of the second language in multimodal search and usefulness of image in the crosslingual search. Our experiments suggest that exploiting multilingual and multimodal resources can help in learning better semantic representations that are useful for various multimodal natural language understanding tasks. Our experiments on visual sense disambiguation, sense disambiguation across languages, multimodal and cross-lingual search demonstrate that visual context alone or combined with textual context is useful for enhancing multimodal and crosslingual applications

    Lidar-based Obstacle Detection and Recognition for Autonomous Agricultural Vehicles

    Get PDF
    Today, agricultural vehicles are available that can drive autonomously and follow exact route plans more precisely than human operators. Combined with advancements in precision agriculture, autonomous agricultural robots can reduce manual labor, improve workflow, and optimize yield. However, as of today, human operators are still required for monitoring the environment and acting upon potential obstacles in front of the vehicle. To eliminate this need, safety must be ensured by accurate and reliable obstacle detection and avoidance systems.In this thesis, lidar-based obstacle detection and recognition in agricultural environments has been investigated. A rotating multi-beam lidar generating 3D point clouds was used for point-wise classification of agricultural scenes, while multi-modal fusion with cameras and radar was used to increase performance and robustness. Two research perception platforms were presented and used for data acquisition. The proposed methods were all evaluated on recorded datasets that represented a wide range of realistic agricultural environments and included both static and dynamic obstacles.For 3D point cloud classification, two methods were proposed for handling density variations during feature extraction. One method outperformed a frequently used generic 3D feature descriptor, whereas the other method showed promising preliminary results using deep learning on 2D range images. For multi-modal fusion, four methods were proposed for combining lidar with color camera, thermal camera, and radar. Gradual improvements in classification accuracy were seen, as spatial, temporal, and multi-modal relationships were introduced in the models. Finally, occupancy grid mapping was used to fuse and map detections globally, and runtime obstacle detection was applied on mapped detections along the vehicle path, thus simulating an actual traversal.The proposed methods serve as a first step towards full autonomy for agricultural vehicles. The study has thus shown that recent advancements in autonomous driving can be transferred to the agricultural domain, when accurate distinctions are made between obstacles and processable vegetation. Future research in the domain has further been facilitated with the release of the multi-modal obstacle dataset, FieldSAFE

    Behavior quantification as the missing link between fields: Tools for digital psychiatry and their role in the future of neurobiology

    Full text link
    The great behavioral heterogeneity observed between individuals with the same psychiatric disorder and even within one individual over time complicates both clinical practice and biomedical research. However, modern technologies are an exciting opportunity to improve behavioral characterization. Existing psychiatry methods that are qualitative or unscalable, such as patient surveys or clinical interviews, can now be collected at a greater capacity and analyzed to produce new quantitative measures. Furthermore, recent capabilities for continuous collection of passive sensor streams, such as phone GPS or smartwatch accelerometer, open avenues of novel questioning that were previously entirely unrealistic. Their temporally dense nature enables a cohesive study of real-time neural and behavioral signals. To develop comprehensive neurobiological models of psychiatric disease, it will be critical to first develop strong methods for behavioral quantification. There is huge potential in what can theoretically be captured by current technologies, but this in itself presents a large computational challenge -- one that will necessitate new data processing tools, new machine learning techniques, and ultimately a shift in how interdisciplinary work is conducted. In my thesis, I detail research projects that take different perspectives on digital psychiatry, subsequently tying ideas together with a concluding discussion on the future of the field. I also provide software infrastructure where relevant, with extensive documentation. Major contributions include scientific arguments and proof of concept results for daily free-form audio journals as an underappreciated psychiatry research datatype, as well as novel stability theorems and pilot empirical success for a proposed multi-area recurrent neural network architecture.Comment: PhD thesis cop

    Proceedings of the 10th international conference on disability, virtual reality and associated technologies (ICDVRAT 2014)

    Get PDF
    The proceedings of the conferenc
    corecore