73,305 research outputs found

    Learning visual attributes from contextual explanations

    Get PDF
    In computer vision, attributes are mid-level concepts shared across categories. They provide a natural communication between humans and machines for image retrieval. They also provide detailed information about objects. Finally, attributes can describe properties of unfamiliar objects. These are some very appealing properties of attributes, but learning attributes is a challenging task. Since attributes are less well-defined, capturing them with computational models poses a different set of challenges than capturing object categories does. There is a miscommunication of attributes between humans and machines, since machines may not understand what humans have in mind when referring to a particular attribute. Humans usually provide labels if an object or attribute is present or not without any explanation. However, attributes are more complex and may require explanations for a better understanding. This Ph.D. thesis aims to tackle these challenges in learning automatic attribute predictive models. In particular, it focuses on enhancing attribute predictive power with contextual explanations. These explanations aim to enhance data quality with human knowledge, which can be expressed in the form of interactions and may be affected by our personality. First, we emulate human learning skill to understand unfamiliar situations. Humans try to infer properties from what they already know (background knowledge). Hence, we study attribute learning in data-scarce and non-related domains emulating human understanding skills. We discover transferable knowledge to learn attributes from different domains. Our previous project inspires us to request contextual explanations to improve attribute learning. Thus, we enhance attribute learning with context in the form of gaze, captioning, and sketches. Human gaze captures subconscious intuition and associates certain components to the meaning of an attribute. For example, gaze associates the tiptoe of a shoe to a pointy attribute. To complement this gaze representation, captioning follows conscious thinking with prior analysis. An annotator may analyze an image and may provide the following description: “This shoe is pointy because its sharp form at the tiptoe”. Finally, in image search, sketches provide a holistic view of an image query, which complement specific details encapsulated via attribute comparisons. To conclude, our methods with contextual explanations outperform many baselines via quantitative and qualitative evaluation

    Gaze Embeddings for Zero-Shot Image Classification

    Get PDF
    Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification

    Visual Decoding of Targets During Visual Search From Human Eye Fixations

    Full text link
    What does human gaze reveal about a users' intents and to which extend can these intents be inferred or even visualized? Gaze was proposed as an implicit source of information to predict the target of visual search and, more recently, to predict the object class and attributes of the search target. In this work, we go one step further and investigate the feasibility of combining recent advances in encoding human gaze information using deep convolutional neural networks with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Such visual decoding is challenging for two reasons: 1) the search target only resides in the user's mind as a subjective visual pattern, and can most often not even be described verbally by the person, and 2) it is, as of yet, unclear if gaze fixations contain sufficient information for this task at all. We show, for the first time, that visual representations of search targets can indeed be decoded only from human gaze fixations. We propose to first encode fixations into a semantic representation and then decode this representation into an image. We evaluate our method on a recent gaze dataset of 14 participants searching for clothing in image collages and validate the model's predictions using two human studies. Our results show that 62% (Chance level = 10%) of the time users were able to select the categories of the decoded image right. In our second studies we show the importance of a local gaze encoding for decoding visual search targets of use

    Going Deeper into First-Person Activity Recognition

    Full text link
    We bring together ideas from recent work on feature design for egocentric action recognition under one framework by exploring the use of deep convolutional neural networks (CNN). Recent work has shown that features such as hand appearance, object attributes, local hand motion and camera ego-motion are important for characterizing first-person actions. To integrate these ideas under one framework, we propose a twin stream network architecture, where one stream analyzes appearance information and the other stream analyzes motion information. Our appearance stream encodes prior knowledge of the egocentric paradigm by explicitly training the network to segment hands and localize objects. By visualizing certain neuron activation of our network, we show that our proposed architecture naturally learns features that capture object attributes and hand-object configurations. Our extensive experiments on benchmark egocentric action datasets show that our deep architecture enables recognition rates that significantly outperform state-of-the-art techniques -- an average 6.6%6.6\% increase in accuracy over all datasets. Furthermore, by learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by 30%30\% (actions) and 14%14\% (objects). We also include the results of extensive ablative analysis to highlight the importance of network design decisions.

    Speech-Gesture Mapping and Engagement Evaluation in Human Robot Interaction

    Full text link
    A robot needs contextual awareness, effective speech production and complementing non-verbal gestures for successful communication in society. In this paper, we present our end-to-end system that tries to enhance the effectiveness of non-verbal gestures. For achieving this, we identified prominently used gestures in performances by TED speakers and mapped them to their corresponding speech context and modulated speech based upon the attention of the listener. The proposed method utilized Convolutional Pose Machine [4] to detect the human gesture. Dominant gestures of TED speakers were used for learning the gesture-to-speech mapping. The speeches by them were used for training the model. We also evaluated the engagement of the robot with people by conducting a social survey. The effectiveness of the performance was monitored by the robot and it self-improvised its speech pattern on the basis of the attention level of the audience, which was calculated using visual feedback from the camera. The effectiveness of interaction as well as the decisions made during improvisation was further evaluated based on the head-pose detection and interaction survey.Comment: 8 pages, 9 figures, Under review in IRC 201

    Towards a human eye behavior model by applying Data Mining Techniques on Gaze Information from IEC

    Get PDF
    In this paper, we firstly present what is Interactive Evolutionary Computation (IEC) and rapidly how we have combined this artificial intelligence technique with an eye-tracker for visual optimization. Next, in order to correctly parameterize our application, we present results from applying data mining techniques on gaze information coming from experiments conducted on about 80 human individuals
    • …
    corecore