10,499 research outputs found
Recommended from our members
Continually improving grounded natural language understanding through human-robot dialog
As robots become ubiquitous in homes and workplaces such as hospitals and factories, they must be able to communicate with humans. Several kinds of knowledge are required to understand and respond to a human's natural language commands and questions. If a person requests an assistant robot to take me to Alice's office, the robot must know that Alice is a person who owns some unique office, and that take me means it should navigate there. Similarly, if a person requests bring me the heavy, green mug, the robot must have accurate mental models of the physical concepts heavy, green, and mug. To avoid forcing humans to use key phrases or words robots already know, this thesis focuses on helping robots understanding new language constructs through interactions with humans and with the world around them. To understand a command in natural language, a robot must first convert that command to an internal representation that it can reason with. Semantic parsing is a method for performing this conversion, and the target representation is often semantic forms represented as predicate logic with lambda calculus. Traditional semantic parsing relies on hand-crafted resources from a human expert: an ontology of concepts, a lexicon connecting language to those concepts, and training examples of language with abstract meanings. One thrust of this thesis is to perform semantic parsing with sparse initial data. We use the conversations between a robot and human users to induce pairs of natural language utterances with the target semantic forms a robot discovers through its questions, reducing the annotation effort of creating training examples for parsing. We use this data to build more dialog-capable robots in new domains with much less expert human effort (Thomason et al., 2015; Padmakumar et al., 2017). Meanings of many language concepts are bound to the physical world. Understanding object properties and categories, such as heavy, green, and mug requires interacting with and perceiving the physical world. Embodied robots can use manipulation capabilities, such as pushing, picking up, and dropping objects to gather sensory data about them. This data can be used to understand non-visual concepts like heavy and empty (e.g. get the empty carton of milk from the fridge), and assist with concepts that have both visual and non-visual expression (e.g. tall things look big and also exert force sooner than short things when pressed down on). A second thrust of this thesis focuses on strategies for learning these concepts using multi-modal sensory information. We use human-in-the-loop learning to get labels between concept words and actual objects in the environment (Thomason et al., 2016, 2017). We also explore ways to tease out polysemy and synonymy in concept words (Thomason and Mooney, 2017) such as light, which can refer to a weight or a color, the latter sense being synonymous with pale. Additionally, pushing, picking up, and dropping objects to gather sensory information is prohibitively time-consuming, so we investigate strategies for using linguistic information and human input to expedite exploration when learning a new concept (Thomason et al., 2018). Finally, we build an integrated agent with both parsing and perception capabilities that learns from conversations with users to improve both components over time. We demonstrate that parser learning from conversations (Thomason et al., 2015) can be combined with multi-modal perception (Thomason et al., 2016) using predicate-object labels gathered through opportunistic active learning (Thomason et al., 2017) during those conversations to improve performance for understanding natural language commands from humans. Human users also qualitatively rate this integrated learning agent as more usable after it has improved from conversation-based learning.Computer Science
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
While direction of arrival (DOA) of sound events is generally estimated from
multichannel audio data recorded in a microphone array, sound events usually
derive from visually perceptible source objects, e.g., sounds of footsteps come
from the feet of a walker. This paper proposes an audio-visual sound event
localization and detection (SELD) task, which uses multichannel audio and video
information to estimate the temporal activation and DOA of target sound events.
Audio-visual SELD systems can detect and localize sound events using signals
from a microphone array and audio-visual correspondence. We also introduce an
audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23),
which consists of multichannel audio data recorded with a microphone array,
video data, and spatiotemporal annotation of sound events. Sound scenes in
STARSS23 are recorded with instructions, which guide recording participants to
ensure adequate activity and occurrences of sound events. STARSS23 also serves
human-annotated temporal activation labels and human-confirmed DOA labels,
which are based on tracking results of a motion capture system. Our benchmark
results demonstrate the benefits of using visual object positions in
audio-visual SELD tasks. The data is available at
https://zenodo.org/record/7880637.Comment: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track
on Datasets and Benchmark
Enabling Seamless Access to Digital Graphical Contents for Visually Impaired Individuals via Semantic-Aware Processing
Vision is one of the main sources through which people obtain information from the world, but unfortunately, visually-impaired people are partially or completely deprived of this type of information. With the help of computer technologies, people with visual impairment can independently access digital textual information by using text-to-speech and text-to-Braille software. However, in general, there still exists a major barrier for people who are blind to access the graphical information independently in real-time without the help of sighted people. In this paper, we propose a novel multi-level and multi-modal approach aiming at addressing this challenging and practical problem, with the key idea being semantic-aware visual-to-tactile conversion through semantic image categorization and segmentation, and semantic-driven image simplification. An end-to-end prototype system was built based on the approach. We present the details of the approach and the system, report sample experimental results with realistic data, and compare our approach with current typical practice
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Celebration Schedule 2015 (Friday)
Full presentation schedule for Celebration, Friday, May 1, 201
Enhancing camera surveillance using computer vision: a research note
- The growth of police operated surveillance cameras has
out-paced the ability of humans to monitor them effectively. Computer vision is
a possible solution. An ongoing research project on the application of computer
vision within a municipal police department is described. The paper aims to
discuss these issues.
- Following the demystification of
computer vision technology, its potential for police agencies is developed
within a focus on computer vision as a solution for two common surveillance
camera tasks (live monitoring of multiple surveillance cameras and summarizing
archived video files). Three unaddressed research questions (can specialized
computer vision applications for law enforcement be developed at this time, how
will computer vision be utilized within existing public safety camera
monitoring rooms, and what are the system-wide impacts of a computer vision
capability on local criminal justice systems) are considered.
- Despite computer vision becoming accessible to law
enforcement agencies the impact of computer vision has not been discussed or
adequately researched. There is little knowledge of computer vision or its
potential in the field.
- This paper introduces and discusses computer
vision from a law enforcement perspective and will be valuable to police
personnel tasked with monitoring large camera networks and considering computer
vision as a system upgrade
- âŠ