10,339 research outputs found
Semantic Visual Localization
Robust visual localization under a wide range of viewing conditions is a
fundamental problem in computer vision. Handling the difficult cases of this
problem is not only very challenging but also of high practical relevance,
e.g., in the context of life-long localization for augmented reality or
autonomous robots. In this paper, we propose a novel approach based on a joint
3D geometric and semantic understanding of the world, enabling it to succeed
under conditions where previous approaches failed. Our method leverages a novel
generative model for descriptor learning, trained on semantic scene completion
as an auxiliary task. The resulting 3D descriptors are robust to missing
observations by encoding high-level 3D geometric and semantic information.
Experiments on several challenging large-scale localization datasets
demonstrate reliable localization under extreme viewpoint, illumination, and
geometry changes
Systems Biology Graphical Notation: Activity Flow language Level 1
Standard graphical representations have played a crucial role in science and engineering throughout the last century. Without electrical symbolism, it is very likely that our industrial society would not have evolved at the same pace. Similarly, specialized notations such as the Feynmann notation or the process flow diagrams did a lot for the adoption of concepts in their own fields. With the advent of Systems Biology, and more recently of Synthetic Biology, the need for precise and unambiguous descriptions of biochemical interactions has become more pressing. While some ideas have been advanced over the last decade, with a few detailed proposals, no actual community standard has emerged. The Systems Biology Graphical Notation (SBGN) is a graphical representation crafted over several years by a community of biochemists, modellers and computer scientists. Three orthogonal and complementary languages have been created, the Process Descriptions, the Entity Relationships and the Activity Flows. Using these three idioms a scientist can represent any network of biochemical interactions, which can then be interpreted in an unambiguous way. The set of symbols used is limited, and the grammar quite simple, to allow its usage ranging from textbooks and teaching in high schools to peer reviewed articles in scientific journals. The first level of the SBGN Activity Flow language has been publicly released. Shared by the communities of biochemists, genomic scientists, theoreticians and computational biologists, SBGN languages will foster efficient storage, exchange and reuse of information on signaling pathways, metabolic networks and gene regulatory maps
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping
Grasping objects by a specific part is often crucial for safety and for
executing downstream tasks. Yet, learning-based grasp planners lack this
behavior unless they are trained on specific object part data, making it a
significant challenge to scale object diversity. Instead, we propose LERF-TOGO,
Language Embedded Radiance Fields for Task-Oriented Grasping of Objects, which
uses vision-language models zero-shot to output a grasp distribution over an
object given a natural language query. To accomplish this, we first reconstruct
a LERF of the scene, which distills CLIP embeddings into a multi-scale 3D
language field queryable with text. However, LERF has no sense of objectness,
meaning its relevancy outputs often return incomplete activations over an
object which are insufficient for subsequent part queries. LERF-TOGO mitigates
this lack of spatial grouping by extracting a 3D object mask via DINO features
and then conditionally querying LERF on this mask to obtain a semantic
distribution over the object with which to rank grasps from an off-the-shelf
grasp planner. We evaluate LERF-TOGO's ability to grasp task-oriented object
parts on 31 different physical objects, and find it selects grasps on the
correct part in 81% of all trials and grasps successfully in 69%. See the
project website at: lerftogo.github.ioComment: See the project website at: lerftogo.github.i
Knowledge-rich Image Gist Understanding Beyond Literal Meaning
We investigate the problem of understanding the message (gist) conveyed by
images and their captions as found, for instance, on websites or news articles.
To this end, we propose a methodology to capture the meaning of image-caption
pairs on the basis of large amounts of machine-readable knowledge that has
previously been shown to be highly effective for text understanding. Our method
identifies the connotation of objects beyond their denotation: where most
approaches to image understanding focus on the denotation of objects, i.e.,
their literal meaning, our work addresses the identification of connotations,
i.e., iconic meanings of objects, to understand the message of images. We view
image understanding as the task of representing an image-caption pair on the
basis of a wide-coverage vocabulary of concepts such as the one provided by
Wikipedia, and cast gist detection as a concept-ranking problem with
image-caption pairs as queries. To enable a thorough investigation of the
problem of gist understanding, we produce a gold standard of over 300
image-caption pairs and over 8,000 gist annotations covering a wide variety of
topics at different levels of abstraction. We use this dataset to
experimentally benchmark the contribution of signals from heterogeneous
sources, namely image and text. The best result with a Mean Average Precision
(MAP) of 0.69 indicate that by combining both dimensions we are able to better
understand the meaning of our image-caption pairs than when using language or
vision information alone. We test the robustness of our gist detection approach
when receiving automatically generated input, i.e., using automatically
generated image tags or generated captions, and prove the feasibility of an
end-to-end automated process
Seeking meaning: Examining a cross-situational solution to learn action verbs using human simulation paradigm
To acquire the meaning of a verb, language learners not only need to find the correct mapping between a specific verb and an action or event in the world, but also infer the underlying relational meaning that the verb encodes. Most verb naming instances in naturalistic contexts are highly ambiguous as many possible actions can be embedded in the same scenario and many possible verbs can be used to describe those actions. To understand whether learners can find the correct verb meaning from referentially ambiguous learning situations, we conducted three experiments using the Human Simulation Paradigm with adult learners. Our results suggest that although finding the right verb meaning from one learning instance is hard, there is a statistical solution to this problem. When provided with multiple verb learning instances all referring to the same verb, learners are able to aggregate information across situations and gradually converge to the correct semantic space. Even in cases where they may not guess the exact target verb, they can still discover the right meaning by guessing a similar verb that is semantically close to the ground truth
Recommended from our members
Gesture production and comprehension in children with specific language impairment
Children with specific language impairment (SLI) have difficulties with spoken language. However, some recent research suggests that these impairments reflect underlying cognitive limitations. Studying gesture may inform us clinically and theoretically about the nature of the association between language and cognition. A total of 20 children with SLI and 19 typically developing (TD) peers were assessed on a novel measure of gesture production. Children were also assessed for sentence comprehension errors in a speech-gesture integration task. Children with SLI performed equally to peers on gesture production but performed less well when comprehending integrated speech and gesture. Error patterns revealed a significant group interaction: children with SLI made more gesture-based errors, whilst TD children made semantically based ones. Children with SLI accessed and produced lexically encoded gestures despite having impaired spoken vocabulary and this group also showed stronger associations between gesture and language than TD children. When SLI comprehension breaks down, gesture may be relied on over speech, whilst TD children have a preference for spoken cues. The findings suggest that for children with SLI, gesture scaffolds are still more related to language development than for TD peers who have out-grown earlier reliance on gestures. Future clinical implications may include standardized assessment of symbolic gesture and classroom based gesture support for clinical groups
- …