52 research outputs found
Multimodal Representations for Teacher-Guided Compositional Visual Reasoning
Neural Module Networks (NMN) are a compelling method for visual question
answering, enabling the translation of a question into a program consisting of
a series of reasoning sub-tasks that are sequentially executed on the image to
produce an answer. NMNs provide enhanced explainability compared to integrated
models, allowing for a better understanding of the underlying reasoning
process. To improve the effectiveness of NMNs we propose to exploit features
obtained by a large-scale cross-modal encoder. Also, the current training
approach of NMNs relies on the propagation of module outputs to subsequent
modules, leading to the accumulation of prediction errors and the generation of
false answers. To mitigate this, we introduce an NMN learning strategy
involving scheduled teacher guidance. Initially, the model is fully guided by
the ground-truth intermediate outputs, but gradually transitions to an
autonomous behavior as training progresses. This reduces error accumulation,
thus improving training efficiency and final performance.We demonstrate that by
incorporating cross-modal features and employing more effective training
techniques for NMN, we achieve a favorable balance between performance and
transparency in the reasoning process
Curriculum Learning for Compositional Visual Reasoning
Visual Question Answering (VQA) is a complex task requiring large datasets
and expensive training. Neural Module Networks (NMN) first translate the
question to a reasoning path, then follow that path to analyze the image and
provide an answer. We propose an NMN method that relies on predefined
cross-modal embeddings to ``warm start'' learning on the GQA dataset, then
focus on Curriculum Learning (CL) as a way to improve training and make a
better use of the data. Several difficulty criteria are employed for defining
CL methods. We show that by an appropriate selection of the CL method the cost
of training and the amount of training data can be greatly reduced, with a
limited impact on the final VQA accuracy. Furthermore, we introduce
intermediate losses during training and find that this allows to simplify the
CL strategy
A Statistical Framework for Image Category Search from a Mental Picture
Image Retrieval; Relevance Feedback; Page Zero Problem; Mental Matching; Bayesian System; Statistical LearningStarting from a member of an image database designated the “query image,” traditional image retrieval techniques, for example search by visual similarity, allow one to locate additional instances of a target category residing in the database. However, in many cases, the query image or, more generally, the target category, resides only in the mind of the user as a set of subjective visual patterns, psychological impressions or “mental pictures.” Consequently, since image databases available today are often unstructured and lack reliable semantic annotations, it is often not obvious how to initiate a search session; this is the “page zero problem.” We propose a new statistical framework based on relevance feedback to locate an instance of a semantic category in an unstructured image database with no semantic annotations. A search session is initiated from a random sample of images. At each retrieval round the user is asked to select one image from among a set of displayed images – the one that is closest in his opinion to the target class. The matching is then “mental.” Performance is measured by the number of iterations necessary to display an image which satisfies the user, at which point standard techniques can be employed to display other instances. Our core contribution is a Bayesian formulation which scales to large databases. The two key components are a response model which accounts for the user's subjective perception of similarity and a display algorithm which seeks to maximize the flow of information. Experiments with real users and two databases of 20,000 and 60,000 images demonstrate the efficiency of the search process
Active SVM-based Relevance Feedback with Hybrid Visual and representation
Most of the available image databases have keyword annotations associated with the images, related to the image context or to the semantic interpretation of image content. Keywords and visual features provide complementary information, so using these sources of information together is an advantage in many applications. We address here the challenge of semantic gap reduction, through an active SVM-based relevance feedback method, jointly with a hybrid visual and conceptual content representation and retrieval. We first introduce a new feature vector, based on the keyword annotations available for the images, which makes use of conceptual information extracted from an external ontology and represented by ``core concepts''. We then present two improvements of the SVM-based relevance feedback mechanism: a new active learning selection criterion and the use of specific kernel functions that reduce the sensitivity of the SVM to scale. We evaluate the use of the proposed hybrid feature vector composed of keyword representations and the low level visual features in our SVM-based relevance feedback setting. Experiments show that the use of the keyword-based feature vectors provides a significant improvement in the quality of the results
Reducing the Redundancy in the Selection of Samples for SVM-based Relevance Feedback
In image retrieval with relevance feedback, the strategy employed by the system for selecting the images presented to the user at every feedback round has a strong effect on the transfer of information between the user and the system. Using SVMs, we put forward a new active learning selection strategy that minimizes redundancy between the images presented to the user and takes into account assumptions that are specific to the retrieval setting. Experiments on several image databases confirm the attractiveness of this selection strategy. We also find that insensitivity to the scale of the data is a desirable property for the SVMs employed as learners in relevance feedback and we show how to obtain such insensitivity by the use of specific kernel functions
An exploration of diversified user strategies for image retrieval with relevance feedback
Given the difficulty of setting up large-scale experiments with real users, the comparison of content-based image retrieval methods using relevance feedback usually relies on the emulation of the user, following a single, well-prescribed strategy. Since the behavior of real users cannot be expected to comply to strict specifications, it is very important to evaluate the sensitiveness of the retrieval results to likely variations of users behavior. It is also important to find out whether some strategies help the system to perform consistently better, so as to promote their use. Two selection algorithms for relevance feedback based on support vector machines are compared here. In these experiments, the user is emulated according to eight significantly different strategies on four ground truth databases of different complexity. It is first found that the ranking of the two algorithms does not depend much on the selected strategy. Also, the ranking of the strategies appears to be relatively independent of the complexity of the ground truth databases, which allows to identify desirable characteristics in the behavior of the user
On the beneficial effect of noise in vertex localization
A theoretical and experimental analysis related to the effect of noise in the task of vertex identication in unknown shapes is presented. Shapes are seen as real functions of their closed boundary. An alternative global perspective of curvature is examined providing insight into the process of noise- enabled vertex localization. The analysis reveals that noise facilitates in the localization of certain vertices. The concept of noising is thus considered and a relevant global method for localizing Global Vertices is investigated in relation to local methods under the presence of increasing noise. Theoretical analysis reveals that induced noise can indeed help localizing certain vertices if combined with global descriptors. Experiments with noise and a comparison to localized methods validate the theoretical results
Acknowledgments
Image retrieval with active relevance feedback using both visual and keyword-based descriptor
- …