48,252 research outputs found
Dublin City University at CLEF 2006: Experiments for the ImageCLEF Photo Collection Standard Ad Hoc Task
We provide a technical description of our submission to the CLEF 2006 Cross Language Image Retrieval(ImageCLEF) Photo Collection Standard Ad Hoc task. We performed monolingual and cross language retrieval of photo images using photo annotations with and without feedback, and also a combined visual and text retrieval approach. Topics are translated into English using the Babelfish online machine translation
system. Our text runs used the BM25 algorithm, while our visual approach used simple low-level features with matching based on the Jeffrey Divergence measure. Our results consistently indicate that the fusion of text and visual features is best for this task, and that performing feedback for text consistently improves on the baseline
non-feedback BM25 text runs for all language pairs
ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax
Radiology narrative reports often describe characteristics of a patient's
disease, including its location, size, and shape. Motivated by the recent
success of multimodal learning, we hypothesized that this descriptive text
could guide medical image analysis algorithms. We proposed a novel
vision-language model, ConTEXTual Net, for the task of pneumothorax
segmentation on chest radiographs. ConTEXTual Net utilizes language features
extracted from corresponding free-form radiology reports using a pre-trained
language model. Cross-attention modules are designed to combine the
intermediate output of each vision encoder layer and the text embeddings
generated by the language model. ConTEXTual Net was trained on the CANDID-PTX
dataset consisting of 3,196 positive cases of pneumothorax with segmentation
annotations from 6 different physicians as well as clinical radiology reports.
Using cross-validation, ConTEXTual Net achieved a Dice score of
0.7160.016, which was similar to the degree of inter-reader variability
(0.7120.044) computed on a subset of the data. It outperformed both
vision-only models (ResNet50 U-Net: 0.6770.015 and GLoRIA:
0.6860.014) and a competing vision-language model (LAVT: 0.7060.009).
Ablation studies confirmed that it was the text information that led to the
performance gains. Additionally, we show that certain augmentation methods
degraded ConTEXTual Net's segmentation performance by breaking the image-text
concordance. We also evaluated the effects of using different language models
and activation functions in the cross-attention module, highlighting the
efficacy of our chosen architectural design
Fine-graind Image Classification via Combining Vision and Language
Fine-grained image classification is a challenging task due to the large
intra-class variance and small inter-class variance, aiming at recognizing
hundreds of sub-categories belonging to the same basic-level category. Most
existing fine-grained image classification methods generally learn part
detection models to obtain the semantic parts for better classification
accuracy. Despite achieving promising results, these methods mainly have two
limitations: (1) not all the parts which obtained through the part detection
models are beneficial and indispensable for classification, and (2)
fine-grained image classification requires more detailed visual descriptions
which could not be provided by the part locations or attribute annotations. For
addressing the above two limitations, this paper proposes the two-stream model
combining vision and language (CVL) for learning latent semantic
representations. The vision stream learns deep representations from the
original visual information via deep convolutional neural network. The language
stream utilizes the natural language descriptions which could point out the
discriminative parts or characteristics for each image, and provides a flexible
and compact way of encoding the salient visual aspects for distinguishing
sub-categories. Since the two streams are complementary, combining the two
streams can further achieves better classification accuracy. Comparing with 12
state-of-the-art methods on the widely used CUB-200-2011 dataset for
fine-grained image classification, the experimental results demonstrate our CVL
approach achieves the best performance.Comment: 9 pages, to appear in CVPR 201
Multilingual interactive experiments with Flickr
This paper presents a proposal for iCLEF 2006, the interactive track
of the CLEF cross-language evaluation campaign. In the past, iCLEF has
addressed applications such as information retrieval and question answering. However, for 2006 the focus
has turned to text-based image retrieval from Flickr. We describe
Flickr, the challenges this kind of collection presents to
cross-language researchers, and suggest initial iCLEF tasks
- …