11 research outputs found

    Deep Unsupervised Embedding for Remote Sensing Image Retrieval Using Textual Cues

    Get PDF
    Compared to image-image retrieval, text-image retrieval has been less investigated in the remote sensing community, possibly because of the complexity of appropriately tying textual data to respective visual representations. Moreover, a single image may be described via multiple sentences according to the perception of the human labeler and the structure/body of the language they use, which magnifies the complexity even further. In this paper, we propose an unsupervised method for text-image retrieval in remote sensing imagery. In the method, image representation is obtained via visual Big Transfer (BiT) Models, while textual descriptions are encoded via a bidirectional Long Short-Term Memory (Bi-LSTM) network. The training of the proposed retrieval architecture is optimized using an unsupervised embedding loss, which aims to make the features of an image closest to its corresponding textual description and different from other image features and vise-versa. To demonstrate the performance of the proposed architecture, experiments are performed on two datasets, obtaining plausible text/image retrieval outcomes

    Vision Transformers for Remote Sensing Image Classification

    No full text
    In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then converted to sequence by flattening and embedding. To keep information about the position, embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers for generating the final representation. At the classification stage, the first token sequence is fed to a softmax classification layer. To boost the classification performance, we explore several data augmentation strategies to generate additional data for training. Moreover, we show experimentally that we can compress the network by pruning half of the layers while keeping competing classification accuracies. Experimental results conducted on different remote-sensing image datasets demonstrate the promising capability of the model compared to state-of-the-art methods. Specifically, Vision Transformer obtains an average classification accuracy of 98.49%, 95.86%, 95.56% and 93.83% on Merced, AID, Optimal31 and NWPU datasets, respectively. While the compressed version obtained by removing half of the multihead attention layers yields 97.90%, 94.27%, 95.30% and 93.05%, respectively

    Deep Unsupervised Embedding for Remote Sensing Image Retrieval Using Textual Cues

    No full text
    Compared to image-image retrieval, text-image retrieval has been less investigated in the remote sensing community, possibly because of the complexity of appropriately tying textual data to respective visual representations. Moreover, a single image may be described via multiple sentences according to the perception of the human labeler and the structure/body of the language they use, which magnifies the complexity even further. In this paper, we propose an unsupervised method for text-image retrieval in remote sensing imagery. In the method, image representation is obtained via visual Big Transfer (BiT) Models, while textual descriptions are encoded via a bidirectional Long Short-Term Memory (Bi-LSTM) network. The training of the proposed retrieval architecture is optimized using an unsupervised embedding loss, which aims to make the features of an image closest to its corresponding textual description and different from other image features and vise-versa. To demonstrate the performance of the proposed architecture, experiments are performed on two datasets, obtaining plausible text/image retrieval outcomes

    Contrasting EfficientNet, ViT, and gMLP for COVID-19 Detection in Ultrasound Imagery

    No full text
    A timely diagnosis of coronavirus is critical in order to control the spread of the virus. To aid in this, we propose in this paper a deep learning-based approach for detecting coronavirus patients using ultrasound imagery. We propose to exploit the transfer learning of a EfficientNet model pre-trained on the ImageNet dataset for the classification of ultrasound images of suspected patients. In particular, we contrast the results of EfficentNet-B2 with the results of ViT and gMLP. Then, we show the results of the three models by learning from scratch, i.e., without transfer learning. We view the detection problem from a multiclass classification perspective by classifying images as COVID-19, pneumonia, and normal. In the experiments, we evaluated the models on a publically available ultrasound dataset. This dataset consists of 261 recordings (202 videos + 59 images) belonging to 216 distinct patients. The best results were obtained using EfficientNet-B2 with transfer learning. In particular, we obtained precision, recall, and F1 scores of 95.84%, 99.88%, and 24 97.41%, respectively, for detecting the COVID-19 class. EfficientNet-B2 with transfer learning presented an overall accuracy of 96.79%, outperforming gMLP and ViT, which achieved accuracies of 93.03% and 92.82%, respectively

    Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval

    No full text
    Remote sensing technology has advanced rapidly in recent years. Because of the deployment of quantitative and qualitative sensors, as well as the evolution of powerful hardware and software platforms, it powers a wide range of civilian and military applications. This in turn leads to the availability of large data volumes suitable for a broad range of applications such as monitoring climate change. Yet, processing, retrieving, and mining large data are challenging. Usually, content-based remote sensing image (RS) retrieval approaches rely on a query image to retrieve relevant images from the dataset. To increase the flexibility of the retrieval experience, cross-modal representations based on text–image pairs are gaining popularity. Indeed, combining text and image domains is regarded as one of the next frontiers in RS image retrieval. Yet, aligning text to the content of RS images is particularly challenging due to the visual-sematic discrepancy between language and vision worlds. In this work, we propose different architectures based on vision and language transformers for text-to-image and image-to-text retrieval. Extensive experimental results on four different datasets, namely TextRS, Merced, Sydney, and RSICD datasets are reported and discussed

    Visual Question Generation From Remote Sensing Images

    No full text
    Visual question generation (VQG) is a fundamental task in vision-language understanding that aims to generate relevant questions about the given input image. In this article, we propose a paragraph-based VQG approach for generating intelligent questions in natural language about remote sensing (RS) images. Specifically, our proposed framework consists of two transformer-based vision and language models. First, we employ a swin-transformer encoder to generate a multiscale representative visual feature from the image. Then, this feature is used as a prefix to guide a generative pretrained transformer-2 (GPT-2) decoder in generating multiple questions in the form of a paragraph to cover the abundant visual information contained in the RS scene. To train the model, the language decoder is fine-tuned on RS dataset to generate a set of relevant questions from the RS image. We evaluate our model on two visual question-answering (VQA) datasets in RS. In addition, we construct a new dataset termed TextRS-VQA for better evaluation for our VQG model. This dataset consists of questions completely annotated by humans which addresses the high redundancy of the questions in prior VQA datasets. Extensive experiments using several accuracy and diversity metrics demonstrate the effectiveness of our proposed VQG model in generating meaningful, valid, and diverse questions from RS images

    Classification of AAMI heartbeat classes with an interactive ELM ensemble learning approach

    No full text
    In recent years, the recommendations of the Association for the Advancement of Medical Instrumentation (AAMI) for class labeling and results presentation are closely followed as a possible solution for standardization. Regardless of the class normalization, this standard basically recommends for performance evaluation to adopt inter-patient scenarios, which renders the classification task very challenging due to the strong variability of ECG signals. To deal with this issue, we propose in this paper a novel interactive ensemble learning approach based on the extreme learning machine (ELM) classifier and the induced ordered weighted averaging (IOWA) operators. While ELM is adopted for ensemble generation the IOWA operators are used for aggregating the obtained predictions in a nonlinear way. During the iterative learning process, the approach allows the expert to label the most relevant and uncertain ECG heart beats in the data under analysis and then adds them to the original training set for retraining. The experimental results obtained on the widely used MIT-BIH arrhythmia database show that the proposed approach significantly outperforms state-of-the-art methods after labeling on average 100 ECG beats per record. In addition, the results obtained on four other ECG databases starting with the same initial training set from MIT-BIH confirm its promising generalization capability

    TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images

    Get PDF
    Exploring the relevance between images and their respective natural language descriptions, due to its paramount importance, is regarded as the next frontier in the general computer vision literature. Thus, recently several works have attempted to map visual attributes onto their corresponding textual tenor with certain success. However, this line of research has not been widespread in the remote sensing community. On this point, our contribution is three-pronged. First, we construct a new dataset for text-image matching tasks, termed TextRS, by collecting images from four well-known different scene datasets, namely AID, Merced, PatternNet, and NWPU datasets. Each image is annotated by five different sentences. All the five sentences were allocated by five people to evidence the diversity. Second, we put forth a novel Deep Bidirectional Triplet Network (DBTN) for text to image matching. Unlike traditional remote sensing image-to-image retrieval, our paradigm seeks to carry out the retrieval by matching text to image representations. To achieve that, we propose to learn a bidirectional triplet network, which is composed of Long Short Term Memory network (LSTM) and pre-trained Convolutional Neural Networks (CNNs) based on (EfficientNet-B2, ResNet-50, Inception-v3, and VGG16). Third, we top the proposed architecture with an average fusion strategy to fuse the features pertaining to the five image sentences, which enables learning of more robust embedding. The performances of the method expressed in terms Recall@K representing the presence of the relevant image among the top K retrieved images to the query text shows promising results as it yields 17.20%, 51.39%, and 73.02% for K = 1, 5, and 10, respectively

    Open-ended remote sensing visual question answering with transformers

    No full text
    Visual question answering (VQA) has been attracting attention in remote sensing very recently. However, the proposed solutions remain rather limited in the sense that the existing VQA datasets address closed-ended question-answer queries, which may not necessarily reflect real open-ended scenarios. In this paper, we propose a new dataset named VQA-TextRS that was built manually with human annotations and considers various forms of open-ended question-answer pairs. Moreover, we propose an encoder-decoder architecture via transformers on account of their self-attention property that allows relational learning of different positions of the same sequence without the need of typical recurrence operations. Thus, we employed vision and natural language processing (NLP) transformers respectively to draw visual and textual cues from the image and respective question. Afterwards, we applied a transformer decoder, which enables the cross-attention mechanism to fuse the earlier two modalities. The fusion vectors correlate with the process of answer generation to produce the final form of the output. We demonstrate that plausible results can be obtained in open-ended VQA. For instance, the proposed architecture scores an accuracy of 84.01% on questions related to the presence of objects in the query images
    corecore