1,103 research outputs found
N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution
While some studies have proven that Swin Transformer (SwinT) with window
self-attention (WSA) is suitable for single image super-resolution (SR), SwinT
ignores the broad regions for reconstructing high-resolution images due to
window and shift size. In addition, many deep learning SR methods suffer from
intensive computations. To address these problems, we introduce the N-Gram
context to the image domain for the first time in history. We define N-Gram as
neighboring local windows in SwinT, which differs from text analysis that views
N-Gram as consecutive characters or words. N-Grams interact with each other by
sliding-WSA, expanding the regions seen to restore degraded pixels. Using the
N-Gram context, we propose NGswin, an efficient SR network with SCDP bottleneck
taking all outputs of the hierarchical encoder. Experimental results show that
NGswin achieves competitive performance while keeping an efficient structure,
compared with previous leading methods. Moreover, we also improve other
SwinT-based SR methods with the N-Gram context, thereby building an enhanced
model: SwinIR-NG. Our improved SwinIR-NG outperforms the current best
lightweight SR approaches and establishes state-of-the-art results. Codes will
be available soon.Comment: 8 pages (main content) + 14 pages (supplementary content
Blind Image Super-resolution with Rich Texture-Aware Codebooks
Blind super-resolution (BSR) methods based on high-resolution (HR)
reconstruction codebooks have achieved promising results in recent years.
However, we find that a codebook based on HR reconstruction may not effectively
capture the complex correlations between low-resolution (LR) and HR images. In
detail, multiple HR images may produce similar LR versions due to complex blind
degradations, causing the HR-dependent only codebooks having limited texture
diversity when faced with confusing LR inputs. To alleviate this problem, we
propose the Rich Texture-aware Codebook-based Network (RTCNet), which consists
of the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware
Texture Prior Module (PTPM). DTPM effectively mines the cross-resolution
correlation of textures between LR and HR images by exploiting the
cross-resolution correspondence of textures. PTPM uses patch-wise semantic
pre-training to correct the misperception of texture similarity in the
high-level semantic regularization. By taking advantage of this, RTCNet
effectively gets rid of the misalignment of confusing textures between HR and
LR in the BSR scenarios. Experiments show that RTCNet outperforms
state-of-the-art methods on various benchmarks by up to 0.16 ~ 0.46dB
Super-resolution assessment and detection
Super Resolution (SR) techniques are powerful digital manipulation tools that have significantly impacted various industries due to their ability to enhance the resolution of lower quality images and videos. Yet, the real-world adaptation of SR models poses numerous challenges, which blind SR models aim to overcome by emulating complex real-world degradations. In this thesis, we investigate these SR techniques, with a particular focus on comparing the performance of blind models to their non-blind counterparts under various conditions. Despite recent progress, the proliferation of SR techniques raises concerns about their potential misuse. These methods can easily manipulate real digital content and create misrepresentations, which highlights the need for robust SR detection mechanisms. In our study, we analyze the limitations of current SR detection techniques and propose a new detection system that exhibits higher performance in discerning real and upscaled videos. Moreover, we conduct several experiments to gain insights into the strengths and weaknesses of the detection models, providing a better understanding of their behavior and limitations. Particularly, we target 4K videos, which are rapidly becoming the standard resolution in various fields such as streaming services, gaming, and content creation. As part of our research, we have created and utilized a unique dataset in 4K resolution, specifically designed to facilitate the investigation of SR techniques and their detection
Revisiting the Encoding of Satellite Image Time Series
Satellite Image Time Series (SITS) representation learning is complex due to
high spatiotemporal resolutions, irregular acquisition times, and intricate
spatiotemporal interactions. These challenges result in specialized neural
network architectures tailored for SITS analysis. The field has witnessed
promising results achieved by pioneering researchers, but transferring the
latest advances or established paradigms from Computer Vision (CV) to SITS is
still highly challenging due to the existing suboptimal representation learning
framework. In this paper, we develop a novel perspective of SITS processing as
a direct set prediction problem, inspired by the recent trend in adopting
query-based transformer decoders to streamline the object detection or image
segmentation pipeline. We further propose to decompose the representation
learning process of SITS into three explicit steps: collect-update-distribute,
which is computationally efficient and suits for irregularly-sampled and
asynchronous temporal satellite observations. Facilitated by the unique
reformulation, our proposed temporal learning backbone of SITS, initially
pre-trained on the resource efficient pixel-set format and then fine-tuned on
the downstream dense prediction tasks, has attained new state-of-the-art (SOTA)
results on the PASTIS benchmark dataset. Specifically, the clear separation
between temporal and spatial components in the semantic/panoptic segmentation
pipeline of SITS makes us leverage the latest advances in CV, such as the
universal image segmentation architecture, resulting in a noticeable 2.5 points
increase in mIoU and 8.8 points increase in PQ, respectively, compared to the
best scores reported so far
Neural architecture search: A contemporary literature review for computer vision applications
Deep Neural Networks have received considerable attention in recent years. As the complexity of network architecture increases in relation to the task complexity, it becomes harder to manually craft an optimal neural network architecture and train it to convergence. As such, Neural Architecture Search (NAS) is becoming far more prevalent within computer vision research, especially when the construction of efficient, smaller network architectures is becoming an increasingly important area of research, for which NAS is well suited. However, despite their promise, contemporary and end-to-end NAS pipeline require vast computational training resources. In this paper, we present a comprehensive overview of contemporary NAS approaches with respect to image classification, object detection, and image segmentation. We adopt consistent terminology to overcome contradictions common within existing NAS literature. Furthermore, we identify and compare current performance limitations in addition to highlighting directions for future NAS research
A Review on Skin Disease Classification and Detection Using Deep Learning Techniques
Skin cancer ranks among the most dangerous cancers. Skin cancers are commonly referred to as Melanoma. Melanoma is brought on by genetic faults or mutations on the skin, which are caused by Unrepaired Deoxyribonucleic Acid (DNA) in skin cells. It is essential to detect skin cancer in its infancy phase since it is more curable in its initial phases. Skin cancer typically progresses to other regions of the body. Owing to the disease's increased frequency, high mortality rate, and prohibitively high cost of medical treatments, early diagnosis of skin cancer signs is crucial. Due to the fact that how hazardous these disorders are, scholars have developed a number of early-detection techniques for melanoma. Lesion characteristics such as symmetry, colour, size, shape, and others are often utilised to detect skin cancer and distinguish benign skin cancer from melanoma. An in-depth investigation of deep learning techniques for melanoma's early detection is provided in this study. This study discusses the traditional feature extraction-based machine learning approaches for the segmentation and classification of skin lesions. Comparison-oriented research has been conducted to demonstrate the significance of various deep learning-based segmentation and classification approaches
Generalized Differentiable Neural Architecture Search with Performance and Stability Improvements
This work introduces improvements to the stability and generalizability of Cyclic DARTS (CDARTS). CDARTS is a Differentiable Architecture Search (DARTS)-based approach to neural architecture search (NAS) that uses a cyclic feedback mechanism to train search and evaluation networks concurrently, thereby optimizing the search process by enforcing that the networks produce similar outputs. However, the dissimilarity between the loss functions used by the evaluation networks during the search and retraining phases results in a search-phase evaluation network, a sub-optimal proxy for the final evaluation network utilized during retraining. ICDARTS, a revised algorithm that reformulates the search phase loss functions to ensure the criteria for training the networks is consistent across both phases, is presented along with a modified process for discretizing the search network\u27s zero operations that allows the retention of these operations in the final evaluation networks. We pair the results of these changes with ablation studies of ICDARTS\u27 algorithm and network template. Multiple methods were then explored for expanding the search space of ICDARTS, including extending its operation set and implementing methods for discretizing its continuous search cells, further improving its discovered networks\u27 performance. In order to balance the flexibility of expanded search spaces with minimal compute costs, both a novel algorithm for incorporating efficient dynamic search spaces into ICDARTS and a multi-objective version of ICDARTS that incorporates an expected latency penalty term into its loss function are introduced. All enhancements to the original search algorithm are verified on two challenging scientific datasets. This work concludes by proposing and examining the preliminary results of a preliminary hierarchical version of ICDARTS that optimizes cell structures and network templates
Image fusion for the novelty rotating synthetic aperture system based on vision transformer
Rotating synthetic aperture (RSA) technology offers a promising solution for achieving large-aperture and lightweight designs in optical remote-sensing systems. It employs a rectangular primary mirror, resulting in noncircular spatial symmetry in the point-spread function, which changes over time as the mirror rotates. Consequently, it is crucial to employ an appropriate image-fusion method to merge high-resolution information intermittently captured from different directions in the image sequence owing to the rotation of the mirror. However, existing image-fusion methods have struggled to address the unique imaging mechanism of this system and the characteristics of the geostationary orbit in which the system operates. To address this challenge, we model the imaging process of a noncircular rotating pupil and analyse its on-orbit imaging characteristics. Based on this analysis, we propose an image-fusion network based on a vision transformer. This network incorporates inter-frame mutual attention and intra-frame self-attention mechanisms, facilitating more effective extraction of temporal and spatial information from the image sequence. Specifically, mutual attention was used to model the correlation between pixels that were close to each other in the spatial and temporal dimensions, whereas long-range spatial dependencies were captured using intra-frame self-attention in the rotated variable-size attention block. We subsequently enhanced the fusion of spatiotemporal information using video swin transformer blocks. Extensive digital simulations and semi-physical imaging experiments on remote-sensing images obtained from the WorldView-3 satellite demonstrated that our method outperformed both image-fusion methods designed for the RSA system and state-of-the-art general deep learning-based methods
- …