25 research outputs found
A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement
Document image enhancement is a fundamental and important stage for attaining
the best performance in any document analysis assignment because there are many
degradation situations that could harm document images, making it more
difficult to recognize and analyze them. In this paper, we propose
\textbf{T2T-BinFormer} which is a novel document binarization encoder-decoder
architecture based on a Tokens-to-token vision transformer. Each image is
divided into a set of tokens with a defined length using the ViT model, which
is then applied several times to model the global relationship between the
tokens. However, the conventional tokenization of input data does not
adequately reflect the crucial local structure between adjacent pixels of the
input image, which results in low efficiency. Instead of using a simple ViT and
hard splitting of images for the document image enhancement task, we employed a
progressive tokenization technique to capture this local information from an
image to achieve more effective results. Experiments on various DIBCO and
H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing
CNN and ViT-based state-of-the-art methods. In this research, the primary area
of examination is the application of the proposed architecture to the task of
document binarization. The source code will be made available at
https://github.com/RisabBiswas/T2T-BinFormer.Comment: arXiv admin note: text overlap with arXiv:2312.0356
Statistical Analysis of Fractal Image Coding and Fixed Size Partitioning Scheme
Fractal Image Compression (FIC) is a state of the art technique used for high compression ratio. But it lacks behind in its encoding time requirements. In this method an image is divided into non-overlapping range blocks and overlapping domain blocks. The total number of domain blocks is larger than the range blocks. Similarly the sizes of the domain blocks are twice larger than the range blocks. Together all domain blocks creates a domain pool. A range block is compared with all possible domains block for similarity measure. So the domain is decimated for a proper domainrange comparison. In this paper a novel domain pool decimation and reduction technique has been developed which uses the median as a measure of the central tendency instead of the mean (or average) of the domain pixel values. However this process is very time consuming
LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks
The activation function in neural network is one of the important aspects
which facilitates the deep training by introducing the non-linearity into the
learning process. However, because of zero-hard rectification, some of the
existing activation functions such as ReLU and Swish miss to utilize the large
negative input values and may suffer from the dying gradient problem. Thus, it
is important to look for a better activation function which is free from such
problems. As a remedy, this paper proposes a new non-parametric function,
called Linearly Scaled Hyperbolic Tangent (LiSHT) for Neural Networks (NNs).
The proposed LiSHT activation function is an attempt to scale the non-linear
Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying
gradient problem. The training and classification experiments are performed
over benchmark Iris, MNIST, CIFAR10, CIFAR100 and twitter140 datasets to show
that the proposed activation achieves faster convergence and higher
performance. A very promising performance improvement is observed on three
different type of neural networks including Multi-layer Perceptron (MLP),
Convolutional Neural Network (CNN) and Recurrent neural network like Long-short
term memory (LSTM). The advantages of proposed activation function are also
visualized in terms of the feature activation maps, weight distribution and
loss landscape. The code is available at https://github.com/swalpa/lisht.Comment: Submitted to IET Image Processin
Deep Hyperspectral Unmixing using Transformer Network
Currently, this paper is under review in IEEE. Transformers have intrigued
the vision research community with their state-of-the-art performance in
natural language processing. With their superior performance, transformers have
found their way in the field of hyperspectral image classification and achieved
promising results. In this article, we harness the power of transformers to
conquer the task of hyperspectral unmixing and propose a novel deep unmixing
model with transformers. We aim to utilize the ability of transformers to
better capture the global feature dependencies in order to enhance the quality
of the endmember spectra and the abundance maps. The proposed model is a
combination of a convolutional autoencoder and a transformer. The hyperspectral
data is encoded by the convolutional encoder. The transformer captures
long-range dependencies between the representations derived from the encoder.
The data are reconstructed using a convolutional decoder. We applied the
proposed unmixing model to three widely used unmixing datasets, i.e., Samson,
Apex, and Washington DC mall and compared it with the state-of-the-art in terms
of root mean squared error and spectral angle distance. The source code for the
proposed model will be made publicly available at
\url{https://github.com/preetam22n/DeepTrans-HSU}.Comment: Currently, this paper is under review in IEE
Multimodal Fusion Transformer for Remote Sensing Image Classification
Vision transformer (ViT) has been trending in image classification tasks due
to its promising performance when compared to convolutional neural networks
(CNNs). As a result, many researchers have tried to incorporate ViT models in
hyperspectral image (HSI) classification tasks, but without achieving
satisfactory performance. To this paper, we introduce a new multimodal fusion
transformer (MFT) network for HSI land-cover classification, which utilizes
other sources of multimodal data in addition to HSI. Instead of using
conventional feature fusion techniques, other multimodal data are used as an
external classification (CLS) token in the transformer encoder, which helps
achieving better generalization. ViT and other similar transformer models use a
randomly initialized external classification token {and fail to generalize
well}. However, the use of a feature embedding derived from other sources of
multimodal data, such as light detection and ranging (LiDAR), offers the
potential to improve those models by means of a CLS. The concept of
tokenization is used in our work to generate CLS and HSI patch tokens, helping
to learn key features in a reduced feature space. We also introduce a new
attention mechanism for improving the exchange of information between HSI
tokens and the CLS (e.g., LiDAR) token. Extensive experiments are carried out
on widely used and benchmark datasets i.e., the University of Houston, Trento,
University of Southern Mississippi Gulfpark (MUUFL), and Augsburg. In the
results section, we compare the proposed MFT model with other state-of-the-art
transformer models, classical CNN models, as well as conventional classifiers.
The superior performance achieved by the proposed model is due to the use of
multimodal information as external classification tokens
Hybrid Dense Network With Attention Mechanism for Hyperspectral Image Classification
The nonlinear relation between the spectral information and the corresponding objects (complex physiognomies) makes pixelwise classification challenging for conventional methods. To deal with nonlinearity issues in hyperspectral image classification (HSIC), convolutional neural networks (CNN) are more suitable, indeed. However, fixed kernel sizes make traditional CNN too specific, neither flexible nor conducive to feature learning, thus impacting on the classification accuracy. The convolution of different kernel size networks may overcome this problem by capturing more discriminating and relevant information. In light of this, the proposed solution aims at combining the core idea of 3-D and 2-D inception net with the attention mechanism to boost the HSIC CNN performance in a hybrid scenario. The resulting attention-fused hybrid network (AfNet) is based on three attention-fused parallel hybrid subnets with different kernels in each block repeatedly using high-level features to enhance the final ground-truth maps. In short, AfNet is able to selectively filter out the discriminative features critical for classification. Several tests on HSI datasets provided competitive results for AfNet compared to state-of-the-art models