34 research outputs found
A VISION-BASED QUALITY INSPECTION SYSTEM FOR FABRIC DEFECT DETECTION AND CLASSIFICATION
Published ThesisQuality inspection of textile products is an important issue for fabric manufacturers. It is desirable to produce the highest quality goods in the shortest amount of time possible. Fabric faults or defects are responsible for nearly 85% of the defects found by the garment industry. Manufacturers recover only 45 to 65% of their profits from second or off-quality goods. There is a need for reliable automated woven fabric inspection methods in the textile industry.
Numerous methods have been proposed for detecting defects in textile. The methods are generally grouped into three main categories according to the techniques they use for texture feature extraction, namely statistical approaches, spectral approaches and model-based approaches.
In this thesis, we study one method from each category and propose their combinations in order to get improved fabric defect detection and classification accuracy. The three chosen methods are the grey level co-occurrence matrix (GLCM) from the statistical category, the wavelet transform from the spectral category and the Markov random field (MRF) from the model-based category. We identify the most effective texture features for each of those methods and for different fabric types in order to combine them.
Using GLCM, we identify the optimal number of features, the optimal quantisation level of the original image and the optimal intersample distance to use. We identify the optimal GLCM features for different types of fabrics and for three different classifiers.
Using the wavelet transform, we compare the defect detection and classification performance of features derived from the undecimated discrete wavelet and those derived from the dual-tree complex wavelet transform. We identify the best features for different types of fabrics.
Using the Markov random field, we study the performance for fabric defect detection and classification of features derived from different models of Gaussian Markov random fields of order from 1 through 9. For each fabric type we identify the best model order.
Finally, we propose three combination schemes of the best features identified from the three methods and study their fabric detection and classification performance. They lead generally to improved performance as compared to the individual methods, but two of them need further improvement
Scattering Vision Transformer: Spectral Mixing Matters
Vision transformers have gained significant attention and achieved
state-of-the-art performance in various computer vision tasks, including image
classification, instance segmentation, and object detection. However,
challenges remain in addressing attention complexity and effectively capturing
fine-grained information within images. Existing solutions often resort to
down-sampling operations, such as pooling, to reduce computational cost.
Unfortunately, such operations are non-invertible and can result in information
loss. In this paper, we present a novel approach called Scattering Vision
Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally
scattering network that enables the capture of intricate image details. SVT
overcomes the invertibility issue associated with down-sampling operations by
separating low-frequency and high-frequency components. Furthermore, SVT
introduces a unique spectral gating network utilizing Einstein multiplication
for token and channel mixing, effectively reducing complexity. We show that SVT
achieves state-of-the-art performance on the ImageNet dataset with a
significant reduction in a number of parameters and FLOPS. SVT shows 2\%
improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy,
while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L
reaches 85.7\% (again state-of-art for large versions). SVT also shows
comparable results in other vision tasks such as instance segmentation. SVT
also outperforms other transformers in transfer learning on standard datasets
such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The
project page is available on this
webpage.\url{https://badripatro.github.io/svt/}.Comment: Accepted @NeurIPS 202
Computer lipreading via hybrid deep neural network hidden Markov models
Constructing a viable lipreading system is a challenge because it is claimed that only 30% of information of speech production is visible on the lips. Nevertheless, in small vocabulary tasks, there have been several reports of high accuracies. However, investigation of larger vocabulary tasks is rare. This work examines constructing a large vocabulary lipreading system using an approach based-on Deep Neural Network Hidden Markov Models (DNN-HMMs). We present the historical development of computer lipreading technology and the state-ofthe-art results in small and large vocabulary tasks. In preliminary experiments, we evaluate the performance of lipreading and audiovisual speech recognition in small vocabulary data sets. We then concentrate on the improvement of lipreading systems in a more substantial vocabulary size with a multi-speaker data set. We tackle the problem of lipreading an unseen speaker. We investigate the effect of employing several stepstopre-processvisualfeatures. Moreover, weexaminethecontributionoflanguage modelling in a lipreading system where we use longer n-grams to recognise visual speech. Our lipreading system is constructed on the 6000-word vocabulary TCDTIMIT audiovisual speech corpus. The results show that visual-only speech recognition can definitely reach about 60% word accuracy on large vocabularies. We actually achieved a mean of 59.42% measured via three-fold cross-validation on the speaker independent setting of the TCD-TIMIT corpus using Deep autoencoder features and DNN-HMM models. This is the best word accuracy of a lipreading system in a large vocabulary task reported on the TCD-TIMIT corpus. In the final part of the thesis, we examine how the DNN-HMM model improves lipreading performance. We also give an insight into lipreading by providing a feature visualisation. Finally, we present an analysis of lipreading results and suggestions for future development
Recommended from our members
Uses of Complex Wavelets in Deep Convolutional Neural Networks
Image understanding has long been a goal for computer vision. It has proved to be an exceptionally difficult task due to the large amounts of variability that are inherent to objects in a scene. Recent advances in supervised learning methods, particularly convolutional neural networks (CNNs), have pushed forth the frontier of what we have been able to train
computers to do.
Despite their successes, the mechanics of how these networks are able to recognize objects are little understood, and the networks themselves are often very difficult and time-consuming to train. It is very important that we improve our current approaches in every way possible.
A CNN is built from connecting many learned convolutional layers in series. These convolutional layers are fairly crude in terms of signal processing - they are arbitrary taps of a finite impulse response filter, learned through stochastic gradient descent from random initial conditions. We believe that if we reformulate the problem, we may achieve many insights and benefits in training CNNs. Noting that modern CNNs are mostly viewed from and analyzed in the spatial domain, this thesis aims to view the convolutional layers in the frequency domain (viewing things in the frequency domain has proved useful in the past for denoising, filter design, compression and many other tasks). In particular, we use complex wavelets (rather than the Fourier transform or the discrete wavelet transform) as basis functions to reformulate image understanding with deep networks.
In this thesis, we explore the most popular and well-developed form of using complex wavelets in deep learning, the ScatterNet from Stephane Mallat. We explore its current limitations by building a DeScatterNet and found that while it has many nice properties, it may not be sensitive to the most appropriate shapes for understanding natural images.
We then develop a locally invariant convolutional layer, a combination of a complex wavelet transform, a modulus operation, and a learned mixing. To do this, we derive backpropagation equations and allow gradients to flow back through the (previously fixed) ScatterNet front end. Connecting several such locally invariant layers allows us to build learnable ScatterNet, a more flexible and general form of the ScatterNet (while still maintaining its desired properties).
We show that the learnable ScatterNet can provide significant improvements over the regular ScatterNet when being used as a front end for a learning system. Additionally, we show that the locally invariant convolutional layer can directly replace convolutional layers in a deep CNN (and not just at the front-end). The locally invariant convolutional layers naturally downsample the input (because of the complex modulus) while increasing the channel dimension (because of the multiple wavelet orientations used). This is an operation that often happens in a CNN by a combination of a pooling and convolutional layer. It was at these locations in a CNN where the learnable ScatterNet performed best, implying it may
be useful as learnable pooling layer.
Finally, we develop a system to learn complex weights that act directly on the wavelet coefficients of signals, in place of a convolutional layer. We call this layer the wavelet gain layer and show it can be used alongside convolutional layers. The network designer may then choose to learn in the pixel or wavelet domains. This layer shows a lot of promise and affords more control over what regions of the frequency space we want our layer to learn from. Our experiments show that it can improve on learning in the pixel domain for early layers of a CNN
Classification of Gastric Lesions Using Gabor Block Local Binary Patterns
The identification of cancer tissues in Gastroenterology imaging poses novel challenges to the computer vision community in designing generic decision support systems. This generic nature demands the image descriptors to be invariant to illumination gradients, scaling, homogeneous illumination, and rotation. In this article, we devise a novel feature extraction methodology, which explores the effectiveness of Gabor filters coupled with Block Local Binary Patterns in designing such descriptors. We effectively exploit the illumination invariance properties of Block Local Binary Patterns and the inherent capability of convolutional neural networks to construct novel rotation, scale and illumination invariant features. The invariance characteristics of the proposed Gabor Block Local Binary Patterns (GBLBP) are demonstrated using a publicly available texture dataset. We use the proposed feature extraction methodology to extract texture features from Chromoendoscopy (CH) images for the classification of cancer lesions. The proposed feature set is later used in conjuncture with convolutional neural networks to classify the CH images. The proposed convolutional neural network is a shallow network comprising of fewer parameters in contrast to other state-of-the-art networks exhibiting millions of parameters required for effective training. The obtained results reveal that the proposed GBLBP performs favorably to several other state-of-the-art methods including both hand crafted and convolutional neural networks-based features
Colorization of Multispectral Image Fusion using Convolutional Neural Network approach
The proposed technique offers a significant advantage in enhancing multiband nighttime imagery for surveillance and navigation purposes., The multi-band image data set comprises visual and infrared motion sequences with various military and civilian surveillance scenarios which include people that are stationary, walking or running, Vehicles and buildings or other man-made structures. Colorization method led to provide superior discrimination, identification of objects (Lesions), faster reaction times and an increased scene understanding than monochrome fused image. The guided filtering approach is used to decompose the source images hence they are divided into two parts: approximation part and detail content part further the weighted-averaging method is used to fuse the approximation part. The multi-layer features are extracted from the detail content part using the VGG-19 network. Finally, the approximation part and detail content part will be combined to reconstruct the fused image. The proposed approach has offers better outcomes equated to prevailing state-of-the-art techniques in terms of quantitative and qualitative parameters. In future, propose technique will help Battlefield monitoring, Defence for situation awareness, Surveillance, Target tracking and Person authentication
An Approach for Object Tracking in Video Sequences
In recent past there has been a significant increase in number of applications effectively utilizing digital videos because of less costly but superior devices. This
upsurge in video acquisition has led to huge augmentation of data, which are quite impossible to handle manually. Therefore, an automated means of processing these
videos is indispensable. In this thesis one such attempt has been made to track objects in videos. Object tracking comprises two closely related processes; object
detection followed by tracking of the detected objects. Algorithms on these two processes are proposed in this thesis.
Simple object detection algorithms compare a static background frame at pixel
level with the current frame in a video. Existing methods in this domain first try to detect objects and then remove shadows associated with them, which is a
two-stage process. The proposed approach combines both the stages into a single stage. Two different algorithms are proposed on object detection. First one to model
the background and the next to extract the objects and remove shadows from them. Initially, from first few frames the nature of each pixel is determined as stationary
or non-stationary and considering only the stationary pixels a background model is developed. Subsequently, a local thresholding technique is used to extract objects
and discard shadows.
After successfully detecting all the foreground objects, two different algorithms are proposed for tracking the objects and updating the background model. The
first algorithm suggests a centroid searching technique, where a centroid in current frame is estimated from the previous frame. Its accuracy is verified by comparing
the entropy of dual-tree complex wavelet coefficients in the bounding boxes of both the frames. If estimation becomes inaccurate, a dynamic window is utilized to
search for accurate centroid. The second algorithm updates the background using a randomized updating scheme.
Both stages of the proposed tracking model is simulated with various recorded videos. Simulation results are compared with the recent schemes to show the
superiority of the model