1,108 research outputs found
Generalizable Industrial Visual Anomaly Detection with Self-Induction Vision Transformer
Industrial vision anomaly detection plays a critical role in the advanced
intelligent manufacturing process, while some limitations still need to be
addressed under such a context. First, existing reconstruction-based methods
struggle with the identity mapping of trivial shortcuts where the
reconstruction error gap is legible between the normal and abnormal samples,
leading to inferior detection capabilities. Then, the previous studies mainly
concentrated on the convolutional neural network (CNN) models that capture the
local semantics of objects and neglect the global context, also resulting in
inferior performance. Moreover, existing studies follow the individual learning
fashion where the detection models are only capable of one category of the
product while the generalizable detection for multiple categories has not been
explored. To tackle the above limitations, we proposed a self-induction vision
Transformer(SIVT) for unsupervised generalizable multi-category industrial
visual anomaly detection and localization. The proposed SIVT first extracts
discriminatory features from pre-trained CNN as property descriptors. Then, the
self-induction vision Transformer is proposed to reconstruct the extracted
features in a self-supervisory fashion, where the auxiliary induction tokens
are additionally introduced to induct the semantics of the original signal.
Finally, the abnormal properties can be detected using the semantic feature
residual difference. We experimented with the SIVT on existing Mvtec AD
benchmarks, the results reveal that the proposed method can advance
state-of-the-art detection performance with an improvement of 2.8-6.3 in AUROC,
and 3.3-7.6 in AP.Comment: 8 pages, 6 figures
Latent Class Model with Application to Speaker Diarization
In this paper, we apply a latent class model (LCM) to the task of speaker
diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in
that it uses soft information and avoids premature hard decisions in its
iterations. In contrast to the VB method, which is based on a generative model,
LCM provides a framework allowing both generative and discriminative models.
The discriminative property is realized through the use of i-vector (Ivec),
probabilistic linear discriminative analysis (PLDA), and a support vector
machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid are introduced. In addition, three further improvements are
applied to enhance its performance. 1) Adding neighbor windows to extract more
speaker information for each short segment. 2) Using a hidden Markov model to
avoid frequent speaker change points. 3) Using an agglomerative hierarchical
cluster to do initialization and present hard and soft priors, in order to
overcome the problem of initial sensitivity. Experiments on the National
Institute of Standards and Technology Rich Transcription 2009 speaker
diarization database, under the condition of a single distant microphone, show
that the diarization error rate (DER) of the proposed methods has substantial
relative improvements compared with mainstream systems. Compared to the VB
method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments
on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial
conditions also show that the proposed LCM-Ivec-Hybrid system has the best
overall performance
- …