181 research outputs found
Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding
In this paper, we are tackling the weakly-supervised referring expression
grounding task, for the localization of a referent object in an image according
to a query sentence, where the mapping between image regions and queries are
not available during the training stage. In traditional methods, an object
region that best matches the referring expression is picked out, and then the
query sentence is reconstructed from the selected region, where the
reconstruction difference serves as the loss for back-propagation. The existing
methods, however, conduct both the matching and the reconstruction
approximately as they ignore the fact that the matching correctness is unknown.
To overcome this limitation, a discriminative triad is designed here as the
basis to the solution, through which a query can be converted into one or
multiple discriminative triads in a very scalable way. Based on the
discriminative triad, we further propose the triad-level matching and
reconstruction modules which are lightweight yet effective for the
weakly-supervised training, making it three times lighter and faster than the
previous state-of-the-art methods. One important merit of our work is its
superior performance despite the simple and neat design. Specifically, the
proposed method achieves a new state-of-the-art accuracy when evaluated on
RefCOCO (39.21%), RefCOCO+ (39.18%) and RefCOCOg (43.24%) datasets, that is
4.17%, 4.08% and 7.8% higher than the previous one, respectively.Comment: TPAM
EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones
The superior performance of modern deep networks usually comes with a costly
training procedure. This paper presents a new curriculum learning approach for
the efficient training of visual backbones (e.g., vision Transformers). Our
work is inspired by the inherent learning dynamics of deep networks: we
experimentally show that at an earlier training stage, the model mainly learns
to recognize some 'easier-to-learn' discriminative patterns within each
example, e.g., the lower-frequency components of images and the original
information before data augmentation. Driven by this phenomenon, we propose a
curriculum where the model always leverages all the training data at each
epoch, while the curriculum starts with only exposing the 'easier-to-learn'
patterns of each example, and introduces gradually more difficult patterns. To
implement this idea, we 1) introduce a cropping operation in the Fourier
spectrum of the inputs, which enables the model to learn from only the
lower-frequency components efficiently, 2) demonstrate that exposing the
features of original images amounts to adopting weaker data augmentation, and
3) integrate 1) and 2) and design a curriculum learning schedule with a
greedy-search algorithm. The resulting approach, EfficientTrain, is simple,
general, yet surprisingly effective. As an off-the-shelf method, it reduces the
wall-time training cost of a wide variety of popular models (e.g., ResNet,
ConvNeXt, DeiT, PVT, Swin, and CSWin) by >1.5x on ImageNet-1K/22K without
sacrificing accuracy. It is also effective for self-supervised learning (e.g.,
MAE). Code is available at https://github.com/LeapLabTHU/EfficientTrain.Comment: ICCV 202
Recommended from our members
Understanding of Visual Domains via the Lens of Natural Language
A joint understanding of vision and language can enable intelligent systems to perceive, act, and communicate with humans for a wide range of applications. For example, they can assist a human to navigate in an environment, edit the content of an image through natural language commands, or search through image collections using natural language queries. In this thesis, we aim to improve our understanding of visual domains through the lens of natural language. We specifically look into (1) images of categories within a fine-grained taxonomy such as species of birds or variants of aircraft, (2) images of textures that describe local color, shape, and patterns, and (3) regions in images that correspond to objects, materials, and textures.
In one line of work, we investigate ways to discover a domain-specific language by asking annotators to describe visual differences between instances within a fine-grained taxonomy. We show that a system trained to describe these differences leads to an accurate and interpretable basis for categorization. In another line of work, we investigate the effectiveness of language and vision models for describing textures, a problem that, despite the ubiquity of textures, has not been sufficiently studied in the literature. Textures are diverse, yet their local nature allows for the description of appearance of a wide range of visual categories. The locality also allows us to systematically generate synthetic variations to investigate how disentangled visual representations are for properties such as shape, color, and figure-ground segmentation. Finally, instead of modeling an image as a whole, we design a system that allows descriptions of regions within an image. A challenge is to handle the long-tail distribution of names and appearances of concepts within natural scenes. We design a modular framework that integrates object detection, semantic segmentation, and contextual reasoning with language that leads to better performance. In addition to methods and analysis, we contribute datasets and benchmarks to evaluate the performance of models in each of these domains.
The availability of large-scale pre-trained models for vision (e.g., ResNet) and language (e.g., BERT) have catalyzed improvements and novel applications in computer vision and natural language processing, but until recently similar models that could jointly reason about language and vision were not available. This has changed through the availability of models such as CLIP, which have been trained on a massive number of images with associated texts. Therefore, we analyze the effectiveness of CLIP-based representations for tasks posed in our earlier work. By comparing and contrasting these with domain-specific ones we presented in the earlier chapters, we shed some light on the nature of the learned representations and the biases they encode
ํฌํ ๋ฆฌ์๊ทธ๋ํผ ๊ฒ์ฌ ์์คํ ์ ์ด๋ฏธ์ง ๋ถํ ์ ์ํ ์๋ก์ด ๊น์ ์ํคํ ์ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ์ตํฉ๊ณผํ๊ธฐ์ ๋ํ์ ์ตํฉ๊ณผํ๋ถ(์ง๋ฅํ์ตํฉ์์คํ
์ ๊ณต), 2021.8. ํ์ฑ์.In semiconductor manufacturing, defect detection is critical to maintain high yield. Typically, the defects of semiconductor wafer may be generated from the manufacturing process. Most computer vision systems used in semiconductor photolithography process inspection still have adopt to image processing algorithm, which often occur inspection faults due to sensitivity to external environment changes. Therefore, we intend to tackle this problem by means of converging the advantages of image processing algorithm and deep learning.
In this dissertation, we propose Image Segmentation Detector (ISD) to extract the enhanced feature-maps under the situations where training dataset is limited in the specific industry domain, such as semiconductor photolithography inspection. ISD is used as a novel backbone network of state-of-the-art Mask R-CNN framework for image segmentation. ISD consists of four dense blocks and four transition layers. Especially, each dense block in ISD has the shortcut connection and the concatenation of the feature-maps produced in layer with dynamic growth rate for more compactness. ISD is trained from scratch without using recently approached transfer learning method. Additionally, ISD is trained with image dataset pre-processed by means of our designed image filter to extract the better enhanced feature map of Convolutional Neural Network (CNN). In ISD, one of the key design principles is the compactness, plays a critical role for addressing real-time problem and for application on resource bounded devices.
To empirically demonstrate the model, this dissertation uses the existing image obtained from the computer vision system embedded in the currently operating semiconductor manufacturing equipment. ISD achieves consistently better results than state-of-the-art methods at the standard mean average precision which is the most common metric used to measure the accuracy of the instance detection. Significantly, our ISD outperforms baseline method DenseNet, while requiring only 1/4 parameters. We also observe that ISD can achieve comparable better results in performance than ResNet, with only much smaller 1/268 parameters, using no extra data or pre-trained models. Our experimental results show that ISD can be useful to many future image segmentation research efforts in diverse fields of semiconductor industry which is requiring real-time and good performance with only limited training dataset.๋ฐ๋์ฒด ์ ์กฐ์์ ๊ฒฐํจ ๊ฒ์ถ์ ๋์ ์์จ์ ์ ์งํ๋๋ฐ ์ค์ํฉ๋๋ค. ์ ํ์ ์ผ๋ก, ๋ฐ๋์ฒด ์จ์ดํผ์ ๊ฒฐํจ์ ์ ์กฐ ๊ณต์ ์์ ๋ฐ์ํ๊ณ ์์ต๋๋ค. ๋ฐ๋์ฒด ํฌํ ๋ฆฌ์๊ทธ๋ํผ ๊ณต์ ๊ฒ์ฌ์ ์ฌ์ฉ๋๋ ๋๋ถ๋ถ์ ์ปดํจํฐ ๋น์ ์์คํ
๋ค์ ์ฌ์ ํ ์ธ๋ถ ํ๊ฒฝ ๋ณํ์ ๋ฏผ๊ฐํ ์ด๋ฏธ์ง ์ฒ๋ฆฌ ์๊ณ ๋ฆฌ์ฆ์ ์ฌ์ฉํ๊ณ ์์ด์ ๊ฒ์ฌ ์ค๋ฅ๊ฐ ์์ฃผ ๋ฐ์ํ๊ณ ์์ต๋๋ค. ๋ฐ๋ผ์, ์ด๋ฏธ์ง ์ฒ๋ฆฌ ์๊ณ ๋ฆฌ์ฆ์ ์ฅ์ ๊ณผ ๋ฅ ๋ฌ๋์ ์ฅ์ ์ ์ตํฉํ์ฌ ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ ค๊ณ ํฉ๋๋ค.
์ด ๋
ผ๋ฌธ์์ ์ฐ๋ฆฌ๋ ๋ฐ๋์ฒด ํฌํ ๋ฆฌ์๊ทธ๋ํผ ๊ฒ์ฌ์ ๊ฐ์ด ํ๋ จ ๋ฐ์ดํฐ ์ธํธ๊ฐ ์ ํ๋ ์ํฉ์์ ํฅ์๋ ๊ธฐ๋ฅ ๋งต์ ์ถ์ถํ๊ธฐ ์ํด ์ด๋ฏธ์ง ๋ถํ ๊ฒ์ถ๊ธฐ(Image Segmentation Detector, ์ดํ ISD)๋ฅผ ์ ์ํฉ๋๋ค. ISD๋ ์ด๋ฏธ์ง ๋ถํ ์ ์ํ ์ต์ Mask R-CNN ํ๋ ์ ์ํฌ์ ์๋ก์ด ๋ฐฑ๋ณธ ๋คํธ์ํฌ๋ก ์ฌ์ฉํฉ๋๋ค. ISD๋ 4 ๊ฐ์ ์กฐ๋ฐํ ๋ธ๋ก๊ณผ 4 ๊ฐ์ ์ ํ ๋ ์ด์ด๋ก ๊ตฌ์ฑํฉ๋๋ค. ํนํ, ISD์ ๊ฐ ์กฐ๋ฐํ ๋ธ๋ก์ ๋ณด๋ค ์ปดํฉํธํจ์ ์ํด ๋จ์ถ ์ฐ๊ฒฐ ๋ฐ ๋์ ์ฑ์ฅ๋ฅ ์ ๊ฐ์ง๊ณ ๋ ์ด์ด์์ ์์ฑ๋ ํผ์ณ ๋งต์ ๊ฒฐํฉํ๊ณ ์์ต๋๋ค. ISD๋ ์ต๊ทผ ์ ์ฉํ๊ณ ์๋ ์ ์ด ํ์ต ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์ง ์๊ณ ์ฒ์๋ถํฐ ํ๋ จํฉ๋๋ค. ๋ํ, ISD๋ ํฉ์ฑ๊ณฑ ์ ๊ฒฝ๋ง(Convolutional Neural Network, ์ดํ CNN)์ ํฅ์๋ ๊ธฐ๋ฅ ๋งต์ ์ถ์ถํ๊ธฐ ์ํด ์ฐ๋ฆฌ๊ฐ ์ค๊ณํ ์ด๋ฏธ์ง ํํฐ๋ฅผ ํตํด ์ฌ์ ์ฒ๋ฆฌ๋ ์ด๋ฏธ์ง ๋ฐ์ดํฐ ์ธํธ๋ก ํ๋ จ์ ํฉ๋๋ค. ISD์ ์ค๊ณ ํต์ฌ ์์น ์ค ํ๋๋ ์ํํ๋ก ์ค์๊ฐ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ณ ๋ฆฌ์์ค์ ์ ํ์ด ์๋ ์ฅ์น์ ์ ์ฉํ๋๋ฐ ์ค์ํ ์ญํ ์ ํ๊ฒ ํฉ๋๋ค.
๋ชจ๋ธ์ ์ค์ฆ์ ์ผ๋ก ์
์ฆํ๊ธฐ ์ํด ์ด ๋
ผ๋ฌธ์์๋ ํ์ฌ ์ด์ ์ค์ธ ๋ฐ๋์ฒด ์ ์กฐ ์ฅ๋น์ ๋ด์ฅ๋ ์ปดํจํฐ ๋น์ ์์คํ
์์ ํ๋ํ ์ค์ ์ด๋ฏธ์ง๋ฅผ ์ฌ์ฉํฉ๋๋ค. ISD๋ ๊ฐ์ฅ ์ผ๋ฐ์ ์ธ ์ฑ๋ฅ ์ธก์ ์งํ์ธ ํ๊ท ์ ๋ฐ๋์์ ์ต์ฒจ๋จ ๋ฐฑ๋ณธ ๋คํธ์ํฌ ๋ณด๋ค ์ผ๊ด๋๊ฒ ๋ ๋์ ์ฑ๋ฅ์ ์ป์ต๋๋ค. ํนํ, ISD๋ ๋ฒ ์ด์ค ๋ผ์ธ์ผ๋ก ์ผ์ DenseNet ๋ณด๋ค ํ๋ผ๋ฏธํฐ๋ค์ด 4๋ฐฐ ๋ ์ ์ง๋ง, ์ฑ๋ฅ์ด ์ฐ์ ํฉ๋๋ค. ์ฐ๋ฆฌ๋ ๋ํ ISD๊ฐ Mask R-CNN ๋ฐฑ๋ณธ ๋คํธ์ํฌ๋ก ์ฃผ๋ก ์ฌ์ฉํ๋ ResNet ๋ณด๋ค 268๋ฐฐ ํจ์ฌ ๋ ์ ์ ํ๋ผ๋ฏธํฐ๋ค์ ๊ฐ์ง๊ณ , ์ถ๊ฐ ๋ฐ์ดํฐ ๋๋ ์ฌ์ ํ๋ จ๋ ๋ชจ๋ธ์ ์ฌ์ฉํ์ง ์๊ณ , ์ฑ๋ฅ์์ ๋น์ทํ๊ฑฐ๋ ๋ ๋์ ๊ฒฐ๊ณผ๋ฅผ ์ป์ ์ ์์์ ๊ด์ฐฐํฉ๋๋ค. ์ฐ๋ฆฌ์ ์คํ ๊ฒฐ๊ณผ๋ค์ ISD๊ฐ ์ ํ๋ ํ๋ จ ๋ฐ์ดํฐ ์ธํธ๋ง์ผ๋ก ์ค์๊ฐ ๋ฐ ์ฐ์ํ ์ฑ๋ฅ์ ์๊ตฌํ๋ ๋ฐ๋์ฒด ์ฐ์
์ ๋ค์ํ ๋ถ์ผ๋ค์์ ๋ง์ ๋ฏธ๋์ ์ด๋ฏธ์ง ๋ถํ ์ฐ๊ตฌ ๋
ธ๋ ฅ์ ์ ์ฉํ ์ ์์์ ๋ณด์ฌ์ค๋๋ค.Chapter 1. Introduction ๏ผ
1.1. Background and Motivation ๏ผ
Chapter 2. Related Work ๏ผ๏ผ
2.1. Inspection Method ๏ผ๏ผ
2.2. Instance Segmentation ๏ผ๏ผ
2.3. Backbone Structure ๏ผ๏ผ
2.4. Enhanced Feature Map ๏ผ๏ผ
2.5. Detection Performance Evaluation ๏ผ๏ผ
2.6. Learning Network Model from Scratch ๏ผ๏ผ
Chapter 3. Proposed Method ๏ผ๏ผ
3.1. ISD Architecture ๏ผ๏ผ
3.2. Pre-processing ๏ผ๏ผ
3.3. Model Training ๏ผ๏ผ
3.4. Training Objective ๏ผ๏ผ
3.5. Setting and Configurations ๏ผ๏ผ
Chapter 4. Experimental Evaluation ๏ผ๏ผ
4.1. Classification Results on ISD ๏ผ๏ผ
4.2. Comparison with Pre-processing ๏ผ๏ผ
4.3. Image Segmentation Results on ISD ๏ผ๏ผ
4.3.1. Results on Suck-back State ๏ผ๏ผ
4.3.2. Results on Dispensing State ๏ผ๏ผ๏ผ
4.4. Comparison with State-of-the-art Methods ๏ผ๏ผ๏ผ
Chapter 5. Conclusion ๏ผ๏ผ๏ผ
Bibliography ๏ผ๏ผ๏ผ
์ด๋ก ๏ผ๏ผ๏ผ๋ฐ
Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey
The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving
Learning with delayed reinforcement in an exploratory probabilistic logic neural network
Imperial Users onl
- โฆ