53 research outputs found
Toward the Generalization of Vision Models
In the field of computer vision, the ability to generalize is essential for ensuring robust and reliable performance across diverse datasets and real-world scenarios. Generalization refers to a computer vision model's capability to accurately interpret and process new, unseen data beyond the specific examples it was trained on. Real-world visual data can vary widely in terms of lighting conditions, backgrounds, viewpoints, occlusions, and other factors, making generalization crucial for effective performance. A computer vision system with strong generalization ability can identify objects, recognize patterns, and make accurate predictions even when presented with variations it has not encountered during training. Achieving robust generalization necessitates careful algorithm design, sufficient training data that captures real-world variability, regularization techniques to prevent overfitting, and robust evaluation methodologies. Without robust generalization, computer vision models may perform well on training data but fail to generalize to new, unseen data, limiting their real-world applicability.
In this dissertation, I will present my research efforts in enhancing the generalization ability of vision models from several perspectives. Firstly, i focus on designing vision algorithms that can generalize to novel viewpoints and occlusions, tackling challenges commonly encountered in real-world scenarios. Secondly, I investigate the architectural disparities between vision transformers and convolutional neural networks (CNNs) to better understand their impact on generalization, particularly concerning out-of-distribution data and adversarial attacks. Thirdly, I propose innovative self-supervised learning algorithms, Point-Level Region Contrast for Object Detection Pre-Training, aimed at providing strong backbone features for generalization across various downstream tasks. Additionally, I introduce efficient masked autoencoding techniques for enhanced representation learning. Finally, I propose Sequential Modeling Enables Scalable Learning for Large Vision Models, a methodology designed to effectively leverage large volumes of real-world visual data, thereby further enhancing generalization capabilities. Through these contributions, I aim to advance the field towards more generalized and robust vision models with broad applicability across diverse real-world scenarios
A combinatorial characterization of the annihilator varieties of highest weight modules for classical Lie algebras
Let be a classical Lie algebra. Let be a highest
weight module of with highest weight , where
is half the sum of positive roots. In 1985, Joseph proved that the
associated variety of a primitive ideal is the Zariski closure of a nilpotent
orbit in . In this paper, we will give some combinatorial
characterizations of the annihilator varieties of highest weight modules for
classical Lie algebras. In fact, we will give two algorithms, i.e., bipartition
algorithm and partition algorithm.Comment: 40page
Intriguing Properties of Text-guided Diffusion Models
Text-guided diffusion models (TDMs) are widely applied but can fail
unexpectedly. Common failures include: (i) natural-looking text prompts
generating images with the wrong content, or (ii) different random samples of
the latent variables that generate vastly different, and even unrelated,
outputs despite being conditioned on the same text prompt. In this work, we aim
to study and understand the failure modes of TDMs in more detail. To achieve
this, we propose SAGE, an adversarial attack on TDMs that uses image
classifiers as surrogate loss functions, to search over the discrete prompt
space and the high-dimensional latent space of TDMs to automatically discover
unexpected behaviors and failure cases in the image generation. We make several
technical contributions to ensure that SAGE finds failure cases of the
diffusion model, rather than the classifier, and verify this in a human study.
Our study reveals four intriguing properties of TDMs that have not been
systematically studied before: (1) We find a variety of natural text prompts
producing images that fail to capture the semantics of input texts. We
categorize these failures into ten distinct types based on the underlying
causes. (2) We find samples in the latent space (which are not outliers) that
lead to distorted images independent of the text prompt, suggesting that parts
of the latent space are not well-structured. (3) We also find latent samples
that lead to natural-looking images which are unrelated to the text prompt,
implying a potential misalignment between the latent and prompt spaces. (4) By
appending a single adversarial token embedding to an input prompt we can
generate a variety of specified target objects, while only minimally affecting
the CLIP score. This demonstrates the fragility of language representations and
raises potential safety concerns.Comment: Code will be available at: https://github.com/qihao067/SAG
CoKe: Localized Contrastive Learning for Robust Keypoint Detection
Today's most popular approaches to keypoint detection involve very complex
network architectures that aim to learn holistic representations of all
keypoints. In this work, we take a step back and ask: Can we simply learn a
local keypoint representation from the output of a standard backbone
architecture? This will help make the network simpler and more robust,
particularly if large parts of the object are occluded. We demonstrate that
this is possible by looking at the problem from the perspective of
representation learning. Specifically, the keypoint kernels need to be chosen
to optimize three types of distances in the feature space: Features of the same
keypoint should be similar to each other, while differing from those of other
keypoints, and also being distinct from features from the background clutter.
We formulate this optimization process within a framework, which we call CoKe,
which includes supervised contrastive learning. CoKe needs to make several
approximations to enable representation learning process on large datasets. In
particular, we introduce a clutter bank to approximate non-keypoint features,
and a momentum update to compute the keypoint representation while training the
feature extractor. Our experiments show that CoKe achieves state-of-the-art
results compared to approaches that jointly represent all keypoints
holistically (Stacked Hourglass Networks, MSS-Net) as well as to approaches
that are supervised by detailed 3D object geometry (StarMap). Moreover, CoKe is
robust and performs exceptionally well when objects are partially occluded and
significantly outperforms related work on a range of diverse datasets
(PASCAL3D+, MPII, ObjectNet3D)
Giant Nonreciprocity of Surface Acoustic Waves induced by a positive-negative magnetostrictive heterostructure
Lack of nonreciprocity is one of the major drawbacks of solid-state acoustic
devices, which has hindered the development of microwave-frequency acoustic
isolators and circulators. Here we report giant nonreciprocal transmission of
shear-horizontal surface acoustic waves (SH-SAWs) on a LiTaO3 substrate coated
with a negative-positive magnetostrictive bilayer structure of Ni/Ti/FeCoSiB.
Although the static magnetic moments of two layers are parallel, SH-SAWs can
excite optical-mode spin waves much stronger than acoustic-mode ones at
relatively low frequencies via magnetoelastic coupling. The measured magnitude
nonreciprocity exceeds 40 dB (or 80 dB/mm) at 2.333 GHz. In addition, maximum
nonreciprocal phase accumulation reaches 188{\deg} (376{\deg}/mm), which is
desired for an effective SAW circulator. Our theoretical model and calculations
provide an insight into the observed phenomena and demonstrate a pathway for
further improvement of nonreciprocal acoustic devices
Supply chain finance: what are the challenges in the adoption of blockchain technology?
As an emerging information technology, blockchain has aroused extensive discussions around the world and been suggested as a solution to address current issues in supply chain finance (SCF). The Chinese government also attaches great importance to this technology, and many Chinese state-owned enterprises have invested in establishing their own blockchain research and development centres. However, there is a lack of studies on identifying challenges when deploying this technology; theoretical framework and conceptual exposition are also scarcely seen. Therefore, the aim of this study is to investigate the challenges and obstacles in the adoption of blockchain technology in SCF. An exploratory case study of a Chinese state-owned enterprise was conducted to build up an initial conceptual framework. Semi-structured interview was applied to collect data from the case firm's employees, top management, and technical specialists. The results of the analysis indicate that in the adoption of blockchain technology, there are technological, operational, and other challenges. From a technological perspective, framework identification, cross-chain interoperability, and data governance are major barriers; whereas, from an operational perspective, the new business process and transformation in the entire supply chain are identified as challenges. Besides, other obstacles such as the elimination of jobs and regulatory issues are also not neglectable. This study contributes to research on blockchain and supply chains by shedding light on the challenges of blockchain adoption through an exploratory case study of a Chinese state-owned enterprise. A conceptual framework was generated as a basis for future research, and the findings also provide insights for companies that may or are planning to adopt blockchain technology
- …