142 research outputs found
One for All, All for One: Learning and Transferring User Embeddings for Cross-Domain Recommendation
Cross-domain recommendation is an important method to improve recommender
system performance, especially when observations in target domains are sparse.
However, most existing techniques focus on single-target or dual-target
cross-domain recommendation (CDR) and are hard to be generalized to CDR with
multiple target domains. In addition, the negative transfer problem is
prevalent in CDR, where the recommendation performance in a target domain may
not always be enhanced by knowledge learned from a source domain, especially
when the source domain has sparse data. In this study, we propose CAT-ART, a
multi-target CDR method that learns to improve recommendations in all
participating domains through representation learning and embedding transfer.
Our method consists of two parts: a self-supervised Contrastive AuToencoder
(CAT) framework to generate global user embeddings based on information from
all participating domains, and an Attention-based Representation Transfer (ART)
framework which transfers domain-specific user embeddings from other domains to
assist with target domain recommendation. CAT-ART boosts the recommendation
performance in any target domain through the combined use of the learned global
user representation and knowledge transferred from other domains, in addition
to the original user embedding in the target domain. We conducted extensive
experiments on a collected real-world CDR dataset spanning 5 domains and
involving a million users. Experimental results demonstrate the superiority of
the proposed method over a range of prior arts. We further conducted ablation
studies to verify the effectiveness of the proposed components. Our collected
dataset will be open-sourced to facilitate future research in the field of
multi-domain recommender systems and user modeling.Comment: 9 pages, accepted by WSDM 202
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency
We present ShapeClipper, a novel method that reconstructs 3D object shapes
from real-world single-view RGB images. Instead of relying on laborious 3D,
multi-view or camera pose annotation, ShapeClipper learns shape reconstruction
from a set of single-view segmented images. The key idea is to facilitate shape
learning via CLIP-based shape consistency, where we encourage objects with
similar CLIP encodings to share similar shapes. We also leverage off-the-shelf
normals as an additional geometric constraint so the model can learn better
bottom-up reasoning of detailed surface geometry. These two novel consistency
constraints, when used to regularize our model, improve its ability to learn
both global shape structure and local geometric details. We evaluate our method
over three challenging real-world datasets, Pix3D, Pascal3D+, and OpenImages,
where we achieve superior performance over state-of-the-art methods.Comment: Accepted to CVPR 2023, project website at
https://zixuanh.com/projects/shapeclipper.htm
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
Methods for finetuning generative models for concept-driven personalization
generally achieve strong results for subject-driven or style-driven generation.
Recently, low-rank adaptations (LoRA) have been proposed as a
parameter-efficient way of achieving concept-driven personalization. While
recent work explores the combination of separate LoRAs to achieve joint
generation of learned styles and subjects, existing techniques do not reliably
address the problem; they often compromise either subject fidelity or style
fidelity. We propose ZipLoRA, a method to cheaply and effectively merge
independently trained style and subject LoRAs in order to achieve generation of
any user-provided subject in any user-provided style. Experiments on a wide
range of subject and style combinations show that ZipLoRA can generate
compelling results with meaningful improvements over baselines in subject and
style fidelity while preserving the ability to recontextualize. Project page:
https://ziplora.github.ioComment: Project page: https://ziplora.github.i
Perceptually inspired image estimation and enhancement
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2009.Includes bibliographical references (p. 137-144).In this thesis, we present three image estimation and enhancement algorithms inspired by human vision. In the first part of the thesis, we propose an algorithm for mapping one image to another based on the statistics of a training set. Many vision problems can be cast as image mapping problems, such as, estimating reflectance from luminance, estimating shape from shading, separating signal and noise, etc. Such problems are typically under-constrained, and yet humans are remarkably good at solving them. Classic computational theories about the ability of the human visual system to solve such under-constrained problems attribute this feat to the use of some intuitive regularities of the world, e.g., surfaces tend to be piecewise constant. In recent years, there has been considerable interest in deriving more sophisticated statistical constraints from natural images, but because of the high-dimensional nature of images, representing and utilizing the learned models remains a challenge. Our techniques produce models that are very easy to store and to query. We show these techniques to be effective for a number of applications: removing noise from images, estimating a sharp image from a blurry one, decomposing an image into reflectance and illumination, and interpreting lightness illusions. In the second part of the thesis, we present an algorithm for compressing the dynamic range of an image while retaining important visual detail. The human visual system confronts a serious challenge with dynamic range, in that the physical world has an extremely high dynamic range, while neurons have low dynamic ranges.(cont.) The human visual system performs dynamic range compression by applying automatic gain control, in both the retina and the visual cortex. Taking inspiration from that, we designed techniques that involve multi-scale subband transforms and smooth gain control on subband coefficients, and resemble the contrast gain control mechanism in the visual cortex. We show our techniques to be successful in producing dynamic-range-compressed images without compromising the visibility of detail or introducing artifacts. We also show that the techniques can be adapted for the related problem of "companding", in which a high dynamic range image is converted to a low dynamic range image and saved using fewer bits, and later expanded back to high dynamic range with minimal loss of visual quality. In the third part of the thesis, we propose a technique that enables a user to easily localize image and video editing by drawing a small number of rough scribbles. Image segmentation, usually treated as an unsupervised clustering problem, is extremely difficult to solve. With a minimal degree of user supervision, however, we are able to generate selection masks with good quality. Our technique learns a classifier using the user-scribbled pixels as training examples, and uses the classifier to classify the rest of the pixels into distinct classes. It then uses the classification results as per-pixel data terms, combines them with a smoothness term that respects color discontinuities, and generates better results than state-of-art algorithms for interactive segmentation.by Yuanzhen Li.Ph.D
Background Prompting for Improved Object Depth
Estimating the depth of objects from a single image is a valuable task for
many vision, robotics, and graphics applications. However, current methods
often fail to produce accurate depth for objects in diverse scenes. In this
work, we propose a simple yet effective Background Prompting strategy that
adapts the input object image with a learned background. We learn the
background prompts only using small-scale synthetic object datasets. To infer
object depth on a real image, we place the segmented object into the learned
background prompt and run off-the-shelf depth networks. Background Prompting
helps the depth networks focus on the foreground object, as they are made
invariant to background variations. Moreover, Background Prompting minimizes
the domain gap between synthetic and real object images, leading to better
sim2real generalization than simple finetuning. Results on multiple synthetic
and real datasets demonstrate consistent improvements in real object depths for
a variety of existing depth networks. Code and optimized background prompts can
be found at: https://mbaradad.github.io/depth_prompt
Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble
Automatically estimating 3D skeleton, shape, camera viewpoints, and part
articulation from sparse in-the-wild image ensembles is a severely
under-constrained and challenging problem. Most prior methods rely on
large-scale image datasets, dense temporal correspondence, or human annotations
like camera pose, 2D keypoints, and shape templates. We propose Hi-LASSIE,
which performs 3D articulated reconstruction from only 20-30 online images in
the wild without any user-defined shape or skeleton templates. We follow the
recent work of LASSIE that tackles a similar problem setting and make two
significant advances. First, instead of relying on a manually annotated 3D
skeleton, we automatically estimate a class-specific skeleton from the selected
reference image. Second, we improve the shape reconstructions with novel
instance-specific optimization strategies that allow reconstructions to
faithful fit on each instance while preserving the class-specific priors
learned across all images. Experiments on in-the-wild image ensembles show that
Hi-LASSIE obtains higher fidelity state-of-the-art 3D reconstructions despite
requiring minimum user input
- …