857 research outputs found
A Diffusion model for POI recommendation
Next Point-of-Interest (POI) recommendation is a critical task in
location-based services that aim to provide personalized suggestions for the
user's next destination. Previous works on POI recommendation have laid focused
on modeling the user's spatial preference. However, existing works that
leverage spatial information are only based on the aggregation of users'
previous visited positions, which discourages the model from recommending POIs
in novel areas. This trait of position-based methods will harm the model's
performance in many situations. Additionally, incorporating sequential
information into the user's spatial preference remains a challenge. In this
paper, we propose Diff-POI: a Diffusion-based model that samples the user's
spatial preference for the next POI recommendation. Inspired by the wide
application of diffusion algorithm in sampling from distributions, Diff-POI
encodes the user's visiting sequence and spatial character with two
tailor-designed graph encoding modules, followed by a diffusion-based sampling
strategy to explore the user's spatial visiting trends. We leverage the
diffusion process and its reversed form to sample from the posterior
distribution and optimized the corresponding score function. We design a joint
training and inference framework to optimize and evaluate the proposed
Diff-POI. Extensive experiments on four real-world POI recommendation datasets
demonstrate the superiority of our Diff-POI over state-of-the-art baseline
methods. Further ablation and parameter studies on Diff-POI reveal the
functionality and effectiveness of the proposed diffusion-based sampling
strategy for addressing the limitations of existing methods
TaleCrafter: Interactive Story Visualization with Multiple Characters
Accurate Story visualization requires several necessary elements, such as
identity consistency across frames, the alignment between plain text and visual
content, and a reasonable layout of objects in images. Most previous works
endeavor to meet these requirements by fitting a text-to-image (T2I) model on a
set of videos in the same style and with the same characters, e.g., the
FlintstonesSV dataset. However, the learned T2I models typically struggle to
adapt to new characters, scenes, and styles, and often lack the flexibility to
revise the layout of the synthesized images. This paper proposes a system for
generic interactive story visualization, capable of handling multiple novel
characters and supporting the editing of layout and local structure. It is
developed by leveraging the prior knowledge of large language and T2I models,
trained on massive corpora. The system comprises four interconnected
components: story-to-prompt generation (S2P), text-to-layout generation (T2L),
controllable text-to-image generation (C-T2I), and image-to-video animation
(I2V). First, the S2P module converts concise story information into detailed
prompts required for subsequent stages. Next, T2L generates diverse and
reasonable layouts based on the prompts, offering users the ability to adjust
and refine the layout to their preference. The core component, C-T2I, enables
the creation of images guided by layouts, sketches, and actor-specific
identifiers to maintain consistency and detail across visualizations. Finally,
I2V enriches the visualization process by animating the generated images.
Extensive experiments and a user study are conducted to validate the
effectiveness and flexibility of interactive editing of the proposed system.Comment: Github repository: https://github.com/VideoCrafter/TaleCrafte
A Survey on Deep Learning in Medical Image Analysis
Deep learning algorithms, in particular convolutional networks, have rapidly
become a methodology of choice for analyzing medical images. This paper reviews
the major deep learning concepts pertinent to medical image analysis and
summarizes over 300 contributions to the field, most of which appeared in the
last year. We survey the use of deep learning for image classification, object
detection, segmentation, registration, and other tasks and provide concise
overviews of studies per application area. Open challenges and directions for
future research are discussed.Comment: Revised survey includes expanded discussion section and reworked
introductory section on common deep architectures. Added missed papers from
before Feb 1st 201
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts
Recent progress in diffusion models has revolutionized the popular technology
of text-to-image generation. While existing approaches could produce
photorealistic high-resolution images with text conditions, there are still
several open problems to be solved, which limits the further improvement of
image fidelity and text relevancy. In this paper, we propose ERNIE-ViLG 2.0, a
large-scale Chinese text-to-image diffusion model, which progressively upgrades
the quality of generated images~by: (1) incorporating fine-grained textual and
visual knowledge of key elements in the scene, and (2) utilizing different
denoising experts at different denoising stages. With the proposed mechanisms,
ERNIE-ViLG 2.0 not only achieves the state-of-the-art on MS-COCO with zero-shot
FID score of 6.75, but also significantly outperforms recent models in terms of
image fidelity and image-text alignment, with side-by-side human evaluation on
the bilingual prompt set ViLG-300
- …