9,769 research outputs found
QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time Reasoning
Daily images may convey abstract meanings that require us to memorize and
infer profound information from them. To encourage such human-like reasoning,
in this work, we teach machines to predict where and when it was taken rather
than performing basic tasks like traditional segmentation or classification.
Inspired by Horn's QR theory, we designed a novel QR-CLIP model consisting of
two components: 1) the Quantity module first retrospects more open-world
knowledge as the candidate language inputs; 2) the Relevance module carefully
estimates vision and language cues and infers the location and time.
Experiments show our QR-CLIP's effectiveness, and it outperforms the previous
SOTA on each task by an average of about 10% and 130% relative lift in terms of
location and time reasoning. This study lays a technical foundation for
location and time reasoning and suggests that effectively introducing
open-world knowledge is one of the panaceas for the tasks.Comment: Technical Report. Github: https://github.com/Shi-Wm/QR-CLI
Stand for Something or Fall for Everything: Predict Misinformation Spread with Stance-Aware Graph Neural Networks
Although pervasive spread of misinformation on social media platforms has become a pressing challenge, existing platform interventions have shown limited success in curbing its dissemination. In this study, we propose a stance-aware graph neural network (stance-aware GNN) that leverages users’ stances to proactively predict misinformation spread. As different user stances can form unique echo chambers, we customize four information passing paths in stance-aware GNN, while the trainable attention weights provide explainability by highlighting each structure\u27s importance. Evaluated on a real-world dataset, stance-aware GNN outperforms benchmarks by 32.65% and exceeds advanced GNNs without user stance by over 4.69%. Furthermore, the attention weights indicate that users’ opposition stances have a higher impact on their neighbors’ behaviors than supportive ones, which function as social correction to halt misinformation propagation. Overall, our study provides an effective predictive model for platforms to combat misinformation, and highlights the impact of user stances in the misinformation propagation
Applications of Large Scale Foundation Models for Autonomous Driving
Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007,
autonomous driving has been the most active field of AI applications. Recently
powered by large language models (LLMs), chat systems, such as chatGPT and
PaLM, emerge and rapidly become a promising direction to achieve artificial
general intelligence (AGI) in natural language processing (NLP). There comes a
natural thinking that we could employ these abilities to reformulate autonomous
driving. By combining LLM with foundation models, it is possible to utilize the
human knowledge, commonsense and reasoning to rebuild autonomous driving
systems from the current long-tailed AI dilemma. In this paper, we investigate
the techniques of foundation models and LLMs applied for autonomous driving,
categorized as simulation, world model, data annotation and planning or E2E
solutions etc.Comment: 23 pages. A survey pape
Image Translation as Diffusion Visual Programmers
We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic
image translation framework. Our proposed DVP seamlessly embeds a
condition-flexible diffusion model within the GPT architecture, orchestrating a
coherent sequence of visual programs (i.e., computer vision models) for various
pro-symbolic steps, which span RoI identification, style transfer, and position
manipulation, facilitating transparent and controllable image translation
processes. Extensive experiments demonstrate DVP's remarkable performance,
surpassing concurrent arts. This success can be attributed to several key
features of DVP: First, DVP achieves condition-flexible translation via
instance normalization, enabling the model to eliminate sensitivity caused by
the manual guidance and optimally focus on textual descriptions for
high-quality content generation. Second, the framework enhances in-context
reasoning by deciphering intricate high-dimensional concepts in feature spaces
into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]),
allowing for localized, context-free editing while maintaining overall
coherence. Last but not least, DVP improves systemic controllability and
explainability by offering explicit symbolic representations at each
programming stage, empowering users to intuitively interpret and modify
results. Our research marks a substantial step towards harmonizing artificial
image translation processes with cognitive intelligence, promising broader
applications.Comment: 25 pages, 20 figure
Leveraging Large Models for Crafting Narrative Visualization: A Survey
Narrative visualization effectively transforms data into engaging stories,
making complex information accessible to a broad audience. Large models,
essential for narrative visualization, inherently facilitate this process
through their superior ability to handle natural language queries and answers,
generate cohesive narratives, and enhance visual communication. Inspired by
previous work in narrative visualization and recent advances in large models,
we synthesized potential tasks and opportunities for large models at various
stages of narrative visualization. In our study, we surveyed 79 papers to
explore the role of large models in automating narrative visualization
creation. We propose a comprehensive pipeline that leverages large models for
crafting narrative visualization, categorizing the reviewed literature into
four essential phases: Data, Narration, Visualization, and Presentation.
Additionally, we identify nine specific tasks where large models are applied
across these stages. This study maps out the landscape of challenges and
opportunities in the LM4NV process, providing insightful directions for future
research and valuable guidance for scholars in the field.Comment: 20 pages,6 figures, 2 table
Image Anything: Towards Reasoning-coherent and Training-free Multi-modal Image Generation
The multifaceted nature of human perception and comprehension indicates that,
when we think, our body can naturally take any combination of senses, a.k.a.,
modalities and form a beautiful picture in our brain. For example, when we see
a cattery and simultaneously perceive the cat's purring sound, our brain can
construct a picture of a cat in the cattery. Intuitively, generative AI models
should hold the versatility of humans and be capable of generating images from
any combination of modalities efficiently and collaboratively. This paper
presents ImgAny, a novel end-to-end multi-modal generative model that can mimic
human reasoning and generate high-quality images. Our method serves as the
first attempt in its capacity of efficiently and flexibly taking any
combination of seven modalities, ranging from language, audio to vision
modalities, including image, point cloud, thermal, depth, and event data. Our
key idea is inspired by human-level cognitive processes and involves the
integration and harmonization of multiple input modalities at both the entity
and attribute levels without specific tuning across modalities. Accordingly,
our method brings two novel training-free technical branches: 1) Entity Fusion
Branch ensures the coherence between inputs and outputs. It extracts entity
features from the multi-modal representations powered by our specially
constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly
preserves and processes the attributes. It efficiently amalgamates distinct
attributes from diverse input modalities via our proposed attribute knowledge
graph. Lastly, the entity and attribute features are adaptively fused as the
conditional inputs to the pre-trained Stable Diffusion model for image
generation. Extensive experiments under diverse modality combinations
demonstrate its exceptional capability for visual content creation
When LLMs step into the 3D world: a survey and meta-analysis of 3D tasks via multi-modal Large Language Models
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering
unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of
the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as
in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their
potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our
investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their
integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based
agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and
language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches
to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands
the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a
project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D
Focusing on non-adopters of broadband: A critical realist perspective
Australia is conducting a substantial nationwide implementation of broadband. It is primarily a fixed line network but includes wireless and satellite networks in more remote areas. The rollout is under the control of the NBN Co, whose goal is to ensure access to fast broadband for all Australians. The key performance indicators are the number of serviceable and activated premises. Recent reports indicate activation rates for fixed line broadband are exceeding expectations, despite increased competition from mobile connections. Whilst this is good news, international experience suggests adoption will plateau. We contend that there needs to be more focus on those disenchanted or disinterested “non-users” who are never likely to adopt. We argue for a critical realist perspective to better represent the adoption context and to provide a grounding for better explanations of the causes behind such decisions. We also tentatively suggest possible common-sense strategies to reverse non-adoption
- …