82 research outputs found
Multi-scale Traffic Pattern Bank for Cross-city Few-shot Traffic Forecasting
Traffic forecasting is crucial for intelligent transportation systems (ITS),
aiding in efficient resource allocation and effective traffic control. However,
its effectiveness often relies heavily on abundant traffic data, while many
cities lack sufficient data due to limited device support, posing a significant
challenge for traffic forecasting. Recognizing this challenge, we have made a
noteworthy observation: traffic patterns exhibit similarities across diverse
cities. Building on this key insight, we propose a solution for the cross-city
few-shot traffic forecasting problem called Multi-scale Traffic Pattern Bank
(MTPB). Primarily, MTPB initiates its learning process by leveraging data-rich
source cities, effectively acquiring comprehensive traffic knowledge through a
spatial-temporal-aware pre-training process. Subsequently, the framework
employs advanced clustering techniques to systematically generate a multi-scale
traffic pattern bank derived from the learned knowledge. Next, the traffic data
of the data-scarce target city could query the traffic pattern bank,
facilitating the aggregation of meta-knowledge. This meta-knowledge, in turn,
assumes a pivotal role as a robust guide in subsequent processes involving
graph reconstruction and forecasting. Empirical assessments conducted on
real-world traffic datasets affirm the superior performance of MTPB, surpassing
existing methods across various categories and exhibiting numerous attributes
conducive to the advancement of cross-city few-shot forecasting methodologies.
The code is available in https://github.com/zhyliu00/MTPB.Comment: Under review. Text overlap with arXiv:2308.0972
METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens
In clinical scenarios, multi-specialist consultation could significantly
benefit the diagnosis, especially for intricate cases. This inspires us to
explore a "multi-expert joint diagnosis" mechanism to upgrade the existing
"single expert" framework commonly seen in the current literature. To this end,
we propose METransformer, a method to realize this idea with a
transformer-based backbone. The key design of our method is the introduction of
multiple learnable "expert" tokens into both the transformer encoder and
decoder. In the encoder, each expert token interacts with both vision tokens
and other expert tokens to learn to attend different image regions for image
representation. These expert tokens are encouraged to capture complementary
information by an orthogonal loss that minimizes their overlap. In the decoder,
each attended expert token guides the cross-attention between input words and
visual tokens, thus influencing the generated report. A metrics-based expert
voting strategy is further developed to generate the final report. By the
multi-experts concept, our model enjoys the merits of an ensemble-based
approach but through a manner that is computationally more efficient and
supports more sophisticated interactions among experts. Experimental results
demonstrate the promising performance of our proposed model on two widely used
benchmarks. Last but not least, the framework-level innovation makes our work
ready to incorporate advances on existing "single-expert" models to further
improve its performance.Comment: Accepted by CVPR202
Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder
Medical Visual Question Answering (VQA) systems play a supporting role to
understand clinic-relevant information carried by medical images. The questions
to a medical image include two categories: close-end (such as Yes/No question)
and open-end. To obtain answers, the majority of the existing medical VQA
methods relies on classification approaches, while a few works attempt to use
generation approaches or a mixture of the two. The classification approaches
are relatively simple but perform poorly on long open-end questions. To bridge
this gap, in this paper, we propose a new Transformer based framework for
medical VQA (named as Q2ATransformer), which integrates the advantages of both
the classification and the generation approaches and provides a unified
treatment for the close-end and open-end questions. Specifically, we introduce
an additional Transformer decoder with a set of learnable candidate answer
embeddings to query the existence of each answer class to a given
image-question pair. Through the Transformer attention, the candidate answer
embeddings interact with the fused features of the image-question pair to make
the decision. In this way, despite being a classification-based approach, our
method provides a mechanism to interact with the answer information for
prediction like the generation-based approaches. On the other hand, by
classification, we mitigate the task difficulty by reducing the search space of
answers. Our method achieves new state-of-the-art performance on two medical
VQA benchmarks. Especially, for the open-end questions, we achieve 79.19% on
VQA-RAD and 54.85% on PathVQA, with 16.09% and 41.45% absolute improvements,
respectively
Sacroiliac screws fixation navigated with three-dimensional printing personalized guide template for the treatment of posterior pelvic ring injury: A case report
ObjectivePelvic injuries refer to the disruption of the inherent structural and mechanical integrity of the pelvic ring. Sacroiliac screw fixation technique is often used for the treatment of posterior pelvic ring injury, which is prone to the iatrogenic injury. Various attempts were proposed to avoid iatrogenic injuries, while the executing processes are usually too cumbersome. The patient-personalized guide template based on 3D printing technology has been considered as a promising method, which can achieve lower deviation and higher accuracy in a simple and convenient way. We reported the first case of posterior pelvic ring injury using 3D printing personalized guide template with the verification of intraoperative CT.MethodsThe subject was a 74-year-old female with posterior pelvic ring injury. Two patient-specific guide templates were customized based on 3D printing technology, one for S1 and the other for S2. We used the guide templates for navigation to place the sacroiliac screws. The placement of screws was verified by intraoperative CT. Intraoperative and postoperative variables were collected.ResultsThe technique helped us successfully insert the sacroiliac screws into the safe zone. The intraoperative blood loss was 23.03 ml, and the duration of operation was 62 min. The exposure dose during CT scanning was 7.025 mSv. The assessment of screws position was excellent. Furthermore, there was no sign of any functional impairment postoperatively.ConclusionSacroiliac screws fixation with the assistance of 3D printing personalized guide template under the verification of intraoperative CT may be a promising method to treat posterior pelvic ring injuries
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
The advancement of Large Language Models (LLMs) has brought substantial
attention to the Chain of Thought (CoT) approach, primarily due to its ability
to enhance the capability of LLMs on complex reasoning tasks. Moreover, the
significance of CoT approaches extends to the application of LLMs for
multi-modal tasks. However, the selection of optimal CoT demonstration examples
in multi-modal reasoning remains less explored for LLMs due to the inherent
complexity of multi-modal examples. In this paper, we introduce a novel
approach that addresses this challenge by using retrieval mechanisms to
dynamically and automatically select demonstration examples based on
cross-modal and intra-modal similarities. Furthermore, we employ a Stratified
Sampling method of categorising demonstration examples into groups based on
their types and then retrieving examples from different groups respectively to
promote the diversity of demonstration examples. Through a series of
experiments on two popular benchmark datasets: ScienceQA and MathVista, we
demonstrate that our approach significantly improves the performance of GPT-4
by 6% on ScienceQA and 12.9% on MathVista, and enhances the performance of
GPT-4V on two datasets by 2.7%, substantially improving the performance of the
most advanced LLMs and LMMs for complex multi-modal reasoning tasks.Comment: Work in progres
HumanRecon: Neural Reconstruction of Dynamic Human Using Geometric Cues and Physical Priors
Recent methods for dynamic human reconstruction have attained promising
reconstruction results. Most of these methods rely only on RGB color
supervision without considering explicit geometric constraints. This leads to
existing human reconstruction techniques being more prone to overfitting to
color and causes geometrically inherent ambiguities, especially in the sparse
multi-view setup.
Motivated by recent advances in the field of monocular geometry prediction,
we consider the geometric constraints of estimated depth and normals in the
learning of neural implicit representation for dynamic human reconstruction. As
a geometric regularization, this provides reliable yet explicit supervision
information, and improves reconstruction quality. We also exploit several
beneficial physical priors, such as adding noise into view direction and
maximizing the density on the human surface. These priors ensure the color
rendered along rays to be robust to view direction and reduce the inherent
ambiguities of density estimated along rays. Experimental results demonstrate
that depth and normal cues, predicted by human-specific monocular estimators,
can provide effective supervision signals and render more accurate images.
Finally, we also show that the proposed physical priors significantly reduce
overfitting and improve the overall quality of novel view synthesis. Our code
is available
at:~\href{https://github.com/PRIS-CV/HumanRecon}{https://github.com/PRIS-CV/HumanRecon}
CBLab: Supporting the Training of Large-scale Traffic Control Policies with Scalable Traffic Simulation
Traffic simulation provides interactive data for the optimization of traffic
control policies. However, existing traffic simulators are limited by their
lack of scalability and shortage in input data, which prevents them from
generating interactive data from traffic simulation in the scenarios of real
large-scale city road networks.
In this paper, we present \textbf{C}ity \textbf{B}rain \textbf{Lab}, a
toolkit for scalable traffic simulation. CBLab consists of three components:
CBEngine, CBData, and CBScenario. CBEngine is a highly efficient simulator
supporting large-scale traffic simulation. CBData includes a traffic dataset
with road network data of 100 cities all around the world. We also develop a
pipeline to conduct a one-click transformation from raw road networks to input
data of our traffic simulation. Combining CBEngine and CBData allows
researchers to run scalable traffic simulations in the road network of real
large-scale cities. Based on that, CBScenario implements an interactive
environment and a benchmark for two scenarios of traffic control policies
respectively, with which traffic control policies adaptable for large-scale
urban traffic can be trained and tuned. To the best of our knowledge, CBLab is
the first infrastructure supporting traffic control policy optimization in
large-scale urban scenarios. CBLab has supported the City Brain Challenge @ KDD
CUP 2021. The project is available on
GitHub:~\url{https://github.com/CityBrainLab/CityBrainLab.git}.Comment: Accepted by KDD2023 (Applied Data Science Track
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis
This work conducts an evaluation of GPT-4V's multimodal capability for
medical image analysis, with a focus on three representative tasks of radiology
report generation, medical visual question answering, and medical visual
grounding. For the evaluation, a set of prompts is designed for each task to
induce the corresponding capability of GPT-4V to produce sufficiently good
outputs. Three evaluation ways including quantitative analysis, human
evaluation, and case study are employed to achieve an in-depth and extensive
evaluation. Our evaluation shows that GPT-4V excels in understanding medical
images and is able to generate high-quality radiology reports and effectively
answer questions about medical images. Meanwhile, it is found that its
performance for medical visual grounding needs to be substantially improved. In
addition, we observe the discrepancy between the evaluation outcome from
quantitative analysis and that from human evaluation. This discrepancy suggests
the limitations of conventional metrics in assessing the performance of large
language models like GPT-4V and the necessity of developing new metrics for
automatic quantitative analysis
- …