44 research outputs found
STAR: A Benchmark for Situated Reasoning in Real-World Videos
Reasoning in the real world is not divorced from situations. How to capture
the present knowledge from surrounding situations and perform reasoning
accordingly is crucial and challenging for machine intelligence. This paper
introduces a new benchmark that evaluates the situated reasoning ability via
situation abstraction and logic-grounded question answering for real-world
videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This
benchmark is built upon the real-world videos associated with human actions or
interactions, which are naturally dynamic, compositional, and logical. The
dataset includes four types of questions, including interaction, sequence,
prediction, and feasibility. We represent the situations in real-world videos
by hyper-graphs connecting extracted atomic entities and relations (e.g.,
actions, persons, objects, and relationships). Besides visual perception,
situated reasoning also requires structured situation comprehension and logical
reasoning. Questions and answers are procedurally generated. The answering
logic of each question is represented by a functional program based on a
situation hyper-graph. We compare various existing video reasoning models and
find that they all struggle on this challenging situated reasoning task. We
further propose a diagnostic neuro-symbolic model that can disentangle visual
perception, situation abstraction, language understanding, and functional
reasoning to understand the challenges of this benchmark.Comment: NeurIP
A Simple LLM Framework for Long-Range Video Question-Answering
We present LLoVi, a language-based framework for long-range video
question-answering (LVQA). Unlike prior long-range video understanding methods,
which are often costly and require specialized long-range video modeling design
(e.g., memory queues, state-space layers, etc.), our approach uses a
frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a
Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly
effective LVQA framework. Specifically, we decompose short and long-range
modeling aspects of LVQA into two stages. First, we use a short-term visual
captioner to generate textual descriptions of short video clips (0.5-8s in
length) densely sampled from a long input video. Afterward, an LLM aggregates
the densely extracted short-term captions to perform long-range temporal
reasoning needed to understand the whole video and answer a question. To
analyze what makes our simple framework so effective, we thoroughly evaluate
various components of our system. Our empirical analysis reveals that the
choice of the visual captioner and LLM is critical for good LVQA performance.
Furthermore, we show that a specialized prompt that asks the LLM first to
summarize the noisy short-term visual captions and then answer a given input
question leads to a significant LVQA performance boost. On EgoSchema, which is
best known as a very long-form video question-answering benchmark, our method
achieves 50.3% accuracy, outperforming the previous best-performing approach by
18.1% (absolute gain). In addition, our approach outperforms the previous
state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi
to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA
dataset. We will release our code at https://github.com/CeeZh/LLoVi
Recurrent bladder urothelial carcinoma complicated with primary bladder large cell neuroendocrine carcinoma: a case report and literature review
ObjectiveTo improve the understanding, diagnosis and treatment of bladder large cell neuroendocrine carcinoma (LCNEC).MethodsA clinical case of bladder LCNEC admitted to our hospital was reported. The epidemiology, prognosis, diagnosis and treatment methods of large cell neuroendocrine carcinoma were reviewed. The diagnosis and treatment status and prognosis were discussed based on the literature.ResultsThe female patient was admitted to hospital for “more than 4 years after TURBT and intermittent hematuria for more than 2 years”. She was diagnosed as recurrent bladder cancer and underwent “radical cystotomy + hysterectomy”. The postoperative pathological findings were high-grade urothelial carcinoma of the bladder neck and large cell neuroendocrine carcinoma of the bladder. The patient recovered well after surgery, but refused radiotherapy and chemotherapy and is still under close follow-up.ConclusionBladder LCNEC is clinically rare, has unique pathological features, is more aggressive than traditional urothelial carcinoma, and has a poor prognosis. Surgery, chemotherapy and radiotherapy should be combined with multi-mode treatment
Malignant glomus tumor of prostate: A case report
We reported an 85-year-old patient with malignant glomus tumor (GT) of the prostate. He presented with urinary frequency for more than 2 years and gross hematuria for 7 days. Computed tomography scan showed that the prostate was markedly irregularly enlarged, and the boundary between the prostate and the posterior wall of the bladder was unclear. Bilateral kidneys and ureters were dilated. Biochemical examinations showed that the serum potassium was 7.24 mmol/L and the serum creatinine was 974.6 μmol/L. Transurethral diagnostic resection was performed after restoring homeostasis through several times of bedside blood filtration. The pathological diagnosis was malignant GT. The patient’s renal function recovered after bilateral nephrostomy, and he refused further treatment and was out of contact after 9 months. We summarize the clinical and histopathological features of malignant GT of the prostate in order to improve the early recognition of the disease by clinicians
The Changes and Development Direction of Traditional Chinese Villages after Reform and Opening up —Taking Tunpu, Guizhou as an Example
After the reform and opening up, the speed of changes in China's rural areas is extremely rare in the history of China and even the world. The rapid development of China's economy leads this process of change. During this period, the speed of China's development caused great changes
in ethnic identity, physical space, and cultural structure of traditional villages. On the whole, these changes were made passively along with the economic development of the entire country's large-scale system. Such passive village changes are mainly positive, but there are also negative
aspects. Excessive reliance on exogenous economic forces can easily lead to lower uniqueness of the village’s culture, which is unconducive to the sustainable development of the village. Therefore, traditional Chinese villages should be developed based on the differences between urban
and rural areas and the uniqueness of specific villages. This can not only meet the economic development needs of the village, but also achieve cultural diversity inheritance of traditional Chinese villages, thereby avoiding the continuous destruction and damage of the unique culture of specific
villages in economic development.
</jats:p
Histological inflammation and activation of M2 type macrophages may cause prostate fibrosis
Self-Chained Image-Language Model for Video Localization and Question Answering
Recent studies have shown promising results on utilizing pre-trained
image-language models for video question answering. While these image-language
models can efficiently bootstrap the representation learning of video-language
models, they typically concatenate uniformly sampled video frames as visual
inputs without explicit language-aware, temporal modeling. When only a portion
of a video input is relevant to the language query, such uniform frame sampling
can often lead to missing important visual cues. Although humans often find a
video moment to focus on and rewind the moment to answer questions, training a
query-aware video moment localizer often requires expensive annotations and
high computational costs. To address this issue, we propose Self-Chained Video
Localization-Answering (SeViLA), a novel framework that leverages a single
image-language model (BLIP-2) to tackle both temporal keyframe localization and
QA on videos. SeViLA framework consists of two modules: Localizer and Answerer,
where both are parameter-efficiently fine-tuned from BLIP-2. We chain these
modules for cascaded inference and self-refinement. First, in the forward
chain, the Localizer finds multiple language-aware keyframes in a video, which
the Answerer uses to predict the answer. Second, in the reverse chain, the
Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating
the need for expensive video moment localization annotations. SeViLA
outperforms several strong baselines/previous works on five video QA and event
prediction tasks, and achieves the state-of-the-art in both fine-tuning
(NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We show a
comprehensive analysis, e.g., the impact of Localizer, comparisons of Localizer
with other temporal localization models, pre-training/self-refinement of
Localizer, and varying the number of keyframes.Comment: 20 pages; Our code and checkpoints are available at:
https://github.com/Yui010206/SeViL
