44 research outputs found

    STAR: A Benchmark for Situated Reasoning in Real-World Videos

    Full text link
    Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual perception, situated reasoning also requires structured situation comprehension and logical reasoning. Questions and answers are procedurally generated. The answering logic of each question is represented by a functional program based on a situation hyper-graph. We compare various existing video reasoning models and find that they all struggle on this challenging situated reasoning task. We further propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.Comment: NeurIP

    A Simple LLM Framework for Long-Range Video Question-Answering

    Full text link
    We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi

    Recurrent bladder urothelial carcinoma complicated with primary bladder large cell neuroendocrine carcinoma: a case report and literature review

    Get PDF
    ObjectiveTo improve the understanding, diagnosis and treatment of bladder large cell neuroendocrine carcinoma (LCNEC).MethodsA clinical case of bladder LCNEC admitted to our hospital was reported. The epidemiology, prognosis, diagnosis and treatment methods of large cell neuroendocrine carcinoma were reviewed. The diagnosis and treatment status and prognosis were discussed based on the literature.ResultsThe female patient was admitted to hospital for “more than 4 years after TURBT and intermittent hematuria for more than 2 years”. She was diagnosed as recurrent bladder cancer and underwent “radical cystotomy + hysterectomy”. The postoperative pathological findings were high-grade urothelial carcinoma of the bladder neck and large cell neuroendocrine carcinoma of the bladder. The patient recovered well after surgery, but refused radiotherapy and chemotherapy and is still under close follow-up.ConclusionBladder LCNEC is clinically rare, has unique pathological features, is more aggressive than traditional urothelial carcinoma, and has a poor prognosis. Surgery, chemotherapy and radiotherapy should be combined with multi-mode treatment

    Malignant glomus tumor of prostate: A case report

    Get PDF
    We reported an 85-year-old patient with malignant glomus tumor (GT) of the prostate. He presented with urinary frequency for more than 2 years and gross hematuria for 7 days. Computed tomography scan showed that the prostate was markedly irregularly enlarged, and the boundary between the prostate and the posterior wall of the bladder was unclear. Bilateral kidneys and ureters were dilated. Biochemical examinations showed that the serum potassium was 7.24 mmol/L and the serum creatinine was 974.6 μmol/L. Transurethral diagnostic resection was performed after restoring homeostasis through several times of bedside blood filtration. The pathological diagnosis was malignant GT. The patient’s renal function recovered after bilateral nephrostomy, and he refused further treatment and was out of contact after 9 months. We summarize the clinical and histopathological features of malignant GT of the prostate in order to improve the early recognition of the disease by clinicians

    The Changes and Development Direction of Traditional Chinese Villages after Reform and Opening up —Taking Tunpu, Guizhou as an Example

    Full text link
    After the reform and opening up, the speed of changes in China's rural areas is extremely rare in the history of China and even the world. The rapid development of China's economy leads this process of change. During this period, the speed of China's development caused great changes in ethnic identity, physical space, and cultural structure of traditional villages. On the whole, these changes were made passively along with the economic development of the entire country's large-scale system. Such passive village changes are mainly positive, but there are also negative aspects. Excessive reliance on exogenous economic forces can easily lead to lower uniqueness of the village’s culture, which is unconducive to the sustainable development of the village. Therefore, traditional Chinese villages should be developed based on the differences between urban and rural areas and the uniqueness of specific villages. This can not only meet the economic development needs of the village, but also achieve cultural diversity inheritance of traditional Chinese villages, thereby avoiding the continuous destruction and damage of the unique culture of specific villages in economic development. </jats:p

    Self-Chained Image-Language Model for Video Localization and Question Answering

    Full text link
    Recent studies have shown promising results on utilizing pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We chain these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. SeViLA outperforms several strong baselines/previous works on five video QA and event prediction tasks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We show a comprehensive analysis, e.g., the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.Comment: 20 pages; Our code and checkpoints are available at: https://github.com/Yui010206/SeViL
    corecore