370 research outputs found
Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT
In this paper, we aimed to provide a review and tutorial for researchers in
the field of medical imaging using language models to improve their tasks at
hand. We began by providing an overview of the history and concepts of language
models, with a special focus on large language models. We then reviewed the
current literature on how language models are being used to improve medical
imaging, emphasizing different applications such as image captioning, report
generation, report classification, finding extraction, visual question
answering, interpretable diagnosis, and more for various modalities and organs.
The ChatGPT was specially highlighted for researchers to explore more potential
applications. We covered the potential benefits of accurate and efficient
language models for medical imaging analysis, including improving clinical
workflow efficiency, reducing diagnostic errors, and assisting healthcare
professionals in providing timely and accurate diagnoses. Overall, our goal was
to bridge the gap between language models and medical imaging and inspire new
ideas and innovations in this exciting area of research. We hope that this
review paper will serve as a useful resource for researchers in this field and
encourage further exploration of the possibilities of language models in
medical imaging
Towards Interaction-level Video Action Understanding
A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
Adventures of Trustworthy Vision-Language Models: A Survey
Recently, transformers have become incredibly popular in computer vision and
vision-language tasks. This notable rise in their usage can be primarily
attributed to the capabilities offered by attention mechanisms and the
outstanding ability of transformers to adapt and apply themselves to a variety
of tasks and domains. Their versatility and state-of-the-art performance have
established them as indispensable tools for a wide array of applications.
However, in the constantly changing landscape of machine learning, the
assurance of the trustworthiness of transformers holds utmost importance. This
paper conducts a thorough examination of vision-language transformers,
employing three fundamental principles of responsible AI: Bias, Robustness, and
Interpretability. The primary objective of this paper is to delve into the
intricacies and complexities associated with the practical use of transformers,
with the overarching goal of advancing our comprehension of how to enhance
their reliability and accountability.Comment: Accepted in AAAI 202
- …