Search CORE

2 research outputs found

Analyzing Vision Transformers for Image Classification in Class Embedding Space

Author: Roig Gemma
Schaumlöffel Timothy
Vilas Martina G.
Publication venue
Publication date: 29/10/2023
Field of study

Despite the growing use of transformer models in computer vision, a mechanistic understanding of these networks is still needed. This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks. Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class embedding space to uncover how these networks build categorical representations for their predictions. We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information, and give insights on how self-attention and MLP layers differentially contribute to this categorical composition. We additionally demonstrate that this method (1) can be used to determine the parts of an image that would be important for detecting the class of interest, and (2) exhibits significant advantages over traditional linear probing approaches. Taken together, our results position our proposed framework as a powerful tool for mechanistic interpretability and explainability research.Comment: NeurIPS 202

arXiv.org e-Print Archive

Multimodal Vision-Audio-Language Dataset

Author: Choksi Bhavin
Roig Gemma
Schaumlöffel Timothy
Publication venue: Zenodo
Publication date: 31/10/2023
Field of study

The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.Details can be found in the attached report.<h3>Annotation</h3>The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.The split into train, validation and test set follows the split of the original datasets.<h4>Installation</h4><blockquote>pip install pandas pyarrow</blockquote><h4>Example</h4><blockquote>import pandas as pd df = pd.read_parquet('annotation_train.parquet', engine='pyarrow') print(df.iloc[0])</blockquote><blockquote>dataset                  AudioSet filename                train/---2_BBVHAA.mp3captions_visual      [a man in a black hat and glasses.]captions_auditory  [a man speaks and dishes clank.]tags                       [Speech]</blockquote><h4>Description</h4>The annotation file consists of the following fields: filename: Name of the corresponding file (video or audio file) dataset: Source dataset associated with the data point captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content captions_auditory: A list of captions related to the auditory content of the video tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided<h3>Data files</h3>The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at [email protected]</p&gt

ZENODO