2 research outputs found
Analyzing Vision Transformers for Image Classification in Class Embedding Space
Despite the growing use of transformer models in computer vision, a
mechanistic understanding of these networks is still needed. This work
introduces a method to reverse-engineer Vision Transformers trained to solve
image classification tasks. Inspired by previous research in NLP, we
demonstrate how the inner representations at any level of the hierarchy can be
projected onto the learned class embedding space to uncover how these networks
build categorical representations for their predictions. We use our framework
to show how image tokens develop class-specific representations that depend on
attention mechanisms and contextual information, and give insights on how
self-attention and MLP layers differentially contribute to this categorical
composition. We additionally demonstrate that this method (1) can be used to
determine the parts of an image that would be important for detecting the class
of interest, and (2) exhibits significant advantages over traditional linear
probing approaches. Taken together, our results position our proposed framework
as a powerful tool for mechanistic interpretability and explainability
research.Comment: NeurIPS 202
Multimodal Vision-Audio-Language Dataset
<p>The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.</p><p>Details can be found in the attached report.</p><h3><strong>Annotation</strong></h3><p>The annotation files are provided as Parquet files. They can be read using Python and the <i>pandas</i> and <i>pyarrow</i> library.</p><p>The split into train, validation and test set follows the split of the original datasets.</p><h4><strong>Installation</strong></h4><blockquote><p>pip install pandas pyarrow</p></blockquote><h4><strong>Example</strong></h4><blockquote><p>import pandas as pd<br>df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')<br>print(df.iloc[0])</p></blockquote><blockquote><p>dataset AudioSet </p><p>filename train/---2_BBVHAA.mp3</p><p>captions_visual [a man in a black hat and glasses.]</p><p>captions_auditory [a man speaks and dishes clank.]</p><p>tags [Speech]</p></blockquote><h4><strong>Description</strong></h4><p>The annotation file consists of the following fields:<br><br><i>filename</i>: Name of the corresponding file (video or audio file)<br><i>dataset</i>: Source dataset associated with the data point<br><i>captions_visual</i>: A list of captions related to the visual content of the video. Can be NaN in case of no visual content<br><i>captions_auditory</i>: A list of captions related to the auditory content of the video<br><i>tags</i>: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided</p><h3><strong>Data files</strong></h3><p>The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at [email protected]</p>