2 research outputs found

    Analyzing Vision Transformers for Image Classification in Class Embedding Space

    Full text link
    Despite the growing use of transformer models in computer vision, a mechanistic understanding of these networks is still needed. This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks. Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class embedding space to uncover how these networks build categorical representations for their predictions. We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information, and give insights on how self-attention and MLP layers differentially contribute to this categorical composition. We additionally demonstrate that this method (1) can be used to determine the parts of an image that would be important for detecting the class of interest, and (2) exhibits significant advantages over traditional linear probing approaches. Taken together, our results position our proposed framework as a powerful tool for mechanistic interpretability and explainability research.Comment: NeurIPS 202

    Multimodal Vision-Audio-Language Dataset

    No full text
    <p>The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.</p><p>Details can be found in the attached report.</p><h3><strong>Annotation</strong></h3><p>The annotation files are provided as Parquet files. They can be read using Python and the <i>pandas</i> and <i>pyarrow</i> library.</p><p>The split into train, validation and test set follows the split of the original datasets.</p><h4><strong>Installation</strong></h4><blockquote><p>pip install pandas pyarrow</p></blockquote><h4><strong>Example</strong></h4><blockquote><p>import pandas as pd<br>df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')<br>print(df.iloc[0])</p></blockquote><blockquote><p>dataset                  AudioSet </p><p>filename                train/---2_BBVHAA.mp3</p><p>captions_visual      [a man in a black hat and glasses.]</p><p>captions_auditory  [a man speaks and dishes clank.]</p><p>tags                       [Speech]</p></blockquote><h4><strong>Description</strong></h4><p>The annotation file consists of the following fields:<br><br><i>filename</i>: Name of the corresponding file (video or audio file)<br><i>dataset</i>: Source dataset associated with the data point<br><i>captions_visual</i>: A list of captions related to the visual content of the video. Can be NaN in case of no visual content<br><i>captions_auditory</i>: A list of captions related to the auditory content of the video<br><i>tags</i>: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided</p><h3><strong>Data files</strong></h3><p>The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at [email protected]</p&gt
    corecore