How to train your instruction-following text encoder without labeling

Abstract

Text embedding models encode the semantic content of natural language inputs into fixed-length vectors. Contrastive learning has been the go-to training strategy to ensure semantically similar vectors are closely positioned in the embedding space. While successful, this training recipe requires large amounts of labeled training data in order to cover diverse domains. In addition, current text embedding models cannot adhere to user instructions when encoding inputs. In this thesis, we provide the first attempt to build text embedding models that can (1) adhere to user instructions, and (2) generalize without domain-specific annotated data. We leverage the recent success of generative large language models (LLMs), which exhibit strong domain generalization and rich latent knowledge. Transferring these properties to text encoders can enrich their contextualized representations and allow for instruction-controlled representations. In this work, we rely on the state-space model (SSM) parametrization to achieve such goals. SSMs are defined through ordinary differential equations with respect to the state vector, capturing the dynamics of an information system over time. State vectors are a fixed-length compression of the system’s past trajectory. We show that, empirically, state vectors in learned, discretized SSMs still preserve this information-rich property in language modeling tasks, and can be applied off-the-shelf to perform text embedding tasks. The instruction-following ability of pretrained generative LMs allows the state vectors to be sensitive to user intentions and can successfully compress different information, given different prompts

    Similar works

    Full text

    thumbnail-image

    Available Versions