19 research outputs found
A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
Much effort has been devoted to evaluate whether multi-task learning can be
leveraged to learn rich representations that can be used in various Natural
Language Processing (NLP) down-stream applications. However, there is still a
lack of understanding of the settings in which multi-task learning has a
significant effect. In this work, we introduce a hierarchical model trained in
a multi-task learning setup on a set of carefully selected semantic tasks. The
model is trained in a hierarchical fashion to introduce an inductive bias by
supervising a set of low level tasks at the bottom layers of the model and more
complex tasks at the top layers of the model. This model achieves
state-of-the-art results on a number of tasks, namely Named Entity Recognition,
Entity Mention Detection and Relation Extraction without hand-engineered
features or external NLP tools like syntactic parsers. The hierarchical
training supervision induces a set of shared semantic representations at lower
layers of the model. We show that as we move from the bottom to the top layers
of the model, the hidden states of the layers tend to represent more complex
semantic information.Comment: 8 pages, 1 figure, To appear in Proceedings of AAAI 201
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Using vision-language models (VLMs) in web development presents a promising
strategy to increase efficiency and unblock no-code solutions: by providing a
screenshot or a sketch of a UI, a VLM could generate the code to reproduce it,
for instance in a language like HTML. Despite the advancements in VLMs for
various tasks, the specific challenge of converting a screenshot into a
corresponding HTML has been minimally explored. We posit that this is mainly
due to the absence of a suitable, high-quality dataset. This work introduces
WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and
their corresponding screenshots. We fine-tune a foundational VLM on our dataset
and show proficiency in converting webpage screenshots to functional HTML code.
To accelerate the research in this area, we open-source WebSight
What matters when building vision-language models?
The growing interest in vision-language models (VLMs) has been driven by
improvements in large language models and vision transformers. Despite the
abundance of literature on this subject, we observe that critical decisions
regarding the design of VLMs are often not justified. We argue that these
unsupported decisions impede progress in the field by making it difficult to
identify which choices improve model performance. To address this issue, we
conduct extensive experiments around pre-trained models, architecture choice,
data, and training methods. Our consolidation of findings includes the
development of Idefics2, an efficient foundational VLM of 8 billion parameters.
Idefics2 achieves state-of-the-art performance within its size category across
various multimodal benchmarks, and is often on par with models four times its
size. We release the model (base, instructed, and chat) along with the datasets
created for its training
What Language Model to Train if You Have One Million GPU Hours?
The crystallization of modeling methods around the Transformer architecture
has been a boon for practitioners. Simple, well-motivated architectural
variations can transfer across tasks and scale, increasing the impact of
modeling research. However, with the emergence of state-of-the-art 100B+
parameters models, large language models are increasingly expensive to
accurately design and train. Notably, it can be difficult to evaluate how
modeling decisions may impact emergent capabilities, given that these
capabilities arise mainly from sheer scale alone. In the process of building
BLOOM--the Big Science Large Open-science Open-access Multilingual language
model--our goal is to identify an architecture and training setup that makes
the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform
an ablation study at the billion-parameter scale comparing different modeling
practices and their impact on zero-shot generalization. In addition, we study
the impact of various popular pre-training corpora on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to
the English-only one. Finally, we consider the scaling behaviour of
Transformers to choose the target model size, shape, and training setup. All
our models and code are open-sourced at https://huggingface.co/bigscience .Comment: Findings of EMNLP 202
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
PromptSource is a system for creating, sharing, and using natural language
prompts. Prompts are functions that map an example from a dataset to a natural
language input and target output. Using prompts to train and query language
models is an emerging area in NLP that requires new tools that let users
develop and refine these prompts collaboratively. PromptSource addresses the
emergent challenges in this new setting with (1) a templating language for
defining data-linked prompts, (2) an interface that lets users quickly iterate
on prompt development by observing outputs of their prompts on many examples,
and (3) a community-driven set of guidelines for contributing new prompts to a
common pool. Over 2,000 prompts for roughly 170 datasets are already available
in PromptSource. PromptSource is available at
https://github.com/bigscience-workshop/promptsource.Comment: ACL 2022 Dem