10 research outputs found
Deep Neural Networks for Bot Detection
The problem of detecting bots, automated social media accounts governed by
software but disguising as human users, has strong implications. For example,
bots have been used to sway political elections by distorting online discourse,
to manipulate the stock market, or to push anti-vaccine conspiracy theories
that caused health epidemics. Most techniques proposed to date detect bots at
the account level, by processing large amount of social media posts, and
leveraging information from network structure, temporal dynamics, sentiment
analysis, etc.
In this paper, we propose a deep neural network based on contextual long
short-term memory (LSTM) architecture that exploits both content and metadata
to detect bots at the tweet level: contextual features are extracted from user
metadata and fed as auxiliary input to LSTM deep nets processing the tweet
text.
Another contribution that we make is proposing a technique based on synthetic
minority oversampling to generate a large labeled dataset, suitable for deep
nets training, from a minimal amount of labeled data (roughly 3,000 examples of
sophisticated Twitter bots). We demonstrate that, from just one single tweet,
our architecture can achieve high classification accuracy (AUC > 96%) in
separating bots from humans.
We apply the same architecture to account-level bot detection, achieving
nearly perfect classification accuracy (AUC > 99%). Our system outperforms
previous state of the art while leveraging a small and interpretable set of
features yet requiring minimal training data
DANTE: Deep AlterNations for Training nEural networks
We present DANTE, a novel method for training neural networks using the
alternating minimization principle. DANTE provides an alternate perspective to
traditional gradient-based backpropagation techniques commonly used to train
deep networks. It utilizes an adaptation of quasi-convexity to cast training a
neural network as a bi-quasi-convex optimization problem. We show that for
neural network configurations with both differentiable (e.g. sigmoid) and
non-differentiable (e.g. ReLU) activation functions, we can perform the
alternations effectively in this formulation. DANTE can also be extended to
networks with multiple hidden layers. In experiments on standard datasets,
neural networks trained using the proposed method were found to be promising
and competitive to traditional backpropagation techniques, both in terms of
quality of the solution, as well as training speed.Comment: 19 page
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer
Despite remarkable advancements in few-shot generalization in natural
language processing, most models are developed and evaluated primarily in
English. To facilitate research on few-shot cross-lingual transfer, we
introduce a new benchmark, called BUFFET, which unifies 15 diverse tasks across
54 languages in a sequence-to-sequence format and provides a fixed set of
few-shot examples and instructions. BUFFET is designed to establish a rigorous
and equitable evaluation framework for few-shot cross-lingual transfer across a
broad range of tasks and languages. Using BUFFET, we perform thorough
evaluations of state-of-the-art multilingual large language models with
different transfer methods, namely in-context learning and fine-tuning. Our
findings reveal significant room for improvement in few-shot in-context
cross-lingual transfer. In particular, ChatGPT with in-context learning often
performs worse than much smaller mT5-base models fine-tuned on English task
data and few-shot in-language examples. Our analysis suggests various avenues
for future research in few-shot cross-lingual transfer, such as improved
pretraining, understanding, and future evaluations.Comment: The data and code is available at https://buffetfs.github.io
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
We introduce MADLAD-400, a manually audited, general domain 3T token
monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss
the limitations revealed by self-auditing MADLAD-400, and the role data
auditing had in the dataset creation process. We then train and release a
10.7B-parameter multilingual machine translation model on 250 billion tokens
covering over 450 languages using publicly available data, and find that it is
competitive with models that are significantly larger, and report the results
on different domains. In addition, we train a 8B-parameter language model, and
assess the results on few-shot translation. We make the baseline models
available to the research community.Comment: Preprin
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
With the success of large-scale pre-training and multilingual modeling in
Natural Language Processing (NLP), recent years have seen a proliferation of
large, web-mined text datasets covering hundreds of languages. We manually
audit the quality of 205 language-specific corpora released with five major
public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource
corpora have systematic issues: At least 15 corpora have no usable text, and a
significant fraction contains less than 50% sentences of acceptable quality. In
addition, many are mislabeled or use nonstandard/ambiguous language codes. We
demonstrate that these issues are easy to detect even for non-proficient
speakers, and supplement the human audit with automatic analyses. Finally, we
recommend techniques to evaluate and improve multilingual corpora and discuss
potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio
PaLM 2 Technical Report
We introduce PaLM 2, a new state-of-the-art language model that has better
multilingual and reasoning capabilities and is more compute-efficient than its
predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture
of objectives. Through extensive evaluations on English and multilingual
language, and reasoning tasks, we demonstrate that PaLM 2 has significantly
improved quality on downstream tasks across different model sizes, while
simultaneously exhibiting faster and more efficient inference compared to PaLM.
This improved efficiency enables broader deployment while also allowing the
model to respond faster, for a more natural pace of interaction. PaLM 2
demonstrates robust reasoning capabilities exemplified by large improvements
over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable
performance on a suite of responsible AI evaluations, and enables
inference-time control over toxicity without additional overhead or impact on
other capabilities. Overall, PaLM 2 achieves state-of-the-art performance
across a diverse set of tasks and capabilities.
When discussing the PaLM 2 family, it is important to distinguish between
pre-trained models (of various sizes), fine-tuned variants of these models, and
the user-facing products that use these models. In particular, user-facing
products typically include additional pre- and post-processing steps.
Additionally, the underlying models may evolve over time. Therefore, one should
not expect the performance of user-facing products to exactly match the results
reported in this report