Search CORE

10 research outputs found

Deep Neural Networks for Bot Detection

Author: Ferrara Emilio
Kudugunta Sneha
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

The problem of detecting bots, automated social media accounts governed by software but disguising as human users, has strong implications. For example, bots have been used to sway political elections by distorting online discourse, to manipulate the stock market, or to push anti-vaccine conspiracy theories that caused health epidemics. Most techniques proposed to date detect bots at the account level, by processing large amount of social media posts, and leveraging information from network structure, temporal dynamics, sentiment analysis, etc. In this paper, we propose a deep neural network based on contextual long short-term memory (LSTM) architecture that exploits both content and metadata to detect bots at the tweet level: contextual features are extracted from user metadata and fed as auxiliary input to LSTM deep nets processing the tweet text. Another contribution that we make is proposing a technique based on synthetic minority oversampling to generate a large labeled dataset, suitable for deep nets training, from a minimal amount of labeled data (roughly 3,000 examples of sophisticated Twitter bots). We demonstrate that, from just one single tweet, our architecture can achieve high classification accuracy (AUC > 96%) in separating bots from humans. We apply the same architecture to account-level bot detection, achieving nearly perfect classification accuracy (AUC > 99%). Our system outperforms previous state of the art while leveraging a small and interpretable set of features yet requiring minimal training data

arXiv.org e-Print Archive

Research Archive of Indian Institute of Technology Hyderabad

DANTE: Deep AlterNations for Training nEural networks

Author: Balasubramanian Vineeth N
Chavali Surya Teja
Kar Purushottam
Kudugunta Sneha
Sankar Adepu Ravi
Sinha Vaibhav B
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

We present DANTE, a novel method for training neural networks using the alternating minimization principle. DANTE provides an alternate perspective to traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convexity to cast training a neural network as a bi-quasi-convex optimization problem. We show that for neural network configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations effectively in this formulation. DANTE can also be extended to networks with multiple hidden layers. In experiments on standard datasets, neural networks trained using the proposed method were found to be promising and competitive to traditional backpropagation techniques, both in terms of quality of the solution, as well as training speed.Comment: 19 page

arXiv.org e-Print Archive

Research Archive of Indian Institute of Technology Hyderabad

BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer

Author: Asai Akari
Blevins Terra
Gonen Hila
Hajishirzi Hannaneh
Kudugunta Sneha
Reid Machel
Ruder Sebastian
Tsvetkov Yulia
Yu Xinyan Velocity
Publication venue
Publication date: 24/05/2023
Field of study

Despite remarkable advancements in few-shot generalization in natural language processing, most models are developed and evaluated primarily in English. To facilitate research on few-shot cross-lingual transfer, we introduce a new benchmark, called BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples and instructions. BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer across a broad range of tasks and languages. Using BUFFET, we perform thorough evaluations of state-of-the-art multilingual large language models with different transfer methods, namely in-context learning and fine-tuning. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer. In particular, ChatGPT with in-context learning often performs worse than much smaller mT5-base models fine-tuned on English task data and few-shot in-language examples. Our analysis suggests various avenues for future research in few-shot cross-lingual transfer, such as improved pretraining, understanding, and future evaluations.Comment: The data and code is available at https://buffetfs.github.io

arXiv.org e-Print Archive

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Author: Bapna Ankur
Caswell Isaac
Choquette-Choo Christopher A.
Firat Orhan
Garcia Xavier
Kudugunta Sneha
Kusupati Aditya
Lee Katherine
Stella Romi
Xin Derrick
Zhang Biao
Publication venue
Publication date: 08/09/2023
Field of study

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.Comment: Preprin

arXiv.org e-Print Archive

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Author: Adeyemi Mofetoluwa
Agrawal Sweta
Ahia Oghenefego
Ahia Orevaoghene
Ataman Duygu
Awokoya Ayodele
Azime Israel Abebe
Baljekar Pallavi
Ballı Sakine Çabuk
Bapna Ankur
Baruwa Ahmed
Battisti Alessia
Biderman Stella
Caswell Isaac
de Silva Nisansa
Dlamini Sakhile
Dossou Bonaventure F. P.
Firat Orhan
Jenny Mathias
Jernite Yacine
Kreutzer Julia
Kudugunta Sneha
Lawson Nze
Leong Colin
Matangira Tapiwanashe
Mirzakhalov Jamshidbek
Mnyakeni Ayanda
Muhammad Nanda
Muhammad Shamsuddeen Hassan
Müller André
Müller Mathias
Nguyen Toan Q.
Ogueji Kelechi
Orife Iroro
Osei Salomey
Papadimitriou Isabel
Rios Annette
Rivera Clara
Rubungo Andre Niyongabo
Sagot Benoît
Samb Sokhar
Sarin Supheakmungkol
Setyawan Monang
Sikasote Claytone
Sokolov Artem
Subramani Nishant
Suárez Pedro Ortiz
Tapo Allahsera
Ulzii-Orshikh Nasanbayar
van Esch Daan
Wahab Ahsan
Wang Lisa
Publication venue
Publication date: 23/03/2021
Field of study

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

PaLM 2 Technical Report

Author: Abrego Gustavo Hernandez
Ahn Junwhan
Anil Rohan
Austin Jacob
Bailey Paige
Barham Paul
Botha Jan
Bradbury James
Brahma Siddhartha
Brooks Kevin
Catasta Michele
Chen Zhifeng
Cheng Yong
Cherry Colin
Choquette-Choo Christopher A.
Chowdhery Aakanksha
Chu Eric
Clark Jonathan H.
Crepy Clément
Dai Andrew M.
Dave Shachi
Dehghani Mostafa
Dev Sunipa
Devlin Jacob
Du Nan
Dyer Ethan
Díaz Mark
Feinberg Vlad
Feng Fangxiaoyu
Fienber Vlad
Firat Orhan
Freitag Markus
Garcia Xavier
Gehrmann Sebastian
Gonzalez Lucas
Gur-Ari Guy
Hand Steven
Hashemi Hadi
Hou Le
Howland Joshua
Hu Andrea
Huang Yanping
Hui Jeffrey
Hurwitz Jeremy
Isard Michael
Ittycheriah Abe
Jagielski Matthew
Jia Wenhao
Johnson Melvin
Kenealy Kathleen
Krikun Maxim
Kudugunta Sneha
Lan Chang
Lee Benjamin
Lee Katherine
Lepikhin Dmitry
Li Eric
Li Jian
Li Music
Li Wei
Li YaGuang
Lim Hyeontaek
Lin Hanzhao
Liu Frederick
Liu Zhongtao
Maggioni Marcello
Mahendru Aroma
Maynez Joshua
Meier-Hellstern Kathy
Mishra Gaurav
Misra Vedant
Moreira Erica
Moussalem Maysam
Nado Zachary
Nham John
Ni Eric
Nystrom Andrew
Omernick Mark
Parrish Alicia
Passos Alexandre
Pellat Marie
Petrov Slav
Polacek Martin
Polozov Alex
Pope Reiner
Qiao Siyuan
Reif Emily
Richter Bryan
Riley Parker
Robinson Kevin
Ros Alex Castro
Roy Aurko
Ruder Sebastian
Saeta Brennan
Samuel Rajkumar
Shafey Laurent El
Shakeri Siamak
Shelby Renee
Slone Ambrose
Smilkov Daniel
So David R.
Sohn Daniel
Taropa Emanuel
Tay Yi
Tokumine Simon
Valter Dasha
Vasudevan Vijay
Vodrahalli Kiran
Wang Pidong
Wang Tao
Wang Xuezhi
Wang Zirui
Wieting John
Wu Yonghui
Wu Yuhuai
Xiao Kefan
Xu Kelvin
Xu Yuanzhong
Xu Yunhan
Xue Linting
Yin Pengcheng
Yu Jiahui
Zhang Qiao
Zhang Yujing
Zheng Ce
Zheng Steven
Zhou Denny
Zhou Weikang
Publication venue
Publication date: 13/09/2023
Field of study

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report

arXiv.org e-Print Archive

DANTE: Deep alternations for training neural networks

Author: Adepu Ravi Sankar
Bengio
Bengio
Bengio
Blumensath
Chauvin
Choromanska
Clanuwat
Dauphin
Hochreiter
Jaderberg
Jagatap
Lee
Lillicrap
Maas
Malach
Nøkland
Rumelhart
Sneha Kudugunta
Surya Teja Chavali
Tian
Vaibhav B. Sinha
Vineeth N. Balasubramanian
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Social Media Identity Deception Detection

Author: Benevenuto Fabricio
Cai Zhuhua
Daft Richard L.
Devmane M. A.
Dickerson John P.
Egele Manuel
Fisher Max
Goodfellow Ian
Gurajala Supraja
Guyon Isabelle
Ihler Alexander T.
Jin Lei
Kalyani D.
Khaled Sarah
Kontaxis Georgios
Kudugunta Sneha
Kumar Srijan
Kumari Priyanka
Monge Alvaro E.
Nazir Atif
Savyan
Shen Jia
Short John
Solorio Thamar
Spitzer Frank
Threat Intelligence Fire Eye
Wang Alex Hai
Wang G. Alan
Winkler William E.
Zhang Xiaoying
Zhu Yin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref