1,132 research outputs found

    Lost in Translation: Large Language Models in Non-English Content Analysis

    Full text link
    In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.Comment: 50 pages, 4 figure

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Full text link
    We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken

    Reclassification of the Leipzig Corpora Collection for Malay and Indonesian

    Get PDF

    A review on deep-learning-based cyberbullying detection

    Get PDF
    Bullying is described as an undesirable behavior by others that harms an individual physically, mentally, or socially. Cyberbullying is a virtual form (e.g., textual or image) of bullying or harassment, also known as online bullying. Cyberbullying detection is a pressing need in today’s world, as the prevalence of cyberbullying is continually growing, resulting in mental health issues. Conventional machine learning models were previously used to identify cyberbullying. However, current research demonstrates that deep learning surpasses traditional machine learning algorithms in identifying cyberbullying for several reasons, including handling extensive data, efficiently classifying text and images, extracting features automatically through hidden layers, and many others. This paper reviews the existing surveys and identifies the gaps in those studies. We also present a deep-learning-based defense ecosystem for cyberbullying detection, including data representation techniques and different deep-learning-based models and frameworks. We have critically analyzed the existing DL-based cyberbullying detection techniques and identified their significant contributions and the future research directions they have presented. We have also summarized the datasets being used, including the DL architecture being used and the tasks that are accomplished for each dataset. Finally, several challenges faced by the existing researchers and the open issues to be addressed in the future have been presented
    corecore