Search CORE

8 research outputs found

LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions

Author: Arnold Craig
Dieng Adji Bousso
Rand Barry P.
Rubungo Andre Niyongabo
Publication venue
Publication date: 21/10/2023
Field of study

The prediction of crystal properties plays a crucial role in the crystal design process. Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks (GNNs). Although GNNs are powerful, accurately modeling the complex interactions between atoms and molecules within a crystal remains a challenge. Surprisingly, predicting crystal properties from crystal text descriptions is understudied, despite the rich information and expressiveness that text data offer. One of the main reasons is the lack of publicly available data for this task. In this paper, we develop and make public a benchmark dataset (called TextEdge) that contains text descriptions of crystal structures with their properties. We then propose LLM-Prop, a method that leverages the general-purpose learning capabilities of large language models (LLMs) to predict the physical and electronic properties of crystals from their text descriptions. LLM-Prop outperforms the current state-of-the-art GNN-based crystal property predictor by about 4% in predicting band gap, 3% in classifying whether the band gap is direct or indirect, and 66% in predicting unit cell volume. LLM-Prop also outperforms a finetuned MatBERT, a domain-specific pre-trained BERT model, despite having 3 times fewer parameters. Our empirical results may highlight the current inability of GNNs to capture information pertaining to space group symmetry and Wyckoff sites for accurate crystal property prediction.Comment: Code for LLM-Prop can be found at: https://github.com/vertaix/LLM-Pro

arXiv.org e-Print Archive

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Author: Adeyemi Mofetoluwa
Agrawal Sweta
Ahia Oghenefego
Ahia Orevaoghene
Ataman Duygu
Awokoya Ayodele
Azime Israel Abebe
Baljekar Pallavi
Ballı Sakine Çabuk
Bapna Ankur
Baruwa Ahmed
Battisti Alessia
Biderman Stella
Caswell Isaac
de Silva Nisansa
Dlamini Sakhile
Dossou Bonaventure F. P.
Firat Orhan
Jenny Mathias
Jernite Yacine
Kreutzer Julia
Kudugunta Sneha
Lawson Nze
Leong Colin
Matangira Tapiwanashe
Mirzakhalov Jamshidbek
Mnyakeni Ayanda
Muhammad Nanda
Muhammad Shamsuddeen Hassan
Müller André
Müller Mathias
Nguyen Toan Q.
Ogueji Kelechi
Orife Iroro
Osei Salomey
Papadimitriou Isabel
Rios Annette
Rivera Clara
Rubungo Andre Niyongabo
Sagot Benoît
Samb Sokhar
Sarin Supheakmungkol
Setyawan Monang
Sikasote Claytone
Sokolov Artem
Subramani Nishant
Suárez Pedro Ortiz
Tapo Allahsera
Ulzii-Orshikh Nasanbayar
van Esch Daan
Wahab Ahsan
Wang Lisa
Publication venue
Publication date: 23/03/2021
Field of study

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

Participatory Research for Low-resourced Machine Translation:A Case Study in African Languages

Author: Abbott Jade
Adeyemi Mofe
Ahia Orevaoghene
Akinfaderin Adewale
Akinola Solomon Oluwole
Ali Jamiil Toure
Bashir Abdallah
Bassey Blessing Itoro
Biljon Elan van
Dangana Idris Abdulkabir
Degila Kevin
Dossou Bonaventure
Duru Goodness
Elsahar Hady
Emezue Chris
Ezeani Ignatius
Fagbohungbe Taiwo
Fasubaa Timi
Freshia Sackey
Kabongo Salomon
Kamper Herman
Kioko Ghollah
Kolawole Tajudeen
Kreutzer Julia
Macharm Ricky
Marivate Vukosi
Martinus Laura Jane
Matsila Tshinondiwa
Meressa Musie
Mokgesi-Selinga Masabata
Muhammad Shamsuddeen Hassan
Murhabazi Espoir
Nekoto Wilhelmina
Niyongabo Rubungo Andre
Ogayo Perez
Ogueji Kelechi
Okegbemi Lawrence
Olabiyi Ayodele
Onyefuluchi Christopher
Orife Iroro
Osei Salomey
Ramkilowan Arshath
Sibanda Blessing
Siminyu Kathleen
Tajudeen Kolawole
Webster Jason
Whitenack Daniel
Öktem Alp
Publication venue
Publication date: 01/01/2020
Field of study

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt

arXiv.org e-Print Archive

Crossref

Lancaster E-Prints

AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Abdou Aziz DIOP
Adelani David Ifeoluwa
Adeyemi Mofetoluwa
Adhiambo Sonia
Ahia Orevaoghene
Ahmad Ibrahim Said
Ajayi Tunde Oluwaseyi
Ajisafe Daniel A.
Alabi Jesujoba O.
Amuok Priscilla A.
Anuoluwapo Aremu
Arthur Steven
Asai Akari
Awosan Oyinkansola
Ayodele Awokoya
Buzaaba Happy
Chinedu Mbonu
Chukwuneke Chiamaka
Clark Jonathan H.
Dossou Bonaventure F. P.
Emezue Chris
Ezeani Ignatius
Gwadabe Tajuddeen R.
Hacheme Gilles
Iro Ruqayya Nasir
Kahira Albert Njoroge
Lawan Falalu Ibrahim
Mabuya Rooweither
Mbow Habib
Mngoma Ndumiso
Muhammad Shamsuddeen H.
Mukonde Eunice
Mwase Christine
Namukombo Martin
Niyomutabazi Emile
Ogundepo Odunayo
Oladipo Akintunde
Onwuegbuzia Emeka Felix
Opoku Bernard
Osei Salomey
Otiende Verrah
Owodunni Abraham Toluwase
Phiri Mofya
Putini Neo
Rivera Clara E.
Rubungo Andre Niyongabo
Ruder Sebastian
Shode Iyanuoluwa
Sikasote Claytone
Sinkala Boyd
Siro Clemencia
Tonja Atnafu Lambebo
Publication venue: 'Center for Open Science'
Publication date: 11/05/2023
Field of study

African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology

Lancaster E-Prints

AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Author: Adelani David Ifeoluwa
Adeyemi Mofetoluwa
Adhiambo Sonia
Ahia Orevaoghene
Ahmad Ibrahim Said
Ajayi Tunde Oluwaseyi
Ajisafe Daniel A.
Alabi Jesujoba O.
Amuok Priscilla A.
Anuoluwapo Aremu
Arthur Steven
Asai Akari
Awosan Oyinkansola
Ayodele Awokoya
Buzaaba Happy
Chinedu Mbonu
Chukwuneke Chiamaka
Clark Jonathan H.
DIOP Abdou Aziz
Dossou Bonaventure F. P.
Emezue Chris
Ezeani Ignatius
Gwadabe Tajuddeen R.
Hacheme Gilles
Iro Ruqayya Nasir
Kahira Albert Njoroge
Lawan Falalu Ibrahim
Mabuya Rooweither
Mbow Habib
Mngoma Ndumiso
Muhammad Shamsuddeen H.
Mukonde Eunice
Mwase Christine
Namukombo Martin
Niyomutabazi Emile
Ogundepo Odunayo
Oladipo Akintunde
Onwuegbuzia Emeka Felix
Opoku Bernard
Osei Salomey
Otiende Verrah
Owodunni Abraham Toluwase
Phiri Mofya
Putini Neo
Rivera Clara E.
Rubungo Andre Niyongabo
Ruder Sebastian
Shode Iyanuoluwa
Sikasote Claytone
Sinkala Boyd
Siro Clemencia
Tonja Atnafu Lambebo
Publication venue
Publication date: 11/05/2023
Field of study

arXiv.org e-Print Archive

The GEM Benchmark:Natural Language Generation, its Evaluation and Metrics

Author: Adewumi Tosin
Aggarwal Karmanya
Ammanamanchi Pawan Sasanka
Anuoluwapo Aremu
Bosselut Antoine
Cabezudo Marco Antonio Sobrevilla
Chandu Khyathi Raghavi
Clinciu Miruna
Das Dipanjan
Dhole Kaustubh D.
Du Wanyu
Durmus Esin
Dušek Ondřej
Emezue Chris
Gangal Varun
Garbacea Cristina
Gehrmann Sebastian
Hashimoto Tatsunori
Hou Yufang
Jernite Yacine
Jhamtani Harsh
Ji Yangfeng
Jolly Shailza
Kale Mihir
Kumar Dhruv
Ladhak Faisal
Madaan Aman
Maddela Mounica
Mahajan Khyati
Mahamood Saad
Majumder Bodhisattwa Prasad
Martins Pedro Henrique
McMillan-Major Angelina
Mille Simon
Nadeem Moin
Narayan Shashi
Nikolaev Vitaly
Niyongabo Rubungo Andre
Osei Salomey
Parikh Ankur
Perez-Beltrachini Laura
Rao Niranjan Ramesh
Raunak Vikas
Rodriguez Juan Diego
Santhanam Sashank
Sedoc João
Sellam Thibault
Shaikh Samira
Shimorina Anastasia
Strobelt Hendrik
Subramani Nishant
van Miltenburg Emiel
Xu Wei
Yang Diyi
Yerukola Akhila
Zhou Jiawei
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/04/2021
Field of study

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate

arXiv.org e-Print Archive

Heriot Watt Pure

INRIA a CCSD electronic archive server

Tilburg University Repository