Democratizing access to natural language processing (NLP) technology is
crucial, especially for underrepresented and extremely low-resource languages.
Previous research has focused on developing labeled and unlabeled corpora for
these languages through online scraping and document translation. While these
methods have proven effective and cost-efficient, we have identified
limitations in the resulting corpora, including a lack of lexical diversity and
cultural relevance to local communities. To address this gap, we conduct a case
study on Indonesian local languages. We compare the effectiveness of online
scraping, human translation, and paragraph writing by native speakers in
constructing datasets. Our findings demonstrate that datasets generated through
paragraph writing by native speakers exhibit superior quality in terms of
lexical diversity and cultural content. In addition, we present the
\datasetname{} benchmark, encompassing 12 underrepresented and extremely
low-resource languages spoken by millions of individuals in Indonesia. Our
empirical experiment results using existing multilingual large language models
conclude the need to extend these models to more underrepresented languages. We
release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes