Polish Natural Language Inference and Factivity -- an Expert-based
  Dataset and Benchmarks

Seweryn, Karolina; Wróblewska, Anna; Ziembicki, Daniel

Polish Natural Language Inference and Factivity -- an Expert-based Dataset and Benchmarks

Authors: Karolina Seweryn
Anna Wróblewska
Daniel Ziembicki
Publication date: 10 January 2022
Publisher

Abstract

Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, i.e. prediction of entailment, contradiction or neutral (ECN). The dataset contains entirely natural language utterances in Polish and gathers 2,432 verb-complement pairs and 309 unique verbs. The dataset is based on the National Corpus of Polish (NKJP) and is a representative sample in regards to frequency of main verbs and other linguistic features (e.g. occurrence of internal negation). We found that transformer BERT-based models working on sentences obtained relatively good results (

\approx89\%

F1 score). Even though better results were achieved using linguistic features (

\approx91\%

F1 score), this model requires more human labour (humans in the loop) because features were prepared manually by expert linguists. BERT-based models consuming only the input sentences show that they capture most of the complexity of NLI/factivity. Complex cases in the phenomenon - e.g. cases with entitlement (E) and non-factive verbs - remain an open issue for further research

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2201.03521

Last time updated on 20/03/2022