Benchmarking Large Language Models on Floating-Point Error Classification

Taldir, Lisa; Saeed, Muhammad, Ahmad; Defour, David; de Oliveira Castro, Pablo; Petit, Eric

Search results>Research output from HAL Portal UG Universite de Guyane

working paper

oai:HAL:hal-05560550v1

Benchmarking Large Language Models on Floating-Point Error Classification

Authors: Lisa Taldir
Muhammad, Ahmad Saeed
David Defour
Pablo de Oliveira Castro
Eric Petit
Publication date: 20 March 2026
Publisher: 'Centre pour la Communication Scientifique Directe (CCSD)'

Abstract

This paper investigates the capability of Large Language Models (LLMs) to detect and classify floating-point errors statically in software code. We introduce InterFLOPBench, a benchmark of 90 C kernels with 1 130 test samples designed to evaluate LLMs across six categories of floating-point error: cancellation, comparison, division by zero, overflow, underflow and NaN, compared across 14 LLMs. The evaluation framework treats floating-point error detection as a multi-label classification problem and employs the F1-score metric to measure performance. Results demonstrate that latest models (Qwen 3 32b, Gemini 2.5 Flash, Phi 4 Reasoning, DeepSeek R1T2, and gpt-oss 20b and 120b) achieve a performance greater than 0.88 overall F1-score. Performance varies between error categories, between explicit operations such as division by zero (Average F1-score: 0.8479) and more subtle numerical phenomena such as underflow (Average F1-score: 0.6059) and cancellation (Average F1-score: 0.6164)

Similar works

Full text

HAL Portal UG Universite de Guyane

oai:HAL:hal-05560550v1

Last time updated on 07/05/2026

This paper was published in HAL Portal UG Universite de Guyane.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: info:eu-repo/semantics/OpenAccess