Towards Orthographic and Grammatical Clinical Text Correction: a First Approach

Abstract

Akats Gramatikalen Zuzenketa (GEC, ingelesetik, Grammatical Error Analysis) Hizkuntza Naturalaren Prozesamenduaren azpieremu bat da, ortogra a, puntuazio edo gramatika akatsak dituzten testuak automatikoki zuzentzea helburu duena. Orain arte, bigarren hizkuntzako ikasleek ekoitzitako testuetara bideratu da gehien bat, ingelesez idatzitako testuetara batez ere. Master-Tesi honetan gaztelaniaz idatzitako mediku-txostenetarako Akats Gramatikalen Zuzenketa lantzen da. Arlo espezi ko hau ez da asko esploratu orain arte, ez gaztelaniarako zentzu orokorrean, ezta domeinu klinikorako konkretuki ere. Hasteko, IMEC (gaztelaniatik, Informes Médicos en Español Corregidos) corpusa aurkezten da, eskuz zuzendutako mediku-txosten elektronikoen bilduma paralelo berria. Corpusa automatikoki etiketatu da zeregin honetarako egokitutako ERRANT tresna erabiliz. Horrez gain, hainbat esperimentu deskribatzen dira, zeintzuetan sare neuronaletan oinarritutako sistemak ataza honetarako diseinatutako baseline sistema batekin alderatzen diren.Grammatical Error Correction (GEC) is a sub field of Natural Language Processing that aims to automatically correct texts that include errors related to spelling, punctuation or grammar. So far, it has mainly focused on texts produced by second language learners, mostly in English. This Master's Thesis describes a first approach to Grammatical Error Correction for Spanish health records. This specific field has not been explored much until now, nor in Spanish in a general sense nor for the clinical domain specifically. For this purpose, the corpus IMEC (Informes Médicos en Español Corregidos) ---a manually-corrected parallel collection of Electronic Health Records--- is introduced. This corpus has been automatically annotated using the toolkit ERRANT, specialized in the automatic annotation of GEC parallel corpora, which was adapted to Spanish for this task. Furthermore, some experiments using neural networks and data augmentation are shown and compared with a baseline system also created specifically for this task

    Similar works