Search CORE

18 research outputs found

Holaaa!! Writin like u talk is kewl but kinda hard 4 NLP

Author: Domingo Judit
Marquina Montse
Melero Maite
Quixal Martí
Ruiz Costa-Jussà Marta
Publication venue
Publication date: 01/01/2012
Field of study

We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Ordering the suggestions of a spellchecker without using context.

Author: Mitton Roger
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2009
Field of study

Having located a misspelling, a spellchecker generally offers some suggestions for the intended word. Even without using context, a spellchecker can draw on various types of information in ordering its suggestions. A series of experiments is described, beginning with a basic corrector that implements a well-known algorithm for reversing single simple errors, and making successive enhancements to take account of substring matches, pronunciation, known error patterns, syllable structure and word frequency. The improvement in the ordering produced by each enhancement is measured on a large corpus of misspellings. The final version is tested on other corpora against a widely used commercial spellchecker and a research prototype

Birkbeck Institutional Research Online

A spelling corrector for basque based on morphology

Author: Aduriz Itziar
Alegria Iñaki
Artola Xabier
Ezeiza Nerea
Sarasola K.
Urkia Miriam
Publication venue: 'Oxford University Press (OUP)'
Publication date: 22/04/2021
Field of study

This paper describes the components used in the elaboration of the commercial Xuxen spelling checker/corrector for Basque. Because Basque is a highly inflected and agglutinative language, the spelling checker/corrector has been conceived as a by-product of a general purpose morphological analyser/generator. The spelling checker/corrector performs morphological decomposition in order to check misspellings and, to correct them, uses a new strategy which combines the use of an additional two-level morphological subsystem for orthographic errors, and the recognition of correct morphemes inside the world-form during the generation of proposals for typographical errors. Due to a late process of standardization of Basque, Xuxen is intended as a useful tool for standardization purposes of present day written Basque

Diposit Digital de la Universitat de Barcelona

Errors lingüístics en el domini biomèdic: Cap a una tipologia d’errors per a l’espanyol

Author: Almela Ángela
López Hernández Jésica
Valencia-García Rafael
Publication venue: 'Universitat de Lleida'
Publication date: 01/01/2021
Field of study

L’objectiu d’aquest treball és l’anàlisi d’errors continguts en un corpus d’informes mèdics en llenguatge natural i el disseny d’una tipologia d’errors, ja que no hi va haver una revisió sistemàtica sobre verificació i correcció d’errors en documentació clínica en castellà. En el desenvolupament de sistemes automàtics de detecció i correcció, és d’interès aprofundir en la naturalesa dels errors lingüístics que es produeixen en els informes clínics per tal de detectar-los i tractar-los adequadament. Els resultats mostren que els errors d’omissió són els més freqüents en la mostra analitzada i que la longitud de la paraula sens dubte influeix en la freqüència d’error. La tipificació dels patrons d’error proporcionats permet el desenvolupament d’un mòdul basat en coneixements lingüístics, actualment en curs, que serà capaç de millorar el rendiment dels sistemes de correcció de detecció i correcció d’errors per al domini biomèdicThe objective of this work is the analysis of errors contained in a corpus of medical reports in natural language and the design of a typology of errors, as there was no systematic review on verification and correction of errors in clinical documentation in Spanish. In the development of automatic detection and correction systems, it is of great interest to delve into the nature of the linguistic errors that occur in clinical reports, in order to detect and treat them properly. The results show that omission errors are the most frequent ones in the analyzed sample, and that word length certainly influences error frequency. The typification of error patterns provided is enabling the development of a module based on linguistic knowledge, which is currently in progress. This will help to improve the performance of error detection and correction systems for the biomedical domain.This work was supported by the Spanish National Research Agency (AEI) through project LaTe4PSP (PID2019-107652RB-I00/AEI/10.13039/501100011033). Furthermore, the main autor is supported by Ministerio de Universidades of Spain through the national program Ayudas para la formación de profesorado universitario (FPU), with reference FPU16/0332

Directory of Open Access Journals

Repositori Obert UdL

Discovering Lexical Similarity Using Articulatory Feature-Based Phonetic Edit Distance

Author: Alessandro Bogliolo
Muhammad Suffian
Muhammad Yaseen Khan
Tafseer Ahmed
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Lexical Similarity (LS) between two languages uncovers many interesting linguistic insights such as phylogenetic relationship, mutual intelligibility, common etymology, and loan words. There are various methods through which LS is evaluated. This paper presents a method of Phonetic Edit Distance (PED) that uses a soft comparison of letters using the articulatory features associated with their International Phonetic Alphabet (IPA) transcription. In particular, the comparison between the articulatory features of two letters taken from words belonging to different languages is used to compute the cost of replacement in the inner loop of edit distance computation. As an example, PED gives edit distance of 0.82 between German word ‘vater’ ([fa:tər]) and Persian word ‘ ’ ([pedær]), meaning ‘father,’ and, similarly, PED of 0.93 between Hebrew word ‘ ’ ([ʃəɭam]) and Arabic word ‘ ’ ([səɭa:m], meaning ‘peace,’ whereas classical edit distances would be 4 and 2, respectively. We report the results of systematic experiments conducted on six languages: Arabic, Hindi, Marathi, Persian, Sanskrit, and Urdu. Universal Dependencies (UD) corpora were used to restrict the comparison to lists of words belonging to the same part of speech. The LS based on the average PED between pair of words was then computed for each pair of languages, unveiling similarities otherwise masked by the adoption of different alphabets, grammars, and pronunciations rules

Archivio istituzionale della ricerca - Università di Urbino

Программа проверки орфографии (spellchecker) на основе распределенных представлений

Author: Омельченко Р.С.
Publication venue: Інститут програмних систем НАН України
Publication date: 01/01/2013
Field of study

Задача программ проверки орфографии состоит в нахождении и исправлении ошибок в словах. Как правило, программа предлагает пользователю короткий список предполагаемых корректных слов в последовательности от самого вероятного к наименее вероятному. В данной работе исследована возможность применения бинарных распределенных представлений и методов их обработки для представления, поиска и обработки слов с ошибками. Приведены результаты экспериментов на двух наборах слов с типичными орфографическими ошибками, а также проведен сравнительный анализ с другими методами

Наукова електронна бібліотека періодичних видань НАН України (Vernadsky National Library of Ukraine)

Doubling up:The Influence of Native and Foreign Language Cues in Foreign Language Double Consonant Spelling

Author: De Bree E.
Hofman A.D.
Segers E.
Van de Ven M.
van der Maas H.L.J.
Verhoeven L.
Publication venue: 'ARLE (International Associaton for Research in L1 Education)'
Publication date: 01/01/2022
Field of study

International Migration, Integration and Social Cohesion online publications

Fine-Tunning a MAP Error Corrections Algorithm for Five-Key Chording Keyboards

Author: Dillenbourg Pierre
Rimoldi Bixio
Tarniceriu Adrian Dan
Publication venue
Publication date: 14/01/2014
Field of study

Different typing devices lead to different typing error patterns. In addition, different persons using the same device have different error patterns. Considering this, we propose and evaluate a spelling algorithm specifically designed for a five-key chording keyboard. It uses the maximum a posteriori probability rule, the probabilities that one character is typed for another, named confusion probabilities, and a dictionary model. Our study shows that the proposed algorithm reduces the substitution error rate from 7.60% to 1.25%. In comparison, MsWord and iSpell reduce the substitution error rates to 3.12% and 3.94%, respectively. The error rate can be further reduced to 1.15% by using individual confusion matrices for each user

Infoscience - École polytechnique fédérale de Lausanne

Speling Successful Sucesfuly: Statistical Learning in Spelling

Author: Binte Faizal Siti Syuhada
Publication venue: Washington University Open Scholarship
Publication date: 01/01/2011
Field of study

Many spelling errors in English are doubling errors, as when people are stumped by the double â€¹lâ€º in â€¹trellisâ€º. In Study 1, we tabulated statistical patterns with regards to doubling in English. In Study 2, we collected behavioral data to see if people were sensitive to these statistical patterns in doubling and to explore other factors that might influence doubling such as context, individual differences: language background and spelling ability), and task. We gave two nonword spelling tasks to US college students: N=68) and bilingual Singaporean college students from an English-based education system but with diverse language backgrounds: Mandarin: N=54), Malay: N=44), or Tamil: N=42). In the choice task, participants heard a nonword and chose between two spelling options, e.g. dremmib/dremib. In the free task, they wrote down its best spelling. We found a vowel length effect: more doubling after short vowels than long vowels) that was moderated by spelling ability: better spellers were more influenced by vowel length) and language background. Americans had the largest vowel length effect and Tamil Singaporeans had none, as they possibly associated consonant doubling with the lengthening of doubled consonants in Tamil instead of the preceding vowel. The Mandarin group spelled nonwords least accurately, and greater knowledge of pinyin, a phoneme-based writing system, was associated with higher nonword spelling accuracy. These and other findings reflect how linguistic factors and language background moderate the role of statistical learning and context in spelling

Washington University St. Louis: Open Scholarship