18 research outputs found

    Holaaa!! Writin like u talk is kewl but kinda hard 4 NLP

    Get PDF
    We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.Postprint (published version

    Ordering the suggestions of a spellchecker without using context.

    Get PDF
    Having located a misspelling, a spellchecker generally offers some suggestions for the intended word. Even without using context, a spellchecker can draw on various types of information in ordering its suggestions. A series of experiments is described, beginning with a basic corrector that implements a well-known algorithm for reversing single simple errors, and making successive enhancements to take account of substring matches, pronunciation, known error patterns, syllable structure and word frequency. The improvement in the ordering produced by each enhancement is measured on a large corpus of misspellings. The final version is tested on other corpora against a widely used commercial spellchecker and a research prototype

    A spelling corrector for basque based on morphology

    Get PDF
    This paper describes the components used in the elaboration of the commercial Xuxen spelling checker/corrector for Basque. Because Basque is a highly inflected and agglutinative language, the spelling checker/corrector has been conceived as a by-product of a general purpose morphological analyser/generator. The spelling checker/corrector performs morphological decomposition in order to check misspellings and, to correct them, uses a new strategy which combines the use of an additional two-level morphological subsystem for orthographic errors, and the recognition of correct morphemes inside the world-form during the generation of proposals for typographical errors. Due to a late process of standardization of Basque, Xuxen is intended as a useful tool for standardization purposes of present day written Basque

    Errors lingüístics en el domini biomèdic: Cap a una tipologia d’errors per a l’espanyol

    Get PDF
    L’objectiu d’aquest treball és l’anàlisi d’errors continguts en un corpus d’informes mèdics en llenguatge natural i el disseny d’una tipologia d’errors, ja que no hi va haver una revisió sistemàtica sobre verificació i correcció d’errors en documentació clínica en castellà. En el desenvolupament de sistemes automàtics de detecció i correcció, és d’interès aprofundir en la naturalesa dels errors lingüístics que es produeixen en els informes clínics per tal de detectar-los i tractar-los adequadament. Els resultats mostren que els errors d’omissió són els més freqüents en la mostra analitzada i que la longitud de la paraula sens dubte influeix en la freqüència d’error. La tipificació dels patrons d’error proporcionats permet el desenvolupament d’un mòdul basat en coneixements lingüístics, actualment en curs, que serà capaç de millorar el rendiment dels sistemes de correcció de detecció i correcció d’errors per al domini biomèdicThe objective of this work is the analysis of errors contained in a corpus of medical reports in natural language and the design of a typology of errors, as there was no systematic review on verification and correction of errors in clinical documentation in Spanish. In the development of automatic detection and correction systems, it is of great interest to delve into the nature of the linguistic errors that occur in clinical reports, in order to detect and treat them properly. The results show that omission errors are the most frequent ones in the analyzed sample, and that word length certainly influences error frequency. The typification of error patterns provided is enabling the development of a module based on linguistic knowledge, which is currently in progress. This will help to improve the performance of error detection and correction systems for the biomedical domain.This work was supported by the Spanish National Research Agency (AEI) through project LaTe4PSP (PID2019-107652RB-I00/AEI/10.13039/501100011033). Furthermore, the main autor is supported by Ministerio de Universidades of Spain through the national program Ayudas para la formación de profesorado universitario (FPU), with reference FPU16/0332

    Discovering Lexical Similarity Using Articulatory Feature-Based Phonetic Edit Distance

    Get PDF
    Lexical Similarity (LS) between two languages uncovers many interesting linguistic insights such as phylogenetic relationship, mutual intelligibility, common etymology, and loan words. There are various methods through which LS is evaluated. This paper presents a method of Phonetic Edit Distance (PED) that uses a soft comparison of letters using the articulatory features associated with their International Phonetic Alphabet (IPA) transcription. In particular, the comparison between the articulatory features of two letters taken from words belonging to different languages is used to compute the cost of replacement in the inner loop of edit distance computation. As an example, PED gives edit distance of 0.82 between German word ‘vater’ ([fa:tər]) and Persian word ‘ ’ ([pedær]), meaning ‘father,’ and, similarly, PED of 0.93 between Hebrew word ‘ ’ ([ʃəɭam]) and Arabic word ‘ ’ ([səɭa:m], meaning ‘peace,’ whereas classical edit distances would be 4 and 2, respectively. We report the results of systematic experiments conducted on six languages: Arabic, Hindi, Marathi, Persian, Sanskrit, and Urdu. Universal Dependencies (UD) corpora were used to restrict the comparison to lists of words belonging to the same part of speech. The LS based on the average PED between pair of words was then computed for each pair of languages, unveiling similarities otherwise masked by the adoption of different alphabets, grammars, and pronunciations rules

    Программа проверки орфографии (spellchecker) на основе распределенных представлений

    No full text
    Задача программ проверки орфографии состоит в нахождении и исправлении ошибок в словах. Как правило, программа предлагает пользователю короткий список предполагаемых корректных слов в последовательности от самого вероятного к наименее вероятному. В данной работе исследована возможность применения бинарных распределенных представлений и методов их обработки для представления, поиска и обработки слов с ошибками. Приведены результаты экспериментов на двух наборах слов с типичными орфографическими ошибками, а также проведен сравнительный анализ с другими методами

    Fine-Tunning a MAP Error Corrections Algorithm for Five-Key Chording Keyboards

    Get PDF
    Different typing devices lead to different typing error patterns. In addition, different persons using the same device have different error patterns. Considering this, we propose and evaluate a spelling algorithm specifically designed for a five-key chording keyboard. It uses the maximum a posteriori probability rule, the probabilities that one character is typed for another, named confusion probabilities, and a dictionary model. Our study shows that the proposed algorithm reduces the substitution error rate from 7.60% to 1.25%. In comparison, MsWord and iSpell reduce the substitution error rates to 3.12% and 3.94%, respectively. The error rate can be further reduced to 1.15% by using individual confusion matrices for each user

    Speling Successful Sucesfuly: Statistical Learning in Spelling

    Get PDF
    Many spelling errors in English are doubling errors, as when people are stumped by the double ‹l› in ‹trellis›. In Study 1, we tabulated statistical patterns with regards to doubling in English. In Study 2, we collected behavioral data to see if people were sensitive to these statistical patterns in doubling and to explore other factors that might influence doubling such as context, individual differences: language background and spelling ability), and task. We gave two nonword spelling tasks to US college students: N=68) and bilingual Singaporean college students from an English-based education system but with diverse language backgrounds: Mandarin: N=54), Malay: N=44), or Tamil: N=42). In the choice task, participants heard a nonword and chose between two spelling options, e.g. dremmib/dremib. In the free task, they wrote down its best spelling. We found a vowel length effect: more doubling after short vowels than long vowels) that was moderated by spelling ability: better spellers were more influenced by vowel length) and language background. Americans had the largest vowel length effect and Tamil Singaporeans had none, as they possibly associated consonant doubling with the lengthening of doubled consonants in Tamil instead of the preceding vowel. The Mandarin group spelled nonwords least accurately, and greater knowledge of pinyin, a phoneme-based writing system, was associated with higher nonword spelling accuracy. These and other findings reflect how linguistic factors and language background moderate the role of statistical learning and context in spelling
    corecore