32 research outputs found

    Lexical Normalisation of Twitter Data

    Full text link
    Twitter with over 500 million users globally, generates over 100,000 tweets per minute . The 140 character limit per tweet, perhaps unintentionally, encourages users to use shorthand notations and to strip spellings to their bare minimum "syllables" or elisions e.g. "srsly". The analysis of twitter messages which typically contain misspellings, elisions, and grammatical errors, poses a challenge to established Natural Language Processing (NLP) tools which are generally designed with the assumption that the data conforms to the basic grammatical structure commonly used in English language. In order to make sense of Twitter messages it is necessary to first transform them into a canonical form, consistent with the dictionary or grammar. This process, performed at the level of individual tokens ("words"), is called lexical normalisation. This paper investigates various techniques for lexical normalisation of Twitter data and presents the findings as the techniques are applied to process raw data from Twitter.Comment: Removed typo

    Contextual Multilingual Spellchecker for User Queries

    Full text link
    Spellchecking is one of the most fundamental and widely used search features. Correcting incorrectly spelled user queries not only enhances the user experience but is expected by the user. However, most widely available spellchecking solutions are either lower accuracy than state-of-the-art solutions or too slow to be used for search use cases where latency is a key requirement. Furthermore, most innovative recent architectures focus on English and are not trained in a multilingual fashion and are trained for spell correction in longer text, which is a different paradigm from spell correction for user queries, where context is sparse (most queries are 1-2 words long). Finally, since most enterprises have unique vocabularies such as product names, off-the-shelf spelling solutions fall short of users' needs. In this work, we build a multilingual spellchecker that is extremely fast and scalable and that adapts its vocabulary and hence speller output based on a specific product's needs. Furthermore, our speller out-performs general purpose spellers by a wide margin on in-domain datasets. Our multilingual speller is used in search in Adobe products, powering autocomplete in various applications

    Spellcheckers

    Get PDF
    Techniques of computer spellchecking from the 1950's to the 2000's

    Detection of semantic errors in Arabic texts

    Get PDF
    AbstractDetecting semantic errors in a text is still a challenging area of investigation. A lot of research has been done on lexical and syntactic errors while fewer studies have tackled semantic errors, as they are more difficult to treat. Compared to other languages, Arabic appears to be a special challenge for this problem. Because words are graphically very similar to each other, the risk of getting semantic errors in Arabic texts is bigger. Moreover, there are special cases and unique complexities for this language. This paper deals with the detection of semantic errors in Arabic texts but the approach we have adopted can also be applied for texts in other languages. It combines four contextual methods (using statistics and linguistic information) in order to decide about the semantic validity of a word in a sentence. We chose to implement our approach on a distributed architecture, namely, a Multi Agent System (MAS). The implemented system achieved a precision rate of about 90% and a recall rate of about 83%

    Improving the Accuracy of Mobile Touchscreen QWERTY Keyboards

    Get PDF
    In this thesis we explore alternative keyboard layouts in hopes of finding one that increases the accuracy of text input on mobile touchscreen devices. In particular, we investigate if a single swap of 2 keys can significantly improve accuracy on mobile touchscreen QWERTY keyboards. We do so by carefully considering the placement of keys, exploiting a specific vulnerability that occurs within a keyboard layout, namely, that the placement of particular keys next to others may be increasing errors when typing. We simulate the act of typing on a mobile touchscreen QWERTY keyboard, beginning with modeling the typographical errors that can occur when doing so. We then construct a simple autocorrector using Bayesian methods, describing how we can autocorrect user input and evaluate the ability of the keyboard to output the correct text. Then, using our models, we provide methods of testing and define a metric, the WAR rating, which provides us a way of comparing the accuracy of a keyboard layout. After running our tests on all 325 2-key swap layouts against the original QWERTY layout, we show that there exists more than one 2-key swap that increases the accuracy of the current QWERTY layout, and that the best 2-key swap is i ↔ t, increasing accuracy by nearly 0.18 percent

    Interactive and context-aware tag spell check and correction

    Full text link
    corecore