2,438 research outputs found

    Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

    Get PDF
    Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

    Text stylometry for chat bot identification and intelligence estimation.

    Get PDF
    Authorship identification is a technique used to identify the author of an unclaimed document, by attempting to find traits that will match those of the original author. Authorship identification has a great potential for applications in forensics. It can also be used in identifying chat bots, a form of intelligent software created to mimic the human conversations, by their unique style. The online criminal community is utilizing chat bots as a new way to steal private information and commit fraud and identity theft. The need for identifying chat bots by their style is becoming essential to overcome the danger of online criminal activities. Researchers realized the need to advance the understanding of chat bots and design programs to prevent criminal activities, whether it was an identity theft or even a terrorist threat. The more research work to advance chat bots’ ability to perceive humans, the more duties needed to be followed to confront those threats by the research community. This research went further by trying to study whether chat bots have behavioral drift. Studying text for Stylometry has been the goal for many researchers who have experimented many features and combinations of features in their experiments. A novel feature has been proposed that represented Term Frequency Inverse Document Frequency (TFIDF) and implemented that on a Byte level N-Gram. Term Frequency-Inverse Token Frequency (TF-ITF) used these terms and created the feature. The initial experiments utilizing collected data demonstrated the feasibility of this approach. Additional versions of the feature were created and tested for authorship identification. Results demonstrated that the feature was successfully used to identify authors of text, and additional experiments showed that the feature is language independent. The feature successfully identified authors of a German text. Furthermore, the feature was used in text similarities on a book level and a paragraph level. Finally, a selective combination of features was used to classify text that ranges from kindergarten level to scientific researches and novels. The feature combination measured the Quality of Writing (QoW) and the complexity of text, which were the first step to correlate that with the author’s IQ as a future goal

    Error Correction based on Error Signatures applied to automatic speech recognition

    Get PDF

    A large vocabulary online handwriting recognition system for Turkish

    Get PDF
    Handwriting recognition in general and online handwriting recognition in particular has been an active research area for several decades. Most of the research have been focused on English and recently on other scripts like Arabic and Chinese. There is a lack of research on recognition in Turkish text and this work primarily fills that gap with a state-of-the-art recognizer for the first time. It contains design and implementation details of a complete recognition system for recognition of Turkish isolated words. Based on the Hidden Markov Models, the system comprises pre-processing, feature extraction, optical modeling and language modeling modules. It considers the recognition of unconstrained handwriting with a limited vocabulary size first and then evolves to a large vocabulary system. Turkish script has many similarities with other Latin scripts, like English, which makes it possible to adapt strategies that work for them. However, there are some other issues which are particular to Turkish that should be taken into consideration separately. Two of the challenging issues in recognition of Turkish text are determined as delayed strokes which introduce an extra source of variation in the sequence order of the handwritten input and high Out-of-Vocabulary (OOV) rate of Turkish when words are used as vocabulary units in the decoding process. This work examines the problems and alternative solutions at depth and proposes suitable solutions for Turkish script particularly. In delayed stroke handling, first a clear definition of the delayed strokes is developed and then using that definition some alternative handling methods are evaluated extensively on the UNIPEN and Turkish datasets. The best results are obtained by removing all delayed strokes, with up to 2.13% and 2.03% points recognition accuracy increases, over the respective baselines of English and Turkish. The overall system performances are assessed as 86.1% with a 1,000-word lexicon and 83.0% with a 3,500-word lexicon on the UNIPEN dataset and 91.7% on the Turkish dataset. Alternative decoding vocabularies are designed with grammatical sub-lexical units in order to solve the problem of high OOV rate. Additionally, statistical bi-gram and tri-gram language models are applied during the decoding process. The best performance, 67.9% is obtained by the large stem-ending vocabulary that is expanded with a bi-gram model on the Turkish dataset. This result is superior to the accuracy of the word-based vocabulary (63.8%) with the same coverage of 95% on the BOUN Web Corpus

    Ensemble learning using multi-objective optimisation for arabic handwritten words

    Get PDF
    Arabic handwriting recognition is a dynamic and stimulating field of study within pattern recognition. This system plays quite a significant part in today's global environment. It is a widespread and computationally costly function due to cursive writing, a massive number of words, and writing style. Based on the literature, the existing features lack data supportive techniques and building geometric features. Most ensemble learning approaches are based on the assumption of linear combination, which is not valid due to differences in data types. Also, the existing approaches of classifier generation do not support decision-making for selecting the most suitable classifier, and it requires enabling multi-objective optimisation to handle these differences in data types. In this thesis, new type of feature for handwriting using Segments Interpolation (SI) to find the best fitting line in each of the windows with a model for finding the best operating point window size for SI features. Multi-Objective Ensemble Oriented (MOEO) formulated to control the classifier topology and provide feedback support for changing the classifiers' topology and weights based on the extension of Non-dominated Sorting Genetic Algorithm (NSGA-II). It is designated as the Random Subset based Parents Selection (RSPS-NSGA-II) to handle neurons and accuracy. Evaluation metrics from two perspectives classification and Multiobjective optimization. The experimental design based on two subsets of the IFN/ENIT database. The first one consists of 10 classes (C10) and 22 classes (C22). The features were tested with Support Vector Machine (SVM) and Extreme Learning Machine (ELM). This work improved due to the SI feature. SI shows a significant result with SVM with 88.53% for C22. RSPS for C10 at k=2 achieved 91% accuracy with fewer neurons than NSGA-II, and for C22 at k=10, accuracy has been increased 81% compared to NSGA-II 78%. Future work may consider introducing more features to the system, applying them to other languages, and integrating it with sequence learning for more accuracy

    Word Superiority Effects in Dyslexics

    Get PDF
    Distorting the word superiority effect with intraword spacing was used to investigate the processing difference in single-word reading for dyslexics and controls. Perfetti’s Reading model suggests that dyslexics would have reduced processing capacity with intraword spacing. Results from a Covid-modified experimental protocol generally did not support the hypothesis. There was poor differentiation between groups in the word capacity coefficient. Response time by itself was also not informative. However, dyslexics had reduced accuracy in distractor identification across intraword spacings due to the lack of retention in phonological working memory or attention in central executive deficit (Alt, Fox, Levy, et al., 2022; Gray, Green, Alt, et al., 2017) as matching targets was not an issue, only confirmation of an update was problematic. In target identification, early responses and later responses were predictive of WIAT III Pseudoword (phonetic processing) and WAIS-IV Symbol Search (visuospatial matching task). These preliminary results motivate further research regarding word processing differences in dyslexic and controls
    • …
    corecore