Search CORE

872 research outputs found

Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

Author: Alwani Mohammad
Bassil Youssef
Publication venue: 'Canadian Center of Science and Education'
Publication date: 01/01/2012
Field of study

In computing, spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. Basically, a spell checker is a computer program that uses a dictionary of words to perform spell checking. The bigger the dictionary is, the higher is the error detection rate. The fact that spell checkers are based on regular dictionaries, they suffer from data sparseness problem as they cannot capture large vocabulary of words including proper names, domain-specific terms, technical jargons, special acronyms, and terminologies. As a result, they exhibit low error detection rate and often fail to catch major errors in the text. This paper proposes a new context-sensitive spelling correction method for detecting and correcting non-word and real-word errors in digital text documents. The approach hinges around data statistics from Google Web 1T 5-gram data set which consists of a big volume of n-gram word sequences, extracted from the World Wide Web. Fundamentally, the proposed method comprises an error detector that detects misspellings, a candidate spellings generator based on a character 2-gram model that generates correction suggestions, and an error corrector that performs contextual error correction. Experiments conducted on a set of text documents from different domains and containing misspellings, showed an outstanding spelling error correction rate and a drastic reduction of both non-word and real-word errors. In a further study, the proposed algorithm is to be parallelized so as to lower the computational cost of the error detection and correction processes.Comment: LACSC - Lebanese Association for Computational Sciences - http://www.lacsc.or

arXiv.org e-Print Archive

CiteSeerX

Crossref

Improving OCR Post Processing with Machine Learning Tools

Author: Fonseca Cacho Jorge Ramon
Publication venue: Digital Scholarship@UNLV
Publication date: 01/08/2019
Field of study

Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents. The main contributions of this work are: • Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate. • Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected. • Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text. • Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed

University of Nevada, Las Vegas Repository

evoText: A new tool for analyzing the biological sciences

Author: Ashton
Brants
Burns
Burrows
Charles H. Pence
Craig
Einstein Papers Project
Grant Ramsey
He
Ide
Kumar
Manning
Manning
Michel
Pechenick
Pence
Rockwell
Smocovitis
Tsukamoto
van Whye
York
Publication venue
Publication date: 01/01/2016
Field of study

We introduce here evoText, a new tool for automated analysis of the literature in the biological sciences. evoText contains a database of hundreds of thousands of journal articles and an array of analysis tools for generating quantitative data on the nature and history of life science, especially ecology and evolutionary biology. This article describes the features of evoText, presents a variety of examples of the kinds of analyses that evoText can run, and offers a brief tutorial describing how to use it

PhilPapers

Lirias

Elsevier - Publisher Connector

Crossref

Louisiana State University

DIAL UCLouvain

A Real-Time N-Gram Approach to Choosing Synonyms Based on Context

Author: Moore Brian J
Publication venue: Scholarship@Western
Publication date: 07/01/2015
Field of study

Synonymy is an important part of all natural language but not all synonyms are created equal. Just because two words are synonymous, it usually doesn’t mean they can always be interchanged. The problem that we attempt to address is that of near-synonymy and choosing the right word based purely on its surrounding words. This new computational method, unlike previous methods used on this problem, is capable of making multiple word suggestions which more accurately models human choice. It contains a large number of words, does not require training, and is able to be run in real-time. On previous testing data, when able to make multiple suggestions, it improved by over 17 percentage points on the previous best method and 4.5 percentage points on average, with a maximum of 14 percentage points, on the human annotators near-synonym choice. In addition this thesis also presents new synonym sets and human annotated test data that more accurately fits this problem

Scholarship@Western

BrainGene: computational creativity algorithm that invents novel interesting names

Author: Duch Włodzisław
Pilichowski Maciej
Publication venue
Publication date: 01/01/2013
Field of study

Human-level intelligence implies creativity, not only on the grand scale, but primarily in the everyday activity, such as understanding intentions, behavior, and invention of new words. Psychological models of creativity have some support in experimental cognitive psychology, but computational models of creative processes are quite rare. This paper presents a model of creative processes behind invention of novel words related to description of products and services

Repository of Nicolaus Copernicus University

Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words

Author: Paetzold Gustavo Henrique
Specia Lucia
Publication venue
Publication date
Field of study

Conference paper: Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Word

ZENODO

Using the Web 1T 5-Gram Database for Attribute Selection in Formal Concept Analysis to Correct Overstemmed Clusters

Author: Hall Guymon
Publication venue: Digital Scholarship@UNLV
Publication date: 01/05/2014
Field of study

Information retrieval is the process of finding information from an unstructured collection of data. The process of information retrieval involves building an index, commonly called an inverted file. As part of the inverted file, information retrieval algorithms often stem words to a common root. Stemming involves reducing a document term to its root. There are many ways to stem a word: affix removal and successor variety are two common categories of stemmers. The Porter Stemming Algorithm is a suffix removal stemmer that operates as a rule-based process on English words. We can think of stemming as a way to cluster related words together according to one common stem. However, sometimes Porter includes words in a cluster that are un-related. This experiment attempts to correct these stemming errors through the use of Formal Concept Analysis (FCA). FCA is the process of formulating formal concepts from a given formal context. A formal context consists of a set of objects, G, a set of attributes, M, and a binary relation I that indicates the attributes possessed by each object. A formal concept is formed by computing the closure of a subset of objects and attributes. Attribute selection is of critical importance in FCA; using the Cranfield document collection, this experiment attempted to view attributes as a function of word-relatedness and crafted a comparison measure between each word in the stemmed cluster using the Google Web 1T 5-gram data set. Using FCA to correct the clusters, the results showed a varying level of success for precision and recall values dependent upon the error threshold allowed

University of Nevada, Las Vegas Repository

Data Analysis With Map Reduce Programming Paradigm

Author: Bozorgi Mandana
Publication venue: Digital Scholarship@UNLV
Publication date: 01/08/2015
Field of study

Abstract In this thesis, we present a summary of our activities associated with the storage and query processing of Google 1T 5-gram data set. We rst give a brief introduction to some of the implementation techniques for the relational algebra followed by a Map Reduce implementation of the same operators. We then implement a database schema in Hive for the Google 1T 5-gram data set. The thesis will further look into the query processing with Hive and Pig in the Hadoop setting. More specially, we report statistics for our queries in this environment

University of Nevada, Las Vegas Repository