Search CORE

10,732 research outputs found

Adaptive text mining: Inferring structure from sequences

Author: Witten Ian H.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively

Research Commons@Waikato

A note on brain actuated spelling with the Berlin brain-computer interface

Author: A. Kübler
B. Blankertz
B. Blankertz
B. Blankertz
C. Neuper
D.J. Krusienski
D.J. Ward
J. Cleary
J. Williamson
J.R. Wolpaw
J.R. Wolpaw
K.-R. Müller
L.A. Farwell
L.R. Hochberg
M.D. Dunlop
N. Birbaumer
R. Scherer
T. Bell
T. Elbert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Brain-Computer Interfaces (BCIs) are systems capable of decoding neural activity in real time, thereby allowing a computer application to be directly controlled by the brain. Since the characteristics of such direct brain-tocomputer interaction are limited in several aspects, one major challenge in BCI research is intelligent front-end design. Here we present the mental text entry application ‘Hex-o-Spell’ which incorporates principles of Human-Computer Interaction research into BCI feedback design. The system utilises the high visual display bandwidth to help compensate for the extremely limited control bandwidth which operates with only two mental states, where the timing of the state changes encodes most of the information. The display is visually appealing, and control is robust. The effectiveness and robustness of the interface was demonstrated at the CeBIT 2006 (world’s largest IT fair) where two subjects operated the mental text entry system at a speed of up to 7.6 char/min

CiteSeerX

Crossref

MURAL - Maynooth University Research Archive Library

Fraunhofer-ePrints

NUI Maynooth Eprint Archive

Maynooth University ePrints and eTheses Archive

Enlighten

Compression-based Parts-of-Speech Tagger for the Arabic Language

Author: Alkhazi Ibrahim
Publication venue
Publication date: 18/12/2019
Field of study

Bangor University Research Portal

Categorisation of Arabic Twitter Text

Author: Altamimi Mohammed Hamed R
Publication venue
Publication date: 26/02/2020
Field of study

Bangor University Research Portal

Kolmogorov Complexity in perspective. Part II: Classification, Information Processing and Duality

Author: Ferbus-Zanda Marie
Publication venue
Publication date: 01/01/2010
Field of study

We survey diverse approaches to the notion of information: from Shannon entropy to Kolmogorov complexity. Two of the main applications of Kolmogorov complexity are presented: randomness and classification. The survey is divided in two parts published in a same volume. Part II is dedicated to the relation between logic and information system, within the scope of Kolmogorov algorithmic information theory. We present a recent application of Kolmogorov complexity: classification using compression, an idea with provocative implementation by authors such as Bennett, Vitanyi and Cilibrasi. This stresses how Kolmogorov complexity, besides being a foundation to randomness, is also related to classification. Another approach to classification is also considered: the so-called "Google classification". It uses another original and attractive idea which is connected to the classification using compression and to Kolmogorov complexity from a conceptual point of view. We present and unify these different approaches to classification in terms of Bottom-Up versus Top-Down operational modes, of which we point the fundamental principles and the underlying duality. We look at the way these two dual modes are used in different approaches to information system, particularly the relational model for database introduced by Codd in the 70's. This allows to point out diverse forms of a fundamental duality. These operational modes are also reinterpreted in the context of the comprehension schema of axiomatic set theory ZF. This leads us to develop how Kolmogorov's complexity is linked to intensionality, abstraction, classification and information system.Comment: 43 page

arXiv.org e-Print Archive

Hal-Diderot

Statistical Function Tagging and Grammatical Relations of Myanmar Sentences

Author: Htwe Tin Myat
Thant Win Win
Thein Ni Lar
Publication venue
Publication date: 25/09/2011
Field of study

This paper describes a context free grammar (CFG) based grammatical relations for Myanmar sentences which combine corpus-based function tagging system. Part of the challenge of statistical function tagging for Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex morphological system. Function tagging is a pre-processing step to show grammatical relations of Myanmar sentences. In the task of function tagging, which tags the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information, we use Naive Bayesian theory to disambiguate the possible function tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.Comment: 16 pages, 7 figures, 8 tables, AIAA-2011 (India). arXiv admin note: text overlap with arXiv:0912.1820 by other author

arXiv.org e-Print Archive

CiteSeerX

MERAL Portal

A Novel Approach to Printed Arabic Optical Character Recognition

Author: Alghamdi Mansoor
Publication venue
Publication date: 25/09/2019
Field of study

Bangor University Research Portal

Text Augmentation: Inserting markup into natural language text with PPM Models

Author: Yeates Stuart Andrew
Publication venue: The University of Waikato
Publication date: 01/01/2006
Field of study

This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora

CiteSeerX

Research Commons@Waikato

A compression based toolkit for text processing

Author: Teahan William
Publication venue
Publication date: 10/04/2018
Field of study

Bangor University Research Portal