Search CORE

299 research outputs found

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Author: Cohn Trevor
Haffari Gholamreza
Petri Matthias
Shareghi Ehsan
Publication venue
Publication date: 01/01/2016
Field of study

Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

arXiv.org e-Print Archive

University of Melbourne Institutional Repository

Monash University Research Portal

On Prediction Using Variable Order Markov Models

Author: Begleiter R.
El-Yaniv R.
Yona G.
Publication venue: 'AI Access Foundation'
Publication date: 30/06/2011
Field of study

This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a "decomposed" CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems

arXiv.org e-Print Archive

Crossref

Hierarchical Bayesian Nonparametric Models for Power-Law Sequences

Author: Gasthaus Jan Alexander
Publication venue: UCL (University College London)
Publication date: 28/03/2020
Field of study

Sequence data that exhibits power-law behavior in its marginal and conditional distributions arises frequently from natural processes, with natural language text being a prominent example. We study probabilistic models for such sequences based on a hierarchical non-parametric Bayesian prior, develop inference and learning procedures for making these models useful in practice and applicable to large, real-world data sets, and empirically demonstrate their excellent predictive performance. In particular, we consider models based on the infinite-depth variant of the hierarchical Pitman-Yor process (HPYP) language model [Teh, 2006b] known as the Sequence Memoizer, as well as Sequence Memoizer-based cache language models and hybrid models combining the HPYP with neural language models. We empirically demonstrate that these models performwell on languagemodelling and data compression tasks

UCL Discovery

Methods and algorithms for unsupervised learning of morphology

Author: A. Gelbukh
A. Gispert de
B. Can
C. Monson
D. Blackwell
D. Harman
D.R. Morrison
E. Arısoy
E. Minkov
H. Ishwaran
H. Poon
H. Poon
J. Goldsmith
K. Järvelin
K. Kettunen
K. Kirchhoff
K. Sirts
K. Toutanova
L. Aunimo
M. Creutz
M. Kurimo
M.A. Hafer
M.R. Brent
N.A. Smith
P.F. Brown
R. Krovetz
S. Bordag
S. Manandhar
S. Neuvel
Z.S. Harris
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

This is an accepted manuscript of a chapter published by Springer in Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403 in 2014 available online: https://doi.org/10.1007/978-3-642-54906-9_15 The accepted version of the publication may differ from the final published version.This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum likelihood estimation), MAP (maximum a posteriori), parametric and non-parametric Bayesian approaches. A review of the evaluation schemes for unsupervised morphological segmentation is also provided along with a summary of evaluation results on the Morpho Challenge evaluations.Published versio

Crossref

Wolverhampton Intellectual Repository and E-theses

Nuevas tendencias en fundamentos teóricos y aplicaciones de la minería de datos aplicada a la clasificación de textos en lenguaje natural

Author: Carmona Cejudo José María
Publication venue: 'Universidad de Malaga'
Publication date: 01/01/2013
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Málaga

Minimal Absent Words in Rooted and Unrooted Trees

Author: B Schieber
C Barton
D Belazzougui
D Belazzougui
F Mignosi
F Mignosi
F Mignosi
G Fici
G Fici
M Béal
M Béal
M Crochemore
M Crochemore
M Crochemore
M-P Béal
MA Bender
P Charalampopoulos
P Charalampopoulos
RM Silva
S Chairungsee
T Shibuya
Y Almirantis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp. unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Palermo

Recommended from our members

Minimally supervised induction of morphology through bitexts

Author: Moon Taesun, Ph. D.
Publication venue
Publication date: 01/12/2008
Field of study

textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems. Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis. While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic

Texas ScholarWorks