In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited size n. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency information. The approach can also be viewed as a collection of virtual k-testable automata. Once built, we can directly access the results of any k-testable automaton generated from the input training data. Synchronous back- off automatically identies the k-testable automaton with the largest feasible k. We have used this approach in several classification tasks

Stehouwer, H.

van Zaanen, M.

English

MPG.PuRe

Enhanced Suﬃx Arrays as Language Models:Virtual k-Testable LanguagesHerman Stehouwer and Menno van ZaanenTiCC, Tilburg University, Tilburg, The Netherlands{J.H.Stehouwer,M.M.vanZaanen}@uvt.nlAbstract. In this article, we propose the use of suﬃx arrays to eﬃ-ciently implement n-gram language models with practically unlimitedsize n. This approach, which is used with synchronous back-oﬀ, allowsus to distinguish between alternative sequences using large contexts. Wealso show that we can build this kind of models with additional infor-mation for each symbol, such as part-of-speech tags and dependencyinformation.The approach can also be viewed as a collection of virtual k-testableautomata. Once built, we can directly access the results of any k-testableautomaton generated from the input training data. Synchronous back-oﬀ automatically identiﬁes the k-testable automaton with the largestfeasible k. We have used this approach in several classiﬁcation tasks.1 IntroductionWhen writing texts, people often use spelling checkers to reduce the number ofmistakes in their texts. Many spelling checkers concentrate on non-word errors.However, there are also types of errors in which words are correct, but usedincorrectly in context. These errors are called contextual errors and are muchharder to recognize than non-word errors.In this paper, we describe a novel approach, which is based on suﬃx arrays,which are sorted arrays containing all suﬃxes of a collection of sequences, tostore the models. This approach can be used to make decisions about alternativecorrections of contextual errors. The use of suﬃx arrays allows us to use large,potentially enriched n-grams and as such can be seen as an extension to moreconventional n-gram models. The underlying assumption of the language modelis that using more (precise) information pertaining to the decision is better [3].The approach can also be seen as a collection of k-testable automata that wecan access using by using a single query. As De Higuera states in [4] choosingthe right size k is a crucial issue. When k is too small over-generalization willoccur, conversely too large k leads to models that might not generalize enough.The approach described here automatically chooses the largest k applicable tothe situation.J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 305–308, 2010.c© Springer-Verlag Berlin Heidelberg 2010306 H. Stehouwer and M. van Zaanen2 ApproachTo select the best sequence out of a set of alternative sequences, such as in theproblem of contextual errors in text, we consider all possible alternatives anduse a model to select the most likely sequence. The sequence with the highestprobability is selected as the correct form.The language model we use here is based on unbounded size n-grams. Theprobability of a sequence is computed by multiplying the probabilities of then-gram for each position in the sequence.Pseq =∏w∈seqPLM (w|w−1 . . . w−n)Considering that the probabilities are extracted from the training data, whenusing n-grams with very large n, data sparseness is an issue. Long sequencesmay simply not occur in the data, even though the sequence is correct, leadingto a probability of zero, even though the correct probability should be non-zero(albeit small).To reduce the impact of data sparseness, we can use techniques such assmoothing [2], which redistributes probability mass to estimate the probabilityof previously unseen word sequences1 or back-oﬀ, where probabilities of lowerorder n-grams are used to approximate the probability of the larger n-gram.In this article, we use the synchronous back-oﬀ method [6] to deal with datasparseness. This method analyzes n-grams of the same size for each of the al-ternative sequence in parallel. If all n-grams have zero probability, the methodbacks oﬀ to n − 1-grams. This continues until at least one n-gram for an al-ternative has a non-zero probability. This implements the idea that, assumingthe training data is suﬃcient, if a probability is zero the n-gram combination isnot in the language. Eﬀectively, this method selects the largest, usable n-gramsautomatically.Probabilities of all n-grams (from the training data) of all sizes are storedin an enhanced suﬃx array. A suﬃx array is a ﬂat data-structure containing animplicit suﬃx tree structure [1]. A suﬃx tree is a trie-based data structure [5, pp.492] that stores all suﬃxes of a sequence in such a way that a suﬃx (and similarlyan inﬁx) can be found in linear time in the length of the suﬃx. All suﬃxes occupya single path from the root of the suﬃx tree to a leaf. Construction of the datastructure only needs to be performed once.Due to the way suﬃx arrays are constructed, we can eﬃciently ﬁnd the numberof occurrences of subsequences (used as n-grams) of the training data. Startingfrom the entire suﬃx array we can quickly identify the interval(s) that pertainto the particular n-gram query. The interval speciﬁes exactly the number ofoccurrences of the subsequence in the training data. Eﬀectively, this means thatwe can ﬁnd the largest non-zero n-gram eﬃciently.1 In this paper we do not employ smoothing or interpolation methods as they modifythe probabilities of all alternatives equally and hence will not aﬀect the ordering ofalternative sequences.Enhanced Suﬃx Arrays as Language Models: Virtual k-Testable Languages 3073 Suﬃx Arrays as Collections of k-Testable MachinesAn enhanced suﬃx array extends a regular suﬃx array with a data-structureallowing for the implicit access of the longest-common-preﬁx (lcp) intervals [1].An lcp interval represents a virtual node in the implicit suﬃx trie. A simple en-hanced suﬃx-array with its corresponding implicit suﬃx-trie is shown in Figure 1as an example.We can view a suﬃx array as a virtual DFA in which each state is describedby a set of lcp-intervals over the suﬃx array. This view allows us to determine(by the size of the interval) the number of valid sequences that terminated ineach state. If there is no valid path in the DFA for the queried sequence it resultsin an empty state and the sequence is rejected by the learned grammar.Since the suﬃx array stores the n-grams of all sizes n, this comes down to acollection of k-testable machines with k = 1 . . . |T | (with T the training data).Querying with length k automatically results in using a k-testable machine.There is an interesting property of the n-gram suﬃx array approach, whichseparates it from collections of regular k-testable machine DFAs. All the stateson the suﬃx array are accepting states. Rejection of a sequence only happenswhen the query cannot be found in the training data at all. The system alsodoes not support negative training examples, only positive ones.To enhance the system, we have generalized a state to be described by a set oflcp intervals. This allows for the supports of single position wildcards. In practice,wildcards allow for the integration of additional information. By interleaving thesymbol sequences with the additional symbols, we can incorporate for instance,long range information, such as dependency information and local, less speciﬁcfeatures such as part-of-speech tags. Using wildcards, we can construct queriesthat either use such additional information on one or more positions or not.To evaluate the approach, we ran experiments on three contextual error prob-lems from the natural language domain, namely confusible disambiguation, verbi suﬃx lcp S[suﬃx]0 2 0 aaacatat$1 3 2 aacatat$2 0 1 acaaacatat$3 4 3 acatat$4 6 1 atat$5 8 2 at$6 1 0 caaacatat$7 5 2 catat$8 7 0 tat$9 9 1 t$10 10 0 $Fig. 1. An enhanced suﬃx array on the string S= acaaacatat on the left, and itscorresponding lcp-interval tree on the right. From [1].308 H. Stehouwer and M. van Zaanenand noun agreement and adjective ordering. The synchronous back-oﬀ methodautomatically selects the k-testable machine that has the right amount of speci-ﬁcity for selecting between the alternative sequences. These experiments whererun with a simple words-only approach and also with part-of-speech tags. Theexperiments show that the approach is feasible and eﬃcient.When trained on the ﬁrst 675 thousand sequences of the British NationalCorpus building the enhanced suﬃx array takes 2.3 minutes on average. Thesesequences contain about 27 million tokens. When loaded into memory the en-hanced suﬃx array uses roughly 500 megabytes. We ran speed-tests using 10.000randomly selected sequences of length 10. The system has an average runtimeoﬀ 10.2 minutes over tens of runs, with as extremes 8.1 and 12.1 minutes. Thismeans that we can expect the enhanced suﬃx array to process around 1200queries per minute. All tests where run on a 2GHz opteron system with 32GBof main memory. The suﬃx array process is single-threaded.4 Conclusion and Future WorkWe have proposed a novel approach which implements a collection of k-testableautomata using an enhanced suﬃx-array. This approach describes automata thathave no explicit reject states and do not require (or support) negative examplesduring training. Nevertheless, this approach allows for an eﬃcient implementa-tion of many concurrent k-testable machines of various k using suﬃx arrays.The implementation will be applied as a practical system in the context of textcorrection, allowing additional linguistic information to be added when needed.In this context, the eﬀectiveness of the additional information in combinationwith the limitations of k-testable languages still needs to be evaluated.References1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suﬃx trees with enhancedsuﬃx arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)2. Chen, S., Goodman, J.: An empirical study of smoothing techniques for languagemodelling. In: Proceedings of the 34th Annual Meeting of the ACL, pp. 310–318.ACL (June 1996)3. Daelemans, W., Van den Bosch, A., Zavrel, J.: Forgetting exceptions is harmfulin language learning. Machine Learning, Special issue on Natural Language Learn-ing 34, 11–41 (1999)4. de la Higuera, C.: Grammatial Inference, Learning Automata and Grammars. Cam-bridge University Press, Cambridge (2010)5. Knuth, D.E.: The art of computer programming. Sorting and searching, vol. 3.Addison-Wesley, Reading (1973)6. Stehouwer, H., Van den Bosch, A.: Putting the t where it belongs: Solving a con-fusion problem in Dutch. In: Verberne, S., van Halteren, H., Coppen, P.A. (eds.)Computational Linguistics in the Netherlands 2007: Selected Papers from the 18thCLIN Meeting, pp. 21–36. Nijmegen, The Netherlands (2009)

Enhanced suffix arrays as language models: Virtual k-testable languages

http://hdl.handle.net/11858/00-001M-0000-0012-3E92-3

Enhanced suffix arrays as language models: Virtual k-testable languages

Abstract

Similar works

Full text

Available Versions

MPG.PuRe