1,288 research outputs found

    Parallel Suffix Arrays for Corpus Exploration

    Get PDF

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    Optimized Indexes for Data Structured Retrieval

    Get PDF
    The aim of this work is to show the novel index structure based suffix array and ternary search tree with rank and select succinct data structure. Suffix arrays were originally developed to reduce memory consumption compared to a suffix tree and ternary search tree combine the time efficiency of digital tries with the space efficiency of binary search trees. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. These operations are widely used in information retrieval and management, being the base of several data structures and algorithms for text collections, graphs, trees, etc. The resulting structure is faster than hashing for many typical search problems, and supports a broader range of useful problems and operations. There for we implement a path index based on those data structures that shown to be highly efficient when dealing with digital collection consist in structured documents. We describe how the index architecture works and we compare the searching algorithms with others, and finally experiments show the outperforms with earlier approaches

    Genuine phrase-based statistical machine translation with supervision

    Get PDF
    This thesis addresses mainly two issues that have not been addressed in Statis-tical Machine Translation. One issue is that even though research has been evolving from word-based approaches to phrase-based ones, because words were consistently found to be inappropriate translation units, the fact is that words are still considered in the composition of phrases, either to determine translation equivalents or to check language fluency. Such consideration might result in the attempt of establishing relations between words within a phrase translation equivalent even when sometimes its phrases should be considered as a whole. Attempts to further partition such phrases would produce incorrect translation units that would introduce unwanted noise in the translation pro-cess. Besides, the internal fluency of an identified multi-word phrase should not require checking. As such, phrases should indeed be considered units, avoiding incorrect translation equivalents that might be identified from their partition, as well as only considering the fluency of a phrase with other phrases and not within the phrase itself. The other issue is that supervision, in the form of trans-lation lexica, is generally overlooked, with SMT research focusing mainly on the identification of translation units without any human intervention and without considering already known translation units. As such, no importance has been attributed to the inclusion of verified lexica, with only some rarely used dic-tionaries to score translation candidates and not really as a source of translation units. Indeed, translation equivalents should be memorized, checked and used as a source of translation units, avoiding the need to keep identifying the same translation units, in particular if those are frequently used. This Thesis presents a truly Phrase-Based approach to SMT, using contiguous and non-contiguous phrases, along with Supervision, in which phrases are not divided and verified lexica is built, kept and used to propose translations of complete sentences

    Parallel texts alignment

    Get PDF
    Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia InformáticaAlignment of parallel texts (texts that are translation of each other) is a required step for many applications that use parallel texts, including statistical machine translation, automatic extraction of translation equivalents, automatic creation of concordances, etc. This dissertation presents a new methodology for parallel texts alignment that departs from previous work in several ways. One important departure is a shift of goals concerning the use of lexicons for obtaining correspondences between the texts. Previous methods try to infer a bilingual lexicon as part of the alignment process and use it to obtain correspondences between the texts. Some of those methods can use external lexicons to complement the inferred one, but they tend to consider them as secondary. This dissertation presents several arguments supporting the thesis that lexicon inference should not be embedded in the alignment process. The method described complies with this statement and relies exclusively on externally managed lexicons to obtain correspondences. Moreover, the algorithms presented can handle very large lexicons containing terms of arbitrary length. Besides the exclusive use of external lexicons, this dissertation presents a new method for obtaining correspondences between translation equivalents found in the texts. It uses a decision criteria based on features that have been overlooked by prior work. The proposed method is iterative and refines the alignment at each iteration. It uses the alignment obtained in one iteration as a guide to obtaining new correspondences in the next iteration, which in turn are used to compute a finer alignment. This iterative scheme allows the method to correct correspondence errors from previous iterations in face of new information

    Statistical langauge models for alternative sequence selection

    No full text

    K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

    Get PDF
    • …
    corecore