Search CORE

1,023 research outputs found

Efficient Construction of the BWT for Repetitive Text Using String Compression

Author: Díaz-Domínguez Diego
Navarro Gonzalo
Publication venue: Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Publication date: 01/06/2022
Field of study

Funding Information: Funding Diego Díaz-Domínguez: Academy of Finland Grant 323233 Gonzalo Navarro: ANID Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile Publisher Copyright: © Diego Daz-Domnguez and Gonzalo Navarro; licensed under Creative Commons License CC-BY 4.0We present a new semi-external algorithm that builds the Burrows-Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce the computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space, but it also speeds up the required computations. Our experiments show important savings in both space and computation time when the text is repetitive. On average, we are 3.7x faster than the baseline compressed approach, while maintaining a similar memory consumption. These results make our method stand out as the only one (to our knowledge) that can build the BCR BWT of a collection of 25 human genomes (75 GB) in about 7.3 hours, and using only 27 GB of working memory.Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Kernel methods in machine learning

Author: Hofmann Thomas
Schölkopf Bernhard
Smola Alexander J.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data.Comment: Published in at http://dx.doi.org/10.1214/009053607000000677 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

The Australian National University

MPG.PuRe

Computer-Assisted Language Learning and the Revolution in Computational Linguistics

Author: ten Hacken Pius
Publication venue: University of Bern
Publication date: 01/12/2003
Field of study

For a long period, Computational Linguistics (CL) and Computer-Assisted Language Learning (CALL) have developed almost entirely independently of each other. A brief historical survey shows that the main reason for this state of affairs was the long preoccupation in CL with the general problem of Natural Language Understanding (NLU). As a consequence, much effort was directed to fields such as Machine Translation (MT), which were perceived as incorporating and testing NLU. CALL does not fit this model very well so that it was hardly considered worth pursuing in CL. In the 1990s the realization that products could not live up to expectations, even in the domain of MT, led to a crisis. After this crisis the dominant approach to CL has become much more problem-oriented. From this perspective, many of the earlier differences disadvantaging CALL with respect to MT have now disappeared. Therefore the revolution in CL offers promising perspectives for CALL

Directory of Open Access Journals

BOP Serials

Block trees

Author: Belazzougui Djamal
Caceres Manuel
Gagie Travis
Gawrychowski Pawel
Kaerkkaeinen Juha
Navarro Gonzalo
Ordonez Alberto
Puglisi Simon J.
Tabei Yasuo
Publication venue
Publication date: 01/05/2021
Field of study

Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(z log(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings. (C) 2020 Elsevier Inc. All rights reserved.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

A Systematic Approach to English to Bangla Sentence Translator

Author: Bappa Sarkar
Joyassree Sen
Md. Nazrul Islam
Md. Shamim Hossain
Publication venue: Global Journals Inc. (US)
Publication date: 15/03/2020
Field of study

This paper deals with the design and development of an expert sentence translation system. In this translator, the source language is English, and the target language is Bangla. The implemented translation system determines the relationship among different forms of English and Bengali sentences and makes appropriate correspondence between English and Bengali grammar. Here, we have been developing a top-down parsing program. The system incorporates itself with the dictionary and gives the corresponding Bengali meaning. The system performs translation procedure in three steps. The lexical analyzer reads the English sentence, tokenizes into words, and stores information into a stack. The lexical analyzer uses the English to Bangla dictionary and word morphology for finding lexical information. The parser parses the input sentence and identifies the types of it and finds tense, phrase, clauses, etc. The generator generates a Bangla sentence, which is equivalent to the given input English sentence. It uses the output of the lexical analyzer and the parser to make Bengali sentence. This system can translate all kinds of sentences. But the limitation is that it cannot handle semantic and contextual problem

Global Journal of Computer Science and Technology (GJCST)

Three Studies on Model Transformations - Parsing, Generation and Ease of Use

Author: Burden H\ue5kan
Publication venue
Publication date: 01/01/2012
Field of study

ABSTRACTTransformations play an important part in both software development and the automatic processing of natural languages. We present three publications rooted in the multi-disciplinary research of Language Technology and Software Engineering and relate their contribution to the literature on syntactical transformations. Parsing Linear Context-Free Rewriting SystemsThe first publication describes four different parsing algorithms for the mildly context-sensitive grammar formalism Linear Context-Free Rewriting Systems. The algorithms automatically transform a text into a chart. As a result the parse chart contains the (possibly partial) analysis of the text according to a grammar with a lower level of abstraction than the original text. The uni-directional and endogenous transformations are described within the framework of parsing as deduction. Natural Language Generation from Class DiagramsUsing the framework of Model-Driven Architecture we generate natural language from class diagrams. The transformation is done in two steps. In the first step we transform the class diagram, defined by Executable and Translatable UML, to grammars specified by the Grammatical Framework. The grammars are then used to generate the desired text. Overall, the transformation is uni-directional, automatic and an example of a reverse engineering translation. Executable and Translatable UML - How Difficult Can it Be?Within Model-Driven Architecture there has been substantial research on the transformation from Platform-Independent Models (PIM) into Platform-Specifc Models, less so on the transformation from Computationally Independent Models (CIM) into PIMs. This publication reflects on the outcomes of letting novice software developers transform CIMs specified by UML into PIMs defined in Executable and Translatable UML.ConclusionThe three publications show how model transformations can be used within both Language Technology and Software Engineering to tackle the challenges of natural language processing and software development

Chalmers Research