14 research outputs found

    French-English Terminology Extraction from Comparable Corpora

    Full text link

    Mining Parenthetical Translations for Polish-English Lexica

    No full text

    Chinese Ancient-Modern Sentence Alignment

    No full text

    Optimal designs for two-color microarray experiments in multi-factorial models

    Get PDF
    Two-color microarray experiments form an important tool in gene expression analysis. They are often used to identify candidate genes that can be made accountable for the genesis of a certain disease. Due to the high costs of microarray experiments it is fundamental to design these experiments carefully and specifically give instructions, which samples should be allocated on the same microarray. Thereby, two samples are hybridized together on one array and the assignment of samples to arrays influences the precision of the results. Therefore, design issues for microarray experiments have been investigated intensively in the last years. However, only few authors (e.g. Stanzel (2007)) focused on more than one factor of interest. We extend Stanzel's work and derive approximate optimal designs for estimating interactions in multi-factorial settings. Thereby, optimality of candidate designs is shown using equivalence theorems (Pukelsheim (1993)). Another practical important but less studied topic is the derivation of exact optimal designs. Most research considers approximate designs or exact designs for special contrast sets and selected numbers of arrays. Therefore, we focus on exact designs and present a method to construct A-optimal microarray designs for arbitrary numbers of arrays and arbitrary contrast sets. This method is applied to derive optimal designs for estimating treatment-control comparisons, all-to-next contrasts, Helmert contrasts and all pairwise comparisons. Furthermore, we derive robust designs, which achieve efficient results even if observations are missing. Missing values are a crucial topic in the context of microarray experiments, since they often occur due to scratches on the slide or other damaging. In applications recommendations for the choice of efficient experimental layouts can be derived from our constructed designs

    A two-level structure for compressing aligned bitexts

    Get PDF
    A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our strategy is based on a two-level structure for the vocabularies, and on the use of biwords, a pair of associated words, one from each language, as basic symbols to be encoded with an ETDC compressor. The resulting compressed bitext needs around 20% of the space and allows more efficient implementations of the different types of searches and operations that linguistic engineerings need to perform on them. In this paper we discuss and provide results for compression, decompression, different types of searches, and bilingual snippets extraction.Spanish projects TIN2006-15071-C03-01, TIN2006-15071-C03-02 and TIN2006-15071-C03-03. Regional Government of Castilla y León and the European Social Fund

    Adaptive Bilingual Sentence Alignment

    No full text

    N-gram similarity and distance

    No full text
    Abstract. In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.

    Alignment of Paragraphs in Bilingual Texts Using Bilingual Dictionaries and Dynamic Programming

    No full text
    Parallel text alignment is a special type of pattern recognition task aimed to discover the similarity between two sequences of symbols. Given the same text in two different languages, the task is to decide which elements--paragraphs in case of paragraph alignment---in one text are translations of which elements of the other text. One of the applications is training training statistical machine translation algorithms. The task is not trivial unless detailed text understanding can be afforded. In our previous work we have presented a simple technique that relied on bilingual dictionaries but does not perform any syntactic analysis of the texts. In this paper we give a formal definition of the task and present an exact optimization algorithm for finding the best alignment
    corecore