Search CORE

5,550 research outputs found

Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

Author: Wołk Krzysztof
Publication venue: 'AGHU University of Science and Technology Press'
Publication date: 01/01/2015
Field of study

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems, some NLP tools, and any other text processing tasks requiring bilingual data. This research proposes a language independent bi-sentence filtering approach based on Polish (not a position-sensitive language) to English experiments. This cleaning approach was developed on the TED Talks corpus and also initially tested on the Wikipedia comparable corpus, but it can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence comparison. Some of them leverage synonyms and semantic and structural analysis of text as additional information. Minimization of data loss was ensured. An improvement in MT system score with text processed using the tool is discussed.Comment: arXiv admin note: text overlap with arXiv:1509.09093, arXiv:1509.0888

arXiv.org e-Print Archive

AGH (Akademia Górniczo-Hutnicza) University of Science and Technology: Journals

Computer Science Journal (AGH University of Science and Technology, Krakow)

Biblioteka Nauki - repozytorium artykuÅÃ³w

Crossref

Modeling and replicating statistical topology, and evidence for CMB non-homogeneity

Author: Adler Robert J.
Agami Sarit
Pranav Pratyush
Publication venue
Publication date: 26/04/2017
Field of study

Under the banner of `Big Data', the detection and classification of structure in extremely large, high dimensional, data sets, is, one of the central statistical challenges of our times. Among the most intriguing approaches to this challenge is `TDA', or `Topological Data Analysis', one of the primary aims of which is providing non-metric, but topologically informative, pre-analyses of data sets which make later, more quantitative analyses feasible. While TDA rests on strong mathematical foundations from Topology, in applications it has faced challenges due to an inability to handle issues of statistical reliability and robustness and, most importantly, in an inability to make scientific claims with verifiable levels of statistical confidence. We propose a methodology for the parametric representation, estimation, and replication of persistence diagrams, the main diagnostic tool of TDA. The power of the methodology lies in the fact that even if only one persistence diagram is available for analysis -- the typical case for big data applications -- replications can be generated to allow for conventional statistical hypothesis testing. The methodology is conceptually simple and computationally practical, and provides a broadly effective statistical procedure for persistence diagram TDA analysis. We demonstrate the basic ideas on a toy example, and the power of the approach in a novel and revealing analysis of CMB non-homogeneity

arXiv.org e-Print Archive

Concentration Bounds for Stochastic Approximations

Author: Frikha Noufel
Menozzi Stephane
Publication venue
Publication date: 01/01/2012
Field of study

We obtain non asymptotic concentration bounds for two kinds of stochastic approximations. We first consider the deviations between the expectation of a given function of the Euler scheme of some diffusion process at a fixed deterministic time and its empirical mean obtained by the Monte-Carlo procedure. We then give some estimates concerning the deviation between the value at a given time-step of a stochastic approximation algorithm and its target. Under suitable assumptions both concentration bounds turn out to be Gaussian. The key tool consists in exploiting accurately the concentration properties of the increments of the schemes. For the first case, as opposed to the previous work of Lemaire and Menozzi (EJP, 2010), we do not have any systematic bias in our estimates. Also, no specific non-degeneracy conditions are assumed.Comment: 14 page

arXiv.org e-Print Archive

HAL Evry

CiteSeerX

Hal-Diderot

Cryptanalysis of McEliece Cryptosystem Based on Algebraic Geometry Codes and their subcodes

Author: Couvreur Alain
Márquez-Corbella Irene
Pellikaan Ruud
Publication venue
Publication date: 01/03/2016
Field of study

We give polynomial time attacks on the McEliece public key cryptosystem based either on algebraic geometry (AG) codes or on small codimensional subcodes of AG codes. These attacks consist in the blind reconstruction either of an Error Correcting Pair (ECP), or an Error Correcting Array (ECA) from the single data of an arbitrary generator matrix of a code. An ECP provides a decoding algorithm that corrects up to

\frac{d^*-1-g}{2}

errors, where

d^*

denotes the designed distance and

g

denotes the genus of the corresponding curve, while with an ECA the decoding algorithm corrects up to

\frac{d^*-1}{2}

errors. Roughly speaking, for a public code of length

n

over

\mathbb F_q

, these attacks run in

O(n^4\log (n))

operations in

\mathbb F_q

for the reconstruction of an ECP and

O(n^5)

operations for the reconstruction of an ECA. A probabilistic shortcut allows to reduce the complexities respectively to

O(n^{3+\varepsilon} \log (n))

and

O(n^{4+\varepsilon})

. Compared to the previous known attack due to Faure and Minder, our attack is efficient on codes from curves of arbitrary genus. Furthermore, we investigate how far these methods apply to subcodes of AG codes.Comment: A part of the material of this article has been published at the conferences ISIT 2014 with title "A polynomial time attack against AG code based PKC" and 4ICMCTA with title "Crypt. of PKC that use subcodes of AG codes". This long version includes detailed proofs and new results: the proceedings articles only considered the reconstruction of ECP while we discuss here the reconstruction of EC

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Polytechnique

A symmetry-adapted numerical scheme for SDEs

Author: De Vecchi Francesco C.
Romano Andrea
Ugolini Stefania
Publication venue: 'American Institute of Mathematical Sciences (AIMS)'
Publication date: 31/07/2019
Field of study

We propose a geometric numerical analysis of SDEs admitting Lie symmetries which allows us to individuate a symmetry adapted coordinates system where the given SDE has notable invariant properties. An approximation scheme preserving the symmetry properties of the equation is introduced. Our algorithmic procedure is applied to the family of general linear SDEs for which two theoretical estimates of the numerical forward error are established.Comment: A numerical example adde

arXiv.org e-Print Archive

AIR Universita degli studi di Milano