5,550 research outputs found
Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level
Text alignment and text quality are critical to the accuracy of Machine
Translation (MT) systems, some NLP tools, and any other text processing tasks
requiring bilingual data. This research proposes a language independent
bi-sentence filtering approach based on Polish (not a position-sensitive
language) to English experiments. This cleaning approach was developed on the
TED Talks corpus and also initially tested on the Wikipedia comparable corpus,
but it can be used for any text domain or language pair. The proposed approach
implements various heuristics for sentence comparison. Some of them leverage
synonyms and semantic and structural analysis of text as additional
information. Minimization of data loss was ensured. An improvement in MT system
score with text processed using the tool is discussed.Comment: arXiv admin note: text overlap with arXiv:1509.09093,
arXiv:1509.0888
Modeling and replicating statistical topology, and evidence for CMB non-homogeneity
Under the banner of `Big Data', the detection and classification of structure
in extremely large, high dimensional, data sets, is, one of the central
statistical challenges of our times. Among the most intriguing approaches to
this challenge is `TDA', or `Topological Data Analysis', one of the primary
aims of which is providing non-metric, but topologically informative,
pre-analyses of data sets which make later, more quantitative analyses
feasible. While TDA rests on strong mathematical foundations from Topology, in
applications it has faced challenges due to an inability to handle issues of
statistical reliability and robustness and, most importantly, in an inability
to make scientific claims with verifiable levels of statistical confidence. We
propose a methodology for the parametric representation, estimation, and
replication of persistence diagrams, the main diagnostic tool of TDA. The power
of the methodology lies in the fact that even if only one persistence diagram
is available for analysis -- the typical case for big data applications --
replications can be generated to allow for conventional statistical hypothesis
testing. The methodology is conceptually simple and computationally practical,
and provides a broadly effective statistical procedure for persistence diagram
TDA analysis. We demonstrate the basic ideas on a toy example, and the power of
the approach in a novel and revealing analysis of CMB non-homogeneity
Concentration Bounds for Stochastic Approximations
We obtain non asymptotic concentration bounds for two kinds of stochastic
approximations. We first consider the deviations between the expectation of a
given function of the Euler scheme of some diffusion process at a fixed
deterministic time and its empirical mean obtained by the Monte-Carlo
procedure. We then give some estimates concerning the deviation between the
value at a given time-step of a stochastic approximation algorithm and its
target. Under suitable assumptions both concentration bounds turn out to be
Gaussian. The key tool consists in exploiting accurately the concentration
properties of the increments of the schemes. For the first case, as opposed to
the previous work of Lemaire and Menozzi (EJP, 2010), we do not have any
systematic bias in our estimates. Also, no specific non-degeneracy conditions
are assumed.Comment: 14 page
Cryptanalysis of McEliece Cryptosystem Based on Algebraic Geometry Codes and their subcodes
We give polynomial time attacks on the McEliece public key cryptosystem based
either on algebraic geometry (AG) codes or on small codimensional subcodes of
AG codes. These attacks consist in the blind reconstruction either of an Error
Correcting Pair (ECP), or an Error Correcting Array (ECA) from the single data
of an arbitrary generator matrix of a code. An ECP provides a decoding
algorithm that corrects up to errors, where denotes
the designed distance and denotes the genus of the corresponding curve,
while with an ECA the decoding algorithm corrects up to
errors. Roughly speaking, for a public code of length over ,
these attacks run in operations in for the
reconstruction of an ECP and operations for the reconstruction of an
ECA. A probabilistic shortcut allows to reduce the complexities respectively to
and . Compared to the
previous known attack due to Faure and Minder, our attack is efficient on codes
from curves of arbitrary genus. Furthermore, we investigate how far these
methods apply to subcodes of AG codes.Comment: A part of the material of this article has been published at the
conferences ISIT 2014 with title "A polynomial time attack against AG code
based PKC" and 4ICMCTA with title "Crypt. of PKC that use subcodes of AG
codes". This long version includes detailed proofs and new results: the
proceedings articles only considered the reconstruction of ECP while we
discuss here the reconstruction of EC
A symmetry-adapted numerical scheme for SDEs
We propose a geometric numerical analysis of SDEs admitting Lie symmetries
which allows us to individuate a symmetry adapted coordinates system where the
given SDE has notable invariant properties. An approximation scheme preserving
the symmetry properties of the equation is introduced. Our algorithmic
procedure is applied to the family of general linear SDEs for which two
theoretical estimates of the numerical forward error are established.Comment: A numerical example adde
- …