24,854 research outputs found
Comparative study of Arabic and french statistical language models
International audienceIn this paper, we propose a comparative study of statistical language models of Arabic and French. The objective of this study is to understand how to better model both Arabic and French. Several experiments using different smoothing techniques have been carried out. For French, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with Witten Bell method are more efficient. Tests are achieved with comparable corpora and vocabularies in terms of siz
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Letter counting: a stem cell for Cryptology, Quantitative Linguistics, and Statistics
Counting letters in written texts is a very ancient practice. It has
accompanied the development of Cryptology, Quantitative Linguistics, and
Statistics. In Cryptology, counting frequencies of the different characters in
an encrypted message is the basis of the so called frequency analysis method.
In Quantitative Linguistics, the proportion of vowels to consonants in
different languages was studied long before authorship attribution. In
Statistics, the alternation vowel-consonants was the only example that Markov
ever gave of his theory of chained events. A short history of letter counting
is presented. The three domains, Cryptology, Quantitative Linguistics, and
Statistics, are then examined, focusing on the interactions with the other two
fields through letter counting. As a conclusion, the eclectism of past
centuries scholars, their background in humanities, and their familiarity with
cryptograms, are identified as contributing factors to the mutual enrichment
process which is described here
Assumptions behind grammatical approaches to code-switching: when the blueprint is a red herring
Many of the so-called ‘grammars’ of code-switching are based on various underlying assumptions, e.g. that informal speech can be adequately or appropriately described in terms of ‘‘grammar’’; that deep, rather than surface, structures are involved in code-switching; that one ‘language’ is the ‘base’ or ‘matrix’; and that constraints derived from existing data are universal and predictive. We question these assumptions on several grounds. First, ‘grammar’ is arguably distinct from the processes driving speech production. Second, the role of grammar is mediated by the variable, poly-idiolectal repertoires of bilingual speakers. Third, in many instances of CS the notion of a ‘base’ system is either irrelevant, or fails to explain the facts. Fourth, sociolinguistic factors frequently override ‘grammatical’ factors, as evidence from the same language pairs in different settings has shown. No principles proposed to date account for all the facts, and it seems unlikely that ‘grammar’, as conventionally conceived, can provide definitive answers. We conclude that rather than seeking universal, predictive grammatical rules, research on CS should focus on the variability of bilingual grammars
A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units
We address the design of a unified multilingual system for handwriting
recognition. Most of multi- lingual systems rests on specialized models that
are trained on a single language and one of them is selected at test time.
While some recognition systems are based on a unified optical model, dealing
with a unified language model remains a major issue, as traditional language
models are generally trained on corpora composed of large word lexicons per
language. Here, we bring a solution by con- sidering language models based on
sub-lexical units, called multigrams. Dealing with multigrams strongly reduces
the lexicon size and thus decreases the language model complexity. This makes
pos- sible the design of an end-to-end unified multilingual recognition system
where both a single optical model and a single language model are trained on
all the languages. We discuss the impact of the language unification on each
model and show that our system reaches state-of-the-art methods perfor- mance
with a strong reduction of the complexity.Comment: preprin
- …