Search CORE

1,213 research outputs found

Beyond English text: Multilingual and multimedia information retrieval.

Author: Jones Gareth J.F.
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2005
Field of study

Non

CiteSeerX

DCU Online Research Access Service

Implicit Language Model in LSTM for OCR

Author: Natarajan Prem
Rawls Stephen
Sabir Ekraam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/05/2018
Field of study

Neural networks have become the technique of choice for OCR, but many aspects of how and why they deliver superior performance are still unknown. One key difference between current neural network techniques using LSTMs and the previous state-of-the-art HMM systems is that HMM systems have a strong independence assumption. In comparison LSTMs have no explicit constraints on the amount of context that can be considered during decoding. In this paper we show that they learn an implicit LM and attempt to characterize the strength of the LM in terms of equivalent n-gram context. We show that this implicitly learned language model provides a 2.4\% CER improvement on our synthetic test set when compared against a test set of random characters (i.e. not naturally occurring sequences), and that the LSTM learns to use up to 5 characters of context (which is roughly 88 frames in our configuration). We believe that this is the first ever attempt at characterizing the strength of the implicit LM in LSTM based OCR systems

arXiv.org e-Print Archive

Crossref

Russian Lexicographic Landscape: a Tale of 12 Dictionaries

Author: Braslavski P.
Kiselev Yu.
Krizhanovskaya N.
Krizhanovsky A.
Menshikov I.
Mukhin M.
Браславский П. И.
Киселев Ю. А.
Меньшиков И. Л.
Мухин М. Ю.
Publication venue: Rossiiskii Gosudarstvennyi Gumanitarnyi Universitet
Publication date: 01/01/2015
Field of study

The paper reports on quantitative analysis of 12 Russian dictionaries at three levels: 1) headwords: The size and overlap of word lists, coverage of large corpora, and presence of neologisms; 2) synonyms: Overlap of synsets in different dictionaries; 3) definitions: Distribution of definition lengths and numbers of senses, as well as textual similarity of same-headword definitions in different dictionaries. The total amount of data in the study is 805,900 dictionary entries, 892,900 definitions, and 84,500 synsets. The study reveals multiple connections and mutual influences between dictionaries, uncovers differences in modern electronic vs. traditional printed resources, as well as suggests directions for development of new and improvement of existing lexical semantic resources

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Machine Reading the Primeros Libros

Author: Hannah Alpert-Abrams
Publication venue: 'Modern Language Association'
Publication date: 01/01/2016
Field of study

Early modern printed books pose particular challenges for automatic transcription: uneven inking, irregular orthographies, radically multilingual texts. As a result, modern efforts to transcribe these documents tend to produce the textual gibberish commonly known as "dirty OCR" (Optical Character Recognition). This noisy output is most frequently seen as a barrier to access for scholars interested in the computational analysis or digital display of transcribed documents. This article, however, proposes that a closer analysis of dirty OCR can reveal both historical and cultural factors at play in the practice of automatic transcription. To make this argument, it focuses on tools developed for the automatic transcription of the Primeros Libros collection of sixteenth century Mexican printed books. By bringing together the history of the collection with that of the OCR tool, it illustrates how the colonial history of these documents is embedded in, and transformed by, the statistical models used for automatic transcription. It argues that automatic transcription, itself a mechanical and practical tool, also has an interpretive effect on transcribed texts that can have practical consequences for scholarly work

Humanities Commons