1,334 research outputs found
Querying out-of-vocabulary words in lexicon-based keyword spotting
The final publication is available at Springer via http://dx.doi.org/10.1007/s00521-016-2197-8[EN] Lexicon-based handwritten text keyword spotting (KWS) has proven to be a faster and more accurate alternative to lexicon-free methods. Nevertheless, since lexicon-based KWS relies on a predefined vocabulary, fixed in the training phase, it does not support queries involving out-of-vocabulary (OOV) keywords. In this paper, we outline previous work aimed at solving this problem and present a new approach based on smoothing the (null) scores of OOV keywords by means of the information provided by ``similar'' in-vocabulary words. Good results achieved using this approach are compared with previously published alternatives on different data sets.This work was partially supported by the Spanish MEC under FPU Grant FPU13/06281, by the Generalitat Valenciana under the Prometeo/2009/014 Project Grant ALMA-MATER, and through the EU Projects: HIMANIS (JPICH programme, Spanish grant Ref. PCIN-2015-068) and READ (Horizon-2020 programme, grant Ref. 674943).Puigcerver, J.; Toselli, AH.; Vidal, E. (2016). Querying out-of-vocabulary words in lexicon-based keyword spotting. Neural Computing and Applications. 1-10. https://doi.org/10.1007/s00521-016-2197-8S110Almazan J, Gordo A, Fornes A, Valveny E (2013) Handwritten word spotting with corrected attributes. In: 2013 IEEE international conference on computer vision (ICCV), pp 1017–1024. doi: 10.1109/ICCV.2013.130Amengual JC, Vidal E (2000) On the estimation of error-correcting parameters. In: Proceedings 15th international conference on pattern recognition, 2000, vol 2, pp 883–886Fernández D, Lladós J, Fornés A (2011) Handwritten word spotting in old manuscript images using a pseudo-structural descriptor organized in a hash structure. In: Vitri'a J, Sanches JM, Hern'andez M (eds) Pattern recognition and image analysis: Proceedings of 5th Iberian Conference, IbPRIA 2011, Las Palmas de Gran Canaria, Spain, June 8–10. Springer, Berlin, Heidelberg, pp 628–635. doi: 10.1007/978-3-642-21257-4_78Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recognit Lett 33(7):934–942. doi: 10.1016/j.patrec.2011.09.009 Special Issue on Awards from ICPR 2010Fornés A, Frinken V, Fischer A, Almazán J, Jackson G, Bunke H (2011) A keyword spotting approach using blurred shape model-based descriptors. In: Proceedings of the 2011 workshop on historical document imaging and processing, pp 83–90. ACMFrinken V, Fischer A, Manmatha R, Bunke H (2012) A novel word spotting method based on recurrent neural networks. IEEE Trans Pattern Anal Mach Intell 34(2):211–224. doi: 10.1109/TPAMI.2011.113Gatos B, Pratikakis I (2009) Segmentation-free word spotting in historical printed documents. In: 10th International conference on document analysis and recognition, 2009. ICDAR’09, pp 271–275. IEEEJelinek F (1998) Statistical methods for speech recognition. MIT Press, CambridgeKneser R, Ney H (1995) Improved backing-off for N-gram language modeling. In: International conference on acoustics, speech and signal processing (ICASSP ’95), vol 1, pp 181–184. IEEE Computer Society, Los Alamitos, CA, USA. doi: http://doi.ieeecomputersociety.org/10.1109/ICASSP.1995.479394Kolcz A, Alspector J, Augusteijn M, Carlson R, Popescu GV (2000) A line-oriented approach to word spotting in handwritten documents. Pattern Anal Appl 3:153–168. doi: 10.1007/s100440070020Konidaris T, Gatos B, Ntzios K, Pratikakis I, Theodoridis S, Perantonis SJ (2007) Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int J Doc Anal Recognit 9(2–4):167–177Kumar G, Govindaraju V (2014) Bayesian active learning for keyword spotting in handwritten documents. In: 2014 22nd International conference on pattern recognition (ICPR), pp 2041–2046. IEEELevenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10(8):707–710Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkMarti UV, Bunke H (2002) The IAM-database: an English sentence database for offline handwriting recognition. Int J Doc Anal Recognit 5(1):39–46. doi: 10.1007/s100320200071Puigcerver J, Toselli AH, Vidal E (2014) Word-graph and character-lattice combination for KWS in handwritten documents. In: 14th International conference on frontiers in handwriting recognition (ICFHR), pp 181–186Puigcerver J, Toselli AH, Vidal E (2014) Word-graph-based handwriting keyword spotting of out-of-vocabulary queries. In: 22nd International conference on pattern recognition (ICPR), pp 2035–2040Puigcerver J, Toselli AH, Vidal E (2015) A new smoothing method for lexicon-based handwritten text keyword spotting. In: 7th Iberian conference on pattern recognition and image analysis. SpringerRath T, Manmatha R (2007) Word spotting for historical documents. Int J Doc Anal Recognit 9:139–152Robertson S. (2008) A new interpretation of average precision. In: Proceedings of the international. ACM SIGIR conference on research and development in information retrieval (SIGIR ’08), pp 689–690. ACM, New York, NY, USA. doi: http://doi.acm.org/10.1145/1390334.1390453Rodriguez-Serrano JA, Perronnin F (2009) Handwritten word-spotting using hidden markov models and universal vocabularies. Pattern Recognit 42(9):2106–2116. doi: 10.1016/j.patcog.2009.02.005 . http://www.sciencedirect.com/science/article/pii/S0031320309000673Rusinol M, Aldavert D, Toledo R, Llados J (2011) Browsing heterogeneous document collections by a segmentation-free word spotting method. In: International conference on document analysis and recognition (ICDAR), pp 63–67. doi: 10.1109/ICDAR.2011.22Shang H, Merrettal T (1996) Tries for approximate string matching. IEEE Trans Knowl Data Eng 8(4):540–547Toselli AH, Vidal E (2013) Fast HMM-Filler approach for key word spotting in handwritten documents. In: Proceedings of the 12th international conference on document analysis and recognition (ICDAR), pp 501–505Toselli AH, Vidal E (2014) Word-graph based handwriting key-word spotting: impact of word-graph size on performance. In: 11th IAPR international workshop on document analysis systems (DAS), pp 176–180. IEEEToselli AH, Vidal E, Romero V, Frinken V (2013) Word-graph based keyword spotting and indexing of handwritten document images. Technical report, Universitat Politécnica de ValénciaVidal E, Toselli AH, Puigcerver J (2015) High performance query-by-example keyword spotting using query-by-string techniques. In: 2015 13th International conference on document analysis and recognition (ICDAR), pp 741–745. IEEEWoodland P, Leggetter C, Odell J, Valtchev V, Young S (1995) The 1994 HTK large vocabulary speech recognition system. In: International conference on acoustics, speech, and signal processing (ICASSP ’95), vol 1, pp 73 –76. doi: 10.1109/ICASSP.1995.479276Wshah S, Kumar G, Govindaraju V (2012) Script independent word spotting in offline handwritten documents based on hidden markov models. In: 2012 International conference on frontiers in handwriting recognition (ICFHR), pp 14–19. doi: 10.1109/ICFHR.2012.26
Query by String word spotting based on character bi-gram indexing
In this paper we propose a segmentation-free query by string word spotting
method. Both the documents and query strings are encoded using a recently
proposed word representa- tion that projects images and strings into a common
atribute space based on a pyramidal histogram of characters(PHOC). These
attribute models are learned using linear SVMs over the Fisher Vector
representation of the images along with the PHOC labels of the corresponding
strings. In order to search through the whole page, document regions are
indexed per character bi- gram using a similar attribute representation. On top
of that, we propose an integral image representation of the document using a
simplified version of the attribute model for efficient computation. Finally we
introduce a re-ranking step in order to boost retrieval performance. We show
state-of-the-art results for segmentation-free query by string word spotting in
single-writer and multi-writer standard datasetsComment: To be published in ICDAR201
Cross-document word matching for segmentation and retrieval of Ottoman divans
Cataloged from PDF version of article.Motivated by the need for the automatic
indexing and analysis of huge number of documents in
Ottoman divan poetry, and for discovering new knowledge
to preserve and make alive this heritage, in this study we
propose a novel method for segmenting and retrieving
words in Ottoman divans. Documents in Ottoman are dif-
ficult to segment into words without a prior knowledge of
the word. In this study, using the idea that divans have
multiple copies (versions) by different writers in different
writing styles, and word segmentation in some of those
versions may be relatively easier to achieve than in other
versions, segmentation of the versions (which are difficult,
if not impossible, with traditional techniques) is performed
using information carried from the simpler version. One
version of a document is used as the source dataset and the
other version of the same document is used as the target
dataset. Words in the source dataset are automatically
extracted and used as queries to be spotted in the target
dataset for detecting word boundaries. We present the idea
of cross-document word matching for a novel task of
segmenting historical documents into words. We propose a
matching scheme based on possible combinations of
sequence of sub-words. We improve the performance of
simple features through considering the words in a context.
The method is applied on two versions of Layla and
Majnun divan by Fuzuli. The results show that, the proposed
word-matching-based segmentation method is
promising in finding the word boundaries and in retrieving
the words across documents
- …