5 research outputs found

    Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos

    Full text link
    We propose a new zero-shot Event Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following directions: (a) semantic embedding of multimodal information in videos (with focus on the visual modalities), (b) automatically determining relevance of concepts/attributes to a free text query, which could be useful for other applications, and (c) retrieving videos by free text event query (e.g., "changing a vehicle tire") based on their content. We embed videos into a distributional semantic space and then measure the similarity between videos and the event query in a free text form. We validated our method on the large TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-of-the-art that uses big descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC metric. It is also an order of magnitude faster.Comment: To appear in AAAI 201

    A Book Reader Design for Persons with Visual Impairment and Blindness

    Get PDF
    The objective of this dissertation is to provide a new design approach to a fully automated book reader for individuals with visual impairment and blindness that is portable and cost effective. This approach relies on the geometry of the design setup and provides the mathematical foundation for integrating, in a unique way, a 3-D space surface map from a low-resolution time of flight (ToF) device with a high-resolution image as means to enhance the reading accuracy of warped images due to the page curvature of bound books and other magazines. The merits of this low cost, but effective automated book reader design include: (1) a seamless registration process of the two imaging modalities so that the low resolution (160 x 120 pixels) height map, acquired by an Argos3D-P100 camera, accurately covers the entire book spread as captured by the high resolution image (3072 x 2304 pixels) of a Canon G6 Camera; (2) a mathematical framework for overcoming the difficulties associated with the curvature of open bound books, a process referred to as the dewarping of the book spread images, and (3) image correction performance comparison between uniform and full height map to determine which map provides the highest Optical Character Recognition (OCR) reading accuracy possible. The design concept could also be applied to address the challenging process of book digitization. This method is dependent on the geometry of the book reader setup for acquiring a 3-D map that yields high reading accuracy once appropriately fused with the high-resolution image. The experiments were performed on a dataset consisting of 200 pages with their corresponding computed and co-registered height maps, which are made available to the research community (cate-book3dmaps.fiu.edu). Improvements to the characters reading accuracy, due to the correction steps, were quantified and measured by introducing the corrected images to an OCR engine and tabulating the number of miss-recognized characters. Furthermore, the resilience of the book reader was tested by introducing a rotational misalignment to the book spreads and comparing the OCR accuracy to those obtained with the standard alignment. The standard alignment yielded an average reading accuracy of 95.55% with the uniform height map (i.e., the height values of the central row of the 3-D map are replicated to approximate all other rows), and 96.11% with the full height maps (i.e., each row has its own height values as obtained from the 3D camera). When the rotational misalignments were taken into account, the results obtained produced average accuracies of 90.63% and 94.75% for the same respective height maps, proving added resilience of the full height map method to potential misalignments

    ์ •๋ ฌ ํŠน์„ฑ๋“ค ๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ๋ฐ ์žฅ๋ฉด ํ…์ŠคํŠธ ์˜์ƒ ํ‰ํ™œํ™” ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 8. ์กฐ๋‚จ์ต.์นด๋ฉ”๋ผ๋กœ ์ดฌ์˜ํ•œ ํ…์ŠคํŠธ ์˜์ƒ์— ๋Œ€ํ•ด์„œ, ๊ด‘ํ•™ ๋ฌธ์ž ์ธ์‹(OCR)์€ ์ดฌ์˜๋œ ์žฅ๋ฉด์„ ๋ถ„์„ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ํ•˜์ง€๋งŒ ์˜ฌ๋ฐ”๋ฅธ ํ…์ŠคํŠธ ์˜์—ญ ๊ฒ€์ถœ ํ›„์—๋„, ์ดฌ์˜ํ•œ ์˜์ƒ์— ๋Œ€ํ•œ ๋ฌธ์ž ์ธ์‹์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ๋กœ ์—ฌ๊ฒจ์ง„๋‹ค. ์ด๋Š” ์ข…์ด์˜ ๊ตฌ๋ถ€๋Ÿฌ์ง๊ณผ ์นด๋ฉ”๋ผ ์‹œ์ ์— ์˜ํ•œ ๊ธฐํ•˜ํ•™์ ์ธ ์™œ๊ณก ๋•Œ๋ฌธ์ด๊ณ , ๋”ฐ๋ผ์„œ ์ด๋Ÿฌํ•œ ํ…์ŠคํŠธ ์˜์ƒ์— ๋Œ€ํ•œ ํ‰ํ™œํ™”๋Š” ๋ฌธ์ž ์ธ์‹์— ์žˆ์–ด์„œ ํ•„์ˆ˜์ ์ธ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์œผ๋กœ ์—ฌ๊ฒจ์ง„๋‹ค. ์ด๋ฅผ ์œ„ํ•œ ์™œ๊ณก๋œ ์ดฌ์˜ ์˜์ƒ์„ ์ •๋ฉด ์‹œ์ ์œผ๋กœ ๋ณต์›ํ•˜๋Š” ํ…์ŠคํŠธ ์˜์ƒ ํ‰ํ™œํ™” ๋ฐฉ๋ฒ•๋“ค์€ ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜์–ด์ง€๊ณ  ์žˆ๋‹ค. ์ตœ๊ทผ์—๋Š”, ํ‰ํ™œํ™”๊ฐ€ ์ž˜ ๋œ ํ…์ŠคํŠธ์˜ ์„ฑ์งˆ์— ์ดˆ์ ์„ ๋งž์ถ˜ ์—ฐ๊ตฌ๋“ค์ด ์ฃผ๋กœ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€์ ์—์„œ, ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์€ ํ…์ŠคํŠธ ์˜์ƒ ํ‰ํ™œํ™”๋ฅผ ์œ„ํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ •๋ ฌ ํŠน์„ฑ๋“ค์„ ๋‹ค๋ฃฌ๋‹ค. ์ด๋Ÿฌํ•œ ์ •๋ ฌ ํŠน์„ฑ๋“ค์€ ๋น„์šฉ ํ•จ์ˆ˜๋กœ ์„ค๊ณ„๋˜์–ด์ง€๊ณ , ๋น„์šฉ ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด์„œ ํ‰ํ™œํ™”์— ์‚ฌ์šฉ๋˜์–ด์ง€๋Š” ํ‰ํ™œํ™” ๋ณ€์ˆ˜๋“ค์ด ๊ตฌํ•ด์ง„๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์€ ๋ฌธ์„œ ์˜์ƒ ํ‰ํ™œํ™”, ์žฅ๋ฉด ํ…์ŠคํŠธ ํ‰ํ™œํ™”, ์ผ๋ฐ˜ ๋ฐฐ๊ฒฝ ์†์˜ ํœ˜์–ด์ง„ ํ‘œ๋ฉด ํ‰ํ™œํ™”์™€ ๊ฐ™์ด 3๊ฐ€์ง€ ์„ธ๋ถ€ ์ฃผ์ œ๋กœ ๋‚˜๋ˆ ์ง„๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์€ ํ…์ŠคํŠธ ๋ผ์ธ๋“ค๊ณผ ์„ ๋ถ„๋“ค์˜ ์ •๋ ฌ ํŠน์„ฑ์— ๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ์˜์ƒ ํ‰ํ™œํ™” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ํ…์ŠคํŠธ ๋ผ์ธ ๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ์˜์ƒ ํ‰ํ™œํ™” ๋ฐฉ๋ฒ•๋“ค์˜ ๊ฒฝ์šฐ, ๋ฌธ์„œ๊ฐ€ ๋ณต์žกํ•œ ๋ ˆ์ด์•„์›ƒ ํ˜•ํƒœ์ด๊ฑฐ๋‚˜ ์ ์€ ์ˆ˜์˜ ํ…์ŠคํŠธ ๋ผ์ธ์„ ํฌํ•จํ•˜๊ณ  ์žˆ์„ ๋•Œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ์ด๋Š” ๋ฌธ์„œ์— ํ…์ŠคํŠธ ๋Œ€์‹  ๊ทธ๋ฆผ, ๊ทธ๋ž˜ํ”„ ํ˜น์€ ํ‘œ์™€ ๊ฐ™์€ ์˜์—ญ์ด ๋งŽ์€ ๊ฒฝ์šฐ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋ ˆ์ด์•„์›ƒ์— ๊ฐ•์ธํ•œ ๋ฌธ์„œ ์˜์ƒ ํ‰ํ™œํ™”๋ฅผ ์œ„ํ•˜์—ฌ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ •๋ ฌ๋œ ํ…์ŠคํŠธ ๋ผ์ธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์„ ๋ถ„๋“ค๋„ ์ด์šฉํ•œ๋‹ค. ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ‰ํ™œํ™” ๋œ ์„ ๋ถ„๋“ค์€ ์—ฌ์ „ํžˆ ์ผ์ง์„ ์˜ ํ˜•ํƒœ์ด๊ณ , ๋Œ€๋ถ€๋ถ„ ๊ฐ€๋กœ ํ˜น์€ ์„ธ๋กœ ๋ฐฉํ–ฅ์œผ๋กœ ์ •๋ ฌ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฐ€์ • ๋ฐ ๊ด€์ธก์— ๊ทผ๊ฑฐํ•˜์—ฌ, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ด๋Ÿฌํ•œ ์„ฑ์งˆ๋“ค์„ ์ˆ˜์‹ํ™”ํ•˜๊ณ  ์ด๋ฅผ ํ…์ŠคํŠธ ๋ผ์ธ ๊ธฐ๋ฐ˜์˜ ๋น„์šฉ ํ•จ์ˆ˜์™€ ๊ฒฐํ•ฉํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋น„์šฉ ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ข…์ด์˜ ๊ตฌ๋ถ€๋Ÿฌ์ง, ์นด๋ฉ”๋ผ ์‹œ์ , ์ดˆ์  ๊ฑฐ๋ฆฌ์™€ ๊ฐ™์€ ํ‰ํ™œํ™” ๋ณ€์ˆ˜๋“ค์„ ์ถ”์ •ํ•œ๋‹ค. ๋˜ํ•œ, ์˜ค๊ฒ€์ถœ๋œ ํ…์ŠคํŠธ ๋ผ์ธ๋“ค๊ณผ ์ž„์˜์˜ ๋ฐฉํ–ฅ์„ ๊ฐ€์ง€๋Š” ์„ ๋ถ„๋“ค๊ณผ ๊ฐ™์€ ์ด์ƒ์ (outlier)์„ ๊ณ ๋ คํ•˜์—ฌ, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ฐ˜๋ณต์ ์ธ ๋‹จ๊ณ„๋กœ ์„ค๊ณ„๋œ๋‹ค. ๊ฐ ๋‹จ๊ณ„์—์„œ, ์ •๋ ฌ ํŠน์„ฑ์„ ๋งŒ์กฑํ•˜์ง€ ์•Š๋Š” ์ด์ƒ์ ๋“ค์€ ์ œ๊ฑฐ๋˜๊ณ , ์ œ๊ฑฐ๋˜์ง€ ์•Š์€ ํ…์ŠคํŠธ ๋ผ์ธ ๋ฐ ์„ ๋ถ„๋“ค๋งŒ์ด ๋น„์šฉํ•จ์ˆ˜ ์ตœ์ ํ™”์— ์ด์šฉ๋œ๋‹ค. ์ˆ˜ํ–‰ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋“ค์€ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋‹ค์–‘ํ•œ ๋ ˆ์ด์•„์›ƒ์— ๋Œ€ํ•˜์—ฌ ๊ฐ•์ธํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ๋Š”, ๋ณธ ๋…ผ๋ฌธ์€ ์žฅ๋ฉด ํ…์ŠคํŠธ ํ‰ํ™œํ™” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด ์žฅ๋ฉด ํ…์ŠคํŠธ ํ‰ํ™œํ™” ๋ฐฉ๋ฒ•๋“ค์˜ ๊ฒฝ์šฐ, ๊ฐ€๋กœ/์„ธ๋กœ ๋ฐฉํ–ฅ์˜ ํš, ๋Œ€์นญ ํ˜•ํƒœ์™€ ๊ฐ™์€ ๋ฌธ์ž๊ฐ€ ๊ฐ€์ง€๋Š” ๊ณ ์œ ์˜ ์ƒ๊น€์ƒˆ์— ๊ด€๋ จ๋œ ํŠน์„ฑ์„ ์ด์šฉํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์€ ๋ฌธ์ž๋“ค์˜ ์ •๋ ฌ ํ˜•ํƒœ๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ๊ฐ๊ฐ ๊ฐœ๋ณ„ ๋ฌธ์ž์— ๋Œ€ํ•œ ํŠน์„ฑ๋“ค๋งŒ์„ ์ด์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์—ฌ๋Ÿฌ ๋ฌธ์ž๋“ค๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด์„œ ์ž˜ ์ •๋ ฌ๋˜์ง€ ์•Š์€ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌธ์ž๋“ค์˜ ์ •๋ ฌ ์ •๋ณด๋ฅผ ์ด์šฉํ•œ๋‹ค. ์ •ํ™•ํ•˜๊ฒŒ๋Š”, ๋ฌธ์ž ๊ณ ์œ ์˜ ๋ชจ์–‘๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ •๋ ฌ ํŠน์„ฑ๋“ค๋„ ํ•จ๊ป˜ ๋น„์šฉํ•จ์ˆ˜๋กœ ์ˆ˜์‹ํ™”๋˜๊ณ , ๋น„์šฉํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด์„œ ํ‰ํ™œํ™”๊ฐ€ ์ง„ํ–‰๋œ๋‹ค. ๋˜ํ•œ, ๋ฌธ์ž๋“ค์˜ ์ •๋ ฌ ํŠน์„ฑ์„ ์ˆ˜์‹ํ™”ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ๊ฐœ๋ณ„ ๋ฌธ์ž๋“ค๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฌธ์ž ๋ถ„๋ฆฌ ๋˜ํ•œ ์ˆ˜ํ–‰ํ•œ๋‹ค. ๊ทธ ๋’ค, ํ…์ŠคํŠธ์˜ ์œ„, ์•„๋ž˜ ์„ ๋“ค์„ RANSAC ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•œ ์ตœ์†Œ ์ œ๊ณฑ๋ฒ•์„ ํ†ตํ•ด ์ถ”์ •ํ•œ๋‹ค. ์ฆ‰, ์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฌธ์ž ๋ถ„๋ฆฌ์™€ ์„  ์ถ”์ •, ํ‰ํ™œํ™”๊ฐ€ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰๋œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋น„์šฉํ•จ์ˆ˜๋Š” ๋ณผ๋ก(convex)ํ˜•ํƒœ๊ฐ€ ์•„๋‹ˆ๊ณ  ๋˜ํ•œ ๋งŽ์€ ๋ณ€์ˆ˜๋“ค์„ ํฌํ•จํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ Augmented Lagrange Multiplier ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜ ์ดฌ์˜ ์˜์ƒ๊ณผ ํ•ฉ์„ฑ๋œ ํ…์ŠคํŠธ ์˜์ƒ์„ ํ†ตํ•ด ์‹คํ—˜์ด ์ง„ํ–‰๋˜์—ˆ๊ณ , ์‹คํ—˜ ๊ฒฐ๊ณผ๋“ค์€ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์— ๋น„ํ•˜์—ฌ ๋†’์€ ์ธ์‹ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ ๋™์‹œ์— ์‹œ๊ฐ์ ์œผ๋กœ๋„ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„์„ ๋ณด์—ฌ์ค€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜ ๋ฐฐ๊ฒฝ ์†์˜ ํœ˜์–ด์ง„ ํ‘œ๋ฉด ํ‰ํ™œํ™” ๋ฐฉ๋ฒ•์œผ๋กœ๋„ ํ™•์žฅ๋œ๋‹ค. ์ผ๋ฐ˜ ๋ฐฐ๊ฒฝ์— ๋Œ€ํ•ด์„œ, ์•ฝ๋ณ‘์ด๋‚˜ ์Œ๋ฃŒ์ˆ˜ ์บ”๊ณผ ๊ฐ™์ด ์›ํ†ต ํ˜•ํƒœ์˜ ๋ฌผ์ฒด๋Š” ๋งŽ์ด ์กด์žฌํ•œ๋‹ค. ๊ทธ๋“ค์˜ ํ‘œ๋ฉด์€ ์ผ๋ฐ˜ ์›ํ†ต ํ‘œ๋ฉด(GCS)์œผ๋กœ ๋ชจ๋ธ๋ง์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด๋Ÿฌํ•œ ํœ˜์–ด์ง„ ํ‘œ๋ฉด๋“ค์€ ๋งŽ์€ ๋ฌธ์ž์™€ ๊ทธ๋ฆผ๋“ค์„ ํฌํ•จํ•˜๊ณ  ์žˆ์ง€๋งŒ, ํฌํ•จ๋œ ๋ฌธ์ž๋Š” ๋ฌธ์„œ์— ๋น„ํ•ด์„œ ๋งค์šฐ ๋ถˆ๊ทœ์น™์ ์ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ธฐ์กด์˜ ๋ฌธ์„œ ์˜์ƒ ํ‰ํ™œํ™” ๋ฐฉ๋ฒ•๋“ค๋กœ๋Š” ์ผ๋ฐ˜ ๋ฐฐ๊ฒฝ ์† ํœ˜์–ด์ง„ ํ‘œ๋ฉด ์˜์ƒ์„ ํ‰ํ™œํ™”ํ•˜๊ธฐ ํž˜๋“ค๋‹ค. ๋งŽ์€ ํœ˜์–ด์ง„ ํ‘œ๋ฉด์€ ์ž˜ ์ •๋ ฌ๋œ ์„ ๋ถ„๋“ค (ํ…Œ๋‘๋ฆฌ ์„  ํ˜น์€ ๋ฐ”์ฝ”๋“œ)์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ด€์ธก์— ๊ทผ๊ฑฐํ•˜์—ฌ, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•ž์„œ ์ œ์•ˆํ•œ ์„ ๋ถ„๋“ค์— ๋Œ€ํ•œ ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ํœ˜์–ด์ง„ ํ‘œ๋ฉด์„ ํ‰ํ™œํ™”ํ•œ๋‹ค. ๋‹ค์–‘ํ•œ ๋‘ฅ๊ทผ ๋ฌผ์ฒด์˜ ํœ˜์–ด์ง„ ํ‘œ๋ฉด ์˜์ƒ๋“ค์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋“ค์€ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ‰ํ™œํ™”๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.The optical character recognition (OCR) of text images captured by cameras plays an important role for scene understanding. However, the OCR of camera-captured image is still considered a challenging problem, even after the text detection (localization). It is mainly due to the geometric distortions caused by page curve and perspective view, therefore their rectification has been an essential pre-processing step for their recognition. Thus, there have been many text image rectification methods which recover the fronto-parallel view image from a single distorted image. Recently, many researchers have focused on the properties of the well-rectified text. In this respect, this dissertation presents novel alignment properties for text image rectification, which are encoded into the proposed cost functions. By minimizing the cost functions, the transformation parameters for rectification are obtained. In detail, they are applied to three topics: document image dewarping, scene text rectification, and curved surface dewarping in real scene. First, a document image dewarping method is proposed based on the alignments of text-lines and line segments. Conventional text-line based document dewarping methods have problems when handling complex layout and/or very few text-lines. When there are few aligned text-lines in the image, this usually means that photos, graphics and/or tables take large portion of the input instead. Hence, for the robust document dewarping, the proposed method uses line segments in the image in addition to the aligned text-lines. Based on the assumption and observation that all the transformed line segments are still straight (line to line mapping), and many of them are horizontally or vertically aligned in the well-rectified images, the proposed method encodes this properties into the cost function in addition to the text-line based cost. By minimizing the function, the proposed method can obtain transformation parameters for page curve, camera pose, and focal length, which are used for document image rectification. Considering that there are many outliers in line segment directions and miss-detected text-lines in some cases, the overall algorithm is designed in an iterative manner. At each step, the proposed method removes the text-lines and line segments that are not well aligned, and then minimizes the cost function with the updated information. Experimental results show that the proposed method is robust to the variety of page layouts. This dissertation also presents a method for scene text rectification. Conventional methods for scene text rectification mainly exploited the glyph property, which means that the characters in many language have horizontal/vertical strokes and also some symmetric shapes. However, since they consider the only shape properties of individual character, without considering the alignments of characters, they work well for only images with a single character, and still yield mis-aligned results for images with multiple characters. In order to alleviate this problem, the proposed method explicitly imposes alignment constraints on rectified results. To be precise, character alignments as well as glyph properties are encoded in the proposed cost function, and the transformation parameters are obtained by minimizing the function. Also, in order to encode the alignments of characters into the cost function, the proposed method separates the text into individual characters using a projection profile method before optimizing the cost function. Then, top and bottom lines are estimated using a least squares line fitting with RANSAC. Overall algorithm is designed to perform character segmentation, line fitting, and rectification iteratively. Since the cost function is non-convex and many variables are involved in the function, the proposed method also develops an optimization method using Augmented Lagrange Multiplier method. This dissertation evaluates the proposed method on real and synthetic text images and experimental results show that the proposed method achieves higher OCR accuracy than the conventional approach and also yields visually pleasing results. Finally, the proposed method can be extended to the curved surface dewarping in real scene. In real scene, there are many circular objects such as medicine bottles or cans of drinking water, and their curved surfaces can be modeled as Generalized Cylindrical Surfaces (GCS). These curved surfaces include many significant text and figures, however their text has irregular structure compared to documents. Therefore, the conventional dewarping methods based on the properties of well-rectified text have problems in their rectification. Based on the observation that many curved surfaces include well-aligned line segments (boundary lines of objects or barcode), the proposed method rectifies the curved surfaces by exploiting the proposed line segment terms. Experimental results on a range of images with curved surfaces of circular objects show that the proposed method performs rectification robustly.1 Introduction 1 1.1 Document image dewarping 3 1.2 Scene text rectification 5 1.3 Curved surface dewarping in real scene 7 1.4 Contents 8 2 Related work 9 2.1 Document image dewarping 9 2.1.1 Dewarping methods using additional information 9 2.1.2 Text-line based dewarping methods 10 2.2 Scene text rectification 11 2.3 Curved surface dewarping in real scene 12 3 Document image dewarping 15 3.1 Proposed cost function 15 3.1.1 Parametric model of dewarping process 15 3.1.2 Cost function design 18 3.1.3 Line segment properties and cost function 19 3.2 Outlier removal and optimization 26 3.2.1 Jacobian matrix of the proposed cost function 27 3.3 Document region detection and dewarping 31 3.4 Experimental results 32 3.4.1 Experimental results on text-abundant document images 33 3.4.2 Experimental results on non conventional document images 34 3.5 Summary 47 4 Scene text rectification 49 4.1 Proposed cost function for rectification 49 4.1.1 Cost function design 49 4.1.2 Character alignment properties and alignment terms 51 4.2 Overall algorithm 54 4.2.1 Initialization 55 4.2.2 Character segmentation 56 4.2.3 Estimation of the alignment parameters 57 4.2.4 Cost function optimization for rectification 58 4.3 Experimental results 63 4.4 Summary 66 5 Curved surface dewarping in real scene 73 5.1 Proposed curved surface dewarping method 73 5.1.1 Pre-processing 73 5.1 Experimental results 74 5.2 Summary 76 6 Conclusions 83 Bibliography 85 Abstract (Korean) 93Docto

    Development of a text reading system on video images

    Get PDF
    Since the early days of computer science researchers sought to devise a machine which could automatically read text to help people with visual impairments. The problem of extracting and recognising text on document images has been largely resolved, but reading text from images of natural scenes remains a challenge. Scene text can present uneven lighting, complex backgrounds or perspective and lens distortion; it usually appears as short sentences or isolated words and shows a very diverse set of typefaces. However, video sequences of natural scenes provide a temporal redundancy that can be exploited to compensate for some of these deficiencies. Here we present a complete end-to-end, real-time scene text reading system on video images based on perspective aware text tracking. The main contribution of this work is a system that automatically detects, recognises and tracks text in videos of natural scenes in real-time. The focus of our method is on large text found in outdoor environments, such as shop signs, street names and billboards. We introduce novel efficient techniques for text detection, text aggregation and text perspective estimation. Furthermore, we propose using a set of Unscented Kalman Filters (UKF) to maintain each text regionยฟs identity and to continuously track the homography transformation of the text into a fronto-parallel view, thereby being resilient to erratic camera motion and wide baseline changes in orientation. The orientation of each text line is estimated using a method that relies on the geometry of the characters themselves to estimate a rectifying homography. This is done irrespective of the view of the text over a large range of orientations. We also demonstrate a wearable head-mounted device for text reading that encases a camera for image acquisition and a pair of headphones for synthesized speech output. Our system is designed for continuous and unsupervised operation over long periods of time. It is completely automatic and features quick failure recovery and interactive text reading. It is also highly parallelised in order to maximize the usage of available processing power and to achieve real-time operation. We show comparative results that improve the current state-of-the-art when correcting perspective deformation of scene text. The end-to-end system performance is demonstrated on sequences recorded in outdoor scenarios. Finally, we also release a dataset of text tracking videos along with the annotated ground-truth of text regions

    Extraction of Text from Images and Videos

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore