3,569 research outputs found
Semantic relatedness based re-ranker for text spotting
Applications such as textual entailment, plagiarism detection or document clustering rely on the notion of semantic similarity, and are usually approached with dimension reduction techniques like LDA or with embedding-based neural approaches. We present a scenario where semantic similarity is not enough, and we devise a neural approach to learn semantic relatedness. The scenario is text spotting in the wild, where a text in an image (e.g. street sign, advertisement or bus destination) must be identified and recognized. Our goal is to improve the performance of vision systems by leveraging semantic information. Our rationale is that the text to be spotted is often related to the image context in which it appears (word pairs such as Delta–airplane, or quarters–parking are not similar, but are clearly related). We show how learning a word-to-word or word-to-sentence relatedness score can improve the performance of text spotting systems up to 2.9 points, outperforming other measures in a benchmark dataset.Peer ReviewedPostprint (author's final draft
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
Enhancing scene text recognition with visual context information
This thesis addresses the problem of improving text spotting systems, which aim to detect and recognize text in unrestricted images (e.g. a street sign, an advertisement, a bus destination, etc.). The goal is to improve the performance of off-the-shelf vision systems by exploiting the semantic information derived from the image itself. The rationale is that knowing the content of the image or the visual context can help to decide which words are the correct andidate
words.
For example, the fact that an image shows a coffee shop makes it more likely that a word on a signboard reads as Dunkin and not unkind.
We address this problem by drawing on successful developments in natural language processing and machine learning, in particular, learning to re-rank and neural networks, to present post-process frameworks that improve state-of-the-art text spotting systems without the need for costly data-driven re-training or tuning procedures.
Discovering the degree of semantic relatedness of candidate words and their image context is a task related to assessing the semantic similarity between words or text fragments. However, semantic relatedness is more general than similarity (e.g. car, road, and traffic light are related but not similar) and requires certain adaptations. To meet the requirements of these broader perspectives of semantic similarity, we develop two approaches to learn the semantic related-ness of the spotted word and its environmental context: word-to-word (object) or word-to-sentence (caption). In the word-to-word approach, word embed-ding based re-rankers are developed. The re-ranker takes the words from the text spotting baseline and re-ranks them based on the visual context from the object classifier. For the second, an end-to-end neural approach is designed to drive image description (caption) at the sentence-level as well as the word-level (objects) and re-rank them based not only on the visual context but also on the co-occurrence between them.
As an additional contribution, to meet the requirements of data-driven ap-proaches such as neural networks, we propose a visual context dataset for this task, in which the publicly available COCO-text dataset [Veit et al. 2016] has been extended with information about the scene (including the objects and places appearing in the image) to enable researchers to include the semantic relations between texts and scene in their Text Spotting systems, and to offer a common evaluation baseline for such approaches.Aquesta tesi aborda el problema de millorar els sistemes de reconeixement de text, que permeten detectar i reconèixer text en imatges no restringides (per exemple, un cartell al carrer, un anunci, una destinació d’autobús, etc.). L’objectiu és millorar el rendiment dels sistemes de visió existents explotant la informació semàntica derivada de la pròpia imatge. La idea principal és que conèixer el contingut de la imatge o el context visual en el que un text apareix, pot ajudar a decidir quines són les paraules correctes. Per exemple, el fet que una imatge mostri una cafeteria fa que sigui més probable que una paraula en un rètol es llegeixi com a Dunkin que no pas com unkind. Abordem aquest problema recorrent a avenços en el processament del llenguatge natural i l’aprenentatge automàtic, en particular, aprenent re-rankers i xarxes neuronals, per presentar solucions de postprocés que milloren els sistemes de l’estat de l’art de reconeixement de text, sense necessitat de costosos procediments de reentrenament o afinació que requereixin grans quantitats de dades. Descobrir el grau de relació semàntica entre les paraules candidates i el seu context d’imatge és una tasca relacionada amb l’avaluació de la semblança semàntica entre paraules o fragments de text. Tanmateix, determinar l’existència d’una relació semàntica és una tasca més general que avaluar la semblança (per exemple, cotxe, carretera i semàfor estan relacionats però no són similars) i per tant els mètodes existents requereixen certes adaptacions. Per satisfer els requisits d’aquestes perspectives més àmplies de relació semàntica, desenvolupem dos enfocaments per aprendre la relació semàntica de la paraula reconeguda i el seu context: paraula-a-paraula (amb els objectes a la imatge) o paraula-a-frase (subtítol de la imatge). En l’enfocament de paraula-a-paraula s’usen re-rankers basats en word-embeddings. El re-ranker pren les paraules proposades pel sistema base i les torna a reordenar en funció del context visual proporcionat pel classificador d’objectes. Per al segon cas, s’ha dissenyat un enfocament neuronal d’extrem a extrem per explotar la descripció de la imatge (subtítol) tant a nivell de frase com a nivell de paraula i re-ordenar les paraules candidates basant-se tant en el context visual com en les co-ocurrències amb el subtítol. Com a contribució addicional, per satisfer els requisits dels enfocs basats en dades com ara les xarxes neuronals, presentem un conjunt de dades de contextos visuals per a aquesta tasca, en el què el conjunt de dades COCO-text disponible públicament [Veit et al. 2016] s’ha ampliat amb informació sobre l’escena (inclosos els objectes i els llocs que apareixen a la imatge) per permetre als investigadors incloure les relacions semàntiques entre textos i escena als seus sistemes de reconeixement de text, i oferir una base d’avaluació comuna per a aquests enfocaments
All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting
Recently, end-to-end text spotting that aims to detect and recognize text
from cluttered images simultaneously has received particularly growing interest
in computer vision. Different from the existing approaches that formulate text
detection as bounding box extraction or instance segmentation, we localize a
set of points on the boundary of each text instance. With the representation of
such boundary points, we establish a simple yet effective scheme for end-to-end
text spotting, which can read the text of arbitrary shapes. Experiments on
three challenging datasets, including ICDAR2015, TotalText and COCO-Text
demonstrate that the proposed method consistently surpasses the
state-of-the-art in both scene text detection and end-to-end text recognition
tasks.Comment: Accepted to AAAI202
Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes
Recently, models based on deep neural networks have dominated the fields of
scene text detection and recognition. In this paper, we investigate the problem
of scene text spotting, which aims at simultaneous text detection and
recognition in natural images. An end-to-end trainable neural network model for
scene text spotting is proposed. The proposed model, named as Mask TextSpotter,
is inspired by the newly published work Mask R-CNN. Different from previous
methods that also accomplish text spotting with end-to-end trainable deep
neural networks, Mask TextSpotter takes advantage of simple and smooth
end-to-end learning procedure, in which precise text detection and recognition
are acquired via semantic segmentation. Moreover, it is superior to previous
methods in handling text instances of irregular shapes, for example, curved
text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the
proposed method achieves state-of-the-art results in both scene text detection
and end-to-end text recognition tasks.Comment: To appear in ECCV 201
Storm Spotting and Amateur Radio: a Field Guide for Volunteer Storm Spotters
Volunteers have played a critical part in relaying weather information since the middle 19th century. The effort of these volunteers has helped safeguard life and property when severe weather threatens. For almost 100 years Amateur Radio operators have played a critical part in severe weather preparedness and response. Amateur Radio operators not only bring a willingness to serve, but communications skills that provide an added benefit to any storm spotter program. The National Weather Service recognized this when developing the SKYWARN program during the 1960’s. Amateur Radio and the National Weather Service have developed over the last 40 years a solid relationship that has been beneficial to communities across the country that face the threat of severe weather. This project, the first of its kind, seeks to gather information together that will help better prepare Amateur Radio operators that serve as volunteer storm spotters
PMODE: Prototypical Mask based Object Dimension Estimation
Can a neural network estimate an object's dimension in the wild? In this
paper, we propose a method and deep learning architecture to estimate the
dimensions of a quadrilateral object of interest in videos using a monocular
camera. The proposed technique does not use camera calibration or handcrafted
geometric features; however, features are learned with the help of coefficients
of a segmentation neural network during the training process. A real-time
instance segmentation-based Deep Neural Network with a ResNet50 backbone is
employed, giving the object's prototype mask and thus provides a region of
interest to regress its dimensions. The instance segmentation network is
trained to look at only the nearest object of interest. The regression is
performed using an MLP head which looks only at the mask coefficients of the
bounding box detector head and the prototype segmentation mask. We trained the
system with three different random cameras achieving 22% MAPE for the test
dataset for the dimension estimationComment: 10 page
- …