25 research outputs found
Text Detection and Recognition in the Wild
Text detection and recognition (TDR) in highly structured environments with a clean background and consistent fonts (e.g., office documents, postal addresses and bank cheque) is a well understood problem (i.e., OCR), however this is not the case for unstructured environments.
The main objective for scene text detection is to locate text within images captured in the wild.
For scene text recognition, the techniques map each detected or cropped word image into string.
Nowadays, convolutional neural networks (CNNs) and Recurrent Neural Networks (RNN) deep learning architectures dominate most of the recent state-of-the-art (SOTA) scene TDR methods.
Most of the reported respective accuracies of current SOTA TDR methods are in the range of 80% to 90% on benchmark datasets with regular and clear text instances. However, those detecting and/or recognizing results drastically deteriorate 10% and 30% - in terms of F-measure detection and word recognition accuracy performances with irregular or occluded text images.
Transformers and their variations are new deep learning architectures that mitigate the above-mentioned issues for CNN and RNN-based pipelines.Unlike Recurrent Neural Networks (RNNs),
transformers are models that learn how to encode and decode data by looking not only backward but also forward in order to extract relevant information from a whole sequence.
This thesis utilizes the transformer architecture to address the irregular (multi-oriented and arbitrarily shaped) and occluded text challenges in the wild images. Our main contributions are as follows:
(1) We first targeted solving the irregular TDR in two separate architectures as follows:
In Chapter 4, unlike the SOTA text detection frameworks that have complex pipelines and use many hand-designed components and post-processing stages, we design a conceptually more straightforward and trainable end-to-end architecture of transformer-based detector for multi-oriented scene text detection, which can directly predict the set of detections (i.e., text and box regions) of the input image. A central contribution to our work is introducing a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to capture the rotated text instances adequately.
In Chapter 5, we extend our previous architecture to arbitrary shaped scene text detection.
We design a new text detection technique that aims to better infer n-vertices of a polygon or the degree of a Bezier curve to represent irregular-text instances.
We also propose a loss function that combines a generalized-split-intersection-over union loss defined over the piece-wise polygons.
In Chapter 6, we show that our transformer-based architecture without rectifying the input curved text instances is more suitable than SOTA RNN-based frameworks equipped with rectification modules for irregular text recognition in the wild images.
Our main contribution to this chapter is leveraging a 2D Learnable Sinusoidal frequencies Positional Encoding (2LSPE) with a modified feed-forward neural network to better encode the 2D spatial dependencies of characters in the irregular text instances.
(2) Since TDR tasks encounter the same challenging problems (e.g., irregular text, illumination variations, low-resolution text, etc.), we present a new transformer model that can detect and recognize individual characters of text instances in an end-to-end manner. Reading individual characters later makes a robust occlusion and arbitrarily shaped text spotting model without needing polygon annotation or multiple stages of detection and recognition modules used in SOTA text spotting architectures.
In Chapter 7, unlike SOTA methods that combine two different pipelines of detection and recognition modules for a complete text reading, we utilize our text detection framework by leveraging a recent transformer-based technique, namely Deformable Patch-based Transformer (DPT), as a feature extracting backbone, to robustly read the class and box coordinates of irregular characters in the wild images.
(3) Finally, we address the occlusion problem by using a multi-task end-to-end scene text spotting framework.
In Chapter 8, we leverage a recent transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text recognition and end-to-end scene text spotting pipelines to overcome the partial occlusion limitation. We design a new multitask End-to-End transformer network that directly outputs characters, word instances, and their bounding box representations, saving the computational overhead as it eliminates multiple processing steps. The unified proposed framework can also detect and recognize arbitrarily shaped text instances without using polygon annotations
Object Recognition
Vision-based object recognition tasks are very familiar in our everyday activities, such as driving our car in the correct lane. We do these tasks effortlessly in real-time. In the last decades, with the advancement of computer technology, researchers and application developers are trying to mimic the human's capability of visually recognising. Such capability will allow machine to free human from boring or dangerous jobs
Media streams--representing video for retrieval and repurposing
Thesis (Ph. D.)--Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1995.Includes bibliographical references (p. 325-344).by Marc Eliot Davis.Ph.D
The evaluation of Corona and Ikonos satellite imagery for archaeological applications in a semi-arid environment
Archaeologists have been aware of the potential of satellite imagery as a tool almost since the first Earth remote sensing satellite. Initially sensors such as Landsat had a ground resolution which was too coarse for thorough archaeological prospection although the imagery was used for geo-archaeological and enviro-archaeological analyses. In the intervening years the spatial and spectral resolution of these sensing devices has improved. In recent years two important occurrences enhanced the archaeological applicability of imagery from satellite platforms: The declassification of high resolution photography by the American and Russian governments and the deregulation of commercial remote sensing systems allowing the collection of sub metre resolution imagery. This thesis aims to evaluate the archaeological application of three potentially important resources; Corona space photography and Ikonos panchromatic and multispectral imager). These resources are evaluated in conjunction with Landsat Thematic Mapper (TM) imagery over a 600 square km study area in the semi-arid environment around Homs, Syria. The archaeological resource in this area is poorly understood, mapped and documented. The images are evaluated for their ability to create thematic layers and to locate archaeological residues in different environmental zones. Further consideration is given to the physical factors that allow archaeological residues to be identified and how satellite imagery and modern technology may impact on Cultural Resource Management. This research demonstrates that modern high resolution and historic satellite imagery can be important tools for archaeologists studying in semi-arid environments. The imagery has allowed a representative range of archaeological features and landscape themes to be identified. The research shows that the use of satellite imagery can have significant impact on the design of the archaeological survey in the middle-east and perhaps in other environments
QUIS-CAMPI: Biometric Recognition in Surveillance Scenarios
The concerns about individuals security have justified the increasing number of surveillance
cameras deployed both in private and public spaces. However, contrary to popular belief,
these devices are in most cases used solely for recording, instead of feeding intelligent analysis
processes capable of extracting information about the observed individuals. Thus, even though
video surveillance has already proved to be essential for solving multiple crimes, obtaining relevant
details about the subjects that took part in a crime depends on the manual inspection
of recordings. As such, the current goal of the research community is the development of
automated surveillance systems capable of monitoring and identifying subjects in surveillance
scenarios. Accordingly, the main goal of this thesis is to improve the performance of biometric
recognition algorithms in data acquired from surveillance scenarios. In particular, we aim at
designing a visual surveillance system capable of acquiring biometric data at a distance (e.g.,
face, iris or gait) without requiring human intervention in the process, as well as devising biometric
recognition methods robust to the degradation factors resulting from the unconstrained
acquisition process.
Regarding the first goal, the analysis of the data acquired by typical surveillance systems
shows that large acquisition distances significantly decrease the resolution of biometric samples,
and thus their discriminability is not sufficient for recognition purposes. In the literature,
diverse works point out Pan Tilt Zoom (PTZ) cameras as the most practical way for acquiring
high-resolution imagery at a distance, particularly when using a master-slave configuration. In
the master-slave configuration, the video acquired by a typical surveillance camera is analyzed
for obtaining regions of interest (e.g., car, person) and these regions are subsequently imaged
at high-resolution by the PTZ camera. Several methods have already shown that this configuration
can be used for acquiring biometric data at a distance. Nevertheless, these methods
failed at providing effective solutions to the typical challenges of this strategy, restraining its
use in surveillance scenarios. Accordingly, this thesis proposes two methods to support the development
of a biometric data acquisition system based on the cooperation of a PTZ camera
with a typical surveillance camera. The first proposal is a camera calibration method capable
of accurately mapping the coordinates of the master camera to the pan/tilt angles of the PTZ
camera. The second proposal is a camera scheduling method for determining - in real-time -
the sequence of acquisitions that maximizes the number of different targets obtained, while
minimizing the cumulative transition time. In order to achieve the first goal of this thesis,
both methods were combined with state-of-the-art approaches of the human monitoring field
to develop a fully automated surveillance capable of acquiring biometric data at a distance and
without human cooperation, designated as QUIS-CAMPI system.
The QUIS-CAMPI system is the basis for pursuing the second goal of this thesis. The analysis
of the performance of the state-of-the-art biometric recognition approaches shows that these
approaches attain almost ideal recognition rates in unconstrained data. However, this performance
is incongruous with the recognition rates observed in surveillance scenarios. Taking into
account the drawbacks of current biometric datasets, this thesis introduces a novel dataset comprising
biometric samples (face images and gait videos) acquired by the QUIS-CAMPI system at a
distance ranging from 5 to 40 meters and without human intervention in the acquisition process.
This set allows to objectively assess the performance of state-of-the-art biometric recognition
methods in data that truly encompass the covariates of surveillance scenarios. As such, this set
was exploited for promoting the first international challenge on biometric recognition in the wild. This thesis describes the evaluation protocols adopted, along with the results obtained
by the nine methods specially designed for this competition. In addition, the data acquired by
the QUIS-CAMPI system were crucial for accomplishing the second goal of this thesis, i.e., the
development of methods robust to the covariates of surveillance scenarios. The first proposal
regards a method for detecting corrupted features in biometric signatures inferred by a redundancy
analysis algorithm. The second proposal is a caricature-based face recognition approach
capable of enhancing the recognition performance by automatically generating a caricature
from a 2D photo. The experimental evaluation of these methods shows that both approaches
contribute to improve the recognition performance in unconstrained data.A crescente preocupação com a segurança dos indivíduos tem justificado o crescimento
do número de câmaras de vídeo-vigilância instaladas tanto em espaços privados como públicos.
Contudo, ao contrário do que normalmente se pensa, estes dispositivos são, na maior parte dos
casos, usados apenas para gravação, não estando ligados a nenhum tipo de software inteligente
capaz de inferir em tempo real informações sobre os indivíduos observados. Assim, apesar de a
vídeo-vigilância ter provado ser essencial na resolução de diversos crimes, o seu uso está ainda
confinado à disponibilização de vídeos que têm que ser manualmente inspecionados para extrair
informações relevantes dos sujeitos envolvidos no crime. Como tal, atualmente, o principal
desafio da comunidade científica é o desenvolvimento de sistemas automatizados capazes de
monitorizar e identificar indivíduos em ambientes de vídeo-vigilância.
Esta tese tem como principal objetivo estender a aplicabilidade dos sistemas de reconhecimento
biométrico aos ambientes de vídeo-vigilância. De forma mais especifica, pretende-se
1) conceber um sistema de vídeo-vigilância que consiga adquirir dados biométricos a longas distâncias
(e.g., imagens da cara, íris, ou vídeos do tipo de passo) sem requerer a cooperação dos
indivíduos no processo; e 2) desenvolver métodos de reconhecimento biométrico robustos aos
fatores de degradação inerentes aos dados adquiridos por este tipo de sistemas.
No que diz respeito ao primeiro objetivo, a análise aos dados adquiridos pelos sistemas típicos
de vídeo-vigilância mostra que, devido à distância de captura, os traços biométricos amostrados
não são suficientemente discriminativos para garantir taxas de reconhecimento aceitáveis.
Na literatura, vários trabalhos advogam o uso de câmaras Pan Tilt Zoom (PTZ) para adquirir
imagens de alta resolução à distância, principalmente o uso destes dispositivos no modo masterslave.
Na configuração master-slave um módulo de análise inteligente seleciona zonas de interesse
(e.g. carros, pessoas) a partir do vídeo adquirido por uma câmara de vídeo-vigilância
e a câmara PTZ é orientada para adquirir em alta resolução as regiões de interesse. Diversos
métodos já mostraram que esta configuração pode ser usada para adquirir dados biométricos
à distância, ainda assim estes não foram capazes de solucionar alguns problemas relacionados
com esta estratégia, impedindo assim o seu uso em ambientes de vídeo-vigilância. Deste modo,
esta tese propõe dois métodos para permitir a aquisição de dados biométricos em ambientes de
vídeo-vigilância usando uma câmara PTZ assistida por uma câmara típica de vídeo-vigilância. O
primeiro é um método de calibração capaz de mapear de forma exata as coordenadas da câmara
master para o ângulo da câmara PTZ (slave) sem o auxílio de outros dispositivos óticos. O
segundo método determina a ordem pela qual um conjunto de sujeitos vai ser observado pela
câmara PTZ. O método proposto consegue determinar em tempo-real a sequência de observações
que maximiza o número de diferentes sujeitos observados e simultaneamente minimiza o
tempo total de transição entre sujeitos. De modo a atingir o primeiro objetivo desta tese, os
dois métodos propostos foram combinados com os avanços alcançados na área da monitorização
de humanos para assim desenvolver o primeiro sistema de vídeo-vigilância completamente automatizado
e capaz de adquirir dados biométricos a longas distâncias sem requerer a cooperação
dos indivíduos no processo, designado por sistema QUIS-CAMPI.
O sistema QUIS-CAMPI representa o ponto de partida para iniciar a investigação relacionada
com o segundo objetivo desta tese. A análise do desempenho dos métodos de reconhecimento
biométrico do estado-da-arte mostra que estes conseguem obter taxas de reconhecimento
quase perfeitas em dados adquiridos sem restrições (e.g., taxas de reconhecimento
maiores do que 99% no conjunto de dados LFW). Contudo, este desempenho não é corroborado pelos resultados observados em ambientes de vídeo-vigilância, o que sugere que os conjuntos
de dados atuais não contêm verdadeiramente os fatores de degradação típicos dos ambientes de
vídeo-vigilância. Tendo em conta as vulnerabilidades dos conjuntos de dados biométricos atuais,
esta tese introduz um novo conjunto de dados biométricos (imagens da face e vídeos do tipo de
passo) adquiridos pelo sistema QUIS-CAMPI a uma distância máxima de 40m e sem a cooperação
dos sujeitos no processo de aquisição. Este conjunto permite avaliar de forma objetiva o desempenho
dos métodos do estado-da-arte no reconhecimento de indivíduos em imagens/vídeos
capturados num ambiente real de vídeo-vigilância. Como tal, este conjunto foi utilizado para
promover a primeira competição de reconhecimento biométrico em ambientes não controlados.
Esta tese descreve os protocolos de avaliação usados, assim como os resultados obtidos por 9
métodos especialmente desenhados para esta competição. Para além disso, os dados adquiridos
pelo sistema QUIS-CAMPI foram essenciais para o desenvolvimento de dois métodos para
aumentar a robustez aos fatores de degradação observados em ambientes de vídeo-vigilância. O
primeiro é um método para detetar características corruptas em assinaturas biométricas através
da análise da redundância entre subconjuntos de características. O segundo é um método de
reconhecimento facial baseado em caricaturas automaticamente geradas a partir de uma única
foto do sujeito. As experiências realizadas mostram que ambos os métodos conseguem reduzir
as taxas de erro em dados adquiridos de forma não controlada
Photogrammetry as a surveying thechnique apllied to heritage constructions recording - avantages and limitations
Dissertação de Mestrado Integrado em Arquitetura, com a especialização em Arquitetura apresentada na Faculdade de Arquitetura da Universidade de Lisboa para obtenção do grau de Mestre.A presente dissertação tem por objectivo investigar e evidenciar as vantagens da aplicação da fotogrametria, e possíveis integrações com outros métodos de levantamento, como seja o varrimento laser terrestre, posicionamento por GPS, entre outros, para realizar levantamentos de construções patrimoniais ou eruditas e a respectiva produção de documentação base para viabilizar intervenções de conservação, restauro ou reabilitação.
A motivação para a investigação advém da aplicação flexível, versátil, simples, acessível, e baixo-custo da fotogrametria em projectos de levantamento pequenos ou extensos. Tenciona-se igualmente colmatar as desvantagens tradicionais da fotogrametria, nomeadamente a transição entre espaços interiores e exteriores, e registo de espaços estreitos, de difícil acesso, e de geometrias complexas, num único projecto de documentação. Pretende-se ultrapassar estas dificuldades através da utilização máxima das potencialidades da fotogrametria com o uso de imagens olho de peixe e apenas como último recurso utilizar instrumentos complementares.
No caso de estudo principal, o Castelo do Convento de Cristo, demonstra-se a aplicação dos métodos investigados. Nos casos de estudo secundários abordam-se problemas parcelares, desde elementos decorativos até à totalidade do edificado: Convento dos Capuchos, em Sintra; Alcáçova e trecho de muralha do Castelo de Sesimbra; Igreja de Stº André, em Mafra; entre outros. Os casos auxiliaram na determinação de procedimentos a generalizar posteriormente. Por fim, propõem-se algoritmos que auxiliam na produção de documentação.ABSTRACT: The present dissertation aims to research and demonstrate the advantages of the application of photogrammetry, and its possible integrations with other methods, such as terrestrial laser scanning, GPS positioning, and among others, to perform surveys of heritage or erudite buildings and respective production of base documentation to enable interventions of conservation, restoration, or rehabilitation.
The motivation for researching is due to the flexible, versatile, simple, affordable, and low-cost application of photogrammetry in small and extensive survey projects. It is also intended to overcome the traditional disadvantages of photogrammetry, such as the transition between interior and exterior spaces, and difficulty of recording narrow, hard-to-access, and complex geometric spaces, in a single project. It is intended to overcome such challenges by maximizing the potential uses of photogrammetry with the use of fisheye images and by using other survey instruments as a last resort.
In the main case study, the Castle of the Convent of Christ, the application of the investigated methods is demonstrated. In the secondary case studies, partial problems are addressed, ranging from decorative elements to the entire building: Convento dos Capuchos, in Sintra; Citadel and section of a wall of the Castle of Sesimbra; Igreja de St André, in Mafra; among others; The case studies aided in determining general procedures. Finally, algorithms that accelerate the production of documentation are proposed.N/
Automatic grammar induction from free text using insights from cognitive grammar
Automatic identification of the grammatical structure of a sentence is useful in many Natural Language
Processing (NLP) applications such as Document Summarisation, Question Answering systems and
Machine Translation. With the availability of syntactic treebanks, supervised parsers have been
developed successfully for many major languages. However, for low-resourced minority languages with
fewer digital resources, this poses more of a challenge. Moreover, there are a number of syntactic
annotation schemes motivated by different linguistic theories and formalisms which are sometimes
language specific and they cannot always be adapted for developing syntactic parsers across different
language families.
This project aims to develop a linguistically motivated approach to the automatic induction of
grammatical structures from raw sentences. Such an approach can be readily adapted to different
languages including low-resourced minority languages. We draw the basic approach to linguistic analysis
from usage-based, functional theories of grammar such as Cognitive Grammar, Computational Paninian
Grammar and insights from psycholinguistic studies. Our approach identifies grammatical structure of a
sentence by recognising domain-independent, general, cognitive patterns of conceptual organisation
that occur in natural language. It also reflects some of the general psycholinguistic properties of parsing
by humans - such as incrementality, connectedness and expectation.
Our implementation has three components: Schema Definition, Schema Assembly and Schema
Prediction. Schema Definition and Schema Assembly components were implemented algorithmically as
a dictionary and rules. An Artificial Neural Network was trained for Schema Prediction. By using Parts of
Speech tags to bootstrap the simplest case of token level schema definitions, a sentence is passed
through all the three components incrementally until all the words are exhausted and the entire
sentence is analysed as an instance of one final construction schema. The order in which all intermediate
schemas are assembled to form the final schema can be viewed as the parse of the sentence. Parsers
for English and Welsh (a low-resource minority language) were developed using the same approach with
some changes to the Schema Definition component. We evaluated the parser performance by (a)
Quantitative evaluation by comparing the parsed chunks against the constituents in a phrase structure
tree (b) Manual evaluation by listing the range of linguistic constructions covered by the parser and by
performing error analysis on the parser outputs (c) Evaluation by identifying the number of edits
required for a correct assembly (d) Qualitative evaluation based on Likert scales in online surveys