12 research outputs found

    Investigation on advanced image search techniques

    Get PDF
    Content-based image search for retrieval of images based on the similarity in their visual contents, such as color, texture, and shape, to a query image is an active research area due to its broad applications. Color, for example, provides powerful information for image search and classification. This dissertation investigates advanced image search techniques and presents new color descriptors for image search and classification and robust image enhancement and segmentation methods for iris recognition. First, several new color descriptors have been developed for color image search. Specifically, a new oRGB-SIFT descriptor, which integrates the oRGB color space and the Scale-Invariant Feature Transform (SIFT), is proposed for image search and classification. The oRGB-SIFT descriptor is further integrated with other color SIFT features to produce the novel Color SIFT Fusion (CSF), the Color Grayscale SIFT Fusion (CGSF), and the CGSF+PHOG descriptors for image category search with applications to biometrics. Image classification is implemented using a novel EFM-KNN classifier, which combines the Enhanced Fisher Model (EFM) and the K Nearest Neighbor (KNN) decision rule. Experimental results on four large scale, grand challenge datasets have shown that the proposed oRGB-SIFT descriptor improves recognition performance upon other color SIFT descriptors, and the CSF, the CGSF, and the CGSF+PHOG descriptors perform better than the other color SIFT descriptors. The fusion of both Color SIFT descriptors (CSF) and Color Grayscale SIFT descriptor (CGSF) shows significant improvement in the classification performance, which indicates that various color-SIFT descriptors and grayscale-SIFT descriptor are not redundant for image search. Second, four novel color Local Binary Pattern (LBP) descriptors are presented for scene image and image texture classification. Specifically, the oRGB-LBP descriptor is derived in the oRGB color space. The other three color LBP descriptors, namely, the Color LBP Fusion (CLF), the Color Grayscale LBP Fusion (CGLF), and the CGLF+PHOG descriptors, are obtained by integrating the oRGB-LBP descriptor with some additional image features. Experimental results on three large scale, grand challenge datasets have shown that the proposed descriptors can improve scene image and image texture classification performance. Finally, a new iris recognition method based on a robust iris segmentation approach is presented for improving iris recognition performance. The proposed robust iris segmentation approach applies power-law transformations for more accurate detection of the pupil region, which significantly reduces the candidate limbic boundary search space for increasing detection accuracy and efficiency. As the limbic circle, which has a center within a close range of the pupil center, is selectively detected, the eyelid detection approach leads to improved iris recognition performance. Experiments using the Iris Challenge Evaluation (ICE) database show the effectiveness of the proposed method

    Investigation of new feature descriptors for image search and classification

    Get PDF
    Content-based image search, classification and retrieval is an active and important research area due to its broad applications as well as the complexity of the problem. Understanding the semantics and contents of images for recognition remains one of the most difficult and prevailing problems in the machine intelligence and computer vision community. With large variations in size, pose, illumination and occlusions, image classification is a very challenging task. A good classification framework should address the key issues of discriminatory feature extraction as well as efficient and accurate classification. Towards that end, this dissertation focuses on exploring new image descriptors by incorporating cues from the human visual system, and integrating local, texture, shape as well as color information to construct robust and effective feature representations for advancing content-based image search and classification. Based on the Gabor wavelet transformation, whose kernels are similar to the 2D receptive field profiles of the mammalian cortical simple cells, a series of new image descriptors is developed. Specifically, first, a new color Gabor-HOG (GHOG) descriptor is introduced by concatenating the Histograms of Oriented Gradients (HOG) of the component images produced by applying Gabor filters in multiple scales and orientations to encode shape information. Second, the GHOG descriptor is analyzed in six different color spaces and grayscale to propose different color GHOG descriptors, which are further combined to present a new Fused Color GHOG (FC-GHOG) descriptor. Third, a novel GaborPHOG (GPHOG) descriptor is proposed which improves upon the Pyramid Histograms of Oriented Gradients (PHOG) descriptor, and subsequently a new FC-GPHOG descriptor is constructed by combining the multiple color GPHOG descriptors and employing the Principal Component Analysis (PCA). Next, the Gabor-LBP (GLBP) is derived by accumulating the Local Binary Patterns (LBP) histograms of the local Gabor filtered images to encode texture and local information of an image. Furthermore, a novel Gabor-LBPPHOG (GLP) image descriptor is proposed which integrates the GLBP and the GPHOG descriptors as a feature set and an innovative Fused Color Gabor-LBP-PHOG (FC-GLP) is constructed by fusing the GLP from multiple color spaces. Subsequently, The GLBP and the GHOG descriptors are then combined to produce the Gabor-LBP-HOG (GLH) feature vector which performs well on different object and scene image categories. The six color GLH vectors are further concatenated to form the Fused Color GLH (FC-GLH) descriptor. Finally, the Wigner based Local Binary Patterns (WLBP) descriptor is proposed that combines multi-neighborhood LBP, Pseudo-Wigner distribution of images and the popular bag of words model to effectively classify scene images. To assess the feasibility of the proposed new image descriptors, two classification methods are used: one method applies the PCA and the Enhanced Fisher Model (EFM) for feature extraction and the nearest neighbor rule for classification, while the other method employs the Support Vector Machine (SVM). The classification performance of the proposed descriptors is tested on several publicly available popular image datasets. The experimental results show that the proposed new image descriptors achieve image search and classification results better than or at par with other popular image descriptors, such as the Scale Invariant Feature Transform (SIFT), the Pyramid Histograms of visual Words (PHOW), the Pyramid Histograms of Oriented Gradients (PHOG), the Spatial Envelope (SE), the Color SIFT four Concentric Circles (C4CC), the Object Bank (OB), the Context Aware Topic Model (CA-TM), the Hierarchical Matching Pursuit (HMP), the Kernel Spatial Pyramid Matching (KSPM), the SIFT Sparse-coded Spatial Pyramid Matching (Sc-SPM), the Kernel Codebook (KC) and the LBP

    An uncertainty prediction approach for active learning - application to earth observation

    Get PDF
    Mapping land cover and land usage dynamics are crucial in remote sensing since farmers are encouraged to either intensify or extend crop use due to the ongoing rise in the world’s population. A major issue in this area is interpreting and classifying a scene captured in high-resolution satellite imagery. Several methods have been put forth, including neural networks which generate data-dependent models (i.e. model is biased toward data) and static rule-based approaches with thresholds which are limited in terms of diversity(i.e. model lacks diversity in terms of rules). However, the problem of having a machine learning model that, given a large amount of training data, can classify multiple classes over different geographic Sentinel-2 imagery that out scales existing approaches remains open. On the other hand, supervised machine learning has evolved into an essential part of many areas due to the increasing number of labeled datasets. Examples include creating classifiers for applications that recognize images and voices, anticipate traffic, propose products, act as a virtual personal assistant and detect online fraud, among many more. Since these classifiers are highly dependent from the training datasets, without human interaction or accurate labels, the performance of these generated classifiers with unseen observations is uncertain. Thus, researchers attempted to evaluate a number of independent models using a statistical distance. However, the problem of, given a train-test split and classifiers modeled over the train set, identifying a prediction error using the relation between train and test sets remains open. Moreover, while some training data is essential for supervised machine learning, what happens if there is insufficient labeled data? After all, assigning labels to unlabeled datasets is a time-consuming process that may need significant expert human involvement. When there aren’t enough expert manual labels accessible for the vast amount of openly available data, active learning becomes crucial. However, given a large amount of training and unlabeled datasets, having an active learning model that can reduce the training cost of the classifier and at the same time assist in labeling new data points remains an open problem. From the experimental approaches and findings, the main research contributions, which concentrate on the issue of optical satellite image scene classification include: building labeled Sentinel-2 datasets with surface reflectance values; proposal of machine learning models for pixel-based image scene classification; proposal of a statistical distance based Evidence Function Model (EFM) to detect ML models misclassification; and proposal of a generalised sampling approach for active learning that, together with the EFM enables a way of determining the most informative examples. Firstly, using a manually annotated Sentinel-2 dataset, Machine Learning (ML) models for scene classification were developed and their performance was compared to Sen2Cor the reference package from the European Space Agency – a micro-F1 value of 84% was attained by the ML model, which is a significant improvement over the corresponding Sen2Cor performance of 59%. Secondly, to quantify the misclassification of the ML models, the Mahalanobis distance-based EFM was devised. This model achieved, for the labeled Sentinel-2 dataset, a micro-F1 of 67.89% for misclassification detection. Lastly, EFM was engineered as a sampling strategy for active learning leading to an approach that attains the same level of accuracy with only 0.02% of the total training samples when compared to a classifier trained with the full training set. With the help of the above-mentioned research contributions, we were able to provide an open-source Sentinel-2 image scene classification package which consists of ready-touse Python scripts and a ML model that classifies Sentinel-2 L1C images generating a 20m-resolution RGB image with the six studied classes (Cloud, Cirrus, Shadow, Snow, Water, and Other) giving academics a straightforward method for rapidly and effectively classifying Sentinel-2 scene images. Additionally, an active learning approach that uses, as sampling strategy, the observed prediction uncertainty given by EFM, will allow labeling only the most informative points to be used as input to build classifiers; Sumário: Uma Abordagem de Previsão de Incerteza para Aprendizagem Ativa – Aplicação à Observação da Terra O mapeamento da cobertura do solo e a dinâmica da utilização do solo são cruciais na deteção remota uma vez que os agricultores são incentivados a intensificar ou estender as culturas devido ao aumento contínuo da população mundial. Uma questão importante nesta área é interpretar e classificar cenas capturadas em imagens de satélite de alta resolução. Várias aproximações têm sido propostas incluindo a utilização de redes neuronais que produzem modelos dependentes dos dados (ou seja, o modelo é tendencioso em relação aos dados) e aproximações baseadas em regras que apresentam restrições de diversidade (ou seja, o modelo carece de diversidade em termos de regras). No entanto, a criação de um modelo de aprendizagem automática que, dada uma uma grande quantidade de dados de treino, é capaz de classificar, com desempenho superior, as imagens do Sentinel-2 em diferentes áreas geográficas permanece um problema em aberto. Por outro lado, têm sido utilizadas técnicas de aprendizagem supervisionada na resolução de problemas nas mais diversas áreas de devido à proliferação de conjuntos de dados etiquetados. Exemplos disto incluem classificadores para aplicações que reconhecem imagem e voz, antecipam tráfego, propõem produtos, atuam como assistentes pessoais virtuais e detetam fraudes online, entre muitos outros. Uma vez que estes classificadores são fortemente dependente do conjunto de dados de treino, sem interação humana ou etiquetas precisas, o seu desempenho sobre novos dados é incerta. Neste sentido existem propostas para avaliar modelos independentes usando uma distância estatística. No entanto, o problema de, dada uma divisão de treino-teste e um classificador, identificar o erro de previsão usando a relação entre aqueles conjuntos, permanece aberto. Mais ainda, embora alguns dados de treino sejam essenciais para a aprendizagem supervisionada, o que acontece quando a quantidade de dados etiquetados é insuficiente? Afinal, atribuir etiquetas é um processo demorado e que exige perícia, o que se traduz num envolvimento humano significativo. Quando a quantidade de dados etiquetados manualmente por peritos é insuficiente a aprendizagem ativa torna-se crucial. No entanto, dada uma grande quantidade dados de treino não etiquetados, ter um modelo de aprendizagem ativa que reduz o custo de treino do classificador e, ao mesmo tempo, auxilia a etiquetagem de novas observações permanece um problema em aberto. A partir das abordagens e estudos experimentais, as principais contribuições deste trabalho, que se concentra na classificação de cenas de imagens de satélite óptico incluem: criação de conjuntos de dados Sentinel-2 etiquetados, com valores de refletância de superfície; proposta de modelos de aprendizagem automática baseados em pixels para classificação de cenas de imagens de satétite; proposta de um Modelo de Função de Evidência (EFM) baseado numa distância estatística para detetar erros de classificação de modelos de aprendizagem; e proposta de uma abordagem de amostragem generalizada para aprendizagem ativa que, em conjunto com o EFM, possibilita uma forma de determinar os exemplos mais informativos. Em primeiro lugar, usando um conjunto de dados Sentinel-2 etiquetado manualmente, foram desenvolvidos modelos de Aprendizagem Automática (AA) para classificação de cenas e seu desempenho foi comparado com o do Sen2Cor – o produto de referência da Agência Espacial Europeia – tendo sido alcançado um valor de micro-F1 de 84% pelo classificador, o que representa uma melhoria significativa em relação ao desempenho Sen2Cor correspondente, de 59%. Em segundo lugar, para quantificar o erro de classificação dos modelos de AA, foi concebido o Modelo de Função de Evidência baseado na distância de Mahalanobis. Este modelo conseguiu, para o conjunto de dados etiquetado do Sentinel-2 um micro-F1 de 67,89% na deteção de classificação incorreta. Por fim, o EFM foi utilizado como uma estratégia de amostragem para a aprendizagem ativa, uma abordagem que permitiu atingir o mesmo nível de desempenho com apenas 0,02% do total de exemplos de treino quando comparado com um classificador treinado com o conjunto de treino completo. Com a ajuda das contribuições acima mencionadas, foi possível desenvolver um pacote de código aberto para classificação de cenas de imagens Sentinel-2 que, utilizando num conjunto de scripts Python, um modelo de classificação, e uma imagem Sentinel-2 L1C, gera a imagem RGB correspondente (com resolução de 20m) com as seis classes estudadas (Cloud, Cirrus, Shadow, Snow, Water e Other), disponibilizando à academia um método direto para a classificação de cenas de imagens do Sentinel-2 rápida e eficaz. Além disso, a abordagem de aprendizagem ativa que usa, como estratégia de amostragem, a deteção de classificacão incorreta dada pelo EFM, permite etiquetar apenas os pontos mais informativos a serem usados como entrada na construção de classificadores

    Linear subspace methods in face recognition

    Get PDF
    Despite over 30 years of research, face recognition is still one of the most difficult problems in the field of Computer Vision. The challenge comes from many factors affecting the performance of a face recognition system: noisy input, training data collection, speed-accuracy trade-off, variations in expression, illumination, pose, or ageing. Although relatively successful attempts have been made for special cases, such as frontal faces, no satisfactory methods exist that work under completely unconstrained conditions. This thesis proposes solutions to three important problems: lack of training data, speed-accuracy requirement, and unconstrained environments. The problem of lacking training data has been solved in the worst case: single sample per person. Whitened Principal Component Analysis is proposed as a simple but effective solution. Whitened PCA performs consistently well on multiple face datasets. Speed-accuracy trade-off problem is the second focus of this thesis. Two solutions are proposed to tackle this problem. The first solution is a new feature extraction method called Compact Binary Patterns which is about three times faster than Local Binary Patterns. The second solution is a multi-patch classifier which performs much better than a single classifier without compromising speed. Two metric learning methods are introduced to solve the problem of unconstrained face recognition. The first method called Indirect Neighourhood Component Analysis combines the best ideas from Neighourhood Component Analysis and One-shot learning. The second method, Cosine Similarity Metric Learning, uses Cosine Similarity instead of the more popular Euclidean distance to form the objective function in the learning process. This Cosine Similarity Metric Learning method produces the best result in the literature on the state-of-the-art face dataset: the Labelled Faces in the Wild dataset. Finally, a full face verification system based on our real experience taking part in ICPR 2010 Face Verification contest is described. Many practical points are discussed

    Linear subspace methods in face recognition

    Get PDF
    Despite over 30 years of research, face recognition is still one of the most difficult problems in the field of Computer Vision. The challenge comes from many factors affecting the performance of a face recognition system: noisy input, training data collection, speed-accuracy trade-off, variations in expression, illumination, pose, or ageing. Although relatively successful attempts have been made for special cases, such as frontal faces, no satisfactory methods exist that work under completely unconstrained conditions. This thesis proposes solutions to three important problems: lack of training data, speed-accuracy requirement, and unconstrained environments. The problem of lacking training data has been solved in the worst case: single sample per person. Whitened Principal Component Analysis is proposed as a simple but effective solution. Whitened PCA performs consistently well on multiple face datasets. Speed-accuracy trade-off problem is the second focus of this thesis. Two solutions are proposed to tackle this problem. The first solution is a new feature extraction method called Compact Binary Patterns which is about three times faster than Local Binary Patterns. The second solution is a multi-patch classifier which performs much better than a single classifier without compromising speed. Two metric learning methods are introduced to solve the problem of unconstrained face recognition. The first method called Indirect Neighourhood Component Analysis combines the best ideas from Neighourhood Component Analysis and One-shot learning. The second method, Cosine Similarity Metric Learning, uses Cosine Similarity instead of the more popular Euclidean distance to form the objective function in the learning process. This Cosine Similarity Metric Learning method produces the best result in the literature on the state-of-the-art face dataset: the Labelled Faces in the Wild dataset. Finally, a full face verification system based on our real experience taking part in ICPR 2010 Face Verification contest is described. Many practical points are discussed

    Robust approaches for face recognition

    Full text link
    This thesis gave answers to a number of important questions regarding face classification. Via this research, new methods were introduced to represent four facial attributes (three of them related to the demographic information of the human face: gender, age and race) and the fourth one related to facial expression. It stated that, discriminative facial features regarding to demographic information (gender, age and race) and expression information can be obtained by applying texture analysis techniques to the polar raster sampled images. In addition, it is found that, multi-label classification (MLC) is more suitable in the real world as a human face can be associated with multiple labels

    顔表情自動認識における西洋人と東洋人の基本的表情の違いに対する分析

    Get PDF
    Facial Expression Recognition (FER) has been one of the main targets of the well-known Human Computer Interaction (HCI) research field. Recent developments on this topic have attained high recognition rates under controlled and “in-the-wild” environments overcoming some of the main problems attached to FER systems, such as illumination changes, individual differences, partial occlusion, and so on. However, to the best of the author’s knowledge, all of those proposals have taken for granted the cultural universality of basic facial expressions of emotion. This hypothesis recently has been questioned and in some degree refuted by certain part of the research community from the psychological viewpoint. In this dissertation, an analysis of the differences between Western-Caucasian (WSN) and East-Asian (ASN) prototypic facial expressions is presented in order to assess the cultural universality from an HCI viewpoint. In addition, a full automated FER system is proposed for this analysis. This system is based on hybrid features of specific facial regions of forehead, eyes-eyebrows, mouth and nose, which are described by Fourier coefficients calculated individually from appearance and geometric features. The proposal takes advantage of the static structure of individual faces to be finally classified by Support Vector Machines. The culture-specific analysis is composed by automatic facial expression recognition and visual analysis of facial expression images from different standard databases divided into two different cultural datasets. Additionally, a human study applied to 40 subjects from both ethnic races is presented as a baseline. Evaluation results aid in identifying culture-specific facial expression differences based on individual and combined facial regions. Finally, two possible solutions for solving these differences are proposed. The first one builds on an early ethnicity detection which is based on the extraction of color, shape and texture representative features from each culture. The second approach independently considers the culture-specific basic expressions for the final classification process. In summary, the main contributions of this dissertation are: 1) Qualitative and quantitative analysis of appearance and geometric feature differences between Western-Caucasian and East-Asian facial expressions. 2) A fully automated FER system based on facial region segmentation and hybrid features. 3) The prior considerations for working with multicultural databases on FER. 4) Two possible solutions for FER with multicultural environments. This dissertation is organized as follows. Chapter 1 introduced the motivation, objectives and contributions of this dissertation. Chapter 2 presented, in detail, the background of FER and reviewed the related works from the psychological viewpoint along with the proposals which work with multicultural databases for FER from HCI. Chapter 3 explained the proposed FER method based on facial region segmentation. The automatic segmentation is focused on four facial regions. This proposal is capable to recognize the six basic expression by using only one part of the face. Therefore, it is useful for dealing with the problem of partial occlusion. Finally a modal value approach is proposed for unifying the different results obtained by facial regions of the same face image. Chapter 4 described the proposed fully automated FER method based on Fourier coefficients of hybrid features. This method takes advantage of information extracted from pixel intensities (appearance features) and facial shapes (geometric features) of three different facial regions. Hence, it also overcomes the problem of partial occlusion. This proposal is based on a combination of Local Fourier Coefficients (LFC) and Facial Fourier Descriptors (FFD) of appearance and geometric information, respectively. In addition, this method takes into account the effect of the static structure of the faces by subtracting the neutral face from the expressive face at the feature extraction level. Chapter 5 introduced the proposed analysis of differences between Western-Caucasian (WSN) and East-Asian (ASN) basic facial expressions, it is composed by FER and visual analysis which are divided by appearance, geometric and hybrid features. The FER analysis is focused on in- and out-group performance as well as multicultural tests. The proposed human study which shows cultural differences in perceiving the basic facial expressions, is also described in this chapter. Finally, the two possible solutions for working with multicultural environments are detailed, which are based on an early ethnicity detection and the consideration of previously found culture-specific expressions, respectively. Chapter 6 drew the conclusion and the future works of this research.電気通信大学201

    Facial expression recognition and intensity estimation.

    Get PDF
    Doctoral Degree. University of KwaZulu-Natal, Durban.Facial Expression is one of the profound non-verbal channels through which human emotion state is inferred from the deformation or movement of face components when facial muscles are activated. Facial Expression Recognition (FER) is one of the relevant research fields in Computer Vision (CV) and Human-Computer Interraction (HCI). Its application is not limited to: robotics, game, medical, education, security and marketing. FER consists of a wealth of information. Categorising the information into primary emotion states only limit its performance. This thesis considers investigating an approach that simultaneously predicts the emotional state of facial expression images and the corresponding degree of intensity. The task also extends to resolving FER ambiguous nature and annotation inconsistencies with a label distribution learning method that considers correlation among data. We first proposed a multi-label approach for FER and its intensity estimation using advanced machine learning techniques. According to our findings, this approach has not been considered for emotion and intensity estimation in the field before. The approach used problem transformation to present FER as a multilabel task, such that every facial expression image has unique emotion information alongside the corresponding degree of intensity at which the emotion is displayed. A Convolutional Neural Network (CNN) with a sigmoid function at the final layer is the classifier for the model. The model termed ML-CNN (Multilabel Convolutional Neural Network) successfully achieve concurrent prediction of emotion and intensity estimation. ML-CNN prediction is challenged with overfitting and intraclass and interclass variations. We employ Visual Geometric Graphics-16 (VGG-16) pretrained network to resolve the overfitting challenge and the aggregation of island loss and binary cross-entropy loss to minimise the effect of intraclass and interclass variations. The enhanced ML-CNN model shows promising results and outstanding performance than other standard multilabel algorithms. Finally, we approach data annotation inconsistency and ambiguity in FER data using isomap manifold learning with Graph Convolutional Networks (GCN). The GCN uses the distance along the isomap manifold as the edge weight, which appropriately models the similarity between adjacent nodes for emotion predictions. The proposed method produces a promising result in comparison with the state-of-the-art methods.Author's List of Publication is on page xi of this thesis

    A VISION-BASED QUALITY INSPECTION SYSTEM FOR FABRIC DEFECT DETECTION AND CLASSIFICATION

    Get PDF
    Published ThesisQuality inspection of textile products is an important issue for fabric manufacturers. It is desirable to produce the highest quality goods in the shortest amount of time possible. Fabric faults or defects are responsible for nearly 85% of the defects found by the garment industry. Manufacturers recover only 45 to 65% of their profits from second or off-quality goods. There is a need for reliable automated woven fabric inspection methods in the textile industry. Numerous methods have been proposed for detecting defects in textile. The methods are generally grouped into three main categories according to the techniques they use for texture feature extraction, namely statistical approaches, spectral approaches and model-based approaches. In this thesis, we study one method from each category and propose their combinations in order to get improved fabric defect detection and classification accuracy. The three chosen methods are the grey level co-occurrence matrix (GLCM) from the statistical category, the wavelet transform from the spectral category and the Markov random field (MRF) from the model-based category. We identify the most effective texture features for each of those methods and for different fabric types in order to combine them. Using GLCM, we identify the optimal number of features, the optimal quantisation level of the original image and the optimal intersample distance to use. We identify the optimal GLCM features for different types of fabrics and for three different classifiers. Using the wavelet transform, we compare the defect detection and classification performance of features derived from the undecimated discrete wavelet and those derived from the dual-tree complex wavelet transform. We identify the best features for different types of fabrics. Using the Markov random field, we study the performance for fabric defect detection and classification of features derived from different models of Gaussian Markov random fields of order from 1 through 9. For each fabric type we identify the best model order. Finally, we propose three combination schemes of the best features identified from the three methods and study their fabric detection and classification performance. They lead generally to improved performance as compared to the individual methods, but two of them need further improvement

    Document Classification in Support of Automated Metadata Extraction Form Heterogeneous Collections

    Get PDF
    A number of federal agencies, universities, laboratories, and companies are placing their documents online and making them searchable via metadata fields such as author, title, and publishing organization. To enable this, every document in the collection must be catalogued using the metadata fields. Though time consuming, the task of identifying metadata fields by inspecting the document is easy for a human. The visual cues in the formatting of the document along with accumulated knowledge and intelligence make it easy for a human to identify various metadata fields. Even with the best possible automated procedures, numerous sources of error exist, including some that cannot be controlled, such as scanned documents with text obscured by smudges, signatures, or stamps. A commercially viable process for metadata extraction must remain robust in the presence of these external sources of error as well as in the face of the uncertainty that accompanies any attempts to automate intelligent behavior. While extraction accuracy and completeness must be the primary goal of an extraction system, the ability to detect and report questionable results is equally important for a production quality system, since it promotes confidence in the system. We have developed and demonstrated a novel system for extracting metadata. First, a document is examined in an attempt to recognize it as an instance of a known document layout. Then a template, a scripted description of how to associate blocks of text in the layout with metadata fields, is applied to the document to extract the metadata. The extraction is validated after post-processing to evaluate the quality of the extraction and, if necessary, to flag untrusted extractions for human recognition. The success or failure of the template approach is directly tied to document classification, which is the ability to match the document to the proper template correctly and consistently. Document classification in our system is implemented as a module which applies every template available in the system to a document to find candidate templates that extract any data at all. The candidate templates are evaluated by a validation module to select the best performing template. This method is called post hoc classification. Post hoc classification is not only effective at selecting the correct class but it also excels at minimizing false positives. It is, however, very sensitive to changes in the template collection and to poorly written templates. While this dissertation examines the evolution and all the major components of an automated metadata extraction system, the primary focus is on the problem of document classification. The main thrust of my research has been investigating alternative methods of document classification to replace or supplement post hoc classification. I experimented with machine learning techniques as an additional input factor for the post hoc classification script or the final validation script
    corecore