3,344 research outputs found

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy

    Semantic integration of geospatial concepts - a study on land use land cover classification systems

    Get PDF
    In GI Science, one of the most important interoperability is needed in land use and land cover (LULC) data, because it is key to the evaluation of LULC's many environmental impacts throughout the globe (Foley et al. 2005). Accordingly, this research aims to address the interoperability of LULC information derived by different authorities using different classificatory approaches. LULC data are described by LULC classification systems. The interoperability of LULC data hinges on the semantic integration of LULC classification systems. Existing works on semantically integrating LULC classification systems has a major drawback in finding comparable semantic representations from textual descriptions. To tackle this problem, we borrowed the method of comparing documents in information retrieval, and applied it to comparing LULC category names and descriptions. The results showed significant improvement comparing to previous works. However, lexical semantic methods are not able to solve the semantic heterogeneities in LULC classification systems: the confounding conflict - LULC categories under similar labels and descriptions have different LULC status in reality, and the naming conflict - LULC categories under different labels represent similar LULC type. Without confirmation of their actual land cover status from remote sensing, lexical semantic method cannot achieve reliable matching. To discover confounding conflicts and reconcile naming conflicts, we developed an innovative method of applying remote sensing to the integration of LULC classification systems. Remote sensing is a means of observation on actual LULC status of individual parcels. We calculated parcel level statistics from spectral and textural data, and used these statistics to calculate category similarity. The matching results showed this approach fulfilled its goal - to overcome semantic heterogeneities and achieve more reliable and accurate matching between LULC classifications in the majority of cases. To overcome the limitations of either method, we combined the two by aggregating their output similarities, and achieve better integration. LULC categories that post noticeable differences between lexical semantics and remote sensing once again remind us of semantic heterogeneities in LULC classification systems that must to be overcome before LULC data from different sources become interoperable and serve as the key to understanding our highly interrelated Earth system

    Character Recognition

    Get PDF
    Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field

    Chemometric Curve Resolution for Quantitative Liquid Chromatographic Analysis

    Get PDF
    In chemical analyses, it is crucial to distinguish between chemical species. This is often accomplished via chromatographic separations. These separations are often pushed to their limits in terms of the number of analytes that can be sufficiently resolved from one another, particularly when a quantitative analysis of these compounds is needed. Very often, complicated methods or new technology is required to provide adequate separation of samples arising from a variety of fields such as metabolomics, environmental science, food analysis, etc. An often overlooked means for improving analysis is the use of chemometric data analysis techniques. Particularly, the use of chemometric curve resolution techniques can mathematically resolve analyte signals that may be overlapped in the instrumental data. The use of chemometric techniques facilitates quantitation, pattern recognition, or any other desired analyses. Unfortunately, these methods have seen little use outside of traditionally chemometrics focused research groups. In this dissertation, we attempt to show the utility of one of these methods, multivariate curve resolution-alternating least squares (MCR-ALS), to liquid chromatography as well as its application to more advanced separation techniques. First, a general characterization of the performance of MCR-ALS for the analysis of liquid chromatography-diode array detection (LC-DAD) data is accomplished. It is shown that under a wide range of conditions (low chromatographic resolution, low signal-to-noise, and high similarity between analyte spectra), MCR-ALS is able to increase the number of quantitatively analyzable peaks. This increase is up to five-fold in many cases. Second, a novel methodology for MCR-ALS analysis of comprehensive two-dimensional liquid chromatography (LC x LC) is described. This method, called two dimensional assisted liquid chromatography (2DALC), aims to improve quantitation in LC x LC by combining the advantages of both one-dimensional and two dimensional chromatographic data. We show that 2DALC can provide superior quantitation to both LC x LC and one dimensional LC under certain conditions. Finally, we apply MCR-ALS to an LC x LC analysis of fourteen furanocoumarins in three apiaceous vegetables. The optimal implementation of MCR-ALS and subsequent integration was determined. For these data, simply performing MCR-ALS on the two dimensional chromatogram and manually integrating the results proved to be the superior method. These results demonstrate the usefulness of these curve resolution techniques as a compliment to advanced chromatographic techniques

    Extracting product development intelligence from web reviews

    Get PDF
    Product development managers are constantly challenged to learn what the consumer product experience really is, and to learn specifically how the product is performing in the field. Traditionally, they have utilized methods such as prototype testing, customer quality monitoring instruments, field testing methods with sample customers, and independent assessment companies. These methods are limited in that (i) the number of customer evaluations is small, and (ii) the methods are driven by a restrictive structured format. Today the web has created a new source of product intelligence; these are unsolicited reviews from actual product users that are posted across hundreds of websites. The basic hypothesis of this research is that web reviews contain significant amount of information that is of value to the product design community. This research developed the DFOC (Design - Feature - Opinion - Cause Relationship) method for integrating the evaluation of unstructured web reviews into the structured product design process. The key data element in this research is a Web review and its associated opinion polarity (positive, negative, or neutral). Hundreds of Web reviews are collected to form a review database representing a population of customers. The DFOC method (a) identifies a set of design features that are of interest to the product design community, (b) mines the Web review database to identify which features are of significance to customer evaluations, (c) extracts and estimates the sentiment or opinion of the set of significant features, and (d) identifies the likely cause of the customer opinion. To support the DFOC method we develop an association rule based opinion mining procedure for capturing and extracting noun-verb-adjective relationships in the Web review database. This procedure exploits existing opinion mining methods to deconstruct the Web reviews and capture feature-opinion pair polarity. A Design Level Information Quality (DLIQ) measure which evaluates three components (a) Content (b) Complexity and (c) Relevancy is introduced. DLIQ is indicative of the content, complexity and relevancy of the design contextual information that can be extracted from an analysis of Web reviews for a given product. Application of this measure confirms the hypothesis that significant levels of quality design information can be efficiently extracted from Web reviews for a wide variety of product types. Application of the DFOC method and the DLIQ measure to a wide variety of product classes (electronic, automobile, service domain) is demonstrated. Specifically Web review databases for ten products/services are created from real data. Validation occurs by analyzing and presenting the extracted product design information. Examples of extracted features and feature-cause associations for negative polarity opinions are shown along with the observed significance

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Techniques for document image processing in compressed domain

    Full text link
    The main objective for image compression is usually considered the minimization of storage space. However, as the need to frequently access images increases, it is becoming more important for people to process the compressed representation directly. In this work, the techniques that can be applied directly and efficiently to digital information encoded by a given compression algorithm are investigated. Lossless compression schemes and information processing algorithms for binary document images and text data are two closely related areas bridged together by the fast processing of coded data. The compressed domains, which have been addressed in this work, i.e., the ITU fax standards and JBIG standard, are two major schemes used for document compression. Based on ITU Group IV, a modified coding scheme, MG4, which explores the 2-dimensional correlation between scan lines, is developed. From the viewpoints of compression efficiency and processing flexibility of image operations, the MG4 coding principle and its feature-preserving behavior in the compressed domain are investigated and examined. Two popular coding schemes in the area of bi-level image compression, run-length and Group IV, are studied and compared with MG4 in the three aspects of compression complexity, compression ratio, and feasibility of compressed-domain algorithms. In particular, for the operations of connected component extraction, skew detection, and rotation, MG4 shows a significant speed advantage over conventional algorithms. Some useful techniques for processing the JBIG encoded images directly in the compressed domain, or concurrently while they are being decoded, are proposed and generalized; In the second part of this work, the possibility of facilitating image processing in the wavelet transform domain is investigated. The textured images can be distinguished from each other by examining their wavelet transforms. The basic idea is that highly textured regions can be segmented using feature vectors extracted from high frequency bands based on the observation that textured images have large energies in both high and middle frequencies while images in which the grey level varies smoothly are heavily dominated by the low-frequency channels in the wavelet transform domain. As a result, a new method is developed and implemented to detect textures and abnormalities existing in document images by using polynomial wavelets. Segmentation experiments indicate that this approach is superior to other traditional methods in terms of memory space and processing time
    corecore