127 research outputs found

    Entity Local Structure Graph Matching for Mislabeling Correction

    Get PDF
    International audienceThis paper proposes an entity local structure comparison approach based on inexact subgraph matching. The comparison results are used for mislabeling correction in the local structure. The latter represents a set of entity attribute labels which are physically close in a document image. It is modeled by an attributed graph describing the content and presentation features of the labels by the nodes and the geometrical features by the arcs. A local structure graph is matched with a structure model which represents a set of local structure model graphs. The structure model is initially built using a set of well chosen local structures based on a graph clustering algorithm and is then incrementally updated. The subgraph matching adopts a specific cost function that integrates the feature dissimilarities. The matched model graph is used to extract the missed labels, prune the extraneous ones and correct the erroneous label fields in the local structure. The evaluation of the structure comparison approach on 525 local structures extracted from 200 business documents achieves about 90% for recall and 95% for precision. The mislabeling correction rates in these local structures vary between 73% and 100%

    Inexact graph matching for entity recognition in OCRed documents

    Get PDF
    International audienceThis paper proposes an entity recognition system in image documents recognized by OCR. The system is based on a graph matching technique and is guided by a database describing the entities in its records. The input of the system is a document which is labeled by the entity attributes. A first grouping of those labels based on a function score leads to a selected set of candidate entities. The entity labels which are locally close are modeled by a structure graph. This graph is matched with model graphs learned for this purpose. The graph matching technique relies on a specific cost function that integrates the feature dissimilarities. The matching results are exploited to correct the mislabeling errors and then validate the entity recognition task. The system evaluation on three datasets which treat different kind of entities shows a variation between 88.3% and 95% for recall and 94.3% and 95.7% for precision

    Semantic Label and Structure Model based Approach for Entity Recognition in Database Context

    Get PDF
    International audience—This paper proposes an entity recognition approach in scanned documents referring to their description in database records. First, using the database record values, the corresponding document fields are labeled. Second, entities are identified by their labels and ranked using a TF/IDF based score. For each entity, local labels are grouped into a graph. This graph is matched with a graph model (structure model) which represents geometric structures of local entity labels using a specific cost function. This model is trained on a set of well chosen entities semi-automatically annotated. At the end, a correction step allows us to complete the eventual entity mislabeling using geometrical relationships between labels. The evaluation on 200 business documents containing 500 entities reaches about 93% for recall and 97% for precision

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy

    Collaborative Cross Graphical Genome

    Get PDF
    Reference genomes are the foundation of most bioinformatic pipelines. They are conventionally represented as a set of single-sequence assembled contigs, referred to as linear genomes. The rapid growth of sequencing technologies has driven the advent of pangenomes that integrate multiple genome assemblies in a single representation. Graphs are commonly used in pangenome models. However, there are challenges for graph-based pangenome representations and operations. This dissertation introduces methods for reference pangenome construction, genomic feature annotation, and tools for analyzing population-scale sequence data based on a graphical pangenome model. We first develop a genome registration tool for constructing a reference pangenome model by merging multiple linear genome assemblies and annotations into a graphical genome. Secondly, we develop a graph-based coordinate framework and discuss the strategies for referring to, annotating, and comparing genomic features in a graphical pangenome model. We demonstrate that the graph coordinate system simplifies assembly and annotation updates, identifying and segmenting updated sequences in a specific genomic region. Thirdly, we develop an alignment-free method to analyze population-scale sequence data based on a pangenome model. We demonstrate the application of our methods by constructing pangenome models for a mouse genetic reference population, Collaborative Cross. The pangenome framework proposed in this dissertation simplified the maintenance and management of massive genomic data and established a novel data structure for analyzing, visualizing, and comparing genomic features in an intra-specific population.Doctor of Philosoph

    Adaptive Analysis and Processing of Structured Multilingual Documents

    Get PDF
    Digital document processing is becoming popular for application to office and library automation, bank and postal services, publishing houses and communication management. In recent years, the demand for tools capable of searching written and spoken sources of multilingual information has increased tremendously, where the bilingual dictionary is one of the important resource to provide the required information. Processing and analysis of bilingual dictionaries brought up the challenges of dealing with many different scripts, some of which are unknown to the designer. A framework is presented to adaptively analyze and process structured multilingual documents, where adaptability is applied to every step. The proposed framework involves: (1) General word-level script identification using Gabor filter. (2) Font classification using the grating cell operator. (3) General word-level style identification using Gaussian mixture model. (4) An adaptable Hindi OCR based on generalized Hausdorff image comparison. (5) Retargetable OCR with automatic training sample creation and its applications to different scripts. (6) Bootstrapping entry segmentation, which segments each page into functional entries for parsing. Experimental results working on different scripts, such as Chinese, Korean, Arabic, Devanagari, and Khmer, demonstrate that the proposed framework can save human efforts significantly by making each phase adaptive

    DEVELOPMENT OF TOOLS FOR ATOM-LEVEL INTERPRETATION OF STABLE ISOTOPE-RESOLVED METABOLOMICS DATASETS

    Get PDF
    Metabolomics is the global study of small molecules in living systems under a given state, merging as a new ‘omics’ study in systems biology. It has shown great promise in elucidating biological mechanism in various areas. Many diseases, especially cancers, are closely linked to reprogrammed metabolism. As the end point of biological processes, metabolic profiles are more representative of the biological phenotype compared to genomic or proteomic profiles. Therefore, characterizing metabolic phenotype of various diseases will help clarify the metabolic mechanisms and promote the development of novel and effective treatment strategies. Advances in analytical technologies such as nuclear magnetic resonance and mass spectroscopy greatly contribute to the detection and characterization of global metabolites in a biological system. Furthermore, application of these analytical tools to stable isotope resolved metabolomics experiments can generate large-scale high-quality metabolomics data containing isotopic flow through cellular metabolism. However, the lack of the corresponding computational analysis tools hinders the characterization of metabolic phenotypes and the downstream applications. Both detailed metabolic modeling and quantitative analysis are required for proper interpretation of these complex metabolomics data. For metabolic modeling, currently there is no comprehensive metabolic network at an atom-resolved level that can be used for deriving context-specific metabolic models for SIRM metabolomics datasets. For quantitative analysis, most available tools conduct metabolic flux analysis based on a well-defined metabolic model, which is hard to achieve for complex biological system due to the limitations in our knowledge. Here, we developed a set of methods to address these problems. First, we developed a neighborhood-specific coloring method that can create identifier for each atom in a specific compound. With the atom identifiers, we successfully harmonized compounds and reactions across KEGG and MetaCyc databases at various levels. In addition, we evaluated the atom mappings of the harmonized metabolic reactions. These results will contribute to the construction of a comprehensive atom-resolved metabolic network. In addition, this method can be easily applied to any metabolic database that provides a molfile representation of compounds, which will greatly facilitate future expansion. In addition, we developed a moiety modeling framework to deconvolute metabolite isotopologue profiles using moiety models along with the analysis and selection of the best moiety model(s) based on the experimental data. To our knowledge, this is the first method that can analyze datasets involving multiple isotope tracers. Furthermore, instead of a single predefined metabolic model, this method allows the comparison of multiple metabolic models derived from a given metabolic profile, and we have demonstrated the robust performance of the moiety modeling framework in model selection with a 13C-labeled UDP-GlcNAc isotopologue dataset. We further explored the data quality requirements and the factors that affect model selection. Collectively, these methods and tools help interpret SIRM metabolomics datasets from metabolic modeling to quantitative analysis

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations

    Cortical Surface Registration and Shape Analysis

    Get PDF
    A population analysis of human cortical morphometry is critical for insights into brain development or degeneration. Such an analysis allows for investigating sulcal and gyral folding patterns. In general, such a population analysis requires both a well-established cortical correspondence and a well-defined quantification of the cortical morphometry. The highly folded and convoluted structures render a reliable and consistent population analysis challenging. Three key challenges have been identified for such an analysis: 1) consistent sulcal landmark extraction from the cortical surface to guide better cortical correspondence, 2) a correspondence establishment for a reliable and stable population analysis, and 3) quantification of the cortical folding in a more reliable and biologically meaningful fashion. The main focus of this dissertation is to develop a fully automatic pipeline that supports a population analysis of local cortical folding changes. My proposed pipeline consists of three novel components I developed to overcome the challenges in the population analysis: 1) automatic sulcal curve extraction for stable/reliable anatomical landmark selection, 2) group-wise registration for establishing cortical shape correspondence across a population with no template selection bias, and 3) quantification of local cortical folding using a novel cortical-shape-adaptive kernel. To evaluate my methodological contributions, I applied all of them in an application to early postnatal brain development. I studied the human cortical morphological development using the proposed quantification of local cortical folding from neonate age to 1 year and 2 years of age, with quantitative developmental assessments. This study revealed a novel pattern of associations between the cortical gyrification and cognitive development.Doctor of Philosoph

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations
    • …
    corecore