21 research outputs found

    Bounding the Probability of Error for High Precision Recognition

    Full text link
    We consider models for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low rates of recall. If some variables can be identified with near certainty, then they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This "clean set" is subsequently used as document-specific training data. While many current OCR systems produce measures of confidence for the identity of each letter or word, thresholding these confidence values, even at very high values, still produces some errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect under very general assumptions, using an approximate worst case analysis. As a result, the parameters of the model are nearly irrelevant, and we are able to identify a subset of words, even in noisy documents, of which we are highly confident. On our set of 10 documents, we are able to identify about 6% of the words on average without making a single error. This ability to produce word lists with very high precision allows us to use a family of models which depends upon such clean word lists

    Learning to Read by Spelling: Towards Unsupervised Text Recognition

    Full text link
    This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets. We present detailed analysis for various aspects of the proposed method, namely - (1) impact of the length of training sequences on convergence, (2) relation between character frequencies and the order in which they are learnt, (3) generalisation ability of our recognition network to inputs of arbitrary lengths, and (4) impact of varying the text corpus on recognition accuracy. Finally, we demonstrate excellent text recognition accuracy on both synthetically generated text images, and scanned images of real printed books, using no labelled training examples

    Analysis of pan-genome content and its application in microbial identification

    Get PDF

    Sequence Analysis and Related Approaches

    Get PDF
    This open access book provides innovative methods and original applications of sequence analysis (SA) and related methods for analysing longitudinal data describing life trajectories such as professional careers, family paths, the succession of health statuses, or the time use. The applications as well as the methodological contributions proposed in this book pay special attention to the combined use of SA and other methods for longitudinal data such as event history analysis, Markov modelling, and sequence network. The methodological contributions in this book include among others original propositions for measuring the precarity of work trajectories, Markov-based methods for clustering sequences, fuzzy and monothetic clustering of sequences, network-based SA, joint use of SA and hidden Markov models, and of SA and survival models. The applications cover the comparison of gendered occupational trajectories in Germany, the study of the changes in women market participation in Denmark, the study of typical day of dual-earner couples in Italy, of mobility patterns in Togo, of internet addiction in Switzerland, and of the quality of employment career after a first unemployment spell. As such this book provides a wealth of information for social scientists interested in quantitative life course analysis, and all those working in sociology, demography, economics, health, psychology, social policy, and statistics. ; Provides new perspectives and methods for sequence analysis Focusses on the link between sequence analysis and other methods for longitudinal data, especially event history analysis and Markov models Stresses the complementarity of sequence analysis and other models for longitudinal data Applications of sequence analysis in a whole range of different domain

    Sequence Analysis and Related Approaches

    Get PDF
    This open access book provides innovative methods and original applications of sequence analysis (SA) and related methods for analysing longitudinal data describing life trajectories such as professional careers, family paths, the succession of health statuses, or the time use. The applications as well as the methodological contributions proposed in this book pay special attention to the combined use of SA and other methods for longitudinal data such as event history analysis, Markov modelling, and sequence network. The methodological contributions in this book include among others original propositions for measuring the precarity of work trajectories, Markov-based methods for clustering sequences, fuzzy and monothetic clustering of sequences, network-based SA, joint use of SA and hidden Markov models, and of SA and survival models. The applications cover the comparison of gendered occupational trajectories in Germany, the study of the changes in women market participation in Denmark, the study of typical day of dual-earner couples in Italy, of mobility patterns in Togo, of internet addiction in Switzerland, and of the quality of employment career after a first unemployment spell. As such this book provides a wealth of information for social scientists interested in quantitative life course analysis, and all those working in sociology, demography, economics, health, psychology, social policy, and statistics. ; Provides new perspectives and methods for sequence analysis Focusses on the link between sequence analysis and other methods for longitudinal data, especially event history analysis and Markov models Stresses the complementarity of sequence analysis and other models for longitudinal data Applications of sequence analysis in a whole range of different domain

    The MGX framework for microbial community analysis

    Get PDF
    Jaenicke S. The MGX framework for microbial community analysis. Bielefeld: Universität Bielefeld; 2020

    Information management applied to bioinformatics

    Get PDF
    Bioinformatics, the discipline concerned with biological information management is essential in the post-genome era, where the complexity of data processing allows for contemporaneous multi level research including that at the genome level, transcriptome level, proteome level, the metabolome level, and the integration of these -omic studies towards gaining an understanding of biology at the systems level. This research is also having a major impact on disease research and drug discovery, particularly through pharmacogenomics studies. In this study innovative resources have been generated via the use of two case studies. One was of the Research & Development Genetics (RDG) department at AstraZeneca, Alderley Park and the other was of the Pharmacogenomics Group at the Sanger Institute in Cambridge UK. In the AstraZeneca case study senior scientists were interviewed using semi-structured interviews to determine information behaviour through the study scientific workflows. Document analysis was used to generate an understanding of the underpinning concepts and fonned one of the sources of context-dependent information on which the interview questions were based. The objectives of the Sanger Institute case study were slightly different as interviews were carried out with eight scientists together with the use of participation observation, to collect data to develop a database standard for one process of their Pharmacogenomics workflow. The results indicated that AstraZeneca would benefit through upgrading their data management solutions in the laboratory and by development of resources for the storage of data from larger scale projects such as whole genome scans. These studies will also generate very large amounts of data and the analysis of these will require more sophisticated statistical methods. At the Sanger Institute a minimum information standard was reported for the manual design of primers and included in a decision making tree developed for Polymerase Chain Reactions (PCRs). This tree also illustrates problems that can be encountered when designing primers along with procedures that can be taken to address such issues.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Phylogenomics of vertebrate serpins

    Get PDF
    Kumar A. Phylogenomics of vertebrate serpins. Bielefeld (Germany): Bielefeld University; 2010.The serpins constitute a superfamily of proteins that fold into a conserved tertiary structure and employ a sophisticated, irreversible suicide-mechanism of inhibition. More than 6000 serpins have been identified, occurring in all three forms of the life - the eukaryotes, the prokaryotes and the archea. Vertebrate serpins can be conveniently classified into six groups (V1 - V6), based on three independent biological features - gene organization, diagnostic amino acid sites and rare indels. In the present work, the phylogenetic relationships of serpins from Nematostella vectensis, Strongylocentrotus purpuratus, Ciona intestinalis, four fish species, frog, chicken and mammals were investigated, using gene architecture analyses and stringent criteria for identification of orthologs. With some deviations, all vertebrate serpin genes fit into one of the six exon/intron gene classes previously identified, dating the existence and maintenance of these gene organizations before or close to the divergence of fishes. Group V1 and V2 gene families underwent rapid adaptive radiation along the lineages leading to mammals as indicated by an up to nine-fold increased number of family members, accompanied by a rapid functional diversification. In contrast, gene groups V3 to V6 display a rather conservative evolution with little changes since the divergence of fishes and the other vertebrates. The orthology assessment indicates that all vertebrates are equipped with a subset of strongly conserved serpins with functions that can be clearly correlated with basic vertebrate-specific physiology. None of serpin genes from C. intestinalis shares a common exon-intron architecture organisation with any of the vertebrate serpin gene classes, nor was it possible to identify orthologs of vertebrates. The lack of gene architecture similarity and the complete absence of orthology between urochordate and vertebrate serpins indicate that major changes with bursts of character acquisition must have occurred during evolution of serpins in the time interval separating urochordates from chordates, indicating massive intron gains or losses and events providing C and N-terminal sequence extensions characteristic for today's vertebrate serpins. Lancelets and sea urchin genomes, in contrast, share one orthologous serpin with vertebrates. Rare genomic characters are used to show that orthologs of neuroserpin, a prominent representative of vertebrate group V3 serpin genes, exist in early diverging deuterostomes and probably also in cnidarians, indicating that the origin of a mammalian serpin can be traced back far in the history of eumetazoans. A C-terminal address code assigning association with secretory pathway organelles is present in all neuroserpin orthologs, suggesting that supervision of cellular export/import routes by antiproteolytic serpins is an ancient trait. Phylogenomic comparisons show that, after establishment of canonical exon-intron patterns in the serpin superfamily at the dawn of vertebrate evolution, multiple intron acquisition events have occurred during diversification of a lineage of actinopterygian fishes. The novel introns were acquired within a limited time interval (on an evolutionary timescale), and no such events were observed in other groups of vertebrates. Examination of the sequences flanking the intron insertion points revealed that the genetic requirements for acquisition of novel introns might be less stringent than previously suggested. Finally, we argue that genome compaction, a phenomenon associated with the fish lineage depicting preferential intron gain, might promote intron acquisition
    corecore