88,907 research outputs found

    Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents

    Full text link
    Document Image Analysis, like any Digital Image Analysis requires identification and extraction of proper features, which are generally extracted from uncompressed images, though in reality images are made available in compressed form for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation induces the motivation to research in extracting features directly from the compressed image. In this research, we propose to extract essential features such as projection profile, run-histogram and entropy for text document analysis directly from run-length compressed text-documents. The experimentation illustrates that features are extracted directly from the compressed image without going through the stage of decompression, because of which the computing time is reduced. The feature values so extracted are exactly identical to those extracted from uncompressed images.Comment: Published by IEEE in Proceedings of ACPR-2013. arXiv admin note: text overlap with arXiv:1403.778

    Layout Inference and Table Detection in Spreadsheet Documents

    Get PDF
    Spreadsheets have found wide use in many different domains and settings. They provide a broad range of both basic and advanced functionalities. In this way, they can support data collection, transformation, analysis, and reporting. Nevertheless, at the same time spreadsheets maintain a friendly and intuitive interface. Additionally, they entail no to very low cost. Well-known spreadsheet applications, such as OpenOffice, LibreOffice, Google Sheets, and Gnumeric, are free to use. Moreover, Microsoft Excel is widely available, with millions of users worldwide. Thus, spreadsheets are not only powerful tools, but also have a very low entrance barrier. Therefore, they have become very popular with novices and professionals alike. As a result, a large volume of valuable data resides in these documents. From spreadsheets, of particular interest are data coming in tabular form, since they provide concise, factual, and to a large extend structured information. One natural progression is to transfer tabular data from spreadsheets to databases. This would allow spreadsheets to become a direct source of data for existing or new business processes. It would be easier to digest them into data warehouses and to integrate them with other sources. Nevertheless, besides databases, there are other means to work with spreadsheet data. New paradigms, like NoDB, advocate querying directly from raw documents. Going one step further, spreadsheets together with other raw documents can be stored in a sophisticated centralized repository, i.e., a data lake. From then on they can serve (on-demand) various tasks and applications. All in all, by making spreadsheet data easily accessible, we can prevent information silos, i.e., valuable knowledge being isolated and scattered in multiple spreadsheet documents. Yet, there are considerable challenges to the automatic processing and understanding of these documents. After all, spreadsheets are designed primarily for human consumption, and as such, they favor customization and visual comprehension. Data are often intermingled with formatting, formulas, layout artifacts, and textual metadata, which carry domain-specific or even user-specific information (i.e., personal preferences). Multiple tables, with different layout and structure, can be found on the same sheet. Most importantly, the structure of the tables is not known, i.e., not explicitly given by the spreadsheet documents. Altogether, spreadsheets are better described as partially structured, with a significant degree of implicit information. In literature, the automatic understanding of spreadsheet data has only been scarcely investigated, often assuming just the same uniform table layout. However, due to the manifold possibilities to structure tabular data in spreadsheets, the assumption of a uniform layout either excludes a substantial number of tables from the extraction process or leads to inaccurate results. In this thesis, we primarily address two fundamental tasks that can lead to more accurate information extraction from spreadsheet documents. Namely, we propose intuitive and effective approaches for layout analysis and table detection in spreadsheets. Nevertheless, our overall solution is designed as a processing pipeline, where specialized steps build on top of each other to discover the tabular data. One of our main objectives is to eliminate most of the assumptions from related work. Instead, we target highly diverse sheet layouts, with one or multiple tables. On the same time, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we make use of sophisticated machine learning and optimization techniques. This brings flexibility to our approach, allowing it to work even with complex or malformed tables. Moreover, this intended flexibility makes our approaches transferable to new spreadsheet datasets. Thus, we are not bounded to specific domains or settings.:1 INTRODUCTION 1.1 Motivation 1.2 Contributions 1.3 Outline 2 FOUNDATIONS AND RELATED WORK 2.1 The Evolution of Spreadsheet Documents 2.1.1 Spreadsheet User Interface and Functionalities 2.1.2 Spreadsheet File Formats 2.1.3 Spreadsheets Are Partially-Structured 2.2 Analysis and Recognition in Electronic Documents 2.2.1 A General Overview of DAR 2.2.2 DAR in Spreadsheets 2.3 Spreadsheet Research Areas 2.3.1 Layout Inference and Table Recognition 2.3.2 Unifying Databases and Spreadsheets 2.3.3 Spreadsheet Software Engineering 2.3.4 Data Wrangling Approaches 3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS 3.1 Available Corpora 3.2 Creating a Gold Standard Dataset 3.2.1 Initial Selection 3.2.2 Annotation Methodology 3.3 Dataset Analysis 3.3.1 Takeaways from Business Spreadsheets 3.3.2 Comparison Between Domains 3.4 Summary and Discussion 3.4.1 Datasets for Experimental Evaluation 3.4.2 A Processing Pipeline 4 LAYOUT ANALYSIS 4.1 A Method for Layout Analysis in Spreadsheets 4.2 Feature Extraction 4.2.1 Content Features 4.2.2 Style Features 4.2.3 Font Features 4.2.4 Formula and Reference Features 4.2.5 Spatial Features 4.2.6 Geometrical Features 4.3 Cell Classification 4.3.1 Classification Datasets 4.3.2 Classifiers and Assessment Methods 4.3.3 Optimum Under-Sampling 4.3.4 Feature Selection 4.3.5 Parameter Tuning 4.3.6 Classification Evaluation 4.4 Layout Regions 4.5 Summary and Discussions 5 CLASSIFICATION POST-PROCESSING 5.1 Dataset for Post-Processing 5.2 Pattern-Based Revisions 5.2.1 Misclassification Patterns 5.2.2 Relabeling Cells 5.2.3 Evaluating the Patterns 5.3 Region-Based Revisions 5.3.1 Standardization Procedure 5.3.2 Extracting Features from Regions 5.3.3 Identifying Misclassified Regions 5.3.4 Relabeling Misclassified Regions 5.4 Summary and Discussion 6 TABLE DETECTION 6.1 A Method for Table Detection in Spreadsheets 6.2 Preliminaries 6.2.1 Introducing a Graph Model 6.2.2 Graph Partitioning for Table Detection 6.2.3 Pre-Processing for Table Detection 6.3 Rule-Based Detection 6.3.1 Remove and Conquer 6.4 Genetic-Based Detection 6.4.1 Undirected Graph 6.4.2 Header Cluster 6.4.3 Quality Metrics 6.4.4 Objective Function 6.4.5 Weight Tuning 6.4.6 Genetic Search 6.5 Experimental Evaluation 6.5.1 Testing Datasets 6.5.2 Training Datasets 6.5.3 Tuning Rounds 6.5.4 Search and Assessment 6.5.5 Evaluation Results 6.6 Summary and Discussions 7 XLINDY: A RESEARCH PROTOTYPE 7.1 Interface and Functionalities 7.1.1 Front-end Walkthrough 7.2 Implementation Details 7.2.1 Interoperability 7.2.2 Efficient Reads 7.3 Information Extraction 7.4 Summary and Discussions 8 CONCLUSION 8.1 Summary of Contributions 8.2 Directions of Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A ANALYSIS OF REDUCED SAMPLES B TABLE DETECTION WITH TIRS B.1 Tables in TIRS B.2 Pairing Fences with Data Regions B.3 Heuristics Framewor

    Toward Entity-Aware Search

    Get PDF
    As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability

    Relation Discovery from Web Data for Competency Management

    Get PDF
    This paper describes a technique for automatically discovering associations between people and expertise from an analysis of very large data sources (including web pages, blogs and emails), using a family of algorithms that perform accurate named-entity recognition, assign different weights to terms according to an analysis of document structure, and access distances between terms in a document. My contribution is to add a social networking approach called BuddyFinder which relies on associations within a large enterprise-wide "buddy list" to help delimit the search space and also to provide a form of 'social triangulation' whereby the system can discover documents from your colleagues that contain pertinent information about you. This work has been influential in the information retrieval community generally, as it is the basis of a landmark system that achieved overall first place in every category in the Enterprise Search Track of TREC2006
    • …
    corecore