Search CORE

43 research outputs found

Document Flow Segmentation for Business Applications

Author: Abdel Belaïd
Daher Hani
Publication venue: HAL CCSD
Publication date: 03/02/2014
Field of study

International audienceThe aim of this paper is to propose a document flow supervised segmentation approach applied to real world heterogeneous documents. Our algorithm treats the flow of documents as couples of consecutive pages and studies the relationship that exists between them. At first, sets of features are extracted from the pages where we propose an approach to model the couple of pages into a single feature vector representation. This representation will be provided to a binary classifier which classifies the relationship as either segmentation or continuity. In case of segmentation, we consider that we have a complete document and the analysis of the flow continues by starting a new document. In case of continuity, the couple of pages are assimilated to the same document and the analysis continues on the flow. If there is an uncertainty on whether the relationship between the couple of pages should be classified as a continuity or segmentation, a rejection is decided and the pages analyzed until this point are considered as a "fragment". The first classification already provides good results approaching 90% on certain documents, which is high at this level of the system

INRIA a CCSD electronic archive server

Bridging Cross-Modal Alignment for OCR-Free Content Retrieval in Scanned Historical Documents

Author: Molina Rodríguez Adrià
Universitat Autònoma de Barcelona. Departament de Ciències de la Computació
Universitat Autònoma de Barcelona. Escola d'Enginyeria
Publication venue
Publication date: 01/01/2023
Field of study

In this work, we address the limitations of current approaches to document retrieval by incorporating vision-based topic extraction. While previous methods have primarily focused on visual elements or relied on optical character recognition (OCR) for text extraction, we propose a paradigm shift by directly incorporating vision into the topic space. We demonstrate that recognizing all visual elements within a document is unnecessary for identifying its underlying topic. Visual cues such as icons, writing style, and font can serve as sufficient indicators. By leveraging ranking loss functions and convolutional neural networks (CNNs), we learn complex topological representations that mimic the behavior of text representations. Our approach aims to eliminate the need for OCR and its associated challenges, including efficiency, performance, data-hunger, and expensive annotation. Furthermore, we highlight the significance of incorporating vision in historical documentation, where visually antiquated documents contain valuable cues. Our research contributes to the understanding of topic extraction from a vision perspective and offers insights into annotation-cheap document retrieval system

Diposit Digital de Documents de la UAB

Web Usability Guidelines for Air Force Knowledge Now Web Site

Author: Felax Gary A.
Publication venue: AFIT Scholar
Publication date: 01/03/2005
Field of study

The Department of Defense Net-Centric Data Strategies number one key attribute is to ensure data is visible, available, and usable when and where needed to accelerate decision-making. The Internet provides opportunities for quick and efficient disseminating of information to the public, distributing information throughout the Air Force, and accessing information from a variety of sources. In 2002, the Air Force CIO designated the Air Force Knowledge Now (AFKN) as the center of excellence for Knowledge Management. The site is a one-stop resource, providing access to a great depth and breadth of information. This study seeks to determine how usable and accessible the web interface is to its customers. A literature review determined the usability inspection method called Heuristic Evaluation to be most favorable for this type of evaluation. The researcher conducted a case study using heuristic evaluation to determine the site usability compliance rate. A second case study using web content accessibility guidelines was then performed to determine the sites accessibility compliance rate. The study finally presented a comparative analysis of the usability and accessibility checklists to determine if any overlap occurred between the two or if one is a subset of the other. This exploratory research finds more emphasis on web usability and accessibility should be explored in the future for AFKN

Improving Retrieval Accuracy in Main Content Extraction from HTML Web Documents

Author: Mohammadzadeh Hadi
Publication venue
Publication date: 27/11/2013
Field of study

The rapid growth of text based information on the World Wide Web and various applications making use of this data motivates the need for efficient and effective methods to identify and separate the “main content” from the additional content items, such as navigation menus, advertisements, design elements or legal disclaimers. Firstly, in this thesis, we study, develop, and evaluate R2L, DANA, DANAg, and AdDANAg, a family of novel algorithms for extracting the main content of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other three algorithms, is to use well particularities of Right-to-Left languages for obtaining the main content of web pages. As the English character set and the Right-to-Left character set are encoded in different intervals of the Unicode character set, we can efficiently distinguish the Right-to-Left characters from the English ones in an HTML file. This enables the R2L approach to recognize areas of the HTML file with a high density of Right-to-Left characters and a low density of characters from the English character set. Having recognized these areas, R2L can successfully separate only the Right-to-Left characters. The first extension of the R2L, DANA, improves effectiveness of the baseline algorithm by employing an HTML parser in a post processing phase of R2L for extracting the main content from areas with a high density of Right-to-Left characters. DANAg is the second extension of the R2L and generalizes the idea of R2L to render it language independent. AdDANAg, the third extension of R2L, integrates a new preprocessing step to normalize the hyperlink tags. The presented approaches are analyzed under the aspects of efficiency and effectiveness. We compare them to several established main content extraction algorithms and show that we extend the state-of-the-art in terms of both, efficiency and effectiveness. Secondly, automatically extracting the headline of web articles has many applications. We develop and evaluate a content-based and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. The proposed method achieves high performance in terms of effectiveness and efficiency and outperforms approaches operating on structural and visual features.Das rasante Wachstum von textbasierten Informationen im World Wide Web und die Vielfalt der Anwendungen, die diese Daten nutzen, macht es notwendig, effiziente und effektive Methoden zu entwickeln, die den Hauptinhalt identifizieren und von den zusätzlichen Inhaltsobjekten wie z.B. Navigations-Menüs, Anzeigen, Design-Elementen oder Haftungsausschlüssen trennen. Zunächst untersuchen, entwickeln und evaluieren wir in dieser Arbeit R2L, DANA, DANAg und AdDANAg, eine Familie von neuartigen Algorithmen zum Extrahieren des Inhalts von Web-Dokumenten. Das grundlegende Konzept hinter R2L, das auch zur Entwicklung der drei weiteren Algorithmen führte, nutzt die Besonderheiten der Rechts-nach-links-Sprachen aus, um den Hauptinhalt von Webseiten zu extrahieren. Da der lateinische Zeichensatz und die Rechts-nach-links-Zeichensätze durch verschiedene Abschnitte des Unicode-Zeichensatzes kodiert werden, lassen sich die Rechts-nach-links-Zeichen leicht von den lateinischen Zeichen in einer HTML-Datei unterscheiden. Das erlaubt dem R2L-Ansatz, Bereiche mit einer hohen Dichte von Rechts-nach-links-Zeichen und wenigen lateinischen Zeichen aus einer HTML-Datei zu erkennen. Aus diesen Bereichen kann dann R2L die Rechts-nach-links-Zeichen extrahieren. Die erste Erweiterung, DANA, verbessert die Wirksamkeit des Baseline-Algorithmus durch die Verwendung eines HTML-Parsers in der Nachbearbeitungsphase des R2L-Algorithmus, um den Inhalt aus Bereichen mit einer hohen Dichte von Rechts-nach-links-Zeichen zu extrahieren. DANAg erweitert den Ansatz des R2L-Algorithmus, so dass eine Sprachunabhängigkeit erreicht wird. Die dritte Erweiterung, AdDANAg, integriert eine neue Vorverarbeitungsschritte, um u.a. die Weblinks zu normalisieren. Die vorgestellten Ansätze werden in Bezug auf Effizienz und Effektivität analysiert. Im Vergleich mit mehreren etablierten Hauptinhalt-Extraktions-Algorithmen zeigen wir, dass sie in diesen Punkten überlegen sind. Darüber hinaus findet die Extraktion der Überschriften aus Web-Artikeln vielfältige Anwendungen. Hierzu entwickeln wir mit TitleFinder einen sich nur auf den Textinhalt beziehenden und sprachabhängigen Ansatz. Das vorgestellte Verfahren ist in Bezug auf Effektivität und Effizienz besser als bekannte Ansätze, die auf strukturellen und visuellen Eigenschaften der HTML-Datei beruhen

Parts and Wholes in Long Non-narrative Poems of the Eighteenth Century

Author: Stenke Katarina
Publication venue: University of Cambridge
Publication date: 17/09/2012
Field of study

This dissertation examines early-eighteenth-century understandings of literary length in order to shed new light on the structures of three long non-narrative poems of the period, James Thomson’s

\textit{The Seasons}

, Mark Akenside’s

\textit{The Pleasures of Imagination}

and Edward Young’s

\textit{Night Thoughts}

. Readings of these poems demonstrate the sophistication with which British eighteenth-century writers used extensive literary structures to represent, explicate and communicate objects and ideas that seemed too vast or complex for comprehensive description or narration. Part I of the dissertation surveys, in chronological order, earlier and contemporary critical theories which inform the three poems, in particular those found in the writings of two major Whig critics, John Dennis and Joseph Addison (discussed in Chapter 1) and in the poetry of Alexander Pope (Chapter 2). Considered collectively, these may be understood to describe a ‘poetics of greatness’ whereby extensive verse is progressively abstracted from its traditional generic loci and becomes associated more broadly with ambitions and potential failures of comprehensive representation and perception, with the sublime, and with playful or witty complexity. Part II covers the three long poems. Chapter 3 argues that in

\textit{The Seasons}

Thomson uses the figure of the maze to modulate allusively between stasis and motion, sublimity and playfulness, gesturing circumspectly towards a vast providential order. Chapter 4 offers close readings of two early Akenside poems and passages from Shaftesbury’s

\textit{Characteristics}

, a key source for

\textit{The Pleasures of Imagination}

. These reveal Akenside’s abiding concern with the fine line distinguishing sublime inspiration from ridiculous delusion, which informs self-reflexively the very structure of his sublime long poem. In Chapter 5, perceptions of

\textit{Night Thoughts}

as too long provide the starting point for an account of how Young’s belief in the didactic function of poetry translates into a temporal, cumulative poetics designed to wear its repetitive

\textit{aperçus}

on ‘life, death and immortality’, through the time of reading, into the heart of the reader. Just as in extensive classical genres like epic and georgic, these works invest structure with the task of transmitting an articulated experience or body of knowledge to the reader. As such, their parts are arranged coherently, if complexly, within the whole

Yavaa: supporting data workflows from discovery to visualization

Author: Schindler Sirko
Publication venue
Publication date: 01/01/2022
Field of study

Recent years have witness an increasing number of data silos being opened up both within organizations and to the general public: Scientists publish their raw data as supplements to articles or even standalone artifacts to enable others to verify and extend their work. Governments pass laws to open up formerly protected data treasures to improve accountability and transparency as well as to enable new business ideas based on this public good. Even companies share structured information about their products and services to advertise their use and thus increase revenue. Exploiting this wealth of information holds many challenges for users, though. Oftentimes data is provided as tables whose sheer endless rows of daunting numbers are barely accessible. InfoVis can mitigate this gap. However, offered visualization options are generally very limited and next to no support is given in applying any of them. The same holds true for data wrangling. Only very few options to adjust the data to the current needs and barely any protection are in place to prevent even the most obvious mistakes. When it comes to data from multiple providers, the situation gets even bleaker. Only recently tools emerged to search for datasets across institutional borders reasonably. Easy-to-use ways to combine these datasets are still missing, though. Finally, results generally lack proper documentation of their provenance. So even the most compelling visualizations can be called into question when their coming about remains unclear. The foundations for a vivid exchange and exploitation of open data are set, but the barrier of entry remains relatively high, especially for non-expert users. This thesis aims to lower that barrier by providing tools and assistance, reducing the amount of prior experience and skills required. It covers the whole workflow ranging from identifying proper datasets, over possible transformations, up until the export of the result in the form of suitable visualizations