1,120 research outputs found

    Article Segmentation in Digitised Newspapers

    Get PDF
    Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

    Explorative Graph Visualization

    Get PDF
    Netzwerkstrukturen (Graphen) sind heutzutage weit verbreitet. Ihre Untersuchung dient dazu, ein besseres VerstĂ€ndnis ihrer Struktur und der durch sie modellierten realen Aspekte zu gewinnen. Die Exploration solcher Netzwerke wird zumeist mit Visualisierungstechniken unterstĂŒtzt. Ziel dieser Arbeit ist es, einen Überblick ĂŒber die Probleme dieser Visualisierungen zu geben und konkrete LösungsansĂ€tze aufzuzeigen. Dabei werden neue Visualisierungstechniken eingefĂŒhrt, um den Nutzen der gefĂŒhrten Diskussion fĂŒr die explorative Graphvisualisierung am konkreten Beispiel zu belegen.Network structures (graphs) have become a natural part of everyday life and their analysis helps to gain an understanding of their inherent structure and the real-world aspects thereby expressed. The exploration of graphs is largely supported and driven by visual means. The aim of this thesis is to give a comprehensive view on the problems associated with these visual means and to detail concrete solution approaches for them. Concrete visualization techniques are introduced to underline the value of this comprehensive discussion for supporting explorative graph visualization

    Leveraging distant supervision for improved named entity recognition

    Full text link
    Les techniques d'apprentissage profond ont fait un bond au cours des derniĂšres annĂ©es, et ont considĂ©rablement changĂ© la maniĂšre dont les tĂąches de traitement automatique du langage naturel (TALN) sont traitĂ©es. En quelques annĂ©es, les rĂ©seaux de neurones et les plongements de mots sont rapidement devenus des composants centraux Ă  adopter dans le domaine. La supervision distante (SD) est une technique connue en TALN qui consiste Ă  gĂ©nĂ©rer automatiquement des donnĂ©es Ă©tiquetĂ©es Ă  partir d'exemples partiellement annotĂ©s. Traditionnellement, ces donnĂ©es sont utilisĂ©es pour l'entraĂźnement en l'absence d'annotations manuelles, ou comme donnĂ©es supplĂ©mentaires pour amĂ©liorer les performances de gĂ©nĂ©ralisation. Dans cette thĂšse, nous Ă©tudions comment la supervision distante peut ĂȘtre utilisĂ©e dans un cadre d'un TALN moderne basĂ© sur l'apprentissage profond. Puisque les algorithmes d'apprentissage profond s'amĂ©liorent lorsqu'une quantitĂ© massive de donnĂ©es est fournie (en particulier pour l'apprentissage des reprĂ©sentations), nous revisitons la gĂ©nĂ©ration automatique des donnĂ©es avec la supervision distante Ă  partir de WikipĂ©dia. On applique des post-traitements sur WikipĂ©dia pour augmenter la quantitĂ© d'exemples annotĂ©s, tout en introduisant une quantitĂ© raisonnable de bruit. Ensuite, nous explorons diffĂ©rentes mĂ©thodes d'utilisation de donnĂ©es obtenues par supervision distante pour l'apprentissage des reprĂ©sentations, principalement pour apprendre des reprĂ©sentations de mots classiques (statistiques) et contextuelles. À cause de sa position centrale pour de nombreuses applications du TALN, nous choisissons la reconnaissance d'entitĂ© nommĂ©e (NER) comme tĂąche principale. Nous expĂ©rimentons avec des bancs d’essai NER standards et nous observons des performances Ă©tat de l’art. Ce faisant, nous Ă©tudions un cadre plus intĂ©ressant, Ă  savoir l'amĂ©lioration des performances inter-domaines (gĂ©nĂ©ralisation).Recent years have seen a leap in deep learning techniques that greatly changed the way Natural Language Processing (NLP) tasks are tackled. In a couple of years, neural networks and word embeddings quickly became central components to be adopted in the domain. Distant supervision (DS) is a well-used technique in NLP to produce labeled data from partially annotated examples. Traditionally, it was mainly used as training data in the absence of manual annotations, or as additional training data to improve generalization performances. In this thesis, we study how distant supervision can be employed within a modern deep learning based NLP framework. As deep learning algorithms gets better when massive amount of data is provided (especially for representation learning), we revisit the task of generating distant supervision data from Wikipedia. We apply post-processing treatments on the original dump to further increase the quantity of labeled examples, while introducing a reasonable amount of noise. Then, we explore different methods for using distant supervision data for representation learning, mainly to learn classic and contextualized word representations. Due to its importance as a basic component in many NLP applications, we choose Named-Entity Recognition (NER) as our main task. We experiment on standard NER benchmarks showing state-of-the-art performances. By doing so, we investigate a more interesting setting, that is, improving the cross-domain (generalization) performances

    Assessment of an enterprise employee portal using dashboard monitoring system: a case study

    Get PDF
    A portal is a browser-based application that provides a web platform for users to improve inter-department collaboration and customer service. Portals are classified either as internal facing portals or external public facing portals. This study addresses the problems facing an internal portal related to its contents, functions and usability and provides a list of essential contents and functions that it should include through integrating theories and industry best practices. The theory framework is based on literature review and the industry best practices are based on the analysis of a number of internal portals of companies used as case studies. These two were compared to develop an information mapping grid to identify gaps between theories and practices. A case company was used to uncover additional insights on employee portal content and functionalities through the analysis of actual and perceived user portal usage. The results were then compared using an information mapping grid to derive a set of content and functionalities to improve usability of an internal employee portal. Results of this study indicate that customization and personalization is an important feature of an employee portal, however, features pertaining to communication and collaboration support, search support, help system and employee self-services appear to be more important in practice. The information mapping grid derived, the data warehouse architecture developed and the Dashboard Monitoring systems created to assess usability of an employee portal are applicable to similar enterprises --Abstract, page iii

    User Interfaces and Difference Visualizations for Alternatives

    Get PDF
    Designers often create multiple iterations to evaluate alternatives. Todays computer-based tools do not support such easy exploration of a design space, despite the fact that such support has been advocated. This dissertation is centered on this. I begin by investigating the effectiveness of various forms of difference visualizations and support for merging changes within a system targeted at diagrams with node and edge attributes. I evaluated the benefits of the introduced difference visualization techniques in two user studies. I found that the basic side-by-side juxtaposition visualization was not effective and also not well received. For comparing diagrams with matching node positions, participants preferred the side-by-side option with a difference layer. For diagrams with non-matching positions animation was beneficial, but the combination with a difference layer was preferred. Thus, the difference layer technique was useful and a good complement to animation. I continue by investigating if explicit support for design alternatives better supports exploration and creativity in a generative design system. To investigate the new techniques to better support exploration, I built a new system that supports parallel exploration of alternative designs and generation of new structural combinations. I investigate the usefulness of my prototype in two user studies and interviews. The results and feedback suggest and confirm that supporting design alternatives explicitly enables designers to work more creatively. Generative models are often represented as DAGs (directed acyclic graphs) in a dataflow programming environment. Existing approaches to compare such DAGs do not generalize to multiple alternatives. Informed by and building on the first part of my dissertation, I introduce a novel user interface that enables visual differencing and editing alternative graphsspecifically more than two alternatives simultaneously, something that has not been presented before. I also explore multi-monitor support to demonstrate that the difference visualization technique scales well to up to 18 alternatives. The novel jamming space feature makes organizing alternatives on a 23 monitor system easier. To investigate the usability of the new difference visualization method I conducted an exploratory interview with three expert designers. The received comments confirmed that it meets their design goals

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy

    Development of a Novel Media-independent Communication Theology for Accessing Local & Web-based Data: Case Study with Robotic Subsystems

    Get PDF
    Realizing media independence in today’s communication system remains an open problem by and large. Information retrieval, mostly through the Internet, is becoming the most demanding feature in technological progress and this web-based data access should ideally be in user-selective form. While blind-folded access of data through the World Wide Web is quite streamlined, the counter-half of the facet, namely, seamless access of information database pertaining to a specific end-device, e.g. robotic systems, is still in a formative stage. This paradigm of access as well as systematic query-based retrieval of data, related to the physical enddevice is very crucial in designing the Internet-based network control of the same in real-time. Moreover, this control of the end-device is directly linked up to the characteristics of three coupled metrics, namely, ‘multiple databases’, ‘multiple servers’ and ‘multiple inputs’ (to each server). This triad, viz. database-input-server (DIS) plays a significant role in overall performance of the system, the background details of which is still very sketchy in global research community. This work addresses the technical issues associated with this theology, with specific reference to formalism of a customized DIS considering real-time delay analysis. The present paper delineates the developmental paradigms of novel multi-input multioutput communication semantics for retrieving web-based information from physical devices, namely, two representative robotic sub-systems in a coherent and homogeneous mode. The developed protocol can be entrusted for use in real-time in a complete user-friendly manner
    • 

    corecore