2,024 research outputs found

    Article Segmentation in Digitised Newspapers

    Get PDF
    Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

    toward modeling kinetics of biomolecular complexes

    Get PDF
    In order to advance the mission of in silico cell biology, modeling the interactions of large and complex biological systems becomes increasingly relevant. The combination of molecular dynamics (MD) and Markov state models (MSMs) have enabled the construction of simplified models of molecular kinetics on long timescales. Despite its success, this approach is inherently limited by the size of the molecular system. With increasing size of macromolecular complexes, the number of independent or weakly coupled subsystems increases, and the number of global system states increase exponentially, making the sampling of all distinct global states unfeasible. In this work, we present a technique called Independent Markov Decomposition (IMD) that leverages weak coupling between subsystems in order to compute a global kinetic model without requiring to sample all combinatorial states of subsystems. We give a theoretical basis for IMD and propose an approach for finding and validating such a decomposition. Using empirical few-state MSMs of ion channel models that are well established in electrophysiology, we demonstrate that IMD can reproduce experimental conductance measurements with a major reduction in sampling compared with a standard MSM approach. We further show how to find the optimal partition of all-atom protein simulations into weakly coupled subunits

    Interactive, multi-purpose traffic prediction platform using connected vehicles dataset

    Get PDF
    Traffic congestion is a perennial issue because of the increasing traffic demand yet limited budget for maintaining current transportation infrastructure; let alone expanding them. Many congestion management techniques require timely and accurate traffic estimation and prediction. Examples of such techniques include incident management, real-time routing, and providing accurate trip information based on historical data. In this dissertation, a speech-powered traffic prediction platform is proposed, which deploys a new deep learning algorithm for traffic prediction using Connected Vehicles (CV) data. To speed-up traffic forecasting, a Graph Convolution -- Gated Recurrent Unit (GC-GRU) architecture is proposed and analysis of its performance on tabular data is compared to state-of-the-art models. GC-GRU's Mean Absolute Percentage Error (MAPE) was very close to Transformer (3.16 vs 3.12) while achieving the fastest inference time and a six-fold faster training time than Transformer, although Long-Short-Term Memory (LSTM) was the fastest in training. Such improved performance in traffic prediction with a shorter inference time and competitive training time allows the proposed architecture to better cater to real-time applications. This is the first study to demonstrate the advantage of using multiscale approach by combining CV data with conventional sources such as Waze and probe data. CV data was better at detecting short duration, Jam and stand-still incidents and detected them earlier as compared to probe. CV data excelled at detecting minor incidents with a 90 percent detection rate versus 20 percent for probes and detecting them 3 minutes faster. To process the big CV data faster, a new algorithm is proposed to extract the spatial and temporal features from the CSV files into a Multiscale Data Analysis (MDA). The algorithm also leverages Graphics Processing Unit (GPU) using the Nvidia Rapids framework and Dask parallel cluster in Python. The results show a seventy-fold speedup in the data Extract, Transform, Load (ETL) of the CV data for the State of Missouri of an entire day for all the unique CV journeys (reducing the processing time from about 48 hours to 25 minutes). The processed data is then fed into a customized UNet model that learns highlevel traffic features from network-level images to predict large-scale, multi-route, speed and volume of CVs. The accuracy and robustness of the proposed model are evaluated by taking different road types, times of day and image snippets of the developed model and comparable benchmarks. To visually analyze the historical traffic data and the results of the prediction model, an interactive web application powered by speech queries is built to offer accurate and fast insights of traffic performance, and thus, allow for better positioning of traffic control strategies. The product of this dissertation can be seamlessly deployed by transportation authorities to understand and manage congestions in a timely manner.Includes bibliographical references

    Explorative Graph Visualization

    Get PDF
    Netzwerkstrukturen (Graphen) sind heutzutage weit verbreitet. Ihre Untersuchung dient dazu, ein besseres Verständnis ihrer Struktur und der durch sie modellierten realen Aspekte zu gewinnen. Die Exploration solcher Netzwerke wird zumeist mit Visualisierungstechniken unterstützt. Ziel dieser Arbeit ist es, einen Überblick über die Probleme dieser Visualisierungen zu geben und konkrete Lösungsansätze aufzuzeigen. Dabei werden neue Visualisierungstechniken eingeführt, um den Nutzen der geführten Diskussion für die explorative Graphvisualisierung am konkreten Beispiel zu belegen.Network structures (graphs) have become a natural part of everyday life and their analysis helps to gain an understanding of their inherent structure and the real-world aspects thereby expressed. The exploration of graphs is largely supported and driven by visual means. The aim of this thesis is to give a comprehensive view on the problems associated with these visual means and to detail concrete solution approaches for them. Concrete visualization techniques are introduced to underline the value of this comprehensive discussion for supporting explorative graph visualization

    Leveraging distant supervision for improved named entity recognition

    Full text link
    Les techniques d'apprentissage profond ont fait un bond au cours des dernières années, et ont considérablement changé la manière dont les tâches de traitement automatique du langage naturel (TALN) sont traitées. En quelques années, les réseaux de neurones et les plongements de mots sont rapidement devenus des composants centraux à adopter dans le domaine. La supervision distante (SD) est une technique connue en TALN qui consiste à générer automatiquement des données étiquetées à partir d'exemples partiellement annotés. Traditionnellement, ces données sont utilisées pour l'entraînement en l'absence d'annotations manuelles, ou comme données supplémentaires pour améliorer les performances de généralisation. Dans cette thèse, nous étudions comment la supervision distante peut être utilisée dans un cadre d'un TALN moderne basé sur l'apprentissage profond. Puisque les algorithmes d'apprentissage profond s'améliorent lorsqu'une quantité massive de données est fournie (en particulier pour l'apprentissage des représentations), nous revisitons la génération automatique des données avec la supervision distante à partir de Wikipédia. On applique des post-traitements sur Wikipédia pour augmenter la quantité d'exemples annotés, tout en introduisant une quantité raisonnable de bruit. Ensuite, nous explorons différentes méthodes d'utilisation de données obtenues par supervision distante pour l'apprentissage des représentations, principalement pour apprendre des représentations de mots classiques (statistiques) et contextuelles. À cause de sa position centrale pour de nombreuses applications du TALN, nous choisissons la reconnaissance d'entité nommée (NER) comme tâche principale. Nous expérimentons avec des bancs d’essai NER standards et nous observons des performances état de l’art. Ce faisant, nous étudions un cadre plus intéressant, à savoir l'amélioration des performances inter-domaines (généralisation).Recent years have seen a leap in deep learning techniques that greatly changed the way Natural Language Processing (NLP) tasks are tackled. In a couple of years, neural networks and word embeddings quickly became central components to be adopted in the domain. Distant supervision (DS) is a well-used technique in NLP to produce labeled data from partially annotated examples. Traditionally, it was mainly used as training data in the absence of manual annotations, or as additional training data to improve generalization performances. In this thesis, we study how distant supervision can be employed within a modern deep learning based NLP framework. As deep learning algorithms gets better when massive amount of data is provided (especially for representation learning), we revisit the task of generating distant supervision data from Wikipedia. We apply post-processing treatments on the original dump to further increase the quantity of labeled examples, while introducing a reasonable amount of noise. Then, we explore different methods for using distant supervision data for representation learning, mainly to learn classic and contextualized word representations. Due to its importance as a basic component in many NLP applications, we choose Named-Entity Recognition (NER) as our main task. We experiment on standard NER benchmarks showing state-of-the-art performances. By doing so, we investigate a more interesting setting, that is, improving the cross-domain (generalization) performances

    Out Of Place: Stone Architecture And Pastoral Nomadism In Prehistoric Inner Asia

    Get PDF
    How architecture reflects the configuration of physical and social spaces among prehistoric pastoral nomads is a topic scarcely explored in the archaeology of Inner Asia, not least because the common preconception is that structural remains are not in keeping with the mobile lifestyle. Yet, the juxtaposition of these two seemingly contrasting strategies of human subsistence forms an interesting paradox that underlies precisely the nature of nomadism. Accordingly, this study questions how pastoral nomads relate to stationary structures and the idea of a locale. To do so, it draws on the archaeological record of stone architecture in the Bortala River Valley of Xinjiang Uyghur Autonomous Region, an area where pastoral nomadism developed in the second and first millennia BCE. With data collected from survey and excavation, this study employs GIS, statistics, and 3D photogrammetry to examine the environment and building patterns of these stone structures on three spatial scales. Built in simple geometric forms recurring in space and time, they correspond typologically to different epochs of human habitation, funerary and ritual activities. Instead of approaching the material typologically, however, this study questions the connection between site selection and architectural design and how the prehistoric landscape of Western Tian Shan was shaped. Three characteristics of place-making and space use are identified. First, the significance of these sites is reinforced through recurring access of specific locations and the adherence to certain building codes. Second, the aggregation of building components over time, like the symbolisms they carry, is cumulative and continuously reconfigured. Third, spatial knowledge is communal. It is anchored to a cartographic palimpsest comprising diverse forms of architecture and art. These preliminary observations form the basis for further modeling, in future research, the logistics of building and cultures of space use among early pastoral societies in Inner Asia on more explicit timescales and in more defined spatial forms

    Out Of Place: Stone Architecture And Pastoral Nomadism In Prehistoric Inner Asia

    Get PDF
    How architecture reflects the configuration of physical and social spaces among prehistoric pastoral nomads is a topic scarcely explored in the archaeology of Inner Asia, not least because the common preconception is that structural remains are not in keeping with the mobile lifestyle. Yet, the juxtaposition of these two seemingly contrasting strategies of human subsistence forms an interesting paradox that underlies precisely the nature of nomadism. Accordingly, this study questions how pastoral nomads relate to stationary structures and the idea of a locale. To do so, it draws on the archaeological record of stone architecture in the Bortala River Valley of Xinjiang Uyghur Autonomous Region, an area where pastoral nomadism developed in the second and first millennia BCE. With data collected from survey and excavation, this study employs GIS, statistics, and 3D photogrammetry to examine the environment and building patterns of these stone structures on three spatial scales. Built in simple geometric forms recurring in space and time, they correspond typologically to different epochs of human habitation, funerary and ritual activities. Instead of approaching the material typologically, however, this study questions the connection between site selection and architectural design and how the prehistoric landscape of Western Tian Shan was shaped. Three characteristics of place-making and space use are identified. First, the significance of these sites is reinforced through recurring access of specific locations and the adherence to certain building codes. Second, the aggregation of building components over time, like the symbolisms they carry, is cumulative and continuously reconfigured. Third, spatial knowledge is communal. It is anchored to a cartographic palimpsest comprising diverse forms of architecture and art. These preliminary observations form the basis for further modeling, in future research, the logistics of building and cultures of space use among early pastoral societies in Inner Asia on more explicit timescales and in more defined spatial forms
    corecore