Search CORE

1,322 research outputs found

Article Segmentation in Digitised Newspapers

Author: Naoum Andrew
Publication venue: Faculty of Engineering and Information Technologies, School of Computer Science
Publication date: 01/01/2020
Field of study

Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

Sydney eScholarship

Semantification of text through summarisation

Author: Joshi Monika
Publication venue
Publication date: 01/03/2019
Field of study

Ulster University's Research Portal

Deep Learning With Sentiment Inference For Discourse-Oriented Opinion Analysis

Author: Marasovic Ana
Publication venue
Publication date: 01/01/2020
Field of study

Opinions are omnipresent in written and spoken text ranging from editorials, reviews, blogs, guides, and informal conversations to written and broadcast news. However, past research in NLP has mainly addressed explicit opinion expressions, ignoring implicit opinions. As a result, research in opinion analysis has plateaued at a somewhat superficial level, providing methods that only recognize what is explicitly said and do not understand what is implied. In this dissertation, we develop machine learning models for two tasks that presumably support propagation of sentiment in discourse, beyond one sentence. The first task we address is opinion role labeling, i.e.\ the task of detecting who expressed a given attitude toward what or who. The second task is abstract anaphora resolution, i.e.\ the task of finding a (typically) non-nominal antecedent of pronouns and noun phrases that refer to abstract objects like facts, events, actions, or situations in the preceding discourse. We propose a neural model for labeling of opinion holders and targets and circumvent the problems that arise from the limited labeled data. In particular, we extend the baseline model with different multi-task learning frameworks. We obtain clear performance improvements using semantic role labeling as the auxiliary task. We conduct a thorough analysis to demonstrate how multi-task learning helps, what has been solved for the task, and what is next. We show that future developments should improve the ability of the models to capture long-range dependencies and consider other auxiliary tasks such as dependency parsing or recognizing textual entailment. We emphasize that future improvements can be measured more reliably if opinion expressions with missing roles are curated and if the evaluation considers all mentions in opinion role coreference chains as well as discontinuous roles. To the best of our knowledge, we propose the first abstract anaphora resolution model that handles the unrestricted phenomenon in a realistic setting. We cast abstract anaphora resolution as the task of learning attributes of the relation that holds between the sentence with the abstract anaphor and its antecedent. We propose a Mention-Ranking siamese-LSTM model (MR-LSTM) for learning what characterizes the mentioned relation in a data-driven fashion. The current resources for abstract anaphora resolution are quite limited. However, we can train our models without conventional data for abstract anaphora resolution. In particular, we can train our models on many instances of antecedent-anaphoric sentence pairs. Such pairs can be automatically extracted from parsed corpora by searching for a common construction which consists of a verb with an embedded sentence (complement or adverbial), applying a simple transformation that replaces the embedded sentence with an abstract anaphor, and using the cut-off embedded sentence as the antecedent. We refer to the extracted data as silver data. We evaluate our MR-LSTM models in a realistic task setup in which models need to rank embedded sentences and verb phrases from the sentence with the anaphor as well as a few preceding sentences. We report the first benchmark results on an abstract anaphora subset of the ARRAU corpus \citep{uryupina_et_al_2016} which presents a greater challenge due to a mixture of nominal and pronominal anaphors as well as a greater range of confounders. We also use two additional evaluation datasets: a subset of the CoNLL-12 shared task dataset \citep{pradhan_et_al_2012} and a subset of the ASN corpus \citep{kolhatkar_et_al_2013_crowdsourcing}. We show that our MR-LSTM models outperform the baselines in all evaluation datasets, except for events in the CoNLL-12 dataset. We conclude that training on the small-scale gold data works well if we encounter the same type of anaphors at the evaluation time. However, the gold training data contains only six shell nouns and events and thus resolution of anaphors in the ARRAU corpus that covers a variety of anaphor types benefits from the silver data. Our MR-LSTM models for resolution of abstract anaphors outperform the prior work for shell noun resolution \citep{kolhatkar_et_al_2013} in their restricted task setup. Finally, we try to get the best out of the gold and silver training data by mixing them. Moreover, we speculate that we could improve the training on a mixture if we: (i) handle artifacts in the silver data with adversarial training and (ii) use multi-task learning to enable our models to make ranking decisions dependent on the type of anaphor. These proposals give us mixed results and hence a robust mixed training strategy remains a challenge

Heidelberger Dokumentenserver

From Texts to Prerequisites. Identifying and Annotating Propaedeutic Relations in Educational Textual Resources

Author
Publication venue: Università degli studi di Genova
Publication date: 12/07/2021
Field of study

openPrerequisite Relations (PRs) are dependency relations established between two distinct concepts expressing which piece(s) of information a student has to learn first in order to understand a certain target concept. Such relations are one of the most fundamental in Education, playing a crucial role not only for what concerns new knowledge acquisition, but also in the novel applications of Artificial Intelligence to distant and e-learning. Indeed, resources annotated with such information could be used to develop automatic systems able to acquire and organize the knowledge embodied in educational resources, possibly fostering educational applications personalized, e.g., on students' needs and prior knowledge. The present thesis discusses the issues and challenges of identifying PRs in educational textual materials with the purpose of building a shared understanding of the relation among the research community. To this aim, we present a methodology for dealing with prerequisite relations as established in educational textual resources which aims at providing a systematic approach for uncovering PRs in textual materials, both when manually annotating and automatically extracting the PRs. The fundamental principles of our methodology guided the development of a novel framework for PR identification which comprises three components, each tackling a different task: (i) an annotation protocol (PREAP), reporting the set of guidelines and recommendations for building PR-annotated resources; (ii) an annotation tool (PRET), supporting the creation of manually annotated datasets reflecting the principles of PREAP; (iii) an automatic PR learning method based on machine learning (PREL). The main novelty of our methodology and framework lies in the fact that we propose to uncover PRs from textual resources relying solely on the content of the instructional material: differently from other works, rather than creating de-contextualised PRs, we acknowledge the presence of a PR between two concepts only if emerging from the way they are presented in the text. By doing so, we anchor relations to the text while modelling the knowledge structure entailed in the resource. As an original contribution of this work, we explore whether linguistic complexity of the text influences the task of manual identification of PRs. To this aim, we investigate the interplay between text and content in educational texts through a crowd-sourcing experiment on concept sequencing. Our methodology values the content of educational materials as it incorporates the evidence acquired from such investigation which suggests that PR recognition is highly influenced by the way in which concepts are introduced in the resource and by the complexity of the texts. The thesis reports a case study dealing with every component of the PR framework which produced a novel manually-labelled PR-annotated dataset.openXXXIII CICLO - DIGITAL HUMANITIES. TECNOLOGIE DIGITALI, ARTI, LINGUE, CULTURE E COMUNICAZIONE - Lingue, culture e tecnologie digitaliAlzetta, Chiar

Archivio istituzionale della ricerca - Università di Genova

Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

Author: Alessandro Mazzei
Elena Cabrio
Fabio Tamburini
Publication venue: 'OpenEdition'
Publication date: 01/01/2018
Field of study

On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

Directory of Open Access Books (DOAB)

Machine Learning in Resource-constrained Devices: Algorithms, Strategies, and Applications

Author: Ragusa Edoardo
Publication venue: Universit\ue0 degli studi di Genova
Publication date: 21/02/2019
Field of study

The ever-increasing growth of technologies is changing people's everyday life. As a major consequence: 1) the amount of available data is growing and 2) several applications rely on battery supplied devices that are required to process data in real time. In this scenario the need for ad-hoc strategies for the development of low-power and low-latency intelligent systems capable of learning inductive rules from data using a modest mount of computational resources is becoming vital. At the same time, one needs to develop specic methodologies to manage complex patterns such as text and images. This Thesis presents different approaches and techniques for the development of fast learning models explicitly designed to be hosted on embedded systems. The proposed methods proved able to achieve state-of-the-art performances in term of the trade-off between generalization capabilities and area requirements when implemented in low-cost digital devices. In addition, advanced strategies for ecient sentiment analysis in text and images are proposed

Archivio istituzionale della ricerca - Università di Genova

Topical scientific researches into resource-saving technologies of mineral mining and processing

Author: Multi-authored monograph .
Publication venue: Publishing House “St.Ivan Rilski”
Publication date: 01/01/2020
Field of study

Table of contents Preface . 5 Malanchuk Z.R., Soroka V.S., Lahodniuk O.A., Marchuk M.M. Physical-mechanical and technological features of amber extraction in the Rivne-Volyn region of Ukraine . 6 Moshynskyi, V.S., Korniyenko V.Ya., Khrystyuk A.O., Solvar L.M. Research of energy effective parameters of the process of hydro mechanical extraction of amber from sandy deposits . 24 Mohamed Tafsir Diallo, Mamadou Oury Fatoumata Diallo Tidal Park – Modeling and Control Strategy . 38 Savina N.B., Malanchuk L.O., Ignatiuk I.Z., Moshchych S.Z. Institutional basis and trends of management of the use of the subsoil in Ukraine . 51 Dedelyanova Kr.Y. Column flotation machine – innovative aeration, vibra-tory – acoustic and technological researches . 60 Makarenko V.D., Manhura A.M., Lartseva I.I., Manhura S.I. Magnetic field on asphalt, resin, paraffin and salt deposits 79 Krzysztof Tomiczek The problem of beds stability in the conditions of undermining higher deposited beds in the context of selected analytical solutions . 95 Safonyk A.P., Koziar M.M., Martyniuk P.M., Fylypchuk V.L. Management of pollution - purification system for mining plants . 117 Marinela Panayotova, Vladko Panayotov Recent developments in the flotation of sulfide ores of base metals - bioflotation . 130 Remez N., Dychko A., Bronytskyi V., Kraychuk S. Simulation of shock waves from explosion of mixture explosive charges . 149 Melodi M.M. Akande V.O. Analysis of productivity and technical efficiency in granite aggregate production in selected quarries in south-western, Nigeria . 166 Doroshenko Ya.V., Karpash O.M., Rybitskyi I.V. Investigation of dispersed contaminates influence on the hydraulic energy consumption of elements of gas pipeline systems with complex geometry . 182 Skipochka S.I., Krukovskyi O.P., Krukovska V.V., Palamarchuk T.A. Features of methane emission in coal mines at high speed longwall face advance 208 Daouda Keita, Valery Pozdnyakov Statistical analysis of experimental data on the indices of operation of the loading units of the bauxite compa-ny of Guinea (CBG) . 226 Yevhenii Malanchuk, Sergiy Stets, Ruslan Zhomyruk, Andriy Stets Modeling of the process of mining of zeolite-smectite tuffs by hydro-well method . 244 Samusia V. I., Kyrychenko Y. О., Cheberiachko I. M., Trofymova, O. P. Development of experimental methods to study heterogenic flows in the context of hydraulic hoisting design . 260 Makarenko V.D., Kharchenko M.O., Manhura A.M., Petrash O.V. Magnetic treatment of production fluid with high content of asphalt-resin-paraffin deposits . 268 Kovshun N.E., Ignatiuk I.Z., Moshchych S.Z. Malanchuk L.O. Innovative model of development of fuel and energy complex of Ukraine 279 Bondarenko А.O., Ostapchuk O.V. Design and implementation of a jet pump dredge . 296 Sotskov V.O., Dereviahina N.I. Research of dependencies of stope stress-strain state change under various conditions of partial stowing of developed space . 305 Sakhno S., Liulchenko Y., Chyrva T., Pischikova O. Determination of bear-ing capacity and calculation of the gain of the damaged span of a railway overpass by the finite element method . 326 Melodi М.М., Ojulari M.K. Oluwafemi V.I. Economic and environmental impacts of artisanal gold mining on near-by community of Sauka-Kahuta, Nigeria . 340 Kruchkov A.I., Besarabets Y.J., Yevtieieva L.I. Energy saving modes of excavators type power shovel . 353 Hryhorash M.V., Kuzminskyi V.P., Ovchinnikova O.V., Kukhar V.Yu. Energy saving through quality of technical water: new types of mechanical screen filters for various links of water treatment . 369 Didenko M. The modeling of the interaction of rock mass and compliant lining while it is expanded . 394 Makarenko V.D., Liashenko A.V. Complex approach to research and selection of hydrocarbon solvents for asphaltene-resin-paraffin-hydrate deposits control . 408 Mykhailovska O.V., Zotsenko M.L. Investigation of the oscillations amplitudes bases and foundations of the forming machine . 417 Inkin O.V., Puhach A.M., Dereviahina N.I. Physical-chemical and technological parameters of improving profitability of underground coal burning . 42

Цифровий репозиторій Національного університету водного господарства та природокористування

Legal argumentation concerning Almost Identical Expressions (AIE) in statutory texts

Author: Araszkiewicz Michał
Łopatkiewicz Agata
Publication venue: CEUR
Publication date: 01/01/2015
Field of study

Jagiellonian Univeristy Repository