Search CORE

2,117 research outputs found

Multi-hypothesis CSV parsing

Author: Boncz P.A. (Peter)
Döhmen T.R. (Till)
Mühleisen H.F. (Hannes)
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/06/2017
Field of study

Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files

Author: Döhmen T.R. (Till)
Publication venue
Publication date: 01/08/2016
Field of study

Tabular data on the web comes in various formats and shapes. Preparing data for data analysis and integration requires manual steps which go beyond simple parsing of the data. The preparation includes steps like correct configuration of the parser, removing of meaningless rows, casting of data types and reshaping of the table structure. The goal of this thesis is the development of a robust and modular system which is able to automatically transform messy CSV data sources into a tidy tabular data structure. The highly diverse corpus of CSV files from the UK open data hub will serve as a basis for the evaluation of the system

CWI's Institutional Repository

Marrying Universal Dependencies and Universal Morphology

Author: Cotterell Ryan
Hulden Mans
McCarthy Arya D.
Silfverberg Miikka
Yarowsky David
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project's annotations could be used to validate the other's. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects.Comment: UDW1

arXiv.org e-Print Archive

Crossref

Extracting Causal Claims from Information Systems Papers with Natural Language Processing for Theory Ontology Learning

Author: Huettemann Sebastian
Mueller Roland M.
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2018
Field of study

The number of scientific papers published each year is growing exponentially. How can computational tools support scientists to better understand and process this data? This paper presents a software-prototype that automatically extracts causes, effects, signs, moderators, mediators, conditions, and interaction signs from propositions and hypotheses of full-text scientific papers. This prototype uses natural language processing methods and a set of linguistic rules for causal information extraction. The prototype is evaluated on a manually annotated corpus of 270 Information Systems papers containing 723 hypotheses and propositions from the AIS basket of eight. F1-results for the detection and extraction of different causal variables range between 0.71 and 0.90. The presented automatic causal theory extraction allows for the analysis of scientific papers based on a theory ontology and therefore contributes to the creation and comparison of inter-nomological networks

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

TransParsCit: A Transformer-Based Citation Parser Trained on Large-Scale Synthesized Data

Author: Uddin MD Sami
Publication venue: ODU Digital Commons
Publication date: 01/05/2022
Field of study

Accurately parsing citation strings is key to automatically building large-scale citation graphs, so a robust citation parser is an essential module in academic search engines. One limitation of the state-of-the-art models (such as ParsCit and Neural-ParsCit) is the lack of a large-scale training corpus. Manually annotating hundreds of thousands of citation strings is laborious and time-consuming. This thesis presents a novel transformer-based citation parser by leveraging the GIANT dataset, consisting of 1 billion synthesized citation strings covering over 1500 citation styles. As opposed to handcrafted features, our model benefits from word embeddings and character-based embeddings by combining the bidirectional long shortterm memory (BiLSTM) with the Transformer and Conditional Random Forest (CRF). We varied the training data size from 500 to 1M and investigated the impact of training size on the performance. We evaluated our models on standard CORA benchmark and observed an increase in F1-score as the training size increased. The best performance happened when the training size was around 220K, achieving an F1-score of up to 100% on key citation fields. To our best knowledge, this is the first citation parser trained on a largescale synthesized dataset. Project codes and documentation can be found on this GitHub repository: https://github.com/lamps-lab/Citation-Parser

Old Dominion University

SheetReader: Efficient Specialized Spreadsheet Parsing

Author: Gavriilidis Haralampos
Henze Felix
Markl Volker
Zacharatou Eleni Tzirita
Publication venue: 'Elsevier BV'
Publication date: 01/05/2023
Field of study

The IT University of Copenhagen's Repository