Search CORE

4,244 research outputs found

Exploring the viability of semi-automated document markup

Author: Fenlon Katrina S.
Publication venue
Publication date: 06/05/2009
Field of study

Digital humanities scholarship has long acknowledged the abundant theoretical advantages of text encoding; more questionable is whether the advantages can, in practice and in general, outweigh the costs of the usually labor-intensive task of encoding. Markup of literary texts has not yet been undertaken on a scale large enough to realize many of its potential applications and benefits. If we can reduce the human labor required to encode texts, libraries and their users can take greater advantage of the hosts of texts being produced by various mass digitization projects, and can focus more attention on implementing tools that use underlying encodings. How far can automation take an encoding effort? And what implications might that have for libraries and their users? Compelled by such questions, this paper explores the viability of semi-automated text encodingunpublishednot peer reviewe

Illinois Digital Environment for Access to Learning and Scholarship Repository

Dutch parallel corpus: a balanced parallel corpus for Dutch-English and Dutch-French

Author: FJ Och
G Sutter De
G Vanderbauwhede
Isabelle Delaere
L Macken
L Macken
Lieve Macken
M Kay
M Simard
MP Marcus
P Keirsbilck Van
PF Brown
R Moore
W Daelemans
WA Gale
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

status: publishe

Lirias

Crossref

Springer - Publisher Connector

Ghent University Academic Bibliography

Implementing a Portable Clinical NLP System with a Common Data Model - a Lisp Perspective

Author: Luo Yuan
Szolovits Peter
Publication venue
Publication date: 14/11/2018
Field of study

This paper presents a Lisp architecture for a portable NLP system, termed LAPNLP, for processing clinical notes. LAPNLP integrates multiple standard, customized and in-house developed NLP tools. Our system facilitates portability across different institutions and data systems by incorporating an enriched Common Data Model (CDM) to standardize necessary data elements. It utilizes UMLS to perform domain adaptation when integrating generic domain NLP tools. It also features stand-off annotations that are specified by positional reference to the original document. We built an interval tree based search engine to efficiently query and retrieve the stand-off annotations by specifying positional requirements. We also developed a utility to convert an inline annotation format to stand-off annotations to enable the reuse of clinical text datasets with inline annotations. We experimented with our system on several NLP facilitated tasks including computational phenotyping for lymphoma patients and semantic relation extraction for clinical notes. These experiments showcased the broader applicability and utility of LAPNLP.Comment: 6 pages, accepted by IEEE BIBM 2018 as regular pape

arXiv.org e-Print Archive

DSpace@MIT

Crossref

A Semantic Web Annotation Tool for a Web-Based Audio Sequencer

Author: Akkermans V.
Restagno L.
Rizzo Giuseppe
Servetti Antonio
Publication venue: Springer
Publication date: 01/01/2011
Field of study

Music and sound have a rich semantic structure which is so clear to the composer and the listener, but that remains mostly hidden to computing machinery. Nevertheless, in recent years, the introduction of software tools for music production have enabled new opportunities for migrating this knowledge from humans to machines. A new generation of these tools may exploit sound samples and semantic information coupling for the creation not only of a musical, but also of a "semantic" composition. In this paper we describe an ontology driven content annotation framework for a web-based audio editing tool. In a supervised approach, during the editing process, the graphical web interface allows the user to annotate any part of the composition with concepts from publicly available ontologies. As a test case, we developed a collaborative web-based audio sequencer that provides users with the functionality to remix the audio samples from the Freesound website and subsequently annotate them. The annotation tool can load any ontology and thus gives users the opportunity to augment the work with annotations on the structure of the composition, the musical materials, and the creator's reasoning and intentions. We believe this approach will provide several novel ways to make not only the final audio product, but also the creative process, first class citizens of the Semantic We

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

A flexible software architecture concept for the creation of accessible PDF documents

Author: Darvishy Alireza
Dorigo Martin
Horvath Alexander
Hutter Hans-Peter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

This paper presents a flexible software architecture concept that allows the automatic generation of fully accessible PDF documents originating from various authoring tools such as Adobe InDesign or Microsoft Word. The architecture can be extended to include any authoring tools capable of creating PDF documents. For each authoring tool, a software accessibility plug-in must be implemented which analyzes the logical structure of the document and creates an XML representation of it. This XML file is used in combination with an untagged non-accessible PDF to create an accessible PDF version of the document. The implemented accessibility plug-in prototype allows authors of documents to check for accessibility issues while creating their documents and add the additional semantic information needed to generate a fully accessible PDF document

ZHAW digitalcollection

Enhanced Integrated Scoring for Cleaning Dirty Texts

Author: Bennamoun Mohammed
Liu Wei
Wong Wilson
Publication venue
Publication date: 06/02/2008
Field of study

An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper presents an enhancement of an Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as part of a text preprocessing phase in an ontology engineering system. New evaluations performed on the enhanced ISSAC using 700 chat records reveal an improved accuracy of 98% as compared to 96.5% and 71% based on the use of only basic ISSAC and of Aspell, respectively.Comment: More information is available at http://explorer.csse.uwa.edu.au/reference

arXiv.org e-Print Archive

CiteSeerX

Social media analytics: a survey of techniques, tools and platforms

Author: Batrinca B
Treleaven PC
Publication venue
Publication date: 01/02/2015
Field of study

This paper is written for (social science) researchers seeking to analyze the wealth of social media now available. It presents a comprehensive review of software tools for social networking media, wikis, really simple syndication feeds, blogs, newsgroups, chat and news feeds. For completeness, it also includes introductions to social media scraping, storage, data cleaning and sentiment analysis. Although principally a review, the paper also provides a methodology and a critique of social media tools. Analyzing social media, in particular Twitter feeds for sentiment analysis, has become a major research and business activity due to the availability of web-based application programming interfaces (APIs) provided by Twitter, Facebook and News services. This has led to an ‘explosion’ of data services, software tools for scraping and analysis and social media analytics platforms. It is also a research area undergoing rapid change and evolution due to commercial pressures and the potential for using social media data for computational (social science) research. Using a simple taxonomy, this paper provides a review of leading software tools and how to use them to scrape, cleanse and analyze the spectrum of social media. In addition, it discussed the requirement of an experimental computational environment for social media research and presents as an illustration the system architecture of a social media (analytics) platform built by University College London. The principal contribution of this paper is to provide an overview (including code fragments) for scientists seeking to utilize social media scraping and analytics either in their research or business. The data retrieval techniques that are presented in this paper are valid at the time of writing this paper (June 2014), but they are subject to change since social media data scraping APIs are rapidly changing

UCL Discovery

Generating summary documents for a variable-quality PDF document collection

Author: Adams Clive E.
Bagley Steven R.
Brailsford David F.
Hughes Jacob
Publication venue
Publication date: 01/01/2014
Field of study

The Cochrane Schizophrenia Group’s Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort – on a given theme but gathered from a wide range of sources – will generally have huge variability in the quality of the PDF, particularly with respect to the key property of text searchability. Summarising the results from the best of these papers, to allow evidence-based health care decision making, has so far been done by manually creating a summary document, starting from a visual inspection of the relevant PDF file. This labour-intensive process has resulted, to date, in only 4,000 of the papers being summarised – with enormous duplication of effort and with many issues around the validity and reliability of the data extraction. This paper describes a pilot project to provide a computer-assisted framework in which any of the PDF documents could be searched for the occurrence of some 8,000 keywords and key phrases.Once keyword tagging has been completed the framework assists in the generation of a standard summary document, thereby greatly speeding up the production of these summaries. Early examples of the framework are described and its capabilities illustrated

Nottingham ePrints

Nottingham eTheses

Crossref

Supporting text mining for e-Science: the challenges for Grid-enabled natural language processing

Author: Carroll John
Evans Roger
Klein Ewan
Publication venue
Publication date: 01/01/2005
Field of study

Over the last few years, language technology has moved rapidly from 'applied research' to 'engineering', and from small-scale to large-scale engineering. Applications such as advanced text mining systems are feasible, but very resource-intensive, while research seeking to address the underlying language processing questions faces very real practical and methodological limitations. The e-Science vision, and the creation of the e-Science Grid, promises the level of integrated large-scale technological support required to sustain this important and successful new technology area. In this paper, we discuss the foundations for the deployment of text mining and other language technology on the Grid - the protocols and tools required to build distributed large-scale language technology systems, meeting the needs of users, application builders and researchers

CiteSeerX

University of Brighton Research Portal

Sussex Research Online

Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context

Author: Aizawa A.
Cajori F.
Cohl H. S.
Dehaye P.
Ginev D.
Ion P. D. F.
Nghiem M.-Q.
Padovani L.
Schubotz M.
Schubotz M.
Schubotz M.
Schubotz M.
Schubotz M.
Schubotz M.
Stamerjohanns H.
Watt S. M.
Youssef A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.Comment: 10 pages, 4 figure

arXiv.org e-Print Archive

KOPS - The Institutional Repository of the University of Konstanz

Crossref

PubMed Central