Search CORE

76 research outputs found

Identifying References to Legal Literature in Portuguese Superior Court Decisions

Author: Fábio Miguel Pereira Nogueira
Publication venue
Publication date: 28/07/2023
Field of study

Repositório Aberto da Universidade do Porto

Recommended from our members

Information extraction from chemical patents

Author: Jessop David M
Publication venue: University of Cambridge
Publication date: 15/03/2011
Field of study

The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature. Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use. PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information. Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.Unileve

Apollo (Cambridge)

Integrating deep and shallow natural language processing components : representations and hybrid architectures

Author: Schäfer Ulrich
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2006
Field of study

We describe basic concepts and software architectures for the integration of shallow and deep (linguistics-based, semantics-oriented) natural language processing (NLP) components. The main goal of this novel, hybrid integration paradigm is improving robustness of deep processing. After an introduction to constraint-based natural language parsing, we give an overview of typical shallow processing tasks. We introduce XML standoff markup as an additional abstraction layer that eases integration of NLP components, and propose the use of XSLT as a standardized and efficient transformation language for online NLP integration. In the main part of the thesis, we describe our contributions to three hybrid architecture frameworks that make use of these fundamentals. SProUT is a shallow system that uses elements of deep constraint-based processing, namely type hierarchy and typed feature structures. WHITEBOARD is the first hybrid architecture to integrate not only part-of-speech tagging, but also named entity recognition and topological parsing, with deep parsing. Finally, we present Heart of Gold, a middleware architecture that generalizes WHITEBOARD into various dimensions such as configurability, multilinguality and flexible processing strategies. We describe various applications that have been implemented using the hybrid frameworks such as structured named entity recognition, information extraction, creative document authoring support, deep question analysis, as well as evaluations. In WHITEBOARD, e.g., it could be shown that shallow pre-processing increases both coverage and efficiency of deep parsing by a factor of more than two. Heart of Gold not only forms the basis for applications that utilize semanticsoriented natural language analysis, but also constitutes a complex research instrument for experimenting with novel processing strategies combining deep and shallow methods, and eases replication and comparability of results.Diese Arbeit beschreibt Grundlagen und Software-Architekturen für die Integration von flachen mit tiefen (linguistikbasierten und semantikorientierten) Verarbeitungskomponenten für natürliche Sprache. Das Hauptziel dieses neuartigen, hybriden Integrationparadigmas ist die Verbesserung der Robustheit der tiefen Verarbeitung. Nach einer Einführung in constraintbasierte Analyse natürlicher Sprache geben wir einen Überblick über typische Aufgaben flacher Sprachverarbeitungskomponenten. Wir führen XML Standoff-Markup als zusätzliche Abstraktionsebene ein, mit deren Hilfe sich Sprachverarbeitungskomponenten einfacher integrieren lassen. Ferner schlagen wir XSLT als standardisierte und effiziente Transformationssprache für die Online-Integration vor. Im Hauptteil der Arbeit stellen wir unsere Beiträge zu drei hybriden Architekturen vor, welche auf den beschriebenen Grundlagen aufbauen. SProUT ist ein flaches System, das Elemente tiefer Verarbeitung wie Typhierarchie und getypte Merkmalsstrukturen nutzt. WHITEBOARD ist das erste System, welches nicht nur Part-of-speech-Tagging, sondern auch Eigennamenerkennung und flaches topologisches Parsing mit tiefer Verarbeitung kombiniert. Schließlich wird Heart of Gold vorgestellt, eine Middleware-Architektur, welche WHITEBOARD hinsichtlich verschiedener Dimensionen wie Konfigurierbarkeit, Mehrsprachigkeit und Unterstützung flexibler Verarbeitungsstrategien generalisiert. Wir beschreiben verschiedene, mit Hilfe der hybriden Architekturen implementierte Anwendungen wie strukturierte Eigennamenerkennung, Informationsextraktion, Kreativitätsunterstützung bei der Dokumenterstellung, tiefe Frageanalyse, sowie Evaluationen. So konnte z.B. in WHITEBOARD gezeigt werden, dass durch flache Vorverarbeitung sowohl Abdeckung als auch Effizienz des tiefen Parsers mehr als verdoppelt werden. Heart of Gold bildet nicht nur Grundlage für semantikorientierte Sprachanwendungen, sondern stellt auch eine wissenschaftliche Experimentierplattform für weitere, neuartige Kombinationsstrategien dar, welche zudem die Replizierbarkeit und Vergleichbarkeit von Ergebnissen erleichtert

Universaar

Acronym

Efficient Conversion of Scientific Legacy Documents into Semantic Web Resources using biosystematics as a working example

Author: Sautter Guido
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2011
Field of study

KITopen

Natural language search of structured documents

Author: Oney Stephen W
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2008
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (leaves 45-47).This thesis focuses on techniques with which natural language can be used to search for specific elements in a structured document, such as an XML file. The goal is to create a system capable of being trained to identify features, of written English sentence describing (in natural language) part of an XML document, that help identify the sections of said document which were discussed. In particular, this thesis will revolve around the problem of searching through XML documents, each of which describes the play-by-play events of a baseball game. These events are collected from Major League Baseball games between 2004 and 2008, containing information detailing the outcome of every pitch thrown. My techniques are trained and tested on written (newspaper) summaries of these games, which often refer to specific game events and statistics. The choice of these training data makes the task much more complex in two ways. First, these summaries come from multiple authors. Each of these authors has a distinct writing style, which uses language in a unique and often complex way. Secondly, large portions of these summaries discuss facts outside of the context of the play-by-play events of the XML documents. Training the system with these portions of the summary can create a problem due to sparse data, which has the potential to reduce the effectiveness of the system. The end result is the creation of a system capable of building classifiers for natural language search of these XML documents.(cont.) This system is able to overcome the two aforementioned problems, as well as several more subtle challenges. In addition, several limitations of alternative, strictly feature-based, classifiers are also illustrated, and applications of this research to related problems (outside of baseball and sports) are discussed.by Stephen W. Oney.M.Eng

DSpace@MIT

A Syntactical Reverse Engineering Approach to Fourth Generation Programming Languages Using Formal Methods

Author: Zohri Yafi Majd
Publication venue
Publication date: 25/01/2022
Field of study

Fourth-generation programming languages (4GLs) feature rapid development with minimum configuration required by developers. However, 4GLs can suffer from limitations such as high maintenance cost and legacy software practices. Reverse engineering an existing large legacy 4GL system into a currently maintainable programming language can be a cheaper and more effective solution than rewriting from scratch. Tools do not exist so far, for reverse engineering proprietary XML-like and model-driven 4GLs where the full language specification is not in the public domain. This research has developed a novel method of reverse engineering some of the syntax of such 4GLs (with Uniface as an exemplar) derived from a particular system, with a view to providing a reliable method to translate/transpile that system's code and data structures into a modern object-oriented language (such as C\#). The method was also applied, although only to a limited extent, to some other 4GLs, Informix and Apex, to show that it was in principle more broadly applicable. A novel testing method that the syntax had been successfully translated was provided using 'abstract syntax trees'. The novel method took manually crafted grammar rules, together with Encapsulated Document Object Model based data from the source language and then used parsers to produce syntactically valid and equivalent code in the target/output language. This proof of concept research has provided a methodology plus sample code to automate part of the process. The methodology comprised a set of manual or semi-automated steps. Further automation is left for future research. In principle, the author's method could be extended to allow the reverse engineering recovery of the syntax of systems developed in other proprietary 4GLs. This would reduce time and cost for the ongoing maintenance of such systems by enabling their software engineers to work using modern object-oriented languages, methodologies, tools and techniques

University of Essex Research Repository

An Introduction to Hyperdex and the Brave New World of High Performance, Scalable, Consistent, Faulttolerant Data Stores

Author: Bernard Wong
Emin
Gün Sirer
Robert Escriva
Publication venue
Publication date
Field of study

CiteSeerX

Acta Cybernetica : Volume 21. Number 3.

Author
Publication venue
Publication date: 01/01/2014
Field of study

University of Szeged