4 research outputs found
Recovering Grammar Relationships for the Java Language Specification
Grammar convergence is a method that helps discovering relationships between
different grammars of the same language or different language versions. The key
element of the method is the operational, transformation-based representation
of those relationships. Given input grammars for convergence, they are
transformed until they are structurally equal. The transformations are composed
from primitive operators; properties of these operators and the composed chains
provide quantitative and qualitative insight into the relationships between the
grammars at hand. We describe a refined method for grammar convergence, and we
use it in a major study, where we recover the relationships between all the
grammars that occur in the different versions of the Java Language
Specification (JLS). The relationships are represented as grammar
transformation chains that capture all accidental or intended differences
between the JLS grammars. This method is mechanized and driven by nominal and
structural differences between pairs of grammars that are subject to
asymmetric, binary convergence steps. We present the underlying operator suite
for grammar transformation in detail, and we illustrate the suite with many
examples of transformations on the JLS grammars. We also describe the
extraction effort, which was needed to make the JLS grammars amenable to
automated processing. We include substantial metadata about the convergence
process for the JLS so that the effort becomes reproducible and transparent
Recovering grammar relationships for the Java language specification
Grammar convergence is a method that helps in discovering relationships between different grammars of the same language or different language versions. The key element of the method is the operational, transformation-based representation of those relationships. Given input grammars for convergence, they are transformed until they are structurally equal. The transformations are composed from primitive operators; properties of these operators and the composed chains provide quantitative and qualitative insight into the relationships between the grammars at hand. We describe a refined method for grammar convergence, and we use it in a major study, where we recover the relationships between all the grammars that occur in the different versions of the Java Language Specification (JLS). The relationships are represented as grammar transformation chains that capture all accidental or intended differences between the JLS grammars. This method is mechanized and driven by nominal and structural differences between pairs of grammars that are subject to asymmetric, binary convergence steps. We present the underlying operator suite for grammar transformation in detail, and we illustrate the suite with many examples of transformations on the JLS grammars. We also describe the extraction effort, which was needed to make the JLS grammars amenable to automated processing. We include substantial metadata about the convergence process for the JLS so that the effort becomes reproducible and transparent
Μηχανική Μάθηση στην Επεξεργασία Φυσικής Γλώσσας
Η διατριβή εξετάζει την χρήση τεχνικών μηχανικής μάθησης σε διάφορα στάδια της
επεξεργασίας φυσικής γλώσσας, κυρίως για σκοπούς εξαγωγής πληροφορίας από
κείμενα. Στόχος είναι τόσο η βελτίωση της προσαρμοστικότητας των συστημάτων
εξαγωγής πληροφορίας σε νέες θεματικές περιοχές (ή ακόμα και γλώσσες), όσο και
η επίτευξη καλύτερης απόδοσης χρησιμοποιώντας όσο το δυνατό λιγότερους πόρους
(τόσο γλωσσικούς όσο και ανθρώπινους). Η διατριβή κινείται σε δύο κύριους
άξονες: α) την έρευνα και αποτίμηση υπαρχόντων αλγορίθμων μηχανικής μάθησης
κυρίως στα στάδια της προ-επεξεργασίας (όπως η αναγνώριση μερών του λόγου) και
της αναγνώρισης ονομάτων οντοτήτων, και β) τη δημιουργία ενός νέου αλγορίθμου
μηχανικής μάθησης και αποτίμησής του, τόσο σε συνθετικά δεδομένα, όσο και σε
πραγματικά δεδομένα από το στάδιο της εξαγωγής σχέσεων μεταξύ ονομάτων
οντοτήτων. Ο νέος αλγόριθμος μηχανικής μάθησης ανήκει στην κατηγορία της
επαγωγικής εξαγωγής γραμματικών, και εξάγει γραμματικές ανεξάρτητες από τα
συμφραζόμενα χρησιμοποιώντας μόνο θετικά παραδείγματα.This thesis examines the use of machine learning techniques in various tasks of
natural language processing, mainly for the task of information extraction from
texts. The objectives are the improvement of adaptability of information
extraction systems to new thematic domains (or even languages), and the
improvement of their performance using as fewer resources (either linguistic or
human) as possible. This thesis has examined two main axes: a) the research and
assessment of existing algorithms of machine learning mainly in the stages of
linguistic pre-processing (such as part of speech tagging) and named-entity
recognition, and b) the creation of a new machine learning algorithm and its
assessment on synthetic data, as well as in real world data from the task of
relation extraction between named entities. This new algorithm belongs to the
category of inductive grammar learning, and can infer context free grammars
from positive examples only