11 research outputs found

    Integrating deep and shallow natural language processing components : representations and hybrid architectures

    Get PDF
    We describe basic concepts and software architectures for the integration of shallow and deep (linguistics-based, semantics-oriented) natural language processing (NLP) components. The main goal of this novel, hybrid integration paradigm is improving robustness of deep processing. After an introduction to constraint-based natural language parsing, we give an overview of typical shallow processing tasks. We introduce XML standoff markup as an additional abstraction layer that eases integration of NLP components, and propose the use of XSLT as a standardized and efficient transformation language for online NLP integration. In the main part of the thesis, we describe our contributions to three hybrid architecture frameworks that make use of these fundamentals. SProUT is a shallow system that uses elements of deep constraint-based processing, namely type hierarchy and typed feature structures. WHITEBOARD is the first hybrid architecture to integrate not only part-of-speech tagging, but also named entity recognition and topological parsing, with deep parsing. Finally, we present Heart of Gold, a middleware architecture that generalizes WHITEBOARD into various dimensions such as configurability, multilinguality and flexible processing strategies. We describe various applications that have been implemented using the hybrid frameworks such as structured named entity recognition, information extraction, creative document authoring support, deep question analysis, as well as evaluations. In WHITEBOARD, e.g., it could be shown that shallow pre-processing increases both coverage and efficiency of deep parsing by a factor of more than two. Heart of Gold not only forms the basis for applications that utilize semanticsoriented natural language analysis, but also constitutes a complex research instrument for experimenting with novel processing strategies combining deep and shallow methods, and eases replication and comparability of results.Diese Arbeit beschreibt Grundlagen und Software-Architekturen für die Integration von flachen mit tiefen (linguistikbasierten und semantikorientierten) Verarbeitungskomponenten für natürliche Sprache. Das Hauptziel dieses neuartigen, hybriden Integrationparadigmas ist die Verbesserung der Robustheit der tiefen Verarbeitung. Nach einer Einführung in constraintbasierte Analyse natürlicher Sprache geben wir einen Überblick über typische Aufgaben flacher Sprachverarbeitungskomponenten. Wir führen XML Standoff-Markup als zusätzliche Abstraktionsebene ein, mit deren Hilfe sich Sprachverarbeitungskomponenten einfacher integrieren lassen. Ferner schlagen wir XSLT als standardisierte und effiziente Transformationssprache für die Online-Integration vor. Im Hauptteil der Arbeit stellen wir unsere Beiträge zu drei hybriden Architekturen vor, welche auf den beschriebenen Grundlagen aufbauen. SProUT ist ein flaches System, das Elemente tiefer Verarbeitung wie Typhierarchie und getypte Merkmalsstrukturen nutzt. WHITEBOARD ist das erste System, welches nicht nur Part-of-speech-Tagging, sondern auch Eigennamenerkennung und flaches topologisches Parsing mit tiefer Verarbeitung kombiniert. Schließlich wird Heart of Gold vorgestellt, eine Middleware-Architektur, welche WHITEBOARD hinsichtlich verschiedener Dimensionen wie Konfigurierbarkeit, Mehrsprachigkeit und Unterstützung flexibler Verarbeitungsstrategien generalisiert. Wir beschreiben verschiedene, mit Hilfe der hybriden Architekturen implementierte Anwendungen wie strukturierte Eigennamenerkennung, Informationsextraktion, Kreativitätsunterstützung bei der Dokumenterstellung, tiefe Frageanalyse, sowie Evaluationen. So konnte z.B. in WHITEBOARD gezeigt werden, dass durch flache Vorverarbeitung sowohl Abdeckung als auch Effizienz des tiefen Parsers mehr als verdoppelt werden. Heart of Gold bildet nicht nur Grundlage für semantikorientierte Sprachanwendungen, sondern stellt auch eine wissenschaftliche Experimentierplattform für weitere, neuartige Kombinationsstrategien dar, welche zudem die Replizierbarkeit und Vergleichbarkeit von Ergebnissen erleichtert

    Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools

    Get PDF
    In my habilitation dissertation, meant to validate my capacity of and maturity for directingresearch activities, I present a panorama of several topics in computational linguistics, linguisticsand computer science.Over the past decade, I was notably concerned with the phenomena of compositionalityand variability of linguistic objects. I illustrate the advantages of a compositional approachto the language in the domain of emotion detection and I explain how some linguistic objects,most prominently multi-word expressions, defy the compositionality principles. I demonstratethat the complex properties of MWEs, notably variability, are partially regular and partiallyidiosyncratic. This fact places the MWEs on the frontiers between different levels of linguisticprocessing, such as lexicon and syntax.I show the highly heterogeneous nature of MWEs by citing their two existing taxonomies.After an extensive state-of-the art study of MWE description and processing, I summarizeMultiflex, a formalism and a tool for lexical high-quality morphosyntactic description of MWUs.It uses a graph-based approach in which the inflection of a MWU is expressed in function ofthe morphology of its components, and of morphosyntactic transformation patterns. Due tounification the inflection paradigms are represented compactly. Orthographic, inflectional andsyntactic variants are treated within the same framework. The proposal is multilingual: it hasbeen tested on six European languages of three different origins (Germanic, Romance and Slavic),I believe that many others can also be successfully covered. Multiflex proves interoperable. Itadapts to different morphological language models, token boundary definitions, and underlyingmodules for the morphology of single words. It has been applied to the creation and enrichmentof linguistic resources, as well as to morphosyntactic analysis and generation. It can be integratedinto other NLP applications requiring the conflation of different surface realizations of the sameconcept.Another chapter of my activity concerns named entities, most of which are particular types ofMWEs. Their rich semantic load turned them into a hot topic in the NLP community, which isdocumented in my state-of-the art survey. I present the main assumptions, processes and resultsissued from large annotation tasks at two levels (for named entities and for coreference), parts ofthe National Corpus of Polish construction. I have also contributed to the development of bothrule-based and probabilistic named entity recognition tools, and to an automated enrichment ofProlexbase, a large multilingual database of proper names, from open sources.With respect to multi-word expressions, named entities and coreference mentions, I pay aspecial attention to nested structures. This problem sheds new light on the treatment of complexlinguistic units in NLP. When these units start being modeled as trees (or, more generally, asacyclic graphs) rather than as flat sequences of tokens, long-distance dependencies, discontinu-ities, overlapping and other frequent linguistic properties become easier to represent. This callsfor more complex processing methods which control larger contexts than what usually happensin sequential processing. Thus, both named entity recognition and coreference resolution comesvery close to parsing, and named entities or mentions with their nested structures are analogous3to multi-word expressions with embedded complements.My parallel activity concerns finite-state methods for natural language and XML processing.My main contribution in this field, co-authored with 2 colleagues, is the first full-fledged methodfor tree-to-language correction, and more precisely for correcting XML documents with respectto a DTD. We have also produced interesting results in incremental finite-state algorithmics,particularly relevant to data evolution contexts such as dynamic vocabularies or user updates.Multilingualism is the leitmotif of my research. I have applied my methods to several naturallanguages, most importantly to Polish, Serbian, English and French. I have been among theinitiators of a highly multilingual European scientific network dedicated to parsing and multi-word expressions. I have used multilingual linguistic data in experimental studies. I believethat it is particularly worthwhile to design NLP solutions taking declension-rich (e.g. Slavic)languages into account, since this leads to more universal solutions, at least as far as nominalconstructions (MWUs, NEs, mentions) are concerned. For instance, when Multiflex had beendeveloped with Polish in mind it could be applied as such to French, English, Serbian and Greek.Also, a French-Serbian collaboration led to substantial modifications in morphological modelingin Prolexbase in its early development stages. This allowed for its later application to Polishwith very few adaptations of the existing model. Other researchers also stress the advantages ofNLP studies on highly inflected languages since their morphology encodes much more syntacticinformation than is the case e.g. in English.In this dissertation I am also supposed to demonstrate my ability of playing an active rolein shaping the scientific landscape, on a local, national and international scale. I describemy: (i) various scientific collaborations and supervision activities, (ii) roles in over 10 regional,national and international projects, (iii) responsibilities in collective bodies such as program andorganizing committees of conferences and workshops, PhD juries, and the National UniversityCouncil (CNU), (iv) activity as an evaluator and a reviewer of European collaborative projects.The issues addressed in this dissertation open interesting scientific perspectives, in whicha special impact is put on links among various domains and communities. These perspectivesinclude: (i) integrating fine-grained language data into the linked open data, (ii) deep parsingof multi-word expressions, (iii) modeling multi-word expression identification in a treebank as atree-to-language correction problem, and (iv) a taxonomy and an experimental benchmark fortree-to-language correction approaches

    Inquiries into the lexicon-syntax relations in Basque

    Get PDF
    Index:- Foreword. B. Oyharçabal.- Morphosyntactic disambiguation and shallow parsing in computational processing in Basque. I. Aduriz, A. Díaz de Ilarraza.- The transitivity of borrowed verbs in Basque: an outline. X. Alberdi.- Patrixa: a unification-based parser for Basque and its application to the automatic analysis of verbs. I. Aldezabal, M. J. Aranzabe, A. Atutxa, K.Gojenola, K, Sarasola.- Learning argument/adjunct distinction for Basque. I. Aldezabal, M. J. Aranzabe, K. Gojenola, K, Sarasola, A. Atutxa.- Analyzing verbal subcategorization aimed at its computation application. I. Aldezabal, P. Goenaga.- Automatic extraction of verb paterns from “hauta-lanerako euskal hiztegia”. J. M. Arriola, X. Artola, A. Soroa.- The case of an enlightening, provoking an admirable Basque derivational siffux with implications for the theory of argument structure. X. Artiagoitia.- Verb-deriving processes in Basque. J. C. Odriozola.- Lexical causatives and causative alternation in Basque. B. Oyharçabal.- Causation and semantic control; diagnosis of incorrect use in minorized languages. I. Zabala.- Subject index.- Contributions

    Extraction of ontology schema components from financial news

    Get PDF
    In this thesis we describe an incremental multi-layer rule-based methodology for the extraction of ontology schema components from German financial newspaper text. By Extraction of Ontology Schema Components we mean the detection of new concepts and relations between these concepts for ontology building. The process of detecting concepts and relations between these concepts corresponds to the intensional part of an ontology and is often referred to as ontology learning. We present the process of rule generation for the extraction of ontology schema components as well as the application of the generated rules.In dieser Arbeit beschreiben wir eine inkrementelle mehrschichtige regelbasierte Methode für die Extraktion von Ontologiekomponenten aus einer deutschen Wirtschaftszeitung. Die Arbeit beschreibt sowohl den Generierungsprozess der Regeln für die Extraktion von ontologischem Wissen als auch die Anwendung dieser Regeln. Unter Extraktion von Ontologiekomponenten verstehen wir die Erkennung von neuen Konzepten und Beziehungen zwischen diesen Konzepten für die Erstellung von Ontologien. Der Prozess der Extraktion von Konzepten und Beziehungen zwischen diesen Konzepten entspricht dem intensionalen Teil einer Ontologie und wird im Englischen Ontology Learning genannt. Im Deutschen enspricht dies dem Lernen von Ontologien

    JACY - a grammar for annotating syntax, semantics and pragmatics of written and spoken japanese for NLP application purposes

    Get PDF
    In this text, we describe the development of a broad coverage grammar for Japanese that has been built for and used in different application contexts. The grammar is based on work done in the Verbmobil project (Siegel 2000) on machine translation of spoken dialogues in the domain of travel planning. The second application for JACY was the automatic email response task. Grammar development was described in Oepen et al. (2002a). Third, it was applied to the task of understanding material on mobile phones available on the internet, while embedded in the project DeepThought (Callmeier et al. 2004, Uszkoreit et al. 2004). Currently, it is being used for treebanking and ontology extraction from dictionary definition sentences by the Japanese company NTT (Bond et al. 2004)

    An automatic morphological analysis system for Indonesian

    Get PDF
    This thesis reports the creation of SANTI-morf (Sistem Analisis Teks Indonesia – morfologi), a rule-based system that performs morphological annotation for Indonesian. The system has been built across three stages, namely preliminaries, annotation scheme creation (the linguistic aspect of the project), and system implementation (the computational aspect of the project). The preliminary matters covered include the necessary key concepts in morphology and Natural Language Processing (NLP), as well as a concise description of Indonesian morphology (largely based on the two primary reference grammars of Indonesian, Alwi et al. 1998 and Sneddon et al. 2010, together with work in the linguistic literature on Indonesian morphology (e.g. Kridalaksana 1989; Chaer 2008). As part of this preliminary stage, I created a testbed corpus for evaluation purposes. The design of the testbed is justified by considering the design of existing evaluation corpora, such as the testbed used by the English Constraint Grammar or EngCG system (Voutilanen 1992), the British National Corpus (BNC) 1994 evaluation data , and the training data used by MorphInd (Larasati et al. 2011), a morphological analyser (MA) for Indonesian. The dataset for this testbed was created by narrowing down an existing very large bit unbalanced collection of texts (drawn from the Leipzig corpora; see Goldhahn et al. 2012). The initial collection was reduced to a corpus composed of nine domains following the domain categorisation of the BNC) . A set of texts from each domain, proportional in size, was extracted and combined to form a testbed that complies with the design cited informed by the prior literature. The second stage, scheme creation, involved the creation of a new Morphological Annotation Scheme (MAS) for Indonesian, for use in the SANTI-morf system. First, a review of MASs in different languages (Finnish, Turkish, Arabic, Indonesian) as well as the Universal Dependencies MAS identifies the best practices in the field. From these, 15 design principles for the novel MAS were devised. This MAS consists of a morphological tagset, together with comprehensive justification of the morphological analyses used in the system. It achieves full morpheme-level annotation, presenting each morpheme’s orthographic and citation forms in the defined output, accompanied by robust morphological analyses, both formal and functional; to my knowledge, this is the first MAS of its kind for Indonesian. The MAS’s design is based not only on reference grammars of Indonesian and other linguistic sources, but also on the anticipated needs of researchers and other users of texts and corpora annotated using this scheme of analysis. The new MAS aims at The third stage of the project, implementation, consisted of three parts: a benchmarking evaluation exercise, a survey of frameworks and tools, leading ultimately to the actual implementation and evaluation of SANTI-morf. MorphInd (Larasati et al. 2012) is the prior state-of-the-art MA for Indonesian. That being the case, I evaluated MorphInd’s performance against the aforementioned testbed, both as just5ification of the need for an improved system, and to serve as a benchmark for SANTI-morf. MorphInd scored 93% on lexical coverage and 89% on tagging accuracy. Next, I surveyed existing MAs frameworks and tools. This survey justifies my choice for the rule-based approach (inspired by Koskenniemi’s 1983 Two Level Morphology, and NooJ (Silberztein 2S003) as respectively the framework and the software tool for SANTI-morf. After selection of this approach and tool, the language resources that constitute the SANTI-morf system were created. These are, primarily, a number of lexicons and sets of analysis rules, as well as necessary NooJ system configuration files. SANTI-morf’s 3 lexicon files (in total 86,590 entries) and 15 rule files (in total 659 rules) are organised into four modules, namely the Annotator, the Guesser, the Improver and the Disambiguator. These modules are applied one after another in a pipeline. The Annotator provides initial morpheme-level annotation for Indonesian words by identifying their having been built according to various morphological processes (affixation, reduplication, compounding, and cliticisation). The Guesser ensures that words not covered by the Annotator, because they are not covered by its lexicons, receive best guesses as to the correct analysis from the application of a set of probable but not exceptionless rules. The Improver improves the existing annotation, by adding probable analyses that the Annotator might have missed. Finally, the Disambiguator resolves ambiguities, that is, words for which the earlier elements of the pipeline have generated two or more possible analyses in terms of the morphemes identified or their annotation. NooJ annotations are saved in a binary file, but for evaluation purposes, plain-text output is required. I thus developed a system for data export using an in-NooJ mapping to and from a modified, exportable expression of the MAS, and wrote a small program to enable re-conversion of the output in plain-text format. For purposes of the evaluation, I created a 10,000 -word gold-standard SANTI-morf manually-annotated dataset. The outcome of the evaluation is that SANTI-morf has 100% coverage (because a best-guess analysis is always provided for unrecognised word forms), and 99% precision and recall for the morphological annotations, with a 1% rate of remaining ambiguity in the final output. SANTI-morf is thus shown to present a number of advancements over MorphInd, the state-of-the-art MA for Indonesian, exhibiting more robust annotation and better coverage. Other performance indicators, namely the high precision and recall, make SANTI-morf a concrete advance in the field of automated morphological annotation for Indonesian, and in consequence a substantive contribution to the field of Indonesian linguistics overall

    Lexical database enrichment through semi-automated morphological analysis

    Get PDF
    Derivational morphology proposes meaningful connections between words and is largely unrepresented in lexical databases. This thesis presents a project to enrich a lexical database with morphological links and to evaluate their contribution to disambiguation. A lexical database with sense distinctions was required. WordNet was chosen because of its free availability and widespread use. Its suitability was assessed through critical evaluation with respect to specifications and criticisms, using a transparent, extensible model. The identification of serious shortcomings suggested a portable enrichment methodology, applicable to alternative resources. Although 40% of the most frequent words are prepositions, they have been largely ignored by computational linguists, so addition of prepositions was also required. The preferred approach to morphological enrichment was to infer relations from phenomena discovered algorithmically. Both existing databases and existing algorithms can capture regular morphological relations, but cannot capture exceptions correctly; neither of them provide any semantic information. Some morphological analysis algorithms are subject to the fallacy that morphological analysis can be performed simply by segmentation. Morphological rules, grounded in observation and etymology, govern associations between and attachment of suffixes and contribute to defining the meaning of morphological relationships. Specifying character substitutions circumvents the segmentation fallacy. Morphological rules are prone to undergeneration, minimised through a variable lexical validity requirement, and overgeneration, minimised by rule reformulation and restricting monosyllabic output. Rules take into account the morphology of ancestor languages through co-occurrences of morphological patterns. Multiple rules applicable to an input suffix need their precedence established. The resistance of prefixations to segmentation has been addressed by identifying linking vowel exceptions and irregular prefixes. The automatic affix discovery algorithm applies heuristics to identify meaningful affixes and is combined with morphological rules into a hybrid model, fed only with empirical data, collected without supervision. Further algorithms apply the rules optimally to automatically pre-identified suffixes and break words into their component morphemes. To handle exceptions, stoplists were created in response to initial errors and fed back into the model through iterative development, leading to 100% precision, contestable only on lexicographic criteria. Stoplist length is minimised by special treatment of monosyllables and reformulation of rules. 96% of words and phrases are analysed. 218,802 directed derivational links have been encoded in the lexicon rather than the wordnet component of the model because the lexicon provides the optimal clustering of word senses. Both links and analyser are portable to an alternative lexicon. The evaluation uses the extended gloss overlaps disambiguation algorithm. The enriched model outperformed WordNet in terms of recall without loss of precision. Failure of all experiments to outperform disambiguation by frequency reflects on WordNet sense distinctions

    Metrics of Graph-Based Meaning Representations with Applications from Parsing Evaluation to Explainable NLG Evaluation and Semantic Search

    Get PDF
    "Who does what to whom?" The goal of a graph-based meaning representation (in short: MR) is to represent the meaning of a text in a structured format. With an MR, we can explicate the meaning of a text, describe occurring events and entities, and their semantic relations. Thus, a metric of MRs would measure a distance (or similarity) between MRs. We believe that such a meaning-focused similarity measurement can be useful for several important AI tasks, for instance, testing the capability of systems to produce meaningful output (system evaluation), or when searching for similar texts (information retrieval). Moreover, due to the natural explicitness of MRs, we hypothesize that MR metrics could provide us with valuable explainability of their similarity measurement. Indeed, if texts reside in a space where their meaning has been isolated and structured, we might directly see in which aspects two texts are actually similar (or dissimilar). However, we find that there is not much previous work on MR metrics, and thus we lack fundamental knowledge about them and their potential applications. Therefore, we make first steps to explore MR metrics and MR spaces, focusing on two key goals: 1. Develop novel and generally applicable methods for conducting similarity measurements in the space of MRs; 2. Explore potential applications that can profit from similarity assessments in MR spaces, including, but (by far) not limited to, their "classic" purpose of evaluating the quality of a text-to-MR system against a reference (aka parsing evaluation). We start by analyzing contributions from previous works that have proposed MR metrics for parsing evaluation. Then, we move beyond this restricted setup and start to develop novel and more general MR metrics based on i) insights from our analysis of the previous parsing evaluation metrics and ii) our motivation to extend MR metrics to similarity assessment of natural language texts. To empirically evaluate and assess our generalized MR metrics, and to open the door for future improvements, we propose the first benchmark of MR metrics. With our benchmark, we can study MR metrics through the lens of multiple metric-objectives such as sentence similarity and robustness. Then, we investigate novel applications of MR metrics. First, we explore new ways of applying MR metrics to evaluate systems that produce i) text from MRs (MR-to-text evaluation) and ii) MRs from text (MR parsing). We call our new setting MR projection-based, since we presume that one MR (at least) is unobserved and needs to be approximated. An advantage of such projection-based MR metric methods is that we can ablate a costly human reference. Notably, when visiting the MR-to-text scenario, we touch on a much broader application scenario for MR metrics: explainable MR-grounded evaluation of text generation systems. Moving steadily towards the application of MR metrics to general text similarity, we study MR metrics for measuring the meaning similarity of natural language arguments, which is an important task in argument mining, a new and surging area of natural language processing (NLP). In particular, we show that MRs and MR metrics can support an explainable and unsupervised argument similarity analysis and inform us about the quality of argumentative conclusions. Ultimately, we seek even more generality and are also interested in practical aspects such as efficiency. To this aim, we distill our insights from our hitherto explorations into MR metric spaces into an explainable state-of-the-art machine learning model for semantic search, a task for which we would like to achieve high accuracy and great efficiency. To this aim, we develop a controllable metric distillation approach that can explain how the similarity decisions in the neural text embedding space are modulated through interpretable features, while maintaining all efficiency and accuracy (sometimes improving it) of a high-performance neural semantic search method. This is an important contribution, since it shows i) that we can alleviate the efficiency bottleneck of computationally costly MR graph metrics and, vice versa, ii) that MR metrics can help mitigate a crucial limitation of large "black box" neural methods by eliciting explanations for decisions
    corecore