171 research outputs found

    Parallel Natural Language Parsing: From Analysis to Speedup

    Get PDF
    Electrical Engineering, Mathematics and Computer Scienc

    Parser as a Novel Reliability Implementation in the Pervasive Computing

    Get PDF
    The ubiquity of mobile devices affects the way society works beyond voice and text messaging. Smart phone capabilities have become similar to those of computers. They promote users to engage in social networking, flash reports, and other vital applications. These mobile devices can also be used to control other devices. However, the heterogeneity of operating systems, hardware, and protocols has brought about the challenge of ensuring that messages could be reliably transferred between these mobile devices with different communicating equipments. Hence in this paper, we showed how a parser could be used as a reliability mechanism over our proposed system. Additionally, we showed how message queue was used as a technique for the pervasive computing to interoperate

    Treebank-based acquisition of Chinese LFG resources for parsing and generation

    Get PDF
    This thesis describes a treebank-based approach to automatically acquire robust,wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena and (in cooperation with PARC) develop a gold-standard dependency-bank of Chinese f-structures for evaluation. Based on the Penn Chinese Treebank, I design and implement two architectures for inducing Chinese LFG resources, one annotation-based and the other dependency conversion-based. I then apply the f-structure acquisition algorithm together with external, state-of-the-art parsers to parsing new text into "proto" f-structures. In order to convert "proto" f-structures into "proper" f-structures or deep dependencies, I present a novel Non-Local Dependency (NLD) recovery algorithm using subcategorisation frames and f-structure paths linking antecedents and traces in NLDs extracted from the automatically-built LFG f-structure treebank. Based on the grammars extracted from the f-structure annotated treebank, I develop a PCFG-based chart generator and a new n-gram based pure dependency generator to realise Chinese sentences from LFG f-structures. The work reported in this thesis is the first effort to scale treebank-based, probabilistic Chinese LFG resources from proof-of-concept research to unrestricted, real text. Although this thesis concentrates on Chinese and LFG, many of the methodologies, e.g. the acquisition of predicate-argument structures, NLD resolution and the PCFG- and dependency n-gram-based generation models, are largely language and formalism independent and should generalise to diverse languages as well as to labelled bilexical dependency representations other than LFG

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    JACY - a grammar for annotating syntax, semantics and pragmatics of written and spoken japanese for NLP application purposes

    Get PDF
    In this text, we describe the development of a broad coverage grammar for Japanese that has been built for and used in different application contexts. The grammar is based on work done in the Verbmobil project (Siegel 2000) on machine translation of spoken dialogues in the domain of travel planning. The second application for JACY was the automatic email response task. Grammar development was described in Oepen et al. (2002a). Third, it was applied to the task of understanding material on mobile phones available on the internet, while embedded in the project DeepThought (Callmeier et al. 2004, Uszkoreit et al. 2004). Currently, it is being used for treebanking and ontology extraction from dictionary definition sentences by the Japanese company NTT (Bond et al. 2004)

    Statistical Deep parsing for spanish

    Get PDF
    This document presents the development of a statistical HPSG parser for Spanish. HPSG is a deep linguistic formalism that combines syntactic and semanticinformation in the same representation, and is capable of elegantly modelingmany linguistic phenomena. Our research consists in the following steps: design of the HPSG grammar, construction of the corpus, implementation of theparsing algorithms, and evaluation of the parsers performance. We created a simple yet powerful HPSG grammar for Spanish that modelsmorphosyntactic information of words, syntactic combinatorial valence, and semantic argument structures in its lexical entries. The grammar uses thirteenvery broad rules for attaching specifiers, complements, modifiers, clitics, relative clauses and punctuation symbols, and for modeling coordinations. In asimplification from standard HPSG, the only type of long range dependency wemodel is the relative clause that modifies a noun phrase, and we use semanticrole labeling as our semantic representation. We transformed the Spanish AnCora corpus using a semi-automatic processand analyzed it using our grammar implementation, creating a Spanish HPSGcorpus of 517,237 words in 17,328 sentences (all of AnCora). We implemented several statistical parsing algorithms and trained them overthis corpus. The implemented strategies are: a bottom-up baseline using bi-lexical comparisons or a multilayer perceptron; a CKY approach that uses theresults of a supertagger; and a top-down approach that encodes word sequencesusing a LSTM network. We evaluated the performance of the implemented parsers and compared them with each other and against other existing Spanish parsers. Our LSTM top-down approach seems to be the best performing parser over our test data, obtaining the highest scores (compared to our strategies and also to externalparsers) according to constituency metrics (87.57 unlabeled F1, 82.06 labeled F1), dependency metrics (91.32 UAS, 88.96 LAS), and SRL (87.68 unlabeled,80.66 labeled), but we must take in consideration that the comparison against the external parsers might be noisy due to the post-processing we needed to do in order to adapt them to our format. We also defined a set of metrics to evaluate the identification of some particular language phenomena, and the LSTM top-down parser out performed the baselines in almost all of these metrics as well.Este documento presenta el desarrollo de un parser HPSG estadístico para el español. HPSG es un formalismo lingüístico profundo que combina información sintáctica y semántica en sus representaciones, y es capaz de modelar elegantemente una buena cantidad de fenómenos lingüísticos. Nuestra investigación se compone de los siguiente pasos: diseño de la gramática HPSG, construcción del corpus, implementación de los algoritmos de parsing y evaluación de la performance de los parsers. Diseñamos una gramática HPSG para el español simple y a la vez poderosa, que modela en sus entradas léxicas la información morfosintáctica de las palabras, la valencia combinatoria sintáctica y la estructura argumental semántica. La gramática utiliza trece reglas genéricas para adjuntar especificadores, complementos, clíticos, cláusulas relativas y símbolos de puntuación, y también para modelar coordinaciones. Como simplificación de la teoría HPSG estándar, el único tipo de dependencia de largo alcance que modelamos son las cláusulas relativas que modifican sintagmas nominales, y utilizamos etiquetado de roles semánticos como representación semántica. Transformamos el corpus AnCora en español utilizando un proceso semiautomático y lo analizamos mediante nuestra implementación de la gramática, para crear un corpus HPSG en español de 517,237 palabras en 17,328 oraciones (todo el contenido de AnCora). Implementamos varios algoritmos de parsing estadístico entrenados sobre este corpus. En particular, teníamos como objetivo probar enfoques basados en redes neuronales. Las estrategias implementadas son: una línea base bottom-up que utiliza comparaciones bi-léxicas o un perceptrón multicapa; un enfoque tipo CKY que utiliza los resultados de un supertagger; y un enfoque top-down que codifica las secuencias de palabras mediante redes tipo LSTM. Evaluamos la performance de los parsers implementados y los comparamos entre sí y con un conjunto de parsers existententes para el español. Nuestro enfoque LSTM top-down parece ser el que tiene mejor desempeño para nuestro conjunto de test, obteniendo los mejores puntajes (comparado con nuestras estrategias y también con parsers externos) en cuanto a métricas de constituyentes (87.57 F1 no etiquetada, 82.06 F1 etiquetada), métricas de dependencias (91.32 UAS, 88.96 LAS), y SRL (87.68 no etiquetada, 80.66 etiquetada), pero debemos tener en cuenta que la comparación con parsers externos puede ser ruidosa debido al post procesamiento realizado para adaptarlos a nuestro formato. También definimos un conjunto de métricas para evaluar la identificación de algunos fenómenos particulares del lenguaje, y el parser LSTM top-down obtuvo mejores resultados que las baselines para casi todas estas métricas

    Proceedings

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Essays on the use of e-Learning in statistics and the implementation of statistical software

    Get PDF
    Die vorliegende Doktorarbeit bündelt die Veröffentlichungen des Autors und seiner Koautoren zu den Themen e-Learning und statistischer Software. Die Kapitel 2 bis 5 sind Aspekten des e-Learning gewidmet, die Kapitel 6 bis 9 beschreiben die Entwicklung der statistischen Programmiersprache Yxilon. In Kapitel 2, Koautoren Wolfgang Härdle und Sigbert Klinke, wird erörtert, ob und wie computerbasierte Elemente in den Kanon der methodischen Bildung integriert werden sollen und wo die Grenzen des e-Learning in der Statistik-Ausbildung liegen. Kapitel 3, Koautoren Wolfgang Härdle und Sigbert Klinke, gibt Einschätzungen verschiedener e-Learning Plattformen und beschreibt Punkte, die bei der Entwicklung von e-Learning Plattformen berücksichtigt werden sollten. Kapitel 4, geschrieben mit Wolfgang Härdle und Sigbert Klinke, diskutiert zwei Veröffentlichungen in der "International Statistical Review", die eine technische Lösung für die Verbesserung des Verständnisses der Statistik-Lehre vorstellen. Kapitel 5, Koautoren Wolfgang Härdle und Sigbert Klinke, beschreibt die Anwendung von Web-Techniken für die Lehre in Statistik. Weiterhin stellt es die Quantnet Plattform vor, eine Plattform für die Verwaltung von Programmen und Daten. In Kapitel 6, Koautoren Wolfgang Härdle und Sigbert Klinke, diskutieren die Autoren die Anforderungen an eine Statistical Engine. Kapitel 7, geschrieben mit Yuval Guri und Sigbert Klinke, erläutert die Ideen, die zur Re-Implementierung der XploRe Sprache geführt haben und diskutiert ausgewählte technische Aspekte der Yxilon Plattform wie Objektdatenbank und die Erzeugung von kompilierbarem Code für Hochsprachen. In Kapitel 8, Koautoren Wolfgang Härdle und Sigbert Klinke, wird die implementierte Client-Server Struktur beschrieben. Server und Kommunikationsprotokoll werden zusammen mit dem entwickelten Client und der Grafik-Engine beschrieben. Das letzte Kapitel, beschreibt die Struktur der Yxilon Plattform in ihrer jetzigen Form.The following doctoral thesis collects the papers the author has written with his coauthors on e-Learning and statistical software. The chapters 2 to 5 are devoted to selected aspects of e-Learning, the chapters 6 to 9 describe the development of the statistical programming environment Yxilon. In chapter 2, coauthored by Wolfgang Härdle and Sigbert Klinke, the question whether and how computational elements should be integrated into the canon of methodological education and where e-techniques have their limits in statistics education is discussed. Chapter 3, coauthored by Wolfgang Härdle and Sigbert Klinke, gives reviews of different e-learning platforms for statistics and reveals facts that may be taken into account for future e-learning platforms in statistics and related fields. Chapter 4, written with Wolfgang Härdle and Sigbert Klinke, discusses two papers published in International Statistical Review which both offer a technical solution to improve the understanding of statistics by students. Chapter 5, coauthored by Wolfgang Härdle and Sigbert Klinke, describes web-related techniques for teaching statistics. It furthermore introduces the Quantnet platform, a framework to manage scientific code and data. In chapter 6, coauthored by Wolfgang Härdle and Sigbert Klinke, the requirements for a statistical engine are discussed. Chapter 7, written jointly with Yuval Guri and Sigbert Klinke, explains ideas which led to the reimplementation of the XploRe language. In chapter 8, coauthored by Wolfgang Härdle and Sigbert Klinke, the implemented client/server structure of the Yxilon platform is laid out in terms of technical features. The server and the communication protocol are described together with the developed Java client featuring the Jasplot graphics engine. Finally chapter 9 describes the structure of the Yxilon environment in its present form
    corecore