40 research outputs found
Automatic grammar induction from free text using insights from cognitive grammar
Automatic identification of the grammatical structure of a sentence is useful in many Natural Language
Processing (NLP) applications such as Document Summarisation, Question Answering systems and
Machine Translation. With the availability of syntactic treebanks, supervised parsers have been
developed successfully for many major languages. However, for low-resourced minority languages with
fewer digital resources, this poses more of a challenge. Moreover, there are a number of syntactic
annotation schemes motivated by different linguistic theories and formalisms which are sometimes
language specific and they cannot always be adapted for developing syntactic parsers across different
language families.
This project aims to develop a linguistically motivated approach to the automatic induction of
grammatical structures from raw sentences. Such an approach can be readily adapted to different
languages including low-resourced minority languages. We draw the basic approach to linguistic analysis
from usage-based, functional theories of grammar such as Cognitive Grammar, Computational Paninian
Grammar and insights from psycholinguistic studies. Our approach identifies grammatical structure of a
sentence by recognising domain-independent, general, cognitive patterns of conceptual organisation
that occur in natural language. It also reflects some of the general psycholinguistic properties of parsing
by humans - such as incrementality, connectedness and expectation.
Our implementation has three components: Schema Definition, Schema Assembly and Schema
Prediction. Schema Definition and Schema Assembly components were implemented algorithmically as
a dictionary and rules. An Artificial Neural Network was trained for Schema Prediction. By using Parts of
Speech tags to bootstrap the simplest case of token level schema definitions, a sentence is passed
through all the three components incrementally until all the words are exhausted and the entire
sentence is analysed as an instance of one final construction schema. The order in which all intermediate
schemas are assembled to form the final schema can be viewed as the parse of the sentence. Parsers
for English and Welsh (a low-resource minority language) were developed using the same approach with
some changes to the Schema Definition component. We evaluated the parser performance by (a)
Quantitative evaluation by comparing the parsed chunks against the constituents in a phrase structure
tree (b) Manual evaluation by listing the range of linguistic constructions covered by the parser and by
performing error analysis on the parser outputs (c) Evaluation by identifying the number of edits
required for a correct assembly (d) Qualitative evaluation based on Likert scales in online surveys
Handbook of Lexical Functional Grammar
Lexical Functional Grammar (LFG) is a nontransformational theory of
linguistic structure, first developed in the 1970s by Joan Bresnan and
Ronald M. Kaplan, which assumes that language is best described and
modeled by parallel structures representing different facets of
linguistic organization and information, related by means of
functional correspondences. This volume has five parts. Part I,
Overview and Introduction, provides an introduction to core syntactic
concepts and representations. Part II, Grammatical Phenomena, reviews
LFG work on a range of grammatical phenomena or constructions. Part
III, Grammatical modules and interfaces, provides an overview of LFG
work on semantics, argument structure, prosody, information structure,
and morphology. Part IV, Linguistic disciplines, reviews LFG work in
the disciplines of historical linguistics, learnability,
psycholinguistics, and second language learning. Part V, Formal and
computational issues and applications, provides an overview of
computational and formal properties of the theory, implementations,
and computational work on parsing, translation, grammar induction, and
treebanks. Part VI, Language families and regions, reviews LFG work
on languages spoken in particular geographical areas or in particular
language families. The final section, Comparing LFG with other
linguistic theories, discusses LFG work in relation to other
theoretical approaches
Noun phrase chunker for Turkish using dependency parser
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2010.Thesis (Master's) -- Bilkent University, 2010.Includes bibliographical references leaves 89-97.Noun phrase chunking is a sub-category of shallow parsing that can be used for many natural language processing tasks. In this thesis, we propose a noun phrase chunker system for Turkish texts. We use a weighted constraint dependency parser to represent the relationship between sentence components and to determine noun phrases.
The dependency parser uses a set of hand-crafted rules which can combine morphological and semantic information for constraints. The rules are suitable for handling complex noun phrase structures because of their flexibility. The developed dependency parser can be easily used for shallow parsing of all phrase types by changing the employed rule set.
The lack of reliable human tagged datasets is a significant problem for natural language studies about Turkish. Therefore, we constructed the first noun phrase dataset for Turkish. According to our evaluation results, our noun phrase chunker gives promising results on this dataset.
The correct morphological disambiguation of words is required for the correctness of the dependency parser. Therefore, in this thesis, we propose a hybrid morphological disambiguation technique which combines statistical information, hand-crafted grammar rules, and transformation based learning rules. We have also constructed a dataset for testing the performance of our disambiguation system. According to tests, the disambiguation system is highly effective.Kutlu, MücahidM.S
Proceedings of the Conference on Natural Language Processing 2010
This book contains state-of-the-art contributions to the 10th
conference on Natural Language Processing, KONVENS 2010
(Konferenz zur Verarbeitung natürlicher Sprache), with a focus
on semantic processing.
The KONVENS in general aims at offering a broad perspective
on current research and developments within the interdisciplinary
field of natural language processing. The central theme
draws specific attention towards addressing linguistic aspects
ofmeaning, covering deep as well as shallow approaches to semantic
processing. The contributions address both knowledgebased
and data-driven methods for modelling and acquiring
semantic information, and discuss the role of semantic information
in applications of language technology.
The articles demonstrate the importance of semantic processing,
and present novel and creative approaches to natural
language processing in general. Some contributions put their
focus on developing and improving NLP systems for tasks like
Named Entity Recognition or Word Sense Disambiguation, or
focus on semantic knowledge acquisition and exploitation with
respect to collaboratively built ressources, or harvesting semantic
information in virtual games. Others are set within the
context of real-world applications, such as Authoring Aids, Text
Summarisation and Information Retrieval. The collection highlights
the importance of semantic processing for different areas
and applications in Natural Language Processing, and provides
the reader with an overview of current research in this field
Natural language processing for semiautomatic semantics extractio: encyclopedic entry disambiguation and relationship extraction using wikipedia and wordnet
Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, septiembre de 200
Discourse analysis of arabic documents and application to automatic summarization
Dans un discours, les textes et les conversations ne sont pas seulement une juxtaposition de mots et de phrases. Ils sont plutôt organisés en une structure dans laquelle des unités de discours sont liées les unes aux autres de manière à assurer à la fois la cohérence et la cohésion du discours. La structure du discours a montré son utilité dans de nombreuses applications TALN, y compris la traduction automatique, la génération de texte et le résumé automatique. L'utilité du discours dans les applications TALN dépend principalement de la disponibilité d'un analyseur de discours performant. Pour aider à construire ces analyseurs et à améliorer leurs performances, plusieurs ressources ont été annotées manuellement par des informations de discours dans des différents cadres théoriques. La plupart des ressources disponibles sont en anglais. Récemment, plusieurs efforts ont été entrepris pour développer des ressources discursives pour d'autres langues telles que le chinois, l'allemand, le turc, l'espagnol et le hindi. Néanmoins, l'analyse de discours en arabe standard moderne (MSA) a reçu moins d'attention malgré le fait que MSA est une langue de plus de 422 millions de locuteurs dans 22 pays. Le sujet de thèse s'intègre dans le cadre du traitement automatique de la langue arabe, plus particulièrement, l'analyse de discours de textes arabes. Cette thèse a pour but d'étudier l'apport de l'analyse sémantique et discursive pour la génération de résumé automatique de documents en langue arabe. Pour atteindre cet objectif, nous proposons d'étudier la théorie de la représentation discursive segmentée (SDRT) qui propose un cadre logique pour la représentation sémantique de phrases ainsi qu'une représentation graphique de la structure du texte où les relations de discours sont de nature sémantique plutôt qu'intentionnelle. Cette théorie a été étudiée pour l'anglais, le français et l'allemand mais jamais pour la langue arabe. Notre objectif est alors d'adapter la SDRT à la spécificité de la langue arabe afin d'analyser sémantiquement un texte pour générer un résumé automatique. Nos principales contributions sont les suivantes : Une étude de la faisabilité de la construction d'une structure de discours récursive et complète de textes arabes. En particulier, nous proposons : Un schéma d'annotation qui couvre la totalité d'un texte arabe, dans lequel chaque constituant est lié à d'autres constituants. Un document est alors représenté par un graphe acyclique orienté qui capture les relations explicites et les relations implicites ainsi que des phénomènes de discours complexes, tels que l'attachement, la longue distance du discours pop-ups et les dépendances croisées. Une nouvelle hiérarchie des relations de discours. Nous étudions les relations rhétoriques d'un point de vue sémantique en se concentrant sur leurs effets sémantiques et non pas sur la façon dont elles sont déclenchées par des connecteurs de discours, qui sont souvent ambigües en arabe.
o une analyse quantitative (en termes de connecteurs de discours, de fréquences de relations, de proportion de relations implicites, etc.) et une analyse qualitative (accord inter-annotateurs et analyse des erreurs) de la campagne d'annotation. Un outil d'analyse de discours où nous étudions à la fois la segmentation automatique de textes arabes en unités de discours minimales et l'identification automatique des relations explicites et implicites du discours. L'utilisation de notre outil pour résumer des textes arabes. Nous comparons la représentation de discours en graphes et en arbres pour la production de résumés.Within a discourse, texts and conversations are not just a juxtaposition of words and sentences. They are rather organized in a structure in which discourse units are related to each other so as to ensure both discourse coherence and cohesion. Discourse structure has shown to be useful in many NLP applications including machine translation, natural language generation and language technology in general. The usefulness of discourse in NLP applications mainly depends on the availability of powerful discourse parsers. To build such parsers and improve their performances, several resources have been manually annotated with discourse information within different theoretical frameworks. Most available resources are in English. Recently, several efforts have been undertaken to develop manually annotated discourse information for other languages such as Chinese, German, Turkish, Spanish and Hindi. Surprisingly, discourse processing in Modern Standard Arabic (MSA) has received less attention despite the fact that MSA is a language with more than 422 million speakers in 22 countries. Computational processing of Arabic language has received a great attention in the literature for over twenty years. Several resources and tools have been built to deal with Arabic non concatenative morphology and Arabic syntax going from shallow to deep parsing. However, the field is still very vacant at the layer of discourse. As far as we know, the sole effort towards Arabic discourse processing was done in the Leeds Arabic Discourse Treebank that extends the Penn Discourse TreeBank model to MSA. In this thesis, we propose to go beyond the annotation of explicit relations that link adjacent units, by completely specifying the semantic scope of each discourse relation, making transparent an interpretation of the text that takes into account the semantic effects of discourse relations. In particular, we propose the first effort towards a semantically driven approach of Arabic texts following the Segmented Discourse Representation Theory (SDRT). Our main contributions are: A study of the feasibility of building a recursive and complete discourse structures of Arabic texts. In particular, we propose: An annotation scheme for the full discourse coverage of Arabic texts, in which each constituent is linked to other constituents. A document is then represented by an oriented acyclic graph, which captures explicit and implicit relations as well as complex discourse phenomena, such as long-distance attachments, long-distance discourse pop-ups and crossed dependencies. A novel discourse relation hierarchy. We study the rhetorical relations from a semantic point of view by focusing on their effect on meaning and not on how they are lexically triggered by discourse connectives that are often ambiguous, especially in Arabic. A thorough quantitative analysis (in terms of discourse connectives, relation frequencies, proportion of implicit relations, etc.) and qualitative analysis (inter-annotator agreements and error analysis) of the annotation campaign. An automatic discourse parser where we investigate both automatic segmentation of Arabic texts into elementary discourse units and automatic identification of explicit and implicit Arabic discourse relations. An application of our discourse parser to Arabic text summarization. We compare tree-based vs. graph-based discourse representations for producing indicative summaries and show that the full discourse coverage of a document is definitively a plus