Search CORE

1,448 research outputs found

Recommended from our members

Learnability and Overgeneration in Computational Syntax

Author: Hao Yiding
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2019
Field of study

This paper addresses the hypothesis that unnatural patterns generated by grammar formalisms can be eliminated on the grounds that they are unlearnable. I consider three examples of formal languages thought to represent dependencies unattested in natural language syntax, and show that all three can be learned by grammar induction algorithms following the Distributional Learning paradigm of Clark and Eyraud (2007). While learnable language classes are restrictive by necessity (Gold, 1967), these facts suggest that learnability alone may be insufficient for addressing concerns of overgeneration in syntax

ScholarWorks@UMass Amherst

Decision problems for Clark-congruential languages

Author: Kanazawa Makoto
Kappé Tobias
Publication venue
Publication date: 01/01/2018
Field of study

A common question when studying a class of context-free grammars is whether equivalence is decidable within this class. We answer this question positively for the class of Clark-congruential grammars, which are of interest to grammatical inference. We also consider the problem of checking whether a given CFG is Clark-congruential, and show that it is decidable given that the CFG is a DCFG.Comment: Version 2 incorporates revisions prompted by the comments of anonymous referees at ICGI and LearnAu

arXiv.org e-Print Archive

UCL Discovery

Using Contextual Representations to Efficiently Learn Context-Free Languages

Author: Clark Alexander
Eyraud Rémi
Habrard Amaury
Publication venue: Microtome Publishing
Publication date: 01/01/2010
Field of study

International audienceWe present a polynomial update time algorithm for the inductive inference of a large class of context-free languages using the paradigm of positive data and a membership oracle. We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs), which are capable of representing richly structured context-free languages as well as some context sensitive languages. These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning. This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly expressive representations necessary for linguistic representation: it allows the learnability of a large class of context-free languages, that includes all regular languages and those context-free languages that satisfy two simple constraints. The formalism and the algorithm seem well suited to natural language and in particular to the modeling of first language acquisition. Preliminary experimental results confirm the effectiveness of this approach

CiteSeerX

HAL AMU

King's Research Portal

Extracting Context-Free Grammars from Recurrent

Author: Barbot B.
Bollig B.
Finkel A.
Haddad S.
Khmelnitsky I.
Leucker M.
Neider D.
Roy R.
Ye L.
Publication venue
Publication date: 01/01/2021
Field of study

MPG.PuRe

Model-Based Evaluation of Multilinguality

Author: Vamvas Jannis
Publication venue
Publication date: 01/01/2023
Field of study

ZORA

Learning Pomset Automata

Author: Kappé Tobias
Rot Jurriaan
Silva Alexandra
van Heerdt Gerco
Publication venue
Publication date: 01/01/2021
Field of study

We extend the L* algorithm to learn bimonoids recognising pomset languages. We then identify a class of pomset automata that accepts precisely the class of pomset languages recognised by bimonoids and show how to convert between bimonoids and automata

arXiv.org e-Print Archive

UCL Discovery

Radboud Repository

SHOE:The extraction of hierarchical structure for machine learning of natural language

Author: Daelemans W.M.P.
Powers D.M.W.
Publication venue: Institute for Language Technology and Artifical IntelIigence, Tilburg University
Publication date: 01/01/1991
Field of study

Tilburg University Repository

Maintaining regularity and generalization in data using the minimum description length principle and genetic algorithm: case of grammatical inference

Author: Angluin
Angluin
Angluin
Bagchi
Bhalse
Choubey
Choubey
Choubey
Clark
Clark
Cleeremans
De La Higuera
de la Higuera
Delgado
Dupont
D’Ulizia
Elman
Fu
Gallager
Gold
Graves
Grünwald
Hansen
Harrison
Higuera
Holland
Hrnčič
Hrnčič
Iuspa
Jonyer
Li
Michalewicz
Pandey
Pandey
Pandey
Petasis
Rissanen
Roy
Saers
Sakakibara
Sakakibara
Sivaraj
Solomonoff
Stevenson
Stevenson
Theeramunkongy
Valiant
Yang
Yoshinaka
Črepinšek
Črepinšek
Publication venue: 'Elsevier BV'
Publication date: 17/05/2016
Field of study

In this paper, a genetic algorithm with minimum description length (GAWMDL) is proposed for grammatical inference. The primary challenge of identifying a language of infinite cardinality from a finite set of examples should know when to generalize and specialize the training data. The minimum description length principle that has been incorporated addresses this issue is discussed in this paper. Previously, the e-GRIDS learning model was proposed, which enjoyed the merits of the minimum description length principle, but it is limited to positive examples only. The proposed GAWMDL, which incorporates a traditional genetic algorithm and has a powerful global exploration capability that can exploit an optimum offspring. This is an effective approach to handle a problem which has a large search space such the grammatical inference problem. The computational capability, the genetic algorithm poses is not questionable, but it still suffers from premature convergence mainly arising due to lack of population diversity. The proposed GAWMDL incorporates a bit mask oriented data structure that performs the reproduction operations, creating the mask, then Boolean based procedure is applied to create an offspring in a generative manner. The Boolean based procedure is capable of introducing diversity into the population, hence alleviating premature convergence. The proposed GAWMDL is applied in the context free as well as regular languages of varying complexities. The computational experiments show that the GAWMDL finds an optimal or close-to-optimal grammar. Two fold performance analysis have been performed. First, the GAWMDL has been evaluated against the elite mating pool genetic algorithm which was proposed to introduce diversity and to address premature convergence. GAWMDL is also tested against the improved tabular representation algorithm. In addition, the authors evaluate the performance of the GAWMDL against a genetic algorithm not using the minimum description length principle. Statistical tests demonstrate the superiority of the proposed algorithm. Overall, the proposed GAWMDL algorithm greatly improves the performance in three main aspects: maintains regularity of the data, alleviates premature convergence and is capable in grammatical inference from both positive and negative corpora

Nottingham ePrints

Nottingham eTheses

Crossref

Edge Hill University Research Information Repository

Middlesex University Research Repository

University of Missouri, St. Louis

Maintaining regularity and generalization in data using the minimum description length principle and genetic algorithm: Case of grammatical inference

Author: Chaudhary Ankit
Kendall Graham
Mehrotra Deepti
Pandey Hari Mohan
Publication venue: IRL @ UMSL
Publication date: 01/12/2016
Field of study

University of Missouri, St. Louis

An investigation into deviant morphology : issues in the implementation of a deep grammar for Indonesian

Author: Mistica Meladel
Publication venue
Publication date: 10/01/2019
Field of study

This thesis investigates deviant morphology in Indonesian for the implementation of a deep grammar. In particular we focus on the implementation of the verbal suffix -kan. This suffix has been described as having many functions, which alter the kinds of arguments and the number of arguments the verb takes (Dardjowidjojo 1971; Chung 1976; Arka 1993; Vamarasi 1999; Kroeger 2007; Son and Cole 2008). Deep grammars or precision grammars (Butt et al. 1999a; Butt et al. 2003; Bender et al. 2011) have been shown to be useful for natural language processing (NLP) tasks, such as machine translation and generation (Oepen et al. 2004; Cahill and Riester 2009; Graham 2011), and information extraction (MacKinlay et al. 2012), demonstrating the need for linguistically rich information to aid NLP tasks. Although these linguistically-motivated grammars are invaluable resources to the NLP community, the biggest drawback is the time required for the manual creation and curation of the lexicon. Our work aims to expedite this process by applying methods to assign syntactic information to kan-affixed verbs automatically. The method we employ exploits the hypothesis that semantic similarity is tightly connected with syntactic behaviour (Levin 1993). Our endeavour in automatically acquiring verbal information for an Indonesian deep grammar poses a number of lingustic challenges. First of all Indonesian verbs exhibit voice marking that is characteristic of the subgrouping of its language family. In order to be able to characterise verbal behaviour in Indonesian, we first need to devise a detailed analysis of voice for implementation. Another challenge we face is the claim that all open class words in Indonesian, at least as it is spoken in some varieties (Gil 1994; Gil 2010), cannot linguistically be analysed as being distinct from each other. That is, there is no distiction between nouns, verbs or adjectives in Indonesian, and all word from the open class categories should be analysed uniformly. This poses difficulties in implementing a grammar in a linguistically motivated way, as well discovering syntactic behaviour of verbs, if verbs cannot be distinguished from nouns. As part of our investigation we conduct experiments to verify the need to employ word class categories, and we find that indeed these are linguistically motivated labels in Indonesian. Through our investigation into deviant morphological behaviour, we gain a better characterisation of the morphosyntactic effects of -kan, and we discover that, although Indonesian has been labelled as a language with no open word class distinctions, word classes can be established as being linguistically-motivated

The Australian National University